How to Run LLMs Locally: Complete Guide for 2026


Last updated: February 2026

Running AI models on your own machine means no API costs, no rate limits, no data leaving your computer, and no company deciding what you can or can’t ask. The tradeoff: you need decent hardware and some technical comfort.

Good news — it’s gotten dramatically easier. Here’s everything you need to know.

Why Run Locally?

Privacy. Your prompts never leave your machine. No training on your data. No logs on someone else’s server. For lawyers, doctors, journalists, and anyone handling sensitive information, this matters.

Cost. After the hardware investment, every query is free. If you’re making hundreds of API calls per day, local inference pays for itself in months.

No limits. No rate limiting. No content policies. No “I can’t help with that.” The model does what you tell it.

Offline access. Works on a plane, in a bunker, during an internet outage. The model is on your disk.

Speed. For small-to-medium models on good hardware, local inference can be faster than cloud APIs (no network latency).

What Hardware Do You Need?

The key bottleneck is RAM (for CPU inference) or VRAM (for GPU inference). LLMs are big. Here’s what different setups can handle:

GPU Inference (Faster)

GPUVRAMMax Model SizePerformance
RTX 306012GB7-13B paramsGood for small models
RTX 4060 Ti16GB13B paramsSolid mid-range
RTX 4070 Ti Super16GB13B paramsFast mid-range
RTX 409024GB30B paramsExcellent
RTX 509032GB40B+ paramsBest consumer GPU
2x RTX 409048GB70B paramsEnthusiast setup

Apple Silicon (Unified Memory)

MacMemoryMax Model SizePerformance
M1/M2 8GB8GB7B paramsUsable but slow
M1/M2 Pro 16GB16GB13B paramsGood
M2/M3 Pro 32GB32GB30B paramsVery good
M3/M4 Max 64GB64GB70B paramsExcellent
M2/M4 Ultra 128GB+128GB+100B+ paramsRun almost anything

CPU Only (Slowest, but works)

Any machine with 16GB+ RAM can run small models (7B) on CPU. It’s slow (2-5 tokens/second) but functional. 32GB+ RAM opens up 13B models. 64GB+ can handle 30B.

My recommendation for getting started: If you have a Mac with 16GB+ memory or a PC with an RTX 3060+, you’re good to go. Don’t buy hardware just for this — try it with what you have first.

The Software Stack

Option 1: Ollama (Easiest — Start Here)

Ollama is the “Docker for LLMs.” One command to install, one command to run a model. It handles downloading, quantization, and serving automatically.

Install:

# Mac/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download from ollama.com

Run a model:

# Download and run Llama 3.1 8B
ollama run llama3.1

# Download and run Deepseek V3 (needs lots of RAM)
ollama run deepseek-v3

# Download and run a coding model
ollama run codellama

# List available models
ollama list

That’s it. You’re running a local LLM. Type your prompt, get a response.

Best models to start with on Ollama:

ModelSizeGood ForMin RAM
llama3.1:8b4.7GBGeneral purpose8GB
mistral4.1GBFast, good quality8GB
codellama:13b7.4GBCoding16GB
deepseek-coder-v28.9GBCoding16GB
llama3.1:70b40GBBest open-source quality48GB+
qwen2.5:32b18GBExcellent multilingual32GB

Option 2: LM Studio (Best GUI)

If you prefer a graphical interface over the terminal, LM Studio is excellent. It’s a desktop app that lets you browse, download, and run models with a ChatGPT-like interface.

Setup:

  1. Download from lmstudio.ai
  2. Browse the model catalog
  3. Click download on any model
  4. Click “Chat” and start talking

LM Studio also runs a local API server compatible with the OpenAI API format — meaning any tool that works with ChatGPT’s API can be pointed at your local model instead.

Option 3: llama.cpp (Most Control)

llama.cpp is the engine that powers both Ollama and LM Studio under the hood. Running it directly gives you maximum control over inference parameters, quantization, and performance tuning.

When to use llama.cpp directly:

  • You want to fine-tune performance for specific hardware
  • You’re building a production application
  • You need features that Ollama/LM Studio don’t expose
  • You want to run on unusual hardware (old GPUs, embedded systems)

For most people, Ollama or LM Studio is sufficient. llama.cpp is for when you need to go deeper.

Option 4: vLLM (Production Serving)

If you’re running local models for a team or application (not just personal use), vLLM provides high-throughput serving with features like continuous batching, PagedAttention, and OpenAI-compatible API endpoints.

When to use vLLM:

  • Serving models to multiple users
  • Building applications that need high throughput
  • Production deployments on your own servers

The Best Local Models Right Now

General Purpose

  1. Llama 3.1 70B — Best overall open-source model. Needs 48GB+ RAM/VRAM.
  2. Qwen 2.5 32B — Excellent quality, great multilingual support (especially Chinese). Needs 32GB.
  3. Mistral Large — Strong reasoning, good instruction following. Various sizes available.
  4. Deepseek V3 — Rivals GPT-4 on many tasks. Very large (needs significant hardware).

Coding

  1. Deepseek Coder V2 — Best open-source coding model.
  2. CodeLlama 34B — Solid, well-tested.
  3. Qwen 2.5 Coder — Excellent for multiple languages.

Small and Fast (for limited hardware)

  1. Llama 3.1 8B — Best quality at 8B size.
  2. Mistral 7B — Fast, efficient, good quality.
  3. Phi-3 Mini — Microsoft’s small model, surprisingly capable.

Quantization: The Magic Trick

Full-precision models are huge. A 70B parameter model at full precision needs ~140GB of memory. Nobody has that in a consumer machine.

Quantization compresses models by reducing numerical precision. The quality loss is minimal, but the size reduction is dramatic:

QuantizationSize ReductionQuality LossWhen to Use
Q8 (8-bit)~50%NegligibleWhen you have enough RAM
Q6_K~60%Very smallGood balance
Q5_K_M~65%SmallRecommended default
Q4_K_M~75%Noticeable on complex tasksWhen RAM is tight
Q3_K~80%SignificantLast resort

Rule of thumb: Use Q5_K_M or Q4_K_M. These give you the best quality-to-size ratio. Ollama and LM Studio handle quantization automatically — you don’t need to think about it unless you want to optimize.

Connecting Local Models to Your Tools

The killer feature of local LLMs: they can replace cloud APIs in your existing tools.

Ollama + Aider (AI coding agent):

aider --model ollama/deepseek-coder-v2

Ollama + Continue (VS Code extension): Install Continue, set the model to your local Ollama endpoint. Free AI coding assistance in your editor.

LM Studio + Any OpenAI-compatible tool: LM Studio runs a local server at http://localhost:1234/v1. Point any tool that accepts an OpenAI API endpoint to this address.

Ollama + Open WebUI: A self-hosted ChatGPT-like interface for your local models. Beautiful UI, conversation history, multiple models.

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main

Getting Started: The 15-Minute Path

  1. Install Ollama (1 minute)
  2. Run ollama run llama3.1 (5 minutes to download, then instant)
  3. Start chatting (0 minutes)
  4. Try ollama run deepseek-coder-v2 for coding tasks
  5. Install Open WebUI if you want a nice chat interface

Total time: 15 minutes. Total cost: $0.

You now have a private, unlimited, free AI assistant running on your own hardware. Welcome to the future.


All tools mentioned are free or open source: