Model sizes
We can split local models into three tiers based on their size and capabilities, balancing speed, intelligence, and world knowledge:| Small Model | Medium Model | Large Model | |
|---|---|---|---|
| Speed | |||
| Intelligence | |||
| World Knowledge | |||
| Recommended model | Gemma 3 4B IT QAT | Mistral Small 3.2 | qwen3-30b-a3b-instruct-2507 |
Run models locally
You need to run a program on your computer that serves models to Cellm. We call these programs “providers”. Cellm supports Ollama, Llamafiles, and vLLM, as well as any OpenAI-compatible provider. If you don’t know any of these names, just use Ollama.Ollama
To get started with Ollama, we recommend you try out the Gemma 3 4B IT QAT model, which is Cellm’s default local model.Install Ollama
Download and install Ollama. Ollama will start after the install and automatically run whenever you start up your computer.
Download the model
Open Windows Terminal (open start menu, type Wait for the download to finish.
Windows Terminal, and click OK), then run:Download Gemma 3 4B QAT
You can use any model that Ollama supports. See https://ollama.com/search for a complete list.
LLamafile
Llamafile is a project by Mozilla that combines llama.cpp with Cosmopolitan Libc, enabling you to download and run a single-file executable (called a “llamafile”) that runs locally on most computers, with no installation.Download a llamafile
Download a llamafile from https://github.com/Mozilla-Ocho/llamafile (e.g. Gemma 3 4B IT).
Rename the file
Append
.exe to the filename. For example, google_gemma-3-4b-it-Q6_K.llamafile should be renamed to google_gemma-3-4b-it-Q6_K.llamafile.exe.Run the llamafile
Open Windows Terminal (open start menu, type To offload inference to your NVIDIA or AMD GPU, run:
Windows Terminal, and click OK) and run:CPU only
With GPU
Configure Cellm
Start Excel and select the
OpenAiCompatible provider from the model drop-down on Cellm’s ribbon menu. Enter any model name e.g., “gemma”. Llamafiles ignore the model name since each llamafile serves only one model, but a name is required by the OpenAI API.Set the Base Address to http://localhost:8080.Dockerized Ollama
If you prefer to run models via docker, both Ollama and vLLM are packaged up with docker compose files in thedocker/ folder. vLLM is designed to run many requests in parallel and particularly useful if you need to process a lot of data with Cellm.
Start Ollama container
Run the following command in the To use your GPU for faster inference:To stop the container:
docker/ directory:Start container
Start with GPU
Stop container
ollama run mistral-small3.1:24b in the container.
Dockerized vLLM
If you want to speed up running many requests in parallel, you can use vLLM instead of Ollama. vLLM requires a Hugging Face API key to download models from the Hugging Face Hub.Set up Hugging Face API key
You must supply the docker compose file with a Hugging Face API key either via an environment variable or by editing the docker compose file directly. Look at the vLLM docker compose file for details.If you don’t know what a Hugging Face API key is, just use Ollama instead.