Model sizes
We can split local models into three tiers based on their size and capabilities, balancing speed, intelligence, and world knowledge:| Small Model | Medium Model | Large Model | |
|---|---|---|---|
| Speed | |||
| Intelligence | |||
| World Knowledge | |||
| Recommended model | Gemma 3 4B IT QAT | Mistral Small 3.2 | qwen3-30b-a3b-instruct-2507 |
Run models locally
You need to run a program on your computer that serves models to Cellm. We call these programs “providers”. Cellm supports Ollama, Llamafiles, and vLLM, as well as any OpenAI-compatible provider. If you don’t know any of these names, just use Ollama.Ollama
To get started with Ollama, we recommend you try out the Gemma 3 4B IT QAT model, which is Cellm’s default local model.1
Install Ollama
Download and install Ollama. Ollama will start after the install and automatically run whenever you start up your computer.
2
Download the model
Open Windows Terminal (open start menu, type Wait for the download to finish.
Windows Terminal, and click OK), then run:Download Gemma 3 4B QAT
3
Test in Excel
In Excel, select The model will tell you that it is called “Gemma” and made by Google DeepMind.
ollama/gemma3:4b-it-qat from the model dropdown menu, and type:Test prompt
You can use any model that Ollama supports. See https://ollama.com/search for a complete list.
LLamafile
Llamafile is a project by Mozilla that combines llama.cpp with Cosmopolitan Libc, enabling you to download and run a single-file executable (called a “llamafile”) that runs locally on most computers, with no installation.1
Download a llamafile
Download a llamafile from https://github.com/Mozilla-Ocho/llamafile (e.g. Gemma 3 4B IT).
2
Rename the file
Append
.exe to the filename. For example, google_gemma-3-4b-it-Q6_K.llamafile should be renamed to google_gemma-3-4b-it-Q6_K.llamafile.exe.3
Run the llamafile
Open Windows Terminal (open start menu, type To offload inference to your NVIDIA or AMD GPU, run:
Windows Terminal, and click OK) and run:CPU only
With GPU
4
Configure Cellm
Start Excel and select the
OpenAiCompatible provider from the model drop-down on Cellm’s ribbon menu. Enter any model name e.g., “gemma”. Llamafiles ignore the model name since each llamafile serves only one model, but a name is required by the OpenAI API.Set the Base Address to http://localhost:8080.Dockerized Ollama
If you prefer to run models via docker, both Ollama and vLLM are packaged up with docker compose files in thedocker/ folder. vLLM is designed to run many requests in parallel and particularly useful if you need to process a lot of data with Cellm.
1
Clone the repository
Clone repo
2
Start Ollama container
Run the following command in the To use your GPU for faster inference:To stop the container:
docker/ directory:Start container
Start with GPU
Stop container
3
Configure Cellm
Start Excel and select the
openaicompatible provider from the model drop-down on Cellm’s ribbon menu. Enter the model name you want to use, e.g., gemma3:4b-it-qat.Set the Base Address to http://localhost:11434.ollama run mistral-small3.1:24b in the container.
Dockerized vLLM
If you want to speed up running many requests in parallel, you can use vLLM instead of Ollama. vLLM requires a Hugging Face API key to download models from the Hugging Face Hub.1
Set up Hugging Face API key
You must supply the docker compose file with a Hugging Face API key either via an environment variable or by editing the docker compose file directly. Look at the vLLM docker compose file for details.If you don’t know what a Hugging Face API key is, just use Ollama instead.
2
Start vLLM container
Start vLLM
--model argument in the docker compose file to another Hugging Face model.