Skip to main content
Cellm supports local models that run on your computer via Llamafiles, Ollama, or vLLM. This ensures none of your data ever leaves your machine. And it’s free. On this page you will learn what to consider when choosing a local model and how to run it.

Model sizes

We can split local models into three tiers based on their size and capabilities, balancing speed, intelligence, and world knowledge:
Small ModelMedium ModelLarge Model
Speed
Intelligence
World Knowledge
Recommended modelGemma 3 4B IT QATMistral Small 3.2qwen3-30b-a3b-instruct-2507
You need a GPU for any of the medium or large models to be useful in practice. If you don’t have a GPU, you can use Hosted Models if small ones are insufficient.
In general, smaller models are faster and less intelligent, while larger models are slower and more intelligent. When using local models, it’s important to find the right balance for your task, because speed impacts your productivity and intelligence impacts your results. You should try out different models and choose the smallest one that gives you good results. Small models are sufficient for many common tasks such as categorizing text or extracting person names from news articles. Medium models are appropriate for more complex tasks such as document review, survey analysis, or tasks involving function calling. Large models are useful for creative writing, tasks requiring nuanced language understanding such as spam detection, or tasks requiring world knowledge. Models larger than 32B require significant hardware investment to run locally, and you are better off using Hosted Models if you need this kind of intelligence and don’t have the hardware already.
Large models are needed to use the Internet Browser tool effectively.

Run models locally

You need to run a program on your computer that serves models to Cellm. We call these programs “providers”. Cellm supports Ollama, Llamafiles, and vLLM, as well as any OpenAI-compatible provider. If you don’t know any of these names, just use Ollama.

Ollama

To get started with Ollama, we recommend you try out the Gemma 3 4B IT QAT model, which is Cellm’s default local model.
1

Install Ollama

Download and install Ollama. Ollama will start after the install and automatically run whenever you start up your computer.
2

Download the model

Open Windows Terminal (open start menu, type Windows Terminal, and click OK), then run:
Download Gemma 3 4B QAT
ollama pull gemma3:4b-it-qat
Wait for the download to finish.
3

Test in Excel

In Excel, select ollama/gemma3:4b-it-qat from the model dropdown menu, and type:
Test prompt
=PROMPT("Which model are you and who made you?")
The model will tell you that it is called “Gemma” and made by Google DeepMind.
You can use any model that Ollama supports. See https://ollama.com/search for a complete list.

LLamafile

Llamafile is a project by Mozilla that combines llama.cpp with Cosmopolitan Libc, enabling you to download and run a single-file executable (called a “llamafile”) that runs locally on most computers, with no installation.
1

Download a llamafile

2

Rename the file

Append .exe to the filename. For example, google_gemma-3-4b-it-Q6_K.llamafile should be renamed to google_gemma-3-4b-it-Q6_K.llamafile.exe.
3

Run the llamafile

Open Windows Terminal (open start menu, type Windows Terminal, and click OK) and run:
CPU only
.\google_gemma-3-4b-it-Q6_K.llamafile.exe --server --v2
To offload inference to your NVIDIA or AMD GPU, run:
With GPU
.\google_gemma-3-4b-it-Q6_K.llamafile.exe --server --v2 -ngl 999
4

Configure Cellm

Start Excel and select the OpenAiCompatible provider from the model drop-down on Cellm’s ribbon menu. Enter any model name e.g., “gemma”. Llamafiles ignore the model name since each llamafile serves only one model, but a name is required by the OpenAI API.Set the Base Address to http://localhost:8080.
Llamafiles are especially useful if you don’t have the necessary permissions to install programs on your computer.

Dockerized Ollama

If you prefer to run models via docker, both Ollama and vLLM are packaged up with docker compose files in the docker/ folder. vLLM is designed to run many requests in parallel and particularly useful if you need to process a lot of data with Cellm.
1

Clone the repository

Clone repo
git clone https://github.com/getcellm/cellm
2

Start Ollama container

Run the following command in the docker/ directory:
Start container
docker compose -f docker-compose.Ollama.yml up --detach
To use your GPU for faster inference:
Start with GPU
docker compose -f docker-compose.Ollama.yml -f docker-compose.Ollama.GPU.yml up --detach
To stop the container:
Stop container
docker compose -f docker-compose.Ollama.yml down
3

Configure Cellm

Start Excel and select the openaicompatible provider from the model drop-down on Cellm’s ribbon menu. Enter the model name you want to use, e.g., gemma3:4b-it-qat.Set the Base Address to http://localhost:11434.
To use other Ollama models, pull another of the supported models by running e.g. ollama run mistral-small3.1:24b in the container.

Dockerized vLLM

If you want to speed up running many requests in parallel, you can use vLLM instead of Ollama. vLLM requires a Hugging Face API key to download models from the Hugging Face Hub.
1

Set up Hugging Face API key

You must supply the docker compose file with a Hugging Face API key either via an environment variable or by editing the docker compose file directly. Look at the vLLM docker compose file for details.If you don’t know what a Hugging Face API key is, just use Ollama instead.
2

Start vLLM container

Start vLLM
docker compose -f docker-compose.vLLM.GPU.yml up --detach
To use other vLLM models, change the --model argument in the docker compose file to another Hugging Face model.
Open WebUI is included in both Ollama and vLLM docker compose files so you can test the local model outside of Cellm. Open WebUI is available at http://localhost:3000.