Run large language model on your own device with Ollama

2024-04-23 · 6 min read

I used ChatGPT every day. Whenever I find myself doing routine tasks like writing, rewriting, or brainstorming, I go to ChatGPT for assistance, treating it like my own personal secretary (and it's free!).

Yet, I'm quite particular about my privacy. Even with training disabled, I always hesitated to share my information online. I simply prefer to keep my thoughts to myself!

Then, one day, I discovered a solution: using ollama to run AI models directly on my own device, offline. Sounds good, let's try it.

What's ollama?

Ollama (Github) is build on top of llama.cpp, you can use it to run Llama 2, Mistral, Gemma, and other large language models locally.

You can start with just your computer's CPU and RAM for smaller AI models, like the 7B version. But if you want faster responses and smoother performance, you'll want a powerful GPU, like the NVIDIA GeForce RTX 4000 series, with at least 8 GB of VRAM.

Install ollama

Here use my Archlinux machine as example. If you are using Windows, you can download and install it from the official website.

Navigate to your Linux environment, run this command in terminal:

curl -fsSL https://ollama.com/install.sh | sh

It'll start installing ollama.

Add AI model(s)

After you're done with the installation, feel free to add any AI models you like.

From the official website, you can see a lists of AI models you can download. Let's use mistral as an example. You can simply download it via:

ollama run mistral:7b

Then it'll start pulling mistral to your machine.

Just a heads-up, it's best to use Wi-Fi instead of mobile data for this because these AI models are quite large, starting from 4GB and up!

What's family, parameters and quantization?

While you wait, let's look at the model page.

In the model field, it shows as family llama ·parameters 7B·quantization 4-bit. What exactly do these terms mean?

Family

Llama is an open-source large language model from Meta. When it said family: llama, it means this model is based on Llama. If you are interested to know more, learn more about it on Meta's official website.

Parameters

Llama comes in various sizes: 7B, 13B, 33B, and 65B. Each size is trained with different amounts of tokens.

We trained LLaMA 65B and LLaMA 33B on 1.4 trillion tokens. Our smallest model, LLaMA 7B, is trained on one trillion tokens.

From Introducing LLaMA A foundational, 65-billion-parameter language model - Meta

Based on my personal experience, if your tasks aren't too complex (like brainstorming names or Easter holiday ideas), 7B should do the trick just fine.

However, if you need more detailed and lengthy responses, 13B and 33B are better options. The responses will be slow if your device isn't powerful enough for running 13B or more.

Quantization

Now things become more complicated!

Quantization, in the simpliest word, it means to keep the most important information and throw away trival data. Data loss will happen, but it kept the most important information with smaller sizes.

Think of it this way: imagine you have a 100,000-word novel in your hands, representing the original AI model. It's weighty, overflowing with words, and requires time to digest.

Now, imagine quantization as creating a summary. Different levels of quantization just like various summaries of your novel.

They might lose some (or a lot) of the finer details, but the main storyline is still there. Essentially, you get to know the main story in a fraction of the time, with summaries of 10,000 or 50,000 words compared to the original one!

For a deeper dive into quantization, I highly recommend this article: Fitting AI models in your pocket with quantization.

Also, here's brief explanation from quantize --help:

Allowed quantization types:
   2  or  Q4_0   :  3.50G, +0.2499 ppl @ 7B - small, very high quality loss - legacy, prefer using Q3_K_M
   3  or  Q4_1   :  3.90G, +0.1846 ppl @ 7B - small, substantial quality loss - legacy, prefer using Q3_K_L
   8  or  Q5_0   :  4.30G, +0.0796 ppl @ 7B - medium, balanced quality - legacy, prefer using Q4_K_M
   9  or  Q5_1   :  4.70G, +0.0415 ppl @ 7B - medium, low quality loss - legacy, prefer using Q5_K_M
  10  or  Q2_K   :  2.67G, +0.8698 ppl @ 7B - smallest, extreme quality loss - not recommended
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5505 ppl @ 7B - very small, very high quality loss
  12  or  Q3_K_M :  3.06G, +0.2437 ppl @ 7B - very small, very high quality loss
  13  or  Q3_K_L :  3.35G, +0.1803 ppl @ 7B - small, substantial quality loss
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.56G, +0.1149 ppl @ 7B - small, significant quality loss
  15  or  Q4_K_M :  3.80G, +0.0535 ppl @ 7B - medium, balanced quality - *recommended*
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0353 ppl @ 7B - large, low quality loss - *recommended*
  17  or  Q5_K_M :  4.45G, +0.0142 ppl @ 7B - large, very low quality loss - *recommended*
  18  or  Q6_K   :  5.15G, +0.0044 ppl @ 7B - very large, extremely low quality loss
   7  or  Q8_0   :  6.70G, +0.0004 ppl @ 7B - very large, extremely low quality loss - not recommended
   1  or  F16    : 13.00G              @ 7B - extremely large, virtually no quality loss - not recommended
   0  or  F32    : 26.00G              @ 7B - absolutely huge, lossless - not recommended

In my opinion, Q4_K_M orQ5_K_M are decent.

Size makes a difference. When an AI model are smaller than your VRAM (for instance, a 7b model with 4GB compared to a VRAM of 12GB), you can expect lightning-fast outcomes. This is because your machine can load the entire AI model into VRAM.

UI

You can execute Ollama through the command line, and there are also open-source interfaces available! I am a fan of open-webui for its straightforward and neat ChatGPT-like interface.

Running open-webui on your system is simple. Just execute a Docker command:

docker run -d --network=host -v open-webui:/app/backend/data -e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui --restart always ghcr.io/open-webui/open-webui:main

When it's done, you can access the UI from http://localhost:8080.

Performance

I can share some performance insights from my local machine setup:

Type Item
CPU Intel Core i5-12400 2.5 GHz 6-Core Processor
Motherboard MSI B760 GAMING PLUS WIFI ATX LGA1700 Motherboard
Memory Corsair Vengeance 32 GB (2 x 16 GB) DDR5-4800 CL40 Memory
Video Card MSI GeForce RTX 3060 Ventus 2X 12G GeForce RTX 3060 12GB 12 GB Video Card

With 12GB VRAM, it's extremely fast with 7B model (Q5_K_M). Slower with 13B model (Q4_K_M). Even slower with 34B mdoel, only 3.07 token/s, but it's still tolerable.

As for the 70B model, I haven't dared to test it yet. I think it could be extremely slow or even throwing errors!

Type Speed (with RTX 3060 12GB)
7B - deepseek-llm:7b-chat-q5_K_M 51.3 token / s
13B - llama2:13b-chat-q5_K_M 8.47 token / s
34B - yi:34b-chat-q3_K_S 3.07 token/s

Thoughts

While experimenting with various models, I realised the default prompt from open-webui effectively assesses AI models.

The prompt is:

Help me study vocabulary: write a sentence for me to fill in the blank, and I'll try to pick the correct option.

Some models provided both question and answer simultaneously, which is meaningless for studying. Some models gave you question, but the options just don't make senses (e.g. all of them are valid).

This actually shows how well AI models can follow instructions, especially the smaller ones like 7B might struggle a bit.

Various models excel in distinct areas too. Some exhibit proficiency in multilingual capabilities, while others do not. This actually isn't a bad thing, it means you can choose different model to do specific tasks on what it's best at.

Forecast

It's fun to work with local AI models! You can tweak tokens, system prompt and other settings to suit your preferences in open-webui too.

One advantage of local AI models is the freedom to access uncensored content. Need to brainstorm some horror stories with violence and sacrifices? No problem!

Plus, running AI locally on devices like laptops or phones ensures privacy. You're not sharing your data with any big corporation.

Of course, local models might not defeat ChatGPT, but for everyday tasks, they're more than capable.

AI development is advancing rapidly. Imagine a future where everyone has their own AI buddy helping out with everyday tasks. Sounds pretty awesome, doesn't it?



Written by Yuki Cheung

Hey there, I am Yuki! I'm a front-end software engineer who's passionate about both coding and gaming.

When I'm not crafting code, you'll often find me immersed in the worlds of Fire Emblem and The Legend of Zelda—those series are my absolute favorites! Lately, I've been delving into the fascinating realm of AI-related topics too.