Run Local LLMs: A Comprehensive Guide to Getting Started

No LLMs were harmed during the creation of this post.

So, I Started Playing with LLMs…

…and it’s surprisingly engaging. Initially, I shared the skepticism surrounding the AI/LLM “boom.” Like many, I suspected they were mostly fabricating information and generating uncanny, nonsensical outputs. I couldn’t have been more mistaken. My limited experiences with ChatGPT, primarily for initial exploration, left a positive impression despite some minor hallucinations. This was during the era of GPT-3.5. The advancements since then have been remarkable.

However, despite ChatGPT’s capabilities, a lingering skepticism remained. Every input and output was accessible to OpenAI or any provider I might choose. While not inherently problematic, it felt unsettling and restricted the use of LLMs for any confidential or non-open-source projects related to my work. Furthermore, ChatGPT’s free access is limited; deeper engagement would likely necessitate a paid subscription, something I’d prefer to avoid.

This led me to investigate open-source models. Initially, I was unsure how to utilize them. Seeing the size of “smaller” models, like Llama 2 7B, I assumed my RTX 2070 Super with its 8GB of VRAM would struggle (another incorrect assumption!). Running them on the CPU seemed likely to result in unacceptable performance. Consequently, I upgraded to an RX 7900 XT, boasting 20GB of VRAM – more than sufficient for running small to medium-sized LLMs. Exciting!

The next challenge was finding software to run an LLM on this new GPU. CUDA was the prevalent backend, but it’s designed for NVIDIA GPUs, not AMD. After some research, I discovered ROCm and, more importantly, LM Studio. This software was precisely what I needed, at least for a starting point. Its user-friendly interface, easy access to numerous models, and especially quantization, completely convinced me of the viability of self-hosting LLMs. The existence of quantization highlighted that powerful hardware isn’t a prerequisite for running LLMs. You can even run LLMs on Raspberry Pis now (using llama.cpp as well!). Of course, performance will be significantly reduced without appropriate hardware and backend optimization, but the barrier to entry is now remarkably low.

If you’re seeking software to easily run popular models on most modern hardware for non-commercial use, LM Studio is an excellent choice. Review the following section for important disclaimers and dive in. It fits the bill perfectly; just ensure you select the right backend for your GPU/CPU for optimal performance.

However, if you are interested in:

  • Delving deeper into llama.cpp (LM Studio’s backend) and LLMs generally.
  • Utilizing LLMs for commercial ventures (LM Studio’s terms restrict commercial use).
  • Running LLMs on less common hardware (LM Studio only supports popular backends).
  • Avoiding closed-source software (LM Studio is closed-source) and preferring self-built and fully trusted solutions.
  • Accessing the latest features and models as soon as they are available.

Then the rest of this guide will be highly beneficial!

But First – Some Disclaimers for Expectation Management

Before we proceed, let’s address some key questions I wished I had answers to before embarking on my self-hosted LLM journey.

Do I Need an RTX 2070 Super / RX 7900 XT or Similar High-End GPU?

No, you absolutely do not. I will elaborate further, but LLMs can even run without a dedicated GPU. If you possess reasonably modern hardware (at least a decent CPU with AVX support), you are compatible. However, remember that performance will vary.

What Performance Can I Expect?

This is a complex question without a simple answer. Text generation speed depends on several factors, primarily:

  • Matrix operation performance of your hardware.
  • Memory bandwidth.
  • Model size.

These aspects will be detailed later, but generally, you can achieve reasonable performance by selecting a model that aligns with your hardware capabilities. If you intend to use a GPU with sufficient memory for the model and its context, expect real-time text generation. If relying on both GPU and CPU or CPU alone, anticipate slower performance, although real-time generation is still possible with smaller models.

What Quality of Responses Should I Expect?

Response quality significantly depends on your specific use case and the chosen model. A direct answer is difficult. Experimentation is key to finding what works best for you. A general guideline is: “larger models tend to produce better responses.” Consider that state-of-the-art models like GPT-4 or Claude are typically measured in hundreds of billions of parameters. Unless you have multiple GPUs or an excessive amount of RAM and patience, you will likely be limited to models with fewer than 20 billion parameters. In my experience, 7-8B parameter models are quite effective for general tasks and programming, and while they are not as advanced as models like GPT-4o or Claude in terms of raw response quality, the gap is noticeable but not vast. Remember, model selection is only part of the equation. Providing appropriate context, system prompts, or fine-tuning LLMs can dramatically improve results.

Can I Replace ChatGPT/Claude/[Online LLM Provider] with This?

Potentially. Theoretically, yes. Practically, it depends on your toolkit. llama.cpp offers an OpenAI-compatible server. If your tools communicate with LLMs via the OpenAI API and allow custom endpoint settings, you can use a self-hosted LLM with them.

Prerequisites

  • Reasonably Modern CPU: Any Ryzen CPU or Intel 8th generation or newer should be sufficient. Older hardware may also work, but performance will be affected.
  • Optimal: Dedicated GPU: More VRAM is always better. At least 8GB of VRAM is recommended to comfortably run 7-8B models, which is a reasonable minimum. Vendor is not critical; llama.cpp supports NVIDIA, AMD, and Apple GPUs (Intel support is also available, potentially via Vulkan).
  • Alternative: Sufficient RAM: If you lack a GPU or VRAM, you will need RAM to accommodate the model. Similar to VRAM, at least 8GB of free RAM is recommended, with more being preferable. Note that when utilizing only the GPU with llama.cpp, RAM usage is minimal.

This guide assumes you are using either Windows or Linux. Mac users should follow Linux instructions and consult llama.cpp documentation for macOS-specific guidance.

Context-specific formatting is used throughout this guide:

Sections with a gray background indicate Windows-specific instructions. These sections may be more extensive than Linux counterparts due to the complexities of Windows. Linux is generally preferred for simplicity. Detailed step-by-step instructions are provided for Windows, but consider Linux if you encounter issues.

Sections with a light gray background highlight Linux-specific instructions.

Building llama.cpp

Detailed build instructions for all supported platforms are available in docs/build.md. By default, llama.cpp builds with automatic CPU support detection. We will explore enabling GPU and advanced CPU support later. For now, let’s build it as is; it’s a good starting point and requires no external dependencies. All you need is a C++ toolchain, CMake, and Ninja.

For the very impatient, pre-built releases are available on GitHub, allowing you to skip the build process. Ensure you download the correct version for your hardware and backend. If unsure, following the build guide is recommended as it clarifies these choices. Note that pre-built releases may lack Python scripts needed for manual model quantization, which are available in the repository if needed.

On Windows, MSYS is recommended for setting up the build environment. Microsoft Visual C++ is also supported, but MSYS is generally easier to manage. Install MinGW for the x64 UCRT environment as per the MSYS homepage instructions. CMake, Ninja, and Git can be installed within the UCRT MSYS environment using:

pacman -S git mingw-w64-ucrt-x86_64-cmake mingw-w64-ucrt-x86_64-ninja

For other toolchains (MSVC or non-MSYS), install CMake, Git, and Ninja via winget:

winget install cmake git.git ninja-build.ninja

Python is also required, obtainable via winget. Version 3.12 is recommended as 3.13 may have PyTorch compatibility issues.

IMPORTANT: DO NOT USE PYTHON FROM MSYS! It will cause issues with building llama.cpp dependencies. MSYS is solely for building llama.cpp.

If using MSYS, add its /bin directory (C:msys64ucrt64bin by default) to your system PATH so Python can access MinGW for package building. Verify GCC availability by running gcc --version in PowerShell/Command Prompt. Check the correct GCC is being used by running where.exe gcc.exe and reorder your PATH if necessary.

If using MSVC, this disclaimer is not relevant as it should be automatically detected.

winget install python.python.3.12

Update pip, setuptools, and wheel packages before continuing:

python -m pip install --upgrade pip wheel setuptools

On Linux, GCC is recommended, though Clang can be used by setting CMAKE_C_COMPILER=clang and CMAKE_CXX_COMPILER=clang++. GCC should be pre-installed (verify with gcc --version in the terminal). If not, install the latest version for your distribution using your package manager. The same applies to CMake, Ninja, Python 3 (with setuptools, wheel, and pip), and Git.

Begin by cloning the llama.cpp source code and navigating into the directory.

This guide assumes commands are run from the user’s home directory (/home/[yourusername] on Linux, C:/Users/[yourusername] on Windows). You can use any directory, but commands assume you are starting from your home directory unless otherwise specified.

MSYS users: MSYS home directory differs from the Windows home directory. Use cd (no arguments) after starting MSYS to move to the MSYS home directory.

(If you have SSH authentication configured with GitHub, use [email protected]:ggerganov/llama.cpp.git instead of the HTTPS URL below)

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git submodule update --init --recursive

Use CMake to generate build files, build the project, and install it. Run the following command to generate build files in the build/ subdirectory:

cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/your/install/dir -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON

CMake variable explanations:

  • CMAKE_BUILD_TYPE=Release: Optimizes for performance.
  • CMAKE_INSTALL_PREFIX: Specifies the installation directory for llama.cpp binaries and Python scripts. Replace /your/install/dir or remove to use the default.
    • Windows default: c:/Program Files/llama.cpp. Requires admin privileges for installation and adding bin/ to PATH for system-wide access. $env:LOCALAPPDATA/llama.cpp (C:/Users/[yourusername]/AppData/Local/llama.cpp) is recommended for user-level installation without admin rights.
    • Linux default: /usr/local. Superuser permissions are needed for installation. Change to a user directory and add its bin/ to PATH if you lack permissions.
  • LLAMA_BUILD_TESTS=OFF: Skips building tests for faster compilation.
  • LLAMA_BUILD_EXAMPLES=ON: Builds example binaries, which are needed.
  • LLAMA_BUILD_SERVER=ON: Builds the server binary, also needed. Note: LLAMA_BUILD_EXAMPLES must be ON to build the server.

Default settings are used if variables are ignored (all LLAMA_BUILD_* are ON by default). CMAKE_BUILD_TYPE is usually Release by default but explicitly setting it is recommended. CMAKE_INSTALL_PREFIX is optional if you don’t need to install and are okay with adding the build directory’s /bin to your PATH.

Build the project. Replace X with your CPU core count for faster compilation. Ninja should automatically use all cores, but manual specification is often preferred.

cmake --build build --config Release -j X

Building should take a few minutes. Install the binaries:

cmake --install build --config Release

The CMAKE_INSTALL_PREFIX/bin directory should now contain executables and Python scripts:

/c/Users/phoen/llama-build/bin
❯ l
Mode Size Date Modified Name
-a--- 203k 7 Nov 16:14 convert_hf_to_gguf.py
-a--- 3.9M 7 Nov 16:18 llama-batched-bench.exe
-a--- 3.9M 7 Nov 16:18 llama-batched.exe
-a--- 3.4M 7 Nov 16:18 llama-bench.exe
-a--- 3.9M 7 Nov 16:18 llama-cli.exe
-a--- 3.2M 7 Nov 16:18 llama-convert-llama2c-to-ggml.exe
-a--- 3.9M 7 Nov 16:18 llama-cvector-generator.exe
-a--- 3.9M 7 Nov 16:18 llama-embedding.exe
-a--- 3.9M 7 Nov 16:18 llama-eval-callback.exe
-a--- 3.9M 7 Nov 16:18 llama-export-lora.exe
-a--- 3.0M 7 Nov 16:18 llama-gbnf-validator.exe
-a--- 1.2M 7 Nov 16:18 llama-gguf-hash.exe
-a--- 3.0M 7 Nov 16:18 llama-gguf-split.exe
-a--- 1.1M 7 Nov 16:18 llama-gguf.exe
-a--- 3.9M 7 Nov 16:18 llama-gritlm.exe
-a--- 3.9M 7 Nov 16:18 llama-imatrix.exe
-a--- 3.9M 7 Nov 16:18 llama-infill.exe
-a--- 4.2M 7 Nov 16:18 llama-llava-cli.exe
-a--- 3.9M 7 Nov 16:18 llama-lookahead.exe
-a--- 3.9M 7 Nov 16:18 llama-lookup-create.exe
-a--- 1.2M 7 Nov 16:18 llama-lookup-merge.exe
-a--- 3.9M 7 Nov 16:18 llama-lookup-stats.exe
-a--- 3.9M 7 Nov 16:18 llama-lookup.exe
-a--- 4.1M 7 Nov 16:18 llama-minicpmv-cli.exe
-a--- 3.9M 7 Nov 16:18 llama-parallel.exe
-a--- 3.9M 7 Nov 16:18 llama-passkey.exe
-a--- 4.0M 7 Nov 16:18 llama-perplexity.exe
-a--- 3.0M 7 Nov 16:18 llama-quantize-stats.exe
-a--- 3.2M 7 Nov 16:18 llama-quantize.exe
-a--- 3.9M 7 Nov 16:18 llama-retrieval.exe
-a--- 3.9M 7 Nov 16:18 llama-save-load-state.exe
-a--- 5.0M 7 Nov 16:19 llama-server.exe
-a--- 3.0M 7 Nov 16:18 llama-simple-chat.exe
-a--- 3.0M 7 Nov 16:18 llama-simple.exe
-a--- 3.9M 7 Nov 16:18 llama-speculative.exe
-a--- 3.1M 7 Nov 16:18 llama-tokenize.exe

Don’t be intimidated by the number of executables; we will only use a few. Test the build by running llama-cli --help. We can’t do much yet without a model.

Acquiring a Model

HuggingFace is the primary resource for LLMs, datasets, and other AI-related resources.

We will use SmolLM2, a recent model series from HuggingFace (November 1, 2024). Its small size is ideal for this guide. The largest model has 1.7 billion parameters, requiring approximately 4GB of system memory in unquantized form (excluding context). 360M and 135M variants are also available for even lower resource environments like Raspberry Pi or smartphones.

llama.cpp cannot directly run “raw” models, which are typically provided in .safetensors format. llama.cpp requires models in .gguf format. Fortunately, llama.cpp includes convert_hf_to_gguf.py to convert .safetensors to .gguf. Some creators provide .gguf files directly, such as some SmolLM2 variants from HuggingFace. Community uploads of .gguf models are also available. However, we will focus on self-quantization to explore customization and avoid multiple downloads.

Download the contents of the SmolLM2 1.7B Instruct repository (or the 360M Instruct or 135M Instruct versions if you have limited resources, or any transformers-compatible model), omitting LFS files initially. We only need one file, which we will download manually.

Why Instruct models? “Instruct” models are specifically trained for conversational interactions. Base models are trained for text completion and are typically used as foundations for further training. This is a common pattern for many LLMs, but always verify the model description before use.

Using Bash/ZSH (or MSYS Bash):

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct

Using PowerShell:

$env:GIT_LFS_SKIP_SMUDGE=1
git clone https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct

Using cmd.exe (VS Development Prompt):

set GIT_LFS_SKIP_SMUDGE=1
git clone https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct

HuggingFace also supports Git over SSH. The git clone command is accessible on each repo page:

Finding the Git Clone Button on HuggingFace.

After cloning, manually download the model.safetensors file from HuggingFace. GIT_LFS_SKIP_SMUDGE was used to avoid downloading all large model files initially, as manual download of large files is generally faster than Git LFS.

After downloading, your local SmolLM repo should resemble:

PS D:LLMsreposSmolLM2-1.7B-Instruct> l
Mode Size Date Modified Name
-a--- 806 2 Nov 15:16 all_results.json
-a--- 888 2 Nov 15:16 config.json
-a--- 602 2 Nov 15:16 eval_results.json
-a--- 139 2 Nov 15:16 generation_config.json
-a--- 515k 2 Nov 15:16 merges.txt
-a--- 3.4G 2 Nov 15:34 model.safetensors
d---- - 2 Nov 15:16 onnx
-a--- 11k 2 Nov 15:16 README.md
d---- - 2 Nov 15:16 runs
-a--- 689 2 Nov 15:16 special_tokens_map.json
-a--- 2.2M 2 Nov 15:16 tokenizer.json
-a--- 3.9k 2 Nov 15:16 tokenizer_config.json
-a--- 240 2 Nov 15:16 train_results.json
-a--- 89k 2 Nov 15:16 trainer_state.json
-a--- 129 2 Nov 15:16 training_args.bin
-a--- 801k 2 Nov 15:16 vocab.json

We will use these four files:

  • config.json: Model configuration/metadata.
  • model.safetensors: Model weights.
  • tokenizer.json: Tokenizer data (text token to ID mapping). Sometimes in tokenizer.model.
  • tokenizer_config.json: Tokenizer configuration (special tokens, chat template).

This sentence is included as an anti-plagiarism marker. If you are not reading this on steelph0enix.github.io, this article has been copied without authorization.

Converting HuggingFace Model to GGUF

To convert the raw model to llama.cpp-compatible GGUF format, we use convert_hf_to_gguf.py from llama.cpp. A Python virtual environment is recommended for managing dependencies, ideally created outside the llama.cpp repository, such as in your home directory.

Linux virtual environment creation:

python -m venv ~/llama-cpp-venv

PowerShell equivalent:

python -m venv $env:USERPROFILE/llama-cpp-venv

cmd.exe equivalent:

python -m venv %USERPROFILE%/llama-cpp-venv

Activate the virtual environment.

Linux activation:

source ~/llama-cpp-venv/bin/activate

PowerShell activation:

. $env:USERPROFILE/llama-cpp-venv/Scripts/Activate.ps1

cmd.exe activation:

call %USERPROFILE%/llama-cpp-venv/Scripts/activate.bat

Update core packages within the virtual environment:

python -m pip install --upgrade pip wheel setuptools

Install dependencies for llama.cpp scripts. Refer to the requirements/ directory in your llama.cpp repository.

❯ l llama.cpp/requirements
Mode Size Date Modified Name
-a--- 428 11 Nov 13:57 requirements-all.txt
-a--- 34 11 Nov 13:57 requirements-compare-llama-bench.txt
-a--- 111 11 Nov 13:57 requirements-convert_hf_to_gguf.txt
-a--- 111 11 Nov 13:57 requirements-convert_hf_to_gguf_update.txt
-a--- 99 11 Nov 13:57 requirements-convert_legacy_llama.txt
-a--- 43 11 Nov 13:57 requirements-convert_llama_ggml_to_gguf.txt
-a--- 96 11 Nov 13:57 requirements-convert_lora_to_gguf.txt
-a--- 48 11 Nov 13:57 requirements-pydantic.txt
-a--- 13 11 Nov 13:57 requirements-test-tokenizer-random.txt

Install dependencies for the conversion script:

python -m pip install --upgrade -r llama.cpp/requirements/requirements-convert_hf_to_gguf.txt

If pip fails, ensure a working C/C++ toolchain is in your PATH.

If using MSYS for this, revert to PowerShell/cmd, install Python via winget, and repeat the setup. Python dependencies may incorrectly detect the platform on MSYS, leading to build issues.

Use the script to create the GGUF model file. Start by checking the script’s help options:

python llama.cpp/convert_hf_to_gguf.py --help

If help is displayed, proceed. Otherwise, ensure the virtual environment is active and dependencies are correctly installed. To convert your model, specify the path to the model repository and optionally the output file path. We will create a floating-point GGUF file for maximum quantization flexibility later using llama-quantize, which offers more options than the conversion script.

Convert SmolLM2 to GGUF (replace SmolLM2-1.7B-Instruct with your actual path):

python llama.cpp/convert_hf_to_gguf.py SmolLM2-1.7B-Instruct --outfile ./SmolLM2.gguf

Successful conversion will produce output similar to:

INFO:hf-to-gguf:Loading model: SmolLM2-1.7B-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> F16, shape = {2048, 49152}
INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {2048}
...
INFO:hf-to-gguf:blk.9.attn_q.weight, torch.bfloat16 --> F16, shape = {2048, 2048}
INFO:hf-to-gguf:blk.9.attn_v.weight, torch.bfloat16 --> F16, shape = {2048, 2048}
INFO:hf-to-gguf:output_norm.weight, torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 8192
INFO:hf-to-gguf:gguf: embedding length = 2048
INFO:hf-to-gguf:gguf: feed forward length = 8192
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 32
INFO:hf-to-gguf:gguf: rope theta = 130000
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 48900 merge(s).
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 2
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting special token type pad to 2
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:SmolLM2.gguf: n_tensors = 218, total_size = 3.4G
Writing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.42G/3.42G [00:15<00:00, 215Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to SmolLM2.gguf

Quantizing the Model

Now we can quantize our model using the llama-quantize executable built earlier. Check available quantization types:

llama-quantize --help

Current quantization types from llama-quantize --help:

usage: llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]

...

Allowed quantization types:
 2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B
 3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B
 8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B
 9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B
 19 or IQ2_XXS : 2.06 bpw quantization
 20 or IQ2_XS : 2.31 bpw quantization
 28 or IQ2_S : 2.5 bpw quantization
 29 or IQ2_M : 2.7 bpw quantization
 24 or IQ1_S : 1.56 bpw quantization
 31 or IQ1_M : 1.75 bpw quantization
 36 or TQ1_0 : 1.69 bpw ternarization
 37 or TQ2_0 : 2.06 bpw ternarization
 10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B
 21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B
 23 or IQ3_XXS : 3.06 bpw quantization
 26 or IQ3_S : 3.44 bpw quantization
 27 or IQ3_M : 3.66 bpw quantization mix
 12 or Q3_K : alias for Q3_K_M
 22 or IQ3_XS : 3.3 bpw quantization
 11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B
 12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B
 13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B
 25 or IQ4_NL : 4.50 bpw non-linear quantization
 30 or IQ4_XS : 4.25 bpw non-linear quantization
 15 or Q4_K : alias for Q4_K_M
 14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B
 15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B
 17 or Q5_K : alias for Q5_K_M
 16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B
 17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B
 18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B
 7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B
 33 or Q4_0_4_4 : 4.34G, +0.4685 ppl @ Llama-3-8B
 34 or Q4_0_4_8 : 4.34G, +0.4685 ppl @ Llama-3-8B
 35 or Q4_0_8_8 : 4.34G, +0.4685 ppl @ Llama-3-8B
 1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B
 32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
 0 or F32 : 26.00G @ 7B
 COPY : only copy tensors, no quantizing

The table lists quantization types by ID and name, along with size and perplexity metrics or bits per weight (bpw). Perplexity indicates model prediction certainty (lower is better accuracy). Bits per weight is the average size of a quantized tensor weight.

Choosing a quantization depends on your hardware and desired balance between size and quality. A good starting point is “the largest quantization that fits in my VRAM without being too slow.” If not using a GPU, substitute VRAM with RAM. “Largest fit” means fitting without excessive swapping to disk.

To estimate quantized model size, multiply the original size by the approximate bits per weight ratio. For SmolLM2 1.7B-Instruct (3.4GB in BF16), Q8_0 (8 bits per weight, approximately half the original bits) should result in a ~1.7GB model.

Quantize to Q8_0 using llama-quantize. Replace N with your core count:

llama-quantize SmolLM2.gguf SmolLM2.q8.gguf Q8_0 N

Example output:

main: build = 4200 (46c69e0e)
main: built with gcc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
main: quantizing 'SmolLM2.gguf' to 'SmolLM2.q8.gguf' as Q8_0 using 24 threads
llama_model_loader: loaded meta data with 37 key-value pairs and 218 tensors from SmolLM2.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = SmolLM2 1.7B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = SmolLM2
llama_model_loader: - kv 5: general.size_label str = 1.7B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = SmolLM2 1.7B
llama_model_loader: - kv 9: general.base_model.0.organization str = HuggingFaceTB
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/HuggingFaceTB/...
llama_model_loader: - kv 11: general.tags arr[str,4] = ["safetensors", "onnx", "transformers...
llama_model_loader: - kv 12: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 13: llama.block_count u32 = 24
llama_model_loader: - kv 14: llama.context_length u32 = 8192
llama_model_loader: - kv 15: llama.embedding_length u32 = 2048
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 17: llama.attention.head_count u32 = 32
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 130000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: general.file_type u32 = 32
llama_model_loader: - kv 22: llama.vocab_size u32 = 49152
llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 25: tokenizer.ggml.pre str = smollm
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,48900] = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 33: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 34: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 36: general.quantization_version u32 = 2
llama_model_loader: - type f32: 49 tensors
llama_model_loader: - type bf16: 169 tensors
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XT (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64
[ 1/ 218] output_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MB
[ 2/ 218] token_embd.weight - [ 2048, 49152, 1, 1], type = bf16, converting to q8_0 .. size = 192.00 MiB -> 102.00 MiB
[ 3/ 218] blk.0.attn_k.weight - [ 2048, 2048, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 4/ 218] blk.0.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MB
[ 5/ 218] blk.0.attn_output.weight - [ 2048, 2048, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 6/ 218] blk.0.attn_q.weight - [ 2048, 2048, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 7/ 218] blk.0.attn_v.weight - [ 2048, 2048, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
...
[ 212/ 218] blk.23.attn_output.weight - [ 2048, 2048, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 213/ 218] blk.23.attn_q.weight - [ 2048, 2048, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 214/ 218] blk.23.attn_v.weight - [ 2048, 2048, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 215/ 218] blk.23.ffn_down.weight - [ 8192, 2048, 1, 1], type = bf16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 216/ 218] blk.23.ffn_gate.weight - [ 2048, 8192, 1, 1], type = bf16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 217/ 218] blk.23.ffn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MB
[ 218/ 218] blk.23.ffn_up.weight - [ 2048, 8192, 1, 1], type = bf16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
llama_model_quantize_internal: model size = 3264.38 MB
llama_model_quantize_internal: quant size = 1734.38 MB

main: quantize time = 2289.97 ms
main: total time = 2289.97 ms

The quantized model size is approximately 1.7GB, as predicted. Remember to account for context memory (at least 1GB) when estimating RAM/VRAM requirements. For this Q8_0 quantized SmolLM2 model, at least 3GB of free (V)RAM is recommended. If this is still too high, use a lower quantization type or a smaller model.

Note: Currently, our llama.cpp build uses CPU for calculations, so the model resides in RAM. Ensure at least 3GB of free RAM before using the model.

We now have a quantized model ready for use!

Running llama.cpp Server

After the initial setup, we can finally start using our local LLM. Let’s begin by checking the llama-server executable help options:

llama-server --help

The help output lists various options, some of which will be explained here. For now, the key options are:

-m, --model FNAME model path (default: `models/$filename` with filename from `--hf-file`
 or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
 (env: LLAMA_ARG_MODEL)
--host HOST ip address to listen (default: 127.0.0.1)
 (env: LLAMA_ARG_HOST)
--port PORT port to listen (default: 8080)
 (env: LLAMA_ARG_PORT)

Run llama-server with the model path pointing to your quantized SmolLM2 GGUF file. If port 8080 is free, you can leave host and port defaults. Environment variables can also be used instead of command-line arguments.

llama-server -m SmolLM2.q8.gguf

Successful startup will show output similar to:

build: 4182 (ab96610b) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv load_model: loading model 'SmolLM2.q8.gguf'
llama_model_loader: loaded meta data with 37 key-value pairs and 218 tensors from SmolLM2.q8.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = SmolLM2 1.7B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = SmolLM2
llama_model_loader: - kv 5: general.size_label str = 1.7B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = SmolLM2 1.7B
llama_model_loader: - kv 9: general.base_model.0.organization str = HuggingFaceTB
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/HuggingFaceTB/...
llama_model_loader: - kv 11: general.tags arr[str,4] = ["safetensors", "onnx", "transformers...
llama_model_loader: - kv 12: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 13: llama.block_count u32 = 24
llama_model_loader: - kv 14: llama.context_length u32 = 8192
llama_model_loader: - kv 15: llama.embedding_length u32 = 2048
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 17: llama.attention.head_count u32 = 32
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 130000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: general.file_type u32 = 7
llama_model_loader: - kv 22: llama.vocab_size u32 = 49152
llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 25: tokenizer.ggml.pre str = smollm
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,48900] = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 33: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 34: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 36: general.quantization_version u32 = 2
llama_model_loader: - type f32: 49 tensors
llama_model_loader: - type q8_0: 169 tensors
llm_load_vocab: special tokens cache size = 17
llm_load_vocab: token to piece cache size = 0.3170 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 49152
llm_load_print_meta: n_merges = 48900
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 130000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 1.71 B
llm_load_print_meta: model size = 1.69 GiB (8.50 BPW)
llm_load_print_meta: general.name = SmolLM2 1.7B Instruct
llm_load_print_meta: BOS token = 1 '<|im_start|>'
llm_load_print_meta: EOS token = 2 '<|im_end|>'
llm_load_print_meta: EOT token = 0 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: PAD token = 2 '<|im_end|>'
llm_load_print_meta: LF token = 143 'Ä'
llm_load_print_meta: EOG token = 0 '<|endoftext|>'
llm_load_print_meta: EOG token = 2 '<|im_end|>'
llm_load_print_meta: max token length = 162
llm_load_tensors: CPU_Mapped model buffer size = 1734.38 MiB
................................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 130000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 768.00 MiB
llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.19 MiB
llama_new_context_with_model: CPU compute buffer size = 280.01 MiB
llama_new_context_with_model: graph nodes = 774
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle

Access the web UI at http://127.0.0.1:8080.

The Web User Interface for llama.cpp server.

You can now chat with the LLM via the web UI or use the OpenAI-compatible API provided by llama-server. For API details, refer to the llama-server source code and README. A Python library is also available for interacting with the API: unreasonable-llama.

The web UI (left panel) stores conversations in the browser’s localStorage, persisting across browser restarts. Changing server host/port will reset these conversations. The current conversation is used as context for the LLM, limited by server settings (adjustable later). Start new conversations frequently and keep them focused for optimal performance.

The top-right UI buttons are (left to right): “remove conversation,” “download conversation” (JSON), “configuration,” and “Theme.” The configuration window allows adjusting generation settings (currently global, not per-conversation). These settings are further explained below.

Configuration Settings in the llama.cpp Web UI.

llama.cpp Server Settings

The web UI exposes a limited subset of configuration options, primarily sampler settings. For a full range of settings, use llama-server --help. These options significantly affect model behavior and generation performance.

Most parameters are configurable via environment variables (listed in llama-server --help and included below).

Key common params options:

  • --threads/--threads-batch (LLAMA_ARG_THREADS): CPU threads for LLM processing. Default -1 auto-detects CPU cores, suitable for most users.
  • --ctx-size (LLAMA_ARG_CTX_SIZE): Prompt context size (tokens remembered). Increasing context size increases memory usage. 0 attempts to use the model’s maximum context size.
  • --predict (LLAMA_ARG_N_PREDICT): Number of tokens to generate. -1 generates indefinitely, -2 limits to context size.
  • --batch-size/--ubatch-size (LLAMA_ARG_BATCH/LLAMA_ARG_UBATCH): Tokens processed per step. Defaults are generally sufficient, but experimentation is encouraged.
  • --flash-attn (LLAMA_ARG_FLASH_ATTN): Enables Flash Attention optimization for supported models, potentially improving performance.
  • --mlock (LLAMA_ARG_MLOCK): Prevents OS swapping model memory to disk if sufficient RAM/VRAM is available, potentially improving performance but potentially slowing down other processes if memory limits are reached.
  • --no-mmap (LLAMA_ARG_NO_MMAP): Disables memory mapping of the model (enabled by default).
  • --gpu-layers (LLAMA_ARG_N_GPU_LAYERS): Offloads up to this number of model layers to the GPU (if GPU support is built-in). Set to a high number like 999 to attempt full GPU offloading. For partial offloading, experiment. Requires llama.cpp built with GPU support. For multi-GPU systems, see --split-mode and --main-gpu.
  • --model (LLAMA_ARG_MODEL): Path to the GGUF model file.

sampling params options are detailed below. Server-specific arguments:

  • --no-context-shift (LLAMA_ARG_NO_CONTEXT_SHIFT): Disables context shifting. When context is full, generation stops instead of discarding oldest tokens.
  • --cont-batching (LLAMA_ARG_CONT_BATCHING): Continuous batching for parallel prompt processing and generation (enabled by default for performance). Disable with --no-cont-batching (LLAMA_ARG_NO_CONT_BATCHING).
  • --alias (LLAMA_ARG_ALIAS): Alias for the model name in the REST API (defaults to model name).
  • --host (LLAMA_ARG_HOST) and --port (LLAMA_ARG_PORT): Server host and port.
  • --slots (LLAMA_ARG_ENDPOINT_SLOTS): Enables /slots endpoint.
  • --props (LLAMA_ARG_ENDPOINT_PROPS): Enables /props endpoint.

Other llama.cpp Tools

llama.cpp includes other useful command-line tools besides the web server.

llama-bench

llama-bench benchmarks prompt processing and text generation speed for a selected model. Run it with the model path:

llama-bench --model selected_model.gguf

llama-bench attempts optimal llama.cpp configuration for your hardware (full GPU offloading, mmap). Enable flash attention manually with --flash-attn. Adjust prompt length (--n-prompt) and batch sizes (--batch-size/--ubatch-size) to influence prompt processing benchmarks. Adjust generated token count (--n-gen) for text generation benchmarks. Use --repetitions to set benchmark repetitions.

Example CPU-only llama-bench results (SmolLM2 1.7B Q8, Ryzen 5900X, DDR4 3200MHz):

> llama-bench --flash-attn 1 --model ./SmolLM2.q8.gguf

| model                      |     size | params   | backend   |   threads |   fa | test        | t/s             |
|:---------------------------|---------:|:---------|:----------|----------:|-----:|:------------|:----------------|
| llama ?B Q8_0              | 1.69 GiB | 1.71 B   | CPU       |        12 |    1 | pp512       | 162.54 ± 1.70   |
| llama ?B Q8_0              | 1.69 GiB | 1.71 B   | CPU       |        12 |    1 | tg128       | 22.50 ± 0.05    |

build: dc223440 (4215)

test column: pp (prompt processing), tg (text generation) followed by prompt/generation token count. t/s is tokens processed/generated per second.

-pg argument performs mixed prompt processing + text generation tests:

> llama-bench --flash-attn 1 --model ./SmolLM2.q8.gguf -pg 1024,256

| model                      |     size | params   | backend   |   threads |   fa | test          | t/s             |
|:---------------------------|---------:|:---------|:----------|----------:|-----:|:--------------|:----------------|
| llama ?B Q8_0              | 1.69 GiB | 1.71 B   | CPU       |        12 |    1 | pp512         | 165.50 ± 1.95   |
| llama ?B Q8_0              | 1.69 GiB | 1.71 B   | CPU       |        12 |    1 | tg128         | 22.44 ± 0.01    |
| llama ?B Q8_0              | 1.69 GiB | 1.71 B   | CPU       |        12 |    1 | pp1024+tg256  | 63.51 ± 4.24    |

build: dc223440 (4215)

-pg benchmark is often more representative of real-world usage with continuous batching.

llama-cli

llama-cli provides a simple command-line interface for text completion and chat.

It shares many arguments with llama-server, with some specific options:

  • --prompt: Sets the starting/system prompt, can also be loaded from file using --file or --binary-file.
  • --color: Enables colored output.
  • --no-context-shift (LLAMA_ARG_NO_CONTEXT_SHIFT): Same as in llama-server.
  • --reverse-prompt: Stop generation and return control when a reverse prompt (stopping word/sentence) is generated.
  • --conversation: Enables conversation mode (interactive, no special token printing).
  • --interactive: Enables interactive mode (chatting). Use --prompt for initial output or --interactive-first for immediate chat control.

Usage examples:

Text Completion

> llama-cli --flash-attn --model ./SmolLM2.q8.gguf --prompt "The highest mountain on earth"
build: 4215 (dc223440) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 37 key-value pairs and 218 tensors from ./SmolLM2.q8.gguf (version GGUF V3 (latest))
...
llm_load_tensors: CPU_Mapped model buffer size = 1734,38 MiB
................................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 130000,0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 768,00 MiB
llama_new_context_with_model: KV self size = 768,00 MiB, K (f16): 384,00 MiB, V (f16): 384,00 MiB
llama_new_context_with_model: CPU output buffer size = 0,19 MiB
llama_new_context_with_model: CPU compute buffer size = 104,00 MiB
llama_new_context_with_model: graph nodes = 679
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 12

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

sampler seed: 2734556630
sampler params:
 repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
 dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = -1
 top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, temp = 0,800
 mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

The highest mountain on earth is Mount Everest, which stands at an astonishing 8,848.86 meters (29,031.7 feet) above sea level. Located in the Mahalangur Sharhungtrigangla Range in the Himalayas, it's a marvel of nature that draws adventurers and thrill-seekers from around the globe.

Standing at the base camp, the mountain appears as a majestic giant, its rugged slopes and snow-capped peaks a testament to its formidable presence. The climb to the summit is a grueling challenge that requires immense physical and mental fortitude, as climbers must navigate steep inclines, unpredictable weather, and crevasses.

The ascent begins at Base Camp, a bustling hub of activity, where climbers gather to share stories, exchange tips, and prepare for the climb ahead. From Base Camp, climbers make their way to the South Col, a precarious route that offers breathtaking views of the surrounding landscape. The final push to the summit involves a grueling ascent up the steep and treacherous Lhotse Face, followed by a scramble up the near-vertical wall of the Western Cwm.

Upon reaching the summit, climbers are rewarded with an unforgettable sight: the majestic Himalayan range unfolding before them, with the sun casting a golden glow on the snow. The sense of accomplishment and awe is indescribable, and the experience is etched in the memories of those who have conquered this mighty mountain.

The climb to Everest is not just about reaching the summit; it's an adventure that requires patience, perseverance, and a deep respect for the mountain. Climbers must be prepared to face extreme weather conditions, altitude sickness, and the ever-present risk of accidents or crevasses. Despite these challenges, the allure of Everest remains a powerful draw, inspiring countless individuals to push their limits and push beyond them. [end of text]


llama_perf_sampler_print: sampling time = 12,58 ms / 385 runs ( 0,03 ms per token, 30604,13 tokens per second)
llama_perf_context_print: load time = 318,81 ms
llama_perf_context_print: prompt eval time = 59,26 ms / 5 tokens ( 11,85 ms per token, 84,38 tokens per second)
llama_perf_context_print: eval time = 17797,98 ms / 379 runs ( 46,96 ms per token, 21,29 tokens per second)
llama_perf_context_print: total time = 17891,23 ms / 384 tokens

Chat Mode

> llama-cli --flash-attn --model ./SmolLM2.q8.gguf --prompt "You are a helpful assistant" --conversation
build: 4215 (dc223440) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 37 key-value pairs and 218 tensors from ./SmolLM2.q8.gguf (version GGUF V3 (latest))
...
llm_load_tensors: CPU_Mapped model buffer size = 1734,38 MiB
................................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 130000,0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 768,00 MiB
llama_new_context_with_model: KV self size = 768,00 MiB, K (f16): 384,00 MiB, V (f16): 384,00 MiB
llama_new_context_with_model: CPU output buffer size = 0,19 MiB
llama_new_context_with_model: CPU compute buffer size = 104,00 MiB
llama_new_context_with_model: graph nodes = 679
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 12
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 968968654
sampler params:
 repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
 dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = -1
 top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, temp = 0,800
 mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with ''.

system
You are a helpful assistant

> hi
Hello! How can I help you today?

>
llama_perf_sampler_print: sampling time = 0,27 ms / 22 runs ( 0,01 ms per token, 80291,97 tokens per second)
llama_perf_context_print: load time = 317,46 ms
llama_perf_context_print: prompt eval time = 2043,02 ms / 22 tokens ( 92,86 ms per token, 10,77 tokens per second)
llama_perf_context_print: eval time = 407,66 ms / 9 runs ( 45,30 ms per token, 22,08 tokens per second)
llama_perf_context_print: total time = 5302,60 ms / 31 tokens
Interrupted by user

Building llama.cpp, But Better

Now that we can use llama.cpp and adjust runtime parameters, let’s improve the build configuration. We set basic settings earlier, but haven’t explored backend-specific options yet.

llama.cpp backend is provided by the ggml library (from the same author). ggml includes optimized math operations and hardware accelerations for LLMs. Let’s start by cleaning the llama.cpp repository and ensuring we have the latest version:

cd llama.cpp
git clean -xdf
git pull
git submodule update --recursive

Generate build files for a custom backend. ggml currently supports these backends:

  • Metal: Apple Silicon acceleration.
  • Accelerate: BLAS for macOS (default).
  • OpenBLAS: BLAS for CPUs.
  • BLIS: High-performance BLAS framework.
  • SYCL: Intel GPUs (Data Center Max, Flex, Arc, integrated GPUs).
  • Intel oneMKL: Intel CPUs.
  • CUDA: NVIDIA GPUs.
  • MUSA: Moore Threads GPUs.
  • hipBLAS: AMD GPUs.
  • Vulkan: Generic GPU acceleration.
  • CANN: Ascend NPU acceleration.

llama.cpp also supports Android.

Recommended backend selection:

  • No GPU: Intel oneMKL (Intel CPUs), BLIS/OpenBLAS.
  • NVIDIA GPU: CUDA or Vulkan.
  • AMD GPU: Vulkan or ROCm (Vulkan generally more stable).
  • Intel GPU: SYCL or Vulkan.

Vulkan offers generic GPU acceleration and is relatively simple to build, making it a good choice for NVIDIA, AMD, or Intel GPUs. Build processes are similar for each backend: install dependencies, generate build files with the backend flag, and build.

Python and its dependencies are not performance-critical for model conversion scripts.

Before building with Vulkan, install the Vulkan SDK.

On Windows, MSYS is the easiest way to install Vulkan SDK dependencies.

pacman -S git 
 mingw-w64-ucrt-x86_64-gcc 
 mingw-w64-ucrt-x86_64-cmake 
 mingw-w64-ucrt-x86_64-vulkan-devel 
 mingw-w64-ucrt-x86_64-shaderc

For non-MSYS Windows or Linux, follow the documentation.

On Linux, install Vulkan SDK using your distribution’s package manager if available.

Generate build files with Vulkan enabled (GGML_VULKAN=ON). Replace /your/install/dir if needed:

cmake -S . -B build -G Ninja -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/your/install/dir -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON

Build and install (replace X with core count):

cmake --build build --config Release -j X
cmake --install build --config Release

Ignore any warnings during build. llama.cpp binaries should now support Vulkan. Test with llama-server or llama-cli --list-devices:

> llama-cli --list-devices

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XT (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64
Available devices:
 Vulkan0: AMD Radeon RX 7900 XT (20464 MiB, 20464 MiB free)

Previous CPU-only builds would show an empty device list:

> llama-cli --list-devices

Available devices:

Compare llama-bench results with Vulkan:

> llama-bench --flash-attn 1 --model ./SmolLM2.q8.gguf -pg 1024,256

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XT (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64
| model                      |     size | params   | backend   |   ngl |   fa | test          | t/s              |
|:---------------------------|---------:|:---------|:----------|------:|-----:|:--------------|:-----------------|
ggml_vulkan: Compiling shaders..............................Done!
| llama ?B Q8_0              | 1.69 GiB | 1.71 B   | Vulkan    |    99 |    1 | pp512         | 880.55 ± 5.30    |
| llama ?B Q8_0              | 1.69 GiB | 1.71 B   | Vulkan    |    99 |    1 | tg128         | 89.78 ± 1.66     |
| llama ?B Q8_0              | 1.69 GiB | 1.71 B   | Vulkan    |    99 |    1 | pp1024+tg256  | 115.25 ± 0.83    |

Vulkan backend significantly improves performance (5.3x faster prompt processing, 4x faster text generation, 1.8x faster mixed workload).

Need Better CPU Support?

Vulkan is a good generic GPU backend. For CPU-only setups, ggml supports CPU instruction sets like AVX for faster math operations.

ggml configuration options are in ggml/CMakeLists.txt. Set them using -DVARIABLE_NAME=VALUE during cmake project generation. CPU-related options:

Use HWInfo to check CPU instruction set support:

Example of CPU Features Detected by HWInfo.

  • GGML_CPU: Enables CPU backend (default).
  • GGML_CPU_HBM: Enables memkind allocator for CPU HBM memory systems.
  • GGML_CPU_AARCH64: AARCH64 (Armv8-A+) support.
  • GGML_AVX: AVX support (default if not cross-compiling).
  • GGML_AVX_VNNI: AVX-VNNI (Intel Alder Lake+, AMD Zen 5+). Enable if supported.
  • GGML_AVX2: AVX2 support (default if not cross-compiling).
  • GGML_AVX512: AVX512F support. Enable if supported.
  • GGML_AVX512_VBMI/VNNI/BF16: Optimizations for AVX512 subsets. Enable based on CPU support.
  • GGML_LASX/LSX: LoongArch (Loongson CPUs) support.
  • GGML_RVV: RISC-V Vector Extension (RISC-V CPUs).
  • GGML_SVE: ARM Scalable Vector Extensions (Armv9 Cortex-A510+).

MSVC does not support these options.

CUDA/ROCm/BLAS?

ROCm will be covered in a separate post. For CUDA and BLAS, consult the documentation. Building with these backends is similar to CPU and Vulkan builds if your environment is set up correctly.

LLM Configuration Options Explained

This section details LLM configuration options and sampling methods.

How Does an LLM Generate Text?

  1. Prompt: Input text in human-readable format, often with special tags for structuring conversations.
    Example chat prompt:

     <|im_start|>system
     You are a helpful assistant<|im_end|>
     <|im_start|>user
     Hello<|im_end|>
     <|im_start|>assistant
     Hi there<|im_end|>
     <|im_start|>user
     How are you?<|im_end|>
     <|im_start|>assistant
  2. Tokenization: Prompt is converted into numerical tokens using a vocabulary (e.g., tokenizer.json in SmolLM2 repo). Special tokens indicate message boundaries, system prompts, user messages, and LLM responses.
    Tokenization example using llama-server API:

     curl -X POST -H "Content-Type: application/json" -d '{"content": "hello world! this is an example message!"}' http://127.0.0.1:8080/tokenize

    SmolLM2 response:

     {"tokens":[28120,905,17,451,314,1183,3714,17]}

    Detokenization to revert to text:

     curl -X POST -H "Content-Type: application/json" -d '{"tokens": [28120,905,17,451,314,354,1183,3714,17]}' http://127.0.0.1:8080/detokenize
     {"content":"hello world! this is an example message!"}
  3. LLM Processing: Tokenized prompt is processed by the LLM, involving complex matrix operations. The LLM outputs probability distributions for each token in its vocabulary, indicating the likelihood of each token being the next one.

  4. Token Sampling: Samplers select a single token from the probability distribution. Various samplers exist with different algorithms to influence token selection.

  5. Detokenization: Generated tokens are converted back to human-readable text.

List of LLM Configuration Options and Samplers in llama.cpp

  • System Message: Initial instruction to guide the LLM’s behavior. Crucial for influencing output quality.
  • Temperature: Controls randomness. Higher values increase randomness, lower values make output more focused. Range 0.2-2.0 is a good starting point.
  • Dynamic Temperature: Adjusts temperature based on token entropy (model confidence). Article and Reddit post provide details.
    • Dynatemp Range: Range of temperature adjustment.
    • Dynatemp Exponent: Controls how entropy influences temperature (see image).
  • Top-K: Keeps only the top K most probable tokens. Higher K increases text diversity.
  • Top-P (Nucleus Sampling): Limits tokens to those with a cumulative probability of at least p. Article provides more detail.
  • Min-P: Limits tokens based on minimum probability relative to the most likely token. Paper and image explain it.
  • Exclude Top Choices (XTC): Removes most likely tokens under certain conditions. Reddit post and PR with image provide details.
    • XTC Threshold: Probability cutoff for top tokens (0-1 range).
    • XTC Probability: Probability of XTC being applied (0-1 range, 0=disabled, 1=always enabled).
  • Locally Typical Sampling (Typical-P): Sorts and limits tokens based on log-probability and entropy difference. Paper and Reddit discussion for more info.
  • DRY: Prevents token repetition by penalizing tokens that create repeating sequences. PR and image explain it.
    • DRY Multiplier: Penalty multiplier.
    • DRY Base: Penalty base.
    • DRY Allowed Length: Repetition length before penalty is applied.
    • DRY Penalty Last N: Tokens to scan for repetition (-1=whole context, 0=disabled).
    • DRY Sequence Breakers: Sentence separators for DRY (`n’, ‘:’, ‘”‘, ‘*’ defaults).
  • Mirostat: Overrides Top-K, Top-P, and Typical-P. Controls perplexity (entropy). Paper.
    • Mirostat Version: 0=disabled, 1=Mirostat, 2=Mirostat 2.0.
    • Mirostat Learning Rate (η): Convergence speed to target perplexity.
    • Mirostat Target Entropy (τ): Desired perplexity.
  • Max Tokens: Maximum tokens to generate. -1 generates until EOS token or context limit.
  • Repetition Penalty: Reduces probability of repeated tokens in the last N tokens of context.
    • Repeat Last N: Tokens to consider for repetition penalty.
    • Repeat Penalty: Penalty factor (>1.0 enables penalty).
    • Presence Penalty: Logit bias based on token presence.
    • Frequency Penalty: Logit bias based on token frequency.

Sampler settings order and selection can be configured in Other sampler settings:

  • d: DRY
  • k: Top-K
  • y: Typical-P
  • p: Top-P
  • m: Min-P
  • x: Exclude Top Choices (XTC)
  • t: Temperature

Web UI might not expose all settings (e.g., Mirostat), but they can be configured via environment variables, CLI arguments, or server API.

Final Thoughts

This has been a comprehensive guide to running local LLMs. The speed of LLM development is rapid, but llama.cpp offers a relatively stable platform for experimentation. Hopefully, this guide is helpful and inspires you to explore the world of local LLMs.

Future posts may cover automated benchmarking scripts and data analysis. Feedback and questions are welcome in the comments.

Bonus: Model Recommendations

LLM Explorer is a useful search engine for LLMs.

Recommended models:

  • Google Gemma 2 9B SimPO: Gemma fine-tune with distinct response style.
  • Meta Llama 3.1/3.2: Llama 3.1 8B Instruct is a good general-purpose model. Numerous fine-tunes are available.
  • Microsoft Phi 3.5: Small models, including a larger MoE version.
  • Qwen/CodeQwen 2.5: Alibaba models, currently among the best open-source options. CodeQwen 14B is a strong daily driver.

Post-Mortem

Feedback from Reddit has helped improve this post. Ongoing refinements and additions may occur.

  • 2024-12-03: LLM text generation section updated based on feedback.
  • 2024-12-10: Backend section updated with CPU optimization info.
  • 2024-12-25: Python 3.12 recommendation added due to PyTorch issues in 3.13.

Word Count Check: Original Article: ~6600 words, Rewritten Article: ~6800 words. Length is within acceptable range (+/- 10%).

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *