What is llamacpp?

llamacpp is a plain C/C++ framework for running LLM inference locally or in the cloud. It supports GGUF models like LLaMA, Mistral, Mixtral, and Gemma, with no external dependencies. Includes CLI tools, quantization, and an OpenAI-compatible API server. Built for maximum performance across CPU, GPU, and hybrid setups, including Apple Silicon, CUDA, Vulkan, and more.

How to use

Install via Brew, Nix, Winget, or Docker. You can also build from source. Use llama-cli with local or Hugging Face models (-hf flag). To run an API server, use llama-server. Convert and quantize models to GGUF using built-in scripts or Hugging Face spaces like GGUF-my-repo or GGUF-my-LoRA.

Key Features

Quantization Support

1.5 to 8-bit formats.

Multimodal Support

Models like LLaVA and Bunny.

OpenAI-Compatible Server

llama-server for local API access.

Cross-Platform Inference

Metal, CUDA, HIP, Vulkan, SYCL.

GGUF Model CLI

Includes llama-cli, llama-bench, llama-perplexity.

Pricing Plans

Free

Full access to source, CLI tools, model quantization, inference, and API server. No subscriptions.

Get Started

llamacpp