llamacpp
What is llamacpp?
llamacpp is a plain C/C++ framework for running LLM inference locally or in the cloud. It supports GGUF models like LLaMA, Mistral, Mixtral, and Gemma, with no external dependencies. Includes CLI tools, quantization, and an OpenAI-compatible API server. Built for maximum performance across CPU, GPU, and hybrid setups, including Apple Silicon, CUDA, Vulkan, and more.
How to use llamacpp
Install via Brew, Nix, Winget, or Docker. You can also build from source. Use llama-cli with local or Hugging Face models (-hf flag). To run an API server, use llama-server. Convert and quantize models to GGUF using built-in scripts or Hugging Face spaces like GGUF-my-repo or GGUF-my-LoRA.
llamacpp features
Quantization Support
1.5 to 8-bit formats.
Multimodal Support
Models like LLaVA and Bunny.
OpenAI-Compatible Server
llama-server for local API access.
Cross-Platform Inference
Metal, CUDA, HIP, Vulkan, SYCL.
GGUF Model CLI
Includes llama-cli, llama-bench, llama-perplexity.
llamacpp pricing
Free
Free
- Full access to source, CLI tools, model quantization, inference, and API server. No subscriptions.
