vLLM

High-throughput GPU LLM serving with PagedAttention. The reference open-source inference engine for production serving in 2026.

From Wikipedia

vLLM is an open-source software framework for inference and serving of large language models and related multimodal models. Originally developed at the University of California, Berkeley's Sky Computing Lab, the project is centered on PagedAttention, a memory-management method for transformer key–value caches, and supports features such as continuous batching, distributed inference, quantization, and OpenAI-compatible APIs.

Read on Wikipedia ↗

Open source ↗

← #20 llama.cpp #22 Ollama →