vLLM is an open-source software framework for inference and serving of large language models and related multimodal models. Originally developed at the University of California, Berkeley's Sky Computing Lab,[1] the project is centered on PagedAttention, a memory-management method for transformer key–value caches, and supports features such as continuous batching, distributed inference, quantization, and OpenAI-compatible APIs.[2][3][4]
| vLLM | |
|---|---|
| Original authors | Sky Computing Lab, University of California, Berkeley |
| Developer | vLLM contributors |
| Initial release | 2023 |
| Written in | Python, CUDA, C++, Rust |
| Type | Large language model inference engine |
| License | Apache License 2.0 |
| Website | vllm |
| Repository | github |
History
editvLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley.[3][2] Its core ideas were described in the 2023 paper Efficient Memory Management for Large Language Model Serving with PagedAttention,[5] which presented the system as a high-throughput and memory-efficient serving engine for large language models.[3]
According to a project maintainer, the "v" in vLLM originally referred to "virtual", inspired by virtual memory.[6]
PyTorch's project page states that the University of California, Berkeley contributed vLLM to the Linux Foundation in July 2024.[7][4] In 2025, the PyTorch Foundation announced that vLLM had become a Foundation-hosted project.
In January 2026, TechCrunch reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding.[8]
Architecture
editAccording to its 2023 paper, vLLM was designed to improve the efficiency of large language model serving by reducing memory waste in the key–value cache used during transformer inference.[3] The paper introduced PagedAttention, an algorithm inspired by virtual memory and paging techniques in operating systems, and described vLLM as using block-level memory management and request scheduling to increase throughput while maintaining similar latency.[3]
The project documentation and repository describe support for continuous batching, chunked prefill, speculative decoding, prefix caching, quantization, and multiple forms of distributed inference.[2][4] PyTorch has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including NVIDIA and AMD GPUs, Google TPUs, AWS Trainium, and Intel processors.[7][4]
See also
editReferences
edit- ↑ "vLLM - A High-Throughput and Memory-Efficient Inference and Serving Engine for LLMs". UC Berkeley, Sky Computing Lab.
- 1 2 3 "GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs". GitHub. GitHub, Inc. Retrieved April 22, 2026.
- 1 2 3 4 5 Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. Retrieved April 22, 2026.
- 1 2 3 4 "vLLM". PyTorch. PyTorch Foundation. Retrieved April 22, 2026.
- ↑ Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (September 12, 2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention". arXiv:2309.06180 [cs.LG].
- ↑ "vLLM full name". GitHub. GitHub, Inc. August 23, 2023. Retrieved April 22, 2026.
- 1 2 "PyTorch Foundation Welcomes vLLM as a Hosted Project". PyTorch. PyTorch Foundation. May 7, 2025. Retrieved April 22, 2026.
- ↑ Temkin, Marina (January 22, 2026). "Inference startup Inferact lands $150M to commercialize vLLM". TechCrunch. Retrieved April 22, 2026.