vLLM

vLLM
vLLM
Original authors	Sky Computing Lab, University of California, Berkeley
Developer	vLLM contributors
Initial release	2023
Written in	Python, CUDA, C++, Rust
Type	Large language model inference engine
License	Apache License 2.0
Website	vllm.ai
Repository	github.com/vllm-project/vllm

vLLM is an open-source software framework for inference and serving of large language models and related multimodal models. Originally developed at the University of California, Berkeley's Sky Computing Lab,^[1] the project is centered on PagedAttention, a memory-management method for transformer key–value caches, and supports features such as continuous batching, distributed inference, quantization, and OpenAI-compatible APIs.^[2]^[3]^[4]

History

vLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley.^[3]^[2] Its core ideas were described in the 2023 paper Efficient Memory Management for Large Language Model Serving with PagedAttention,^[5] which presented the system as a high-throughput and memory-efficient serving engine for large language models.^[3]

According to a project maintainer, the "v" in vLLM originally referred to "virtual", inspired by virtual memory.^[6]

PyTorch's project page states that the University of California, Berkeley contributed vLLM to the Linux Foundation in July 2024.^[7]^[4] In 2025, the PyTorch Foundation announced that vLLM had become a Foundation-hosted project.

In January 2026, TechCrunch reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding.^[8]

Architecture

According to its 2023 paper, vLLM was designed to improve the efficiency of large language model serving by reducing memory waste in the key–value cache used during transformer inference.^[3] The paper introduced PagedAttention, an algorithm inspired by virtual memory and paging techniques in operating systems, and described vLLM as using block-level memory management and request scheduling to increase throughput while maintaining similar latency.^[3]

The project documentation and repository describe support for continuous batching, chunked prefill, speculative decoding, prefix caching, quantization, and multiple forms of distributed inference.^[2]^[4] PyTorch has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including NVIDIA and AMD GPUs, Google TPUs, AWS Trainium, and Intel processors.^[7]^[4]

References

↑ "vLLM - A High-Throughput and Memory-Efficient Inference and Serving Engine for LLMs". UC Berkeley, Sky Computing Lab.
1 2 3 "GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs". GitHub. GitHub, Inc. Retrieved April 22, 2026.
1 2 3 4 5 Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. Retrieved April 22, 2026.
1 2 3 4 "vLLM". PyTorch. PyTorch Foundation. Retrieved April 22, 2026.
↑ Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (September 12, 2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention". arXiv:2309.06180 [cs.LG].
↑ "vLLM full name". GitHub. GitHub, Inc. August 23, 2023. Retrieved April 22, 2026.
1 2 "PyTorch Foundation Welcomes vLLM as a Hosted Project". PyTorch. PyTorch Foundation. May 7, 2025. Retrieved April 22, 2026.
↑ Temkin, Marina (January 22, 2026). "Inference startup Inferact lands $150M to commercialize vLLM". TechCrunch. Retrieved April 22, 2026.

External links

[1] "vLLM - A High-Throughput and Memory-Efficient Inference and Serving Engine for LLMs". UC Berkeley, Sky Computing Lab.

[github-2] 1 2 3 "GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs". GitHub. GitHub, Inc. Retrieved April 22, 2026.

[paper-3] 1 2 3 4 5 Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. Retrieved April 22, 2026.

[pytorch-project-4] 1 2 3 4 "vLLM". PyTorch. PyTorch Foundation. Retrieved April 22, 2026.

[5] Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (September 12, 2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention". arXiv:2309.06180 [cs.LG].

[6] "vLLM full name". GitHub. GitHub, Inc. August 23, 2023. Retrieved April 22, 2026.

[pytorch-hosted-7] 1 2 "PyTorch Foundation Welcomes vLLM as a Hosted Project". PyTorch. PyTorch Foundation. May 7, 2025. Retrieved April 22, 2026.

[techcrunch-8] Temkin, Marina (January 22, 2026). "Inference startup Inferact lands $150M to commercialize vLLM". TechCrunch. Retrieved April 22, 2026.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

vLLM

Contents

History

Architecture

See also

References

External links