Context & Why Optimize LLM Inference?
Large Language Models (LLMs) like Llama, Mixtral, and GPT have become ubiquitous. They are widely used in virtual assistants, content creation, and data analysis, opening up many possibilities. However, deploying LLMs into production environments is challenging, especially regarding performance.
When deploying LLM inference, the two biggest challenges are latency and throughput. For a good user experience, the model needs to respond quickly. The system must also handle many requests simultaneously, which helps optimize infrastructure costs. Additionally, LLMs often consume a lot of VRAM. This makes it difficult to run multiple models or process large batches on common GPUs.
In the past, I also struggled to optimize LLM inference in production environments, especially with large models and high request volumes. How to simultaneously reduce latency, increase concurrent processing capabilities, and efficiently utilize GPU resources was a difficult problem. Traditional solutions like static batching or kernel optimization often only addressed a small part of the issue.
After much research and experimentation, vLLM finally proved to be exceptionally effective. vLLM is a highly efficient library designed to accelerate LLM inference using the PagedAttention algorithm. Unlike traditional attention mechanisms that allocate contiguous memory, PagedAttention manages K/V cache in pages.
This approach is similar to how operating systems manage virtual memory. As a result, vLLM significantly reduces inefficient memory allocation, leading to higher throughput and impressive VRAM savings. When I applied it in a production environment, the results were very stable. It helped reduce infrastructure costs and significantly improved user experience.
So, what are vLLM’s advantages over other solutions?
- Higher throughput: PagedAttention helps vLLM handle many concurrent requests, even with varying prompt and output lengths.
- Lower latency: Continuous batching allows immediate request processing without waiting for a full batch.
- VRAM savings: Efficient K/V cache management maximizes GPU memory utilization.
- Easy to use: Supports many popular Hugging Face models and has a user-friendly API.
In this article, I will share my experience deploying vLLM on Linux. I hope you can also optimize LLM inference performance for your projects.
Installing vLLM in a Linux Environment
To get started with vLLM, you need a Linux system with an NVIDIA GPU and CUDA drivers installed. This article assumes you already have the CUDA toolkit. If not, please install it according to the instructions on the NVIDIA website.
1. Prepare your Python environment
It is recommended to use a Python virtual environment to avoid library conflicts.
# Create a virtual environment
python3 -m venv vllm_env
# Activate the environment
source vllm_env/bin/activate
2. Install vLLM
You can easily install vLLM via pip, with two main options: from PyPI (stable version) or from source (latest version, but may not be fully stable).
Install from PyPI (Recommended)
pip install vllm
If you want to use the latest features or need customization, you can install from source. However, for most cases, the PyPI version is sufficient.
Install from source (Optional)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
After installation, you can check if vLLM is ready by running a simple command:
import vllm
print(vllm.__version__)
If there are no errors and the vLLM version is displayed, you have successfully installed it!
Detailed Configuration and Running the vLLM Server
After installation, you need to start vLLM as an API server. This allows other applications to easily call inference. vLLM provides a convenient interface to do this.
1. Start a basic API server
You can start a vLLM server with just a few basic parameters. For example, to run a Llama 2 7B model on your GPU:
python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--port 8000 --host 0.0.0.0
Explanation of parameters:
--model meta-llama/Llama-2-7b-hf: Specifies the model name from Hugging Face Hub. vLLM will automatically download this model if it’s not already present.--port 8000and--host 0.0.0.0: Sets the address and port where the API server will listen.
I often closely monitor this to know how much VRAM the model occupies, and whether it’s suitable for the current GPU.
2. Important Configuration Parameters
For further optimization, vLLM provides many parameters you should be familiar with:
--tensor-parallel-size (or -tp)
This parameter is very useful when you have multiple GPUs. You can partition the model across them to speed up inference or run large models that don’t fit on a single GPU. For example, if you have 2 GPUs:
python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-13b-hf \
--tensor-parallel-size 2 \
--port 8000 --host 0.0.0.0
At this point, the Llama 2 13B model will be evenly distributed across 2 GPUs, with each GPU handling a portion of the computations. This significantly speeds up inference, especially with ultra-large models.
--gpu-memory-utilization
This parameter is crucial for controlling the amount of VRAM vLLM is allowed to use. The default is 0.9 (90%). If you need to reserve VRAM for other tasks or run multiple applications on the same GPU, you can reduce this value. For example, to use only 80% of VRAM:
python -m vllm.entrypoints.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--gpu-memory-utilization 0.8 \
--port 8000 --host 0.0.0.0
I usually adjust this parameter based on actual VRAM monitoring results during testing. Do not set it too low, as this can affect vLLM’s batching capabilities and reduce throughput.
--dtype
This is the data type of the model (e.g., float16, bfloat16, float32). Using float16 or bfloat16 helps save VRAM and speed up computation, usually without significantly affecting accuracy.
python -m vllm.entrypoints.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--dtype bfloat16 \
--port 8000 --host 0.0.0.0
--max-model-len
This parameter specifies the maximum length of the sequence (prompt + output) that the model can process. If you know in advance that requests will not exceed a certain length, setting this parameter will help vLLM optimize memory better. However, do not set it too small, otherwise it will truncate long responses.
Quantization Configuration
To save maximum VRAM, you can use quantized models like AWQ or GPTQ. vLLM directly supports these models.
# Example with an AWQ model
python -m vllm.entrypoints.api_server \
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
--quantization awq \
--port 8000 --host 0.0.0.0
Quantized models allow you to run much larger models on the same GPU, even on GPUs with less VRAM. Personally, I have successfully run Llama 2 70B AWQ on an RTX 3090 card. This is an impressive result.
3. Calling the API from a Client
Once the vLLM server is running, you can send inference requests. This API is compatible with the OpenAI API, offering great convenience.
Using cURL
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-hf",
"prompt": "Write a short paragraph about the benefits of learning to code.",
"max_tokens": 100,
"temperature": 0.7
}'
Using a Python Client
With Python, you can use the openai or httpx library for easy interaction:
import openai
client = openai.OpenAI(
api_key="EMPTY", # vLLM does not require an API key
base_url="http://localhost:8000/v1"
)
response = client.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.2",
prompt="Write a short story about a detective cat.",
max_tokens=150,
temperature=0.8,
)
print(response.choices[0].text)
You can also use the /v1/chat/completions endpoint if your model is a chat model:
import openai
client = openai.OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1"
)
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.2",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Write a short story about a detective cat."}
],
max_tokens=150,
temperature=0.8,
)
print(response.choices[0].message.content)
Using the standard OpenAI API makes it easy for me to switch applications from OpenAI to vLLM without much code modification.
Performance Testing and Monitoring
Once the vLLM server is running stably, performance testing and monitoring are indispensable steps. They ensure the system operates as expected.
1. Test throughput and latency
vLLM provides an integrated benchmark script. This script helps you easily measure throughput and latency. This is an effective way to compare performance before and after optimization.
python -m vllm.benchmarks.benchmark_throughput \
--model meta-llama/Llama-2-7b-hf \
--dataset sharegpt \
--num-prompts 1000
The --dataset sharegpt parameter uses a real-world prompt dataset to simulate load. You can change --num-prompts to test with different numbers of requests. The results will display metrics such as Prompt throughput (tokens/s), Generation throughput (tokens/s), and Latency.
Additionally, you can write your own Python script to send multiple concurrent requests and measure the average response time. Or, use stress testing tools like ApacheBench (ab) or Locust.
2. Monitor VRAM and system resources
When the vLLM server is running and processing requests, you should regularly monitor the GPU’s VRAM usage. The nvidia-smi command is a suitable choice for this:
nvidia-smi
You will see information about total VRAM, used VRAM, and processes occupying VRAM. Pay attention to how much VRAM vLLM uses, and whether it exceeds the --gpu-memory-utilization limit you set. If VRAM is overloaded, you may need to reduce --gpu-memory-utilization, switch to a smaller model, or use a quantized model.
Don’t forget to monitor your system’s CPU and RAM. Although vLLM optimizes the GPU, some preprocessing or post-processing tasks still require the CPU. The model also needs a certain amount of RAM to load.
3. Optimize based on Monitoring Results
- If throughput is low:
- Recheck
--gpu-memory-utilization. If it’s too low, vLLM might not be able to batch many requests. - Consider increasing
--max-num-seqs(maximum number of concurrent sequences) if you have enough VRAM. - Ensure you are using
--dtype float16orbfloat16. - If you have multiple GPUs, try increasing
--tensor-parallel-size.
- Recheck
- If VRAM is full:
- Reduce
--gpu-memory-utilization. - Use quantized models (AWQ, GPTQ).
- Switch to a smaller model.
- If running multi-GPU, ensure
--tensor-parallel-sizeis correctly configured.
- Reduce
- If latency is high:
- Check the load on the server. If there are too many requests, the server might be overloaded.
- Optimize prompt and output length.
I find this to be a continuous loop: configure → test → monitor → fine-tune. There is no “perfect” configuration for all cases. It highly depends on your model, GPU type, and anticipated load.
Conclusion
Efficient LLM inference deployment is a decisive factor for the success of modern AI applications. vLLM, thanks to PagedAttention and continuous batching, has proven to be an effective solution for achieving high throughput and VRAM savings.
Through the shared practical experience, I hope you now have a clearer understanding of how to install, configure, and optimize vLLM on Linux. Don’t hesitate to experiment with different parameters to find the optimal configuration for your system. Good luck!
