Guide to Deploying LLM Inference with vLLM on Linux: Boosting Throughput and Saving VRAM – ITFROMZERO

Context & Why Optimize LLM Inference?

Large Language Models (LLMs) like Llama, Mixtral, and GPT have become ubiquitous. They are widely used in virtual assistants, content creation, and data analysis, opening up many possibilities. However, deploying LLMs into production environments is challenging, especially regarding performance.

When deploying LLM inference, the two biggest challenges are latency and throughput. For a good user experience, the model needs to respond quickly. The system must also handle many requests simultaneously, which helps optimize infrastructure costs. Additionally, LLMs often consume a lot of VRAM. This makes it difficult to run multiple models or process large batches on common GPUs.

In the past, I also struggled to optimize LLM inference in production environments, especially with large models and high request volumes. How to simultaneously reduce latency, increase concurrent processing capabilities, and efficiently utilize GPU resources was a difficult problem. Traditional solutions like static batching or kernel optimization often only addressed a small part of the issue.

After much research and experimentation, vLLM finally proved to be exceptionally effective. vLLM is a highly efficient library designed to accelerate LLM inference using the PagedAttention algorithm. Unlike traditional attention mechanisms that allocate contiguous memory, PagedAttention manages K/V cache in pages.

This approach is similar to how operating systems manage virtual memory. As a result, vLLM significantly reduces inefficient memory allocation, leading to higher throughput and impressive VRAM savings. When I applied it in a production environment, the results were very stable. It helped reduce infrastructure costs and significantly improved user experience.

So, what are vLLM’s advantages over other solutions?

Higher throughput: PagedAttention helps vLLM handle many concurrent requests, even with varying prompt and output lengths.
Lower latency: Continuous batching allows immediate request processing without waiting for a full batch.
VRAM savings: Efficient K/V cache management maximizes GPU memory utilization.
Easy to use: Supports many popular Hugging Face models and has a user-friendly API.

In this article, I will share my experience deploying vLLM on Linux. I hope you can also optimize LLM inference performance for your projects.

Installing vLLM in a Linux Environment

To get started with vLLM, you need a Linux system with an NVIDIA GPU and CUDA drivers installed. This article assumes you already have the CUDA toolkit. If not, please install it according to the instructions on the NVIDIA website.

1. Prepare your Python environment

It is recommended to use a Python virtual environment to avoid library conflicts.


# Create a virtual environment
python3 -m venv vllm_env

# Activate the environment
source vllm_env/bin/activate

2. Install vLLM

You can easily install vLLM via pip, with two main options: from PyPI (stable version) or from source (latest version, but may not be fully stable).

Install from PyPI (Recommended)


pip install vllm

If you want to use the latest features or need customization, you can install from source. However, for most cases, the PyPI version is sufficient.

Install from source (Optional)


git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

After installation, you can check if vLLM is ready by running a simple command:


import vllm
print(vllm.__version__)

If there are no errors and the vLLM version is displayed, you have successfully installed it!

Detailed Configuration and Running the vLLM Server

After installation, you need to start vLLM as an API server. This allows other applications to easily call inference. vLLM provides a convenient interface to do this.

1. Start a basic API server

You can start a vLLM server with just a few basic parameters. For example, to run a Llama 2 7B model on your GPU:


python -m vllm.entrypoints.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --port 8000 --host 0.0.0.0

Explanation of parameters:

--model meta-llama/Llama-2-7b-hf: Specifies the model name from Hugging Face Hub. vLLM will automatically download this model if it’s not already present.
--port 8000 and --host 0.0.0.0: Sets the address and port where the API server will listen.

I often closely monitor this to know how much VRAM the model occupies, and whether it’s suitable for the current GPU.

2. Important Configuration Parameters

For further optimization, vLLM provides many parameters you should be familiar with:

`--tensor-parallel-size` (or `-tp`)

This parameter is very useful when you have multiple GPUs. You can partition the model across them to speed up inference or run large models that don’t fit on a single GPU. For example, if you have 2 GPUs:


python -m vllm.entrypoints.api_server \
    --model meta-llama/Llama-2-13b-hf \
    --tensor-parallel-size 2 \
    --port 8000 --host 0.0.0.0

At this point, the Llama 2 13B model will be evenly distributed across 2 GPUs, with each GPU handling a portion of the computations. This significantly speeds up inference, especially with ultra-large models.

`--gpu-memory-utilization`

This parameter is crucial for controlling the amount of VRAM vLLM is allowed to use. The default is 0.9 (90%). If you need to reserve VRAM for other tasks or run multiple applications on the same GPU, you can reduce this value. For example, to use only 80% of VRAM:


python -m vllm.entrypoints.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --gpu-memory-utilization 0.8 \
    --port 8000 --host 0.0.0.0

I usually adjust this parameter based on actual VRAM monitoring results during testing. Do not set it too low, as this can affect vLLM’s batching capabilities and reduce throughput.

`--dtype`

This is the data type of the model (e.g., float16, bfloat16, float32). Using float16 or bfloat16 helps save VRAM and speed up computation, usually without significantly affecting accuracy.


python -m vllm.entrypoints.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --dtype bfloat16 \
    --port 8000 --host 0.0.0.0

`--max-model-len`

This parameter specifies the maximum length of the sequence (prompt + output) that the model can process. If you know in advance that requests will not exceed a certain length, setting this parameter will help vLLM optimize memory better. However, do not set it too small, otherwise it will truncate long responses.

Quantization Configuration

To save maximum VRAM, you can use quantized models like AWQ or GPTQ. vLLM directly supports these models.


# Example with an AWQ model
python -m vllm.entrypoints.api_server \
    --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
    --quantization awq \
    --port 8000 --host 0.0.0.0

Quantized models allow you to run much larger models on the same GPU, even on GPUs with less VRAM. Personally, I have successfully run Llama 2 70B AWQ on an RTX 3090 card. This is an impressive result.

3. Calling the API from a Client

Once the vLLM server is running, you can send inference requests. This API is compatible with the OpenAI API, offering great convenience.

Using cURL


curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{ 
        "model": "meta-llama/Llama-2-7b-hf", 
        "prompt": "Write a short paragraph about the benefits of learning to code.", 
        "max_tokens": 100, 
        "temperature": 0.7
    }'

Using a Python Client

With Python, you can use the openai or httpx library for easy interaction:


import openai

client = openai.OpenAI(
    api_key="EMPTY", # vLLM does not require an API key
    base_url="http://localhost:8000/v1"
)

response = client.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    prompt="Write a short story about a detective cat.",
    max_tokens=150,
    temperature=0.8,
)

print(response.choices[0].text)

You can also use the /v1/chat/completions endpoint if your model is a chat model:


import openai

client = openai.OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1"
)

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Write a short story about a detective cat."}
    ],
    max_tokens=150,
    temperature=0.8,
)

print(response.choices[0].message.content)

Using the standard OpenAI API makes it easy for me to switch applications from OpenAI to vLLM without much code modification.

Performance Testing and Monitoring

Once the vLLM server is running stably, performance testing and monitoring are indispensable steps. They ensure the system operates as expected.

1. Test throughput and latency

vLLM provides an integrated benchmark script. This script helps you easily measure throughput and latency. This is an effective way to compare performance before and after optimization.


python -m vllm.benchmarks.benchmark_throughput \
    --model meta-llama/Llama-2-7b-hf \
    --dataset sharegpt \
    --num-prompts 1000

The --dataset sharegpt parameter uses a real-world prompt dataset to simulate load. You can change --num-prompts to test with different numbers of requests. The results will display metrics such as Prompt throughput (tokens/s), Generation throughput (tokens/s), and Latency.

Additionally, you can write your own Python script to send multiple concurrent requests and measure the average response time. Or, use stress testing tools like ApacheBench (ab) or Locust.

2. Monitor VRAM and system resources

When the vLLM server is running and processing requests, you should regularly monitor the GPU’s VRAM usage. The nvidia-smi command is a suitable choice for this:


nvidia-smi

You will see information about total VRAM, used VRAM, and processes occupying VRAM. Pay attention to how much VRAM vLLM uses, and whether it exceeds the --gpu-memory-utilization limit you set. If VRAM is overloaded, you may need to reduce --gpu-memory-utilization, switch to a smaller model, or use a quantized model.

Don’t forget to monitor your system’s CPU and RAM. Although vLLM optimizes the GPU, some preprocessing or post-processing tasks still require the CPU. The model also needs a certain amount of RAM to load.

3. Optimize based on Monitoring Results

If throughput is low:
- Recheck --gpu-memory-utilization. If it’s too low, vLLM might not be able to batch many requests.
- Consider increasing --max-num-seqs (maximum number of concurrent sequences) if you have enough VRAM.
- Ensure you are using --dtype float16 or bfloat16.
- If you have multiple GPUs, try increasing --tensor-parallel-size.
If VRAM is full:
- Reduce --gpu-memory-utilization.
- Use quantized models (AWQ, GPTQ).
- Switch to a smaller model.
- If running multi-GPU, ensure --tensor-parallel-size is correctly configured.
If latency is high:
- Check the load on the server. If there are too many requests, the server might be overloaded.
- Optimize prompt and output length.

I find this to be a continuous loop: configure → test → monitor → fine-tune. There is no “perfect” configuration for all cases. It highly depends on your model, GPU type, and anticipated load.

Conclusion

Efficient LLM inference deployment is a decisive factor for the success of modern AI applications. vLLM, thanks to PagedAttention and continuous batching, has proven to be an effective solution for achieving high throughput and VRAM savings.

Through the shared practical experience, I hope you now have a clearer understanding of how to install, configure, and optimize vLLM on Linux. Don’t hesitate to experiment with different parameters to find the optimal configuration for your system. Good luck!