BentoML: Packaging and Deploying AI/ML Models as Production-Ready REST APIs on Linux – ITFROMZERO

After training an ML model, the next question is always: how do you get it into production? I’ve been through plenty of situations where I had to write Flask APIs from scratch to wrap a model, only to run into a whole list of problems: no versioning, hard to scale, messy logs. When our team tried BentoML, it completely changed how we think about ML serving.

The Big Picture: Ways to Deploy ML Models

In practice, there are several ways to expose an ML model as an API. I break them down into 4 main groups:

Write your own API (Flask/FastAPI) — the most flexible option, but you handle everything from scratch
BentoML — a framework purpose-built for ML serving, with many production features built in
TorchServe — dedicated to PyTorch, great if you’re all-in on PyTorch
Triton Inference Server (NVIDIA) — extremely high performance, but complex and requires a GPU cluster

Detailed Comparison of Each Approach

Flask/FastAPI DIY

This is the most common approach since everyone knows Flask/FastAPI. You load the model, write an endpoint, and you’re done. But the real problems lie elsewhere:

No model versioning mechanism — updating a model means redeploying the entire service
Batching must be implemented manually — significantly impacts throughput under high load
Health checks and monitoring need to be written separately
No standard for packaging a model together with its dependencies

I once maintained a Flask service like this for 6 months. Every time a data scientist updated the model, the whole team had to coordinate manually — copy files, restart the service, check logs. Very time-consuming and prone to incidents.

TorchServe

If your team is PyTorch-only and wants the official solution from Meta, TorchServe is the choice. But it’s fairly opinionated — PyTorch only, XML-based configuration (fairly dated), and the documentation is missing coverage for many real-world edge cases.

Triton Inference Server

When you need to squeeze every ounce of GPU performance — dynamic batching, multiple backends (TensorRT, ONNX, TF, PyTorch) — Triton is worth the setup effort. The problem is the steep learning curve, complex configuration, and if you don’t have a GPU cluster, it’s clearly over-engineering.

BentoML

BentoML sits at the sweet spot between “write your own Flask” and “complex Triton”. It’s framework-agnostic (supports sklearn, PyTorch, TF, XGBoost, ONNX…), has built-in versioning, adaptive batching, and automatic Docker export. It covers 80% of production use cases without complex setup.

Why Choose BentoML for Production

After comparing the options, my DevOps team settled on BentoML for 3 specific reasons:

Built-in model registry — save/load models with versioning, no need for separate DVC or MLflow just for this step
Bento = complete packaged artifact — model + code + dependencies bundled into one artifact, deployable anywhere
Automatic Docker image generation — one command produces a production-ready Docker image, no need to write complex Dockerfiles

The least-mentioned benefit is actually the most important for our team: no more “where’s the new model, what Python version, what dependencies does it need?” calls. The data scientist saves the model with a tag, DevOps takes that tag and builds Docker. This standardized workflow saves our team about 2 hours per new model release.

BentoML Deployment Guide on Linux

1. Installation

pip install bentoml

# Verify
bentoml --version

2. Save the Model to BentoML Model Store

The first step is adding your trained model to the BentoML registry. Example with scikit-learn:

import bentoml
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Train model (in practice, you already have your model)
iris = load_iris()
clf = RandomForestClassifier(n_estimators=100)
clf.fit(iris.data, iris.target)

# Save to BentoML model store
saved_model = bentoml.sklearn.save_model(
    "iris_classifier",
    clf,
    signatures={"predict": {"batchable": True}},
    metadata={"accuracy": 0.97, "dataset": "iris"}
)

print(f"Model saved: {saved_model.tag}")
# Output: iris_classifier:3mxqpfzbs6tpjuqj

Models are stored at ~/bentoml/models/, with automatic versioning via hash tag — rolling back just requires changing the tag.

3. Create a BentoML Service

The core part — defining the API endpoint and processing logic:

# service.py
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray, JSON

# Load model from registry
iris_runner = bentoml.sklearn.get("iris_classifier:latest").to_runner()

svc = bentoml.Service("iris_classifier_service", runners=[iris_runner])

IRIS_CLASSES = ["setosa", "versicolor", "virginica"]

@svc.api(input=NumpyNdarray(shape=(-1, 4), dtype=np.float32), output=JSON())
async def predict(input_data: np.ndarray):
    batch_pred = await iris_runner.predict.async_run(input_data)
    result = [IRIS_CLASSES[i] for i in batch_pred]
    return {"predictions": result}

4. Test Locally

# Run development server
bentoml serve service:svc --reload

# Test with curl (in another terminal)
curl -X POST http://localhost:3000/predict \
  -H "Content-Type: application/json" \
  -d '[[5.1, 3.5, 1.4, 0.2]]'

# Response:
# {"predictions": ["setosa"]}

5. Build the Bento Artifact

Create a bentofile.yaml file describing all dependencies:

service: "service:svc"
labels:
  owner: devops-team
  project: iris-api
include:
  - "service.py"
python:
  packages:
    - scikit-learn
    - numpy

bentoml build

# Output:
# Successfully built Bento(tag="iris_classifier_service:7a3bk2...")
# Bento size: 15.2 MB

6. Deploy to Production with Docker

# Containerize Bento into a Docker image
bentoml containerize iris_classifier_service:latest

# Tag and push to registry
docker tag iris_classifier_service:latest your-registry.com/iris-api:v1
docker push your-registry.com/iris-api:v1

# Run container
docker run -p 3000:3000 iris_classifier_service:latest serve

7. Deploy Directly on Linux with systemd

If you’re not using Docker, create a systemd service to manage the process:

# /etc/systemd/system/iris-api.service
[Unit]
Description=BentoML Iris Classifier API
After=network.target

[Service]
User=www-data
WorkingDirectory=/opt/iris-api
ExecStart=/opt/iris-api/venv/bin/bentoml serve iris_classifier_service:latest \
  --host 0.0.0.0 \
  --port 3000 \
  --workers 4
Restart=always
RestartSec=5
Environment=BENTOML_HOME=/opt/iris-api/bentoml

[Install]
WantedBy=multi-user.target

systemctl daemon-reload
systemctl enable iris-api
systemctl start iris-api
systemctl status iris-api

Things to Keep in Mind When Running in Production

Adaptive Batching

BentoML has an adaptive batching mechanism — it automatically groups multiple small requests into a batch to increase throughput. Configure it in bentofile.yaml:

runners:
  - name: iris_runner
    max_batch_size: 100
    max_latency_ms: 15

Monitoring with Prometheus

BentoML exposes Prometheus metrics at /metrics out of the box. Just add it to your Prometheus config and you’re good to go:

scrape_configs:
  - job_name: 'bentoml'
    static_configs:
      - targets: ['your-server:3000']

Health Check Endpoint

Built-in at /healthz and /readyz — works immediately with Kubernetes liveness/readiness probes without writing any additional code.

Conclusion

I’ve migrated 3 ML services from hand-written Flask to BentoML. The time to get a new model into production — from the moment the data scientist hands it off to when the API is live — dropped from around 3 hours to 20–30 minutes, mostly just Docker image build time. The model registry also makes rollbacks straightforward when a new model runs into issues.

BentoML isn’t a silver bullet. If you need to maximize GPU performance with TensorRT, Triton is still the better choice. But for most ML/DevOps teams, BentoML strikes a practical balance between deployment speed and production stability. Give it a try right now with pip install bentoml scikit-learn and follow the steps above.