After training an ML model, the next question is always: how do you get it into production? I’ve been through plenty of situations where I had to write Flask APIs from scratch to wrap a model, only to run into a whole list of problems: no versioning, hard to scale, messy logs. When our team tried BentoML, it completely changed how we think about ML serving.
The Big Picture: Ways to Deploy ML Models
In practice, there are several ways to expose an ML model as an API. I break them down into 4 main groups:
- Write your own API (Flask/FastAPI) — the most flexible option, but you handle everything from scratch
- BentoML — a framework purpose-built for ML serving, with many production features built in
- TorchServe — dedicated to PyTorch, great if you’re all-in on PyTorch
- Triton Inference Server (NVIDIA) — extremely high performance, but complex and requires a GPU cluster
Detailed Comparison of Each Approach
Flask/FastAPI DIY
This is the most common approach since everyone knows Flask/FastAPI. You load the model, write an endpoint, and you’re done. But the real problems lie elsewhere:
- No model versioning mechanism — updating a model means redeploying the entire service
- Batching must be implemented manually — significantly impacts throughput under high load
- Health checks and monitoring need to be written separately
- No standard for packaging a model together with its dependencies
I once maintained a Flask service like this for 6 months. Every time a data scientist updated the model, the whole team had to coordinate manually — copy files, restart the service, check logs. Very time-consuming and prone to incidents.
TorchServe
If your team is PyTorch-only and wants the official solution from Meta, TorchServe is the choice. But it’s fairly opinionated — PyTorch only, XML-based configuration (fairly dated), and the documentation is missing coverage for many real-world edge cases.
Triton Inference Server
When you need to squeeze every ounce of GPU performance — dynamic batching, multiple backends (TensorRT, ONNX, TF, PyTorch) — Triton is worth the setup effort. The problem is the steep learning curve, complex configuration, and if you don’t have a GPU cluster, it’s clearly over-engineering.
BentoML
BentoML sits at the sweet spot between “write your own Flask” and “complex Triton”. It’s framework-agnostic (supports sklearn, PyTorch, TF, XGBoost, ONNX…), has built-in versioning, adaptive batching, and automatic Docker export. It covers 80% of production use cases without complex setup.
Why Choose BentoML for Production
After comparing the options, my DevOps team settled on BentoML for 3 specific reasons:
- Built-in model registry — save/load models with versioning, no need for separate DVC or MLflow just for this step
- Bento = complete packaged artifact — model + code + dependencies bundled into one artifact, deployable anywhere
- Automatic Docker image generation — one command produces a production-ready Docker image, no need to write complex Dockerfiles
The least-mentioned benefit is actually the most important for our team: no more “where’s the new model, what Python version, what dependencies does it need?” calls. The data scientist saves the model with a tag, DevOps takes that tag and builds Docker. This standardized workflow saves our team about 2 hours per new model release.
BentoML Deployment Guide on Linux
1. Installation
pip install bentoml
# Verify
bentoml --version
2. Save the Model to BentoML Model Store
The first step is adding your trained model to the BentoML registry. Example with scikit-learn:
import bentoml
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Train model (in practice, you already have your model)
iris = load_iris()
clf = RandomForestClassifier(n_estimators=100)
clf.fit(iris.data, iris.target)
# Save to BentoML model store
saved_model = bentoml.sklearn.save_model(
"iris_classifier",
clf,
signatures={"predict": {"batchable": True}},
metadata={"accuracy": 0.97, "dataset": "iris"}
)
print(f"Model saved: {saved_model.tag}")
# Output: iris_classifier:3mxqpfzbs6tpjuqj
Models are stored at ~/bentoml/models/, with automatic versioning via hash tag — rolling back just requires changing the tag.
3. Create a BentoML Service
The core part — defining the API endpoint and processing logic:
# service.py
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray, JSON
# Load model from registry
iris_runner = bentoml.sklearn.get("iris_classifier:latest").to_runner()
svc = bentoml.Service("iris_classifier_service", runners=[iris_runner])
IRIS_CLASSES = ["setosa", "versicolor", "virginica"]
@svc.api(input=NumpyNdarray(shape=(-1, 4), dtype=np.float32), output=JSON())
async def predict(input_data: np.ndarray):
batch_pred = await iris_runner.predict.async_run(input_data)
result = [IRIS_CLASSES[i] for i in batch_pred]
return {"predictions": result}
4. Test Locally
# Run development server
bentoml serve service:svc --reload
# Test with curl (in another terminal)
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '[[5.1, 3.5, 1.4, 0.2]]'
# Response:
# {"predictions": ["setosa"]}
5. Build the Bento Artifact
Create a bentofile.yaml file describing all dependencies:
service: "service:svc"
labels:
owner: devops-team
project: iris-api
include:
- "service.py"
python:
packages:
- scikit-learn
- numpy
bentoml build
# Output:
# Successfully built Bento(tag="iris_classifier_service:7a3bk2...")
# Bento size: 15.2 MB
6. Deploy to Production with Docker
# Containerize Bento into a Docker image
bentoml containerize iris_classifier_service:latest
# Tag and push to registry
docker tag iris_classifier_service:latest your-registry.com/iris-api:v1
docker push your-registry.com/iris-api:v1
# Run container
docker run -p 3000:3000 iris_classifier_service:latest serve
7. Deploy Directly on Linux with systemd
If you’re not using Docker, create a systemd service to manage the process:
# /etc/systemd/system/iris-api.service
[Unit]
Description=BentoML Iris Classifier API
After=network.target
[Service]
User=www-data
WorkingDirectory=/opt/iris-api
ExecStart=/opt/iris-api/venv/bin/bentoml serve iris_classifier_service:latest \
--host 0.0.0.0 \
--port 3000 \
--workers 4
Restart=always
RestartSec=5
Environment=BENTOML_HOME=/opt/iris-api/bentoml
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable iris-api
systemctl start iris-api
systemctl status iris-api
Things to Keep in Mind When Running in Production
Adaptive Batching
BentoML has an adaptive batching mechanism — it automatically groups multiple small requests into a batch to increase throughput. Configure it in bentofile.yaml:
runners:
- name: iris_runner
max_batch_size: 100
max_latency_ms: 15
Monitoring with Prometheus
BentoML exposes Prometheus metrics at /metrics out of the box. Just add it to your Prometheus config and you’re good to go:
scrape_configs:
- job_name: 'bentoml'
static_configs:
- targets: ['your-server:3000']
Health Check Endpoint
Built-in at /healthz and /readyz — works immediately with Kubernetes liveness/readiness probes without writing any additional code.
Conclusion
I’ve migrated 3 ML services from hand-written Flask to BentoML. The time to get a new model into production — from the moment the data scientist hands it off to when the API is live — dropped from around 3 hours to 20–30 minutes, mostly just Docker image build time. The model registry also makes rollbacks straightforward when a new model runs into issues.
BentoML isn’t a silver bullet. If you need to maximize GPU performance with TensorRT, Triton is still the better choice. But for most ML/DevOps teams, BentoML strikes a practical balance between deployment speed and production stability. Give it a try right now with pip install bentoml scikit-learn and follow the steps above.

