Machine Learning for Beginners: From Theory to Practice with scikit-learn

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

Context — Why DevOps Engineers Should Care About Machine Learning

Honestly, I used to not care much about ML. I figured it was the data scientist’s job — my job was just to deploy and monitor. But then a sprint came along that required integrating a customer churn prediction feature into the pipeline, and I was the one who had to wrap that model into a REST API and push it to production. That’s when it hit me: without understanding how ML works, I couldn’t explain why the model was giving strange results, didn’t know what to monitor, and had no idea when to retrain.

Machine Learning isn’t as mysterious as many people think. The core idea is simple: let computers learn from data to make predictions. Instead of writing hard rules like if email contains "discount" then spam, you use historical data to let the machine find patterns on its own.

The three most common types of ML:

  • Supervised learning: Learns from labeled data. Examples: predicting house prices, classifying spam emails.
  • Unsupervised learning: Finds patterns in unlabeled data on its own. Example: clustering customers by shopping behavior.
  • Reinforcement learning: An agent learns through trial and error, receiving rewards or penalties. Used in game AI and autonomous robot navigation.

If you’re just starting out, focus on supervised learning first — results are clearly measurable, the tooling is mature, and there’s significantly more documentation compared to the other two.

Setting Up Your ML Environment

System Requirements

Python 3.9+ is enough to get started. Use a virtualenv from the very beginning — it sounds like a minor thing, but it will save you from a mess of package conflicts between projects down the road.

# Create a virtualenv
python3 -m venv ml-env
source ml-env/bin/activate  # Linux/macOS
# ml-env\Scripts\activate   # Windows

# Install required libraries
pip install scikit-learn pandas numpy matplotlib joblib

Verify Installation

import sklearn
import pandas as pd
import numpy as np

print(f"scikit-learn: {sklearn.__version__}")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")

If it runs without errors, you’re good to go. I’m running scikit-learn 1.4.x in production — the API is stable, and breaking changes between minor versions are rare.

Building Your First Model — Basic Classification

I’m using the Iris dataset — 150 flower samples with 4 features (petal and sepal length/width), classified into 3 species. Not the most realistic, but it’s enough to understand the full workflow without getting lost in data cleaning.

Load and Explore the Data

from sklearn.datasets import load_iris
import pandas as pd

# Load dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
df['species'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print(df.head())
print(df.describe())
print(df['species'].value_counts())

The output shows exactly 50 samples per species — a perfectly balanced dataset that needs no further processing. In the real world, don’t expect to be this lucky, but that’s a story for feature engineering.

Train/Test Split

This step is critically important and the one beginners most often skip. If you train and evaluate on the same data, the model will “memorize” that data and report deceptively good results — this is called overfitting. You’ll only discover the problem after deploying to production.

from sklearn.model_selection import train_test_split

X = iris.data   # Features (4 numeric columns)
y = iris.target # Labels (0, 1, 2)

# 80% train, 20% test — random_state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train size: {len(X_train)}")  # 120
print(f"Test size: {len(X_test)}")    # 30

The stratify=y parameter ensures that the class distribution in the train and test sets remains the same. This is especially important with imbalanced data — for example, if a dataset has 90% class 0 and you skip stratification, the test set could end up being all class 0, reporting 90% accuracy while the model hasn’t actually learned anything meaningful.

Train a Model with Random Forest

I chose Random Forest not because the name sounds cool, but because it’s robust, requires minimal tuning, and the baseline often exceeds 90% accuracy without much configuration. It’s ideal for getting started.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train — just one line
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Results typically hit ~97% accuracy on Iris. But don’t just look at accuracy — classification_report is what really matters because it breaks down precision, recall, and F1 per class. In practice, imbalanced data is almost the default, and accuracy alone is easy to be fooled by.

Saving Models and Monitoring in Production

Serialize the Model for Reuse

Once training is done, you need to serialize the model — you can’t re-train from scratch every time you deploy. For scikit-learn, joblib is faster than pickle for objects that contain large numpy arrays internally.

import joblib

# Save the model
joblib.dump(model, 'iris_model.pkl')
print("Model saved!")

# Load it back in production code
loaded_model = joblib.load('iris_model.pkl')

# Test prediction with new data
sample = [[5.1, 3.5, 1.4, 0.2]]  # This is Setosa
prediction = loaded_model.predict(sample)
probability = loaded_model.predict_proba(sample)

print(f"Predicted: {iris.target_names[prediction[0]]}")
print(f"Confidence: {probability.max():.2%}")

Log Predictions for Monitoring

This is the most expensive lesson I’ve ever learned. My team deployed a model and then… forgot about it. After 3 months, accuracy dropped from 94% to 71% because the data distribution shifted with the season — a phenomenon called data drift. It took a full week of investigation to find the root cause. Since then, I log every prediction:

import json
from datetime import datetime

def log_prediction(input_data, prediction, confidence, model_version="1.0"):
    """Log predictions for monitoring and drift detection later"""
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "model_version": model_version,
        "input": input_data,
        "prediction": int(prediction),
        "confidence": float(confidence),
    }
    # Append to a JSONL file — easy to parse later
    with open("prediction_logs.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")
    return log_entry

# Usage
log_prediction(
    input_data=[5.1, 3.5, 1.4, 0.2],
    prediction=0,
    confidence=0.98
)

Simple Data Drift Detection

Once you have enough prediction logs, compare the distribution of current inputs against what you saw at training time to catch drift early:

import numpy as np

def check_feature_drift(reference_data, new_data, threshold=0.1):
    """Compare the mean of each feature to detect drift"""
    ref_mean = np.mean(reference_data, axis=0)
    new_mean = np.mean(new_data, axis=0)
    
    # Normalize by the standard deviation of the reference data
    drift = np.abs(ref_mean - new_mean) / (np.std(reference_data, axis=0) + 1e-8)
    
    feature_names = iris.feature_names
    for i, d in enumerate(drift):
        status = "WARNING" if d > threshold else "OK"
        print(f"[{status}] {feature_names[i]}: drift={d:.3f}")
    
    return drift

# Run on a regular schedule every week
check_feature_drift(X_train, X_test)

This function runs as a cron job every Sunday in my production environment. If drift exceeds 0.2, it automatically creates a retrain ticket — no manual babysitting required.

Next Steps After This Guide

Now that you have the full workflow down, here’s the roadmap I recommend:

  1. Practice with real datasets: Kaggle has Titanic and House Prices — great for practicing feature engineering with much messier data than Iris.
  2. Cross-validation: Use cross_val_score instead of a single train/test split — it gives a model evaluation that’s less influenced by the luck of how you split the data.
  3. Feature engineering: Handle missing values, encode categoricals, scale features — this takes up 80% of the effort in real projects, while model training is just the remaining 20%.
  4. Hyperparameter tuning: Use GridSearchCV or RandomizedSearchCV to optimize parameters — RandomizedSearchCV is much faster when the search space is large.
  5. MLflow: Version control for models and experiment tracking — essential when working in a team or running many experiments in parallel.

ML doesn’t require you to be a math whiz or a pure data scientist. What matters is understanding what problem you’re solving, choosing the right metric, and remembering that deploying a model isn’t the finish line. The rest? Open a terminal and get started.

Share: