Whisper OpenAI: A Practical Guide to Speech-to-Text with Python

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

After six months running Whisper in production to transcribe audio files from internal podcasts and meeting recordings, I’ve gathered several key insights that most tutorials never mention. In this post, I’ll get straight to the point: which approach to choose, why, and how to deploy it in practice.

1. Three Main Approaches to Using Whisper

When I first started working with Whisper, I saw three clear paths forward:

  • Whisper local (open-source) — runs directly on your machine/server, no API costs
  • Whisper API (OpenAI) — called via API, pay per minute of audio
  • faster-whisper — a Whisper port using CTranslate2, 4–5x faster than the original

All three produce similar-looking output. But pick the wrong one and you’ll regret it fast — they differ drastically in speed, cost, and privacy.

2. Pros and Cons of Each Approach

Whisper local (openai-whisper)

The original repo from OpenAI, installed via pip, runs completely offline. The biggest advantage is no API cost and data never leaving your server — critical for internal company audio.

The downside: it’s slow. The large-v3 model takes roughly 3–4 minutes to transcribe 1 minute of audio on a typical CPU. An NVIDIA GPU speeds things up significantly, but not every server has one. I dropped this approach after two weeks because the audio queue was backing up too much.

Whisper API (OpenAI)

Called via openai.audio.transcriptions.create(), currently priced at around $0.006/minute of audio — cheaper than I expected. File limit is 25MB, supported formats: mp3, mp4, wav, webm, flac…

Advantages: extremely fast — a 60-minute audio file processes in about 1–2 minutes, with no hardware concerns. Disadvantages: audio must be sent to OpenAI’s servers — not suitable for sensitive data, and requires a stable internet connection.

faster-whisper

I’ve been running this in production for nearly four months now. faster-whisper runs locally but is significantly faster than the original thanks to INT8 quantization. The large-v3 model transcribes 1 minute of audio in just 40–60 seconds on a 4-core CPU — perfectly acceptable for batch processing.

3. Choosing the Right Approach

Here’s a quick summary:

  • Public audio, need speed, budget is fine → Whisper API
  • Internal/sensitive audio, have a GPU → openai-whisper local with the large-v3 model
  • Internal audio, CPU only, need near-real-time → faster-whisper

My use case: internal meeting recordings, minimal budget, and I don’t want company audio going to the cloud → faster-whisper was the obvious choice.

4. Practical Deployment Guide

Installing Dependencies

# Create a virtualenv first
python -m venv venv
source venv/bin/activate

# Install faster-whisper (recommended)
pip install faster-whisper

# Or use the original OpenAI version
pip install openai-whisper

# ffmpeg is required — handles audio format conversion
sudo apt install ffmpeg   # Ubuntu/Debian
brew install ffmpeg        # macOS

Transcribing with faster-whisper (local)

from faster_whisper import WhisperModel

# Load the model once, reuse it multiple times
# device="cpu" for servers without a GPU
# compute_type="int8" significantly reduces RAM usage
model = WhisperModel("large-v3", device="cpu", compute_type="int8")

def transcribe_audio(audio_path: str, language: str = "vi") -> str:
    """
    Transcribe an audio file to text.
    language="vi" for Vietnamese, None for auto-detect.
    """
    segments, info = model.transcribe(
        audio_path,
        language=language,
        beam_size=5,
        vad_filter=True,       # filter silence, reduce hallucination
        vad_parameters=dict(min_silence_duration_ms=500)
    )
    
    # segments is a generator — collect all results
    transcript = " ".join(segment.text.strip() for segment in segments)
    return transcript

# Usage
result = transcribe_audio("meeting_2024.mp3")
print(result)

Practical note: vad_filter=True is a parameter I added after noticing Whisper tends to hallucinate text from silent segments — the model essentially “imagines” words out of nowhere. Enabling VAD filter resolves this issue almost entirely.

Transcribing with the Whisper API (OpenAI)

import openai
from pathlib import Path

client = openai.OpenAI(api_key="sk-...")

def transcribe_with_api(audio_path: str, language: str = "vi") -> str:
    audio_file = Path(audio_path)
    
    # Files > 25MB need to be split first
    if audio_file.stat().st_size > 25 * 1024 * 1024:
        raise ValueError(f"File too large: {audio_file.stat().st_size / 1024 / 1024:.1f}MB. Limit is 25MB.")
    
    with open(audio_path, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            language=language,
            response_format="text"   # or "json" to get additional metadata
        )
    
    return transcript

result = transcribe_with_api("interview.mp3")
print(result)

Handling Long Files — Splitting Audio

For recordings longer than 30 minutes, I use pydub to split them before sending to the API:

pip install pydub
from pydub import AudioSegment
import os

def split_audio(audio_path: str, chunk_minutes: int = 10) -> list[str]:
    """Split audio into smaller chunks and return a list of file paths."""
    audio = AudioSegment.from_file(audio_path)
    chunk_ms = chunk_minutes * 60 * 1000
    chunks = []
    
    for i, start in enumerate(range(0, len(audio), chunk_ms)):
        chunk = audio[start:start + chunk_ms]
        chunk_path = f"/tmp/chunk_{i:03d}.mp3"
        chunk.export(chunk_path, format="mp3", bitrate="64k")
        chunks.append(chunk_path)
    
    return chunks

def transcribe_long_audio(audio_path: str) -> str:
    chunks = split_audio(audio_path, chunk_minutes=10)
    full_transcript = []
    
    try:
        for chunk_path in chunks:
            text = transcribe_with_api(chunk_path)
            full_transcript.append(text)
    finally:
        # Clean up temp files
        for f in chunks:
            os.unlink(f)
    
    return " ".join(full_transcript)

Output with Timestamps

When you need to know when each segment was spoken (useful for subtitles or reviewing meetings), faster-whisper returns timestamps natively:

def transcribe_with_timestamps(audio_path: str) -> list[dict]:
    segments, _ = model.transcribe(audio_path, language="vi", vad_filter=True)
    
    result = []
    for seg in segments:
        result.append({
            "start": round(seg.start, 2),
            "end": round(seg.end, 2),
            "text": seg.text.strip()
        })
    
    return result

# Sample output:
# [{"start": 0.0, "end": 4.5, "text": "Hello everyone, today I'll be..."}, ...]

5. Common Issues

Incorrect Language Recognition

Always pass language="vi" instead of relying on auto-detect. Auto-detect sometimes misidentifies Vietnamese as… Chinese or Khmer. I’ve seen this happen with low-quality audio.

Insufficient RAM When Loading Large Models

faster-whisper with INT8 quantization: large-v3 needs around 1.5–2GB RAM, medium around 0.8GB. If your server is RAM-constrained, use medium — accuracy drops slightly but remains sufficient for standard Vietnamese. I’ve used this approach on a 4GB production staging environment with stable results.

Poor Audio Quality

Whisper handles moderately noisy audio reasonably well. But for phone recordings in noisy environments, pre-process with ffmpeg first:

# Normalize + basic noise reduction
ffmpeg -i input.mp3 -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" output.mp3

Conclusion

Whisper is the best speech-to-text tool I’ve used for Vietnamese — it outperforms Google Speech-to-Text in every use case I’ve tested. If I were starting from scratch, I’d go straight to faster-whisper with the medium or large-v3 model depending on server RAM, enable vad_filter, and hardcode language="vi". Those steps alone solve 90% of the common issues.

Share: