Integrating ChatGPT API into Your Web App: Streaming, Conversation History, and Real-World Deployment

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

Quick Start: Get a Chat Endpoint Running in 5 Minutes

I’ll show you the fastest way to build a backend that accepts prompts and returns ChatGPT responses — before explaining why you’ll eventually need something more robust. If you’re completely new to the OpenAI API with Python, that guide covers the fundamentals before diving into Flask integration.

Install the dependencies:

pip install flask openai python-dotenv flask-cors

Create a .env file:

OPENAI_API_KEY=sk-proj-xxxxxx

Minimal app.py:

from flask import Flask, request, jsonify
from flask_cors import CORS
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
app = Flask(__name__)
CORS(app)
client = OpenAI()

@app.route("/chat", methods=["POST"])
def chat():
    user_message = request.json.get("message", "")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_message}]
    )
    return jsonify({"reply": response.choices[0].message.content})

if __name__ == "__main__":
    app.run(debug=True)

Quick test with curl:

curl -X POST http://localhost:5000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is Docker?"}'

That works. But shipping this to production comes with at least 3 serious problems.

3 Real Problems You Need to Solve

Problem 1: Slow Responses — Terrible UX

ChatGPT typically takes 3–10 seconds to generate a long response. Users stare at a blank screen with no feedback — the app feels frozen. Streaming solves this: receive each text chunk as the model generates it and display it progressively, just like the ChatGPT.com interface.

Problem 2: The Chatbot Has No Memory

The OpenAI API is completely stateless — each request is independent, with no connection to previous ones. For a chatbot to “remember” the conversation, you must send the entire conversation history and context memory with every API call.

Problem 3: Exposed API Key When Calling from the Frontend

Calling OpenAI directly from JavaScript in the browser means anyone who opens DevTools can see your API key. Every request must go through your backend — this is rule #1 when working with any AI API.

The Complete Solution: Streaming + Conversation History

The backend uses Server-Sent Events (SSE) — a mechanism that lets the server push text to the browser piece by piece rather than waiting for the full response. I’ve been running this approach in production for a few months; users see text appearing progressively instead of staring at a blank screen for 5–8 seconds, even though the model’s total generation time hasn’t changed.

Flask Backend with Streaming

from flask import Flask, request, jsonify, Response, stream_with_context
from flask_cors import CORS
from openai import OpenAI
from dotenv import load_dotenv
import json

load_dotenv()
app = Flask(__name__)
CORS(app)
client = OpenAI()

# Save conversation history to memory.
# Production nên dùng Redis hoặc database
conversations = {}

SYSTEM_PROMPT = """You are an IT expert, specializing in Linux and DevOps.
Please reply in Vietnamese, concisely, and include code examples when necessary.
If you are unsure, please state it directly instead of making things up."""

MAX_HISTORY = 20  # Keep a maximum of 20 messages.

def trim_history(history):
    system_msg = [m for m in history if m["role"] == "system"]
    other_msgs = [m for m in history if m["role"] != "system"]
    if len(other_msgs) > MAX_HISTORY:
        other_msgs = other_msgs[-MAX_HISTORY:]
    return system_msg + other_msgs

@app.route("/chat/stream", methods=["POST"])
def chat_stream():
    data = request.json
    session_id = data.get("session_id", "default")
    user_message = data.get("message", "")

    if session_id not in conversations:
        conversations[session_id] = [{"role": "system", "content": SYSTEM_PROMPT}]

    conversations[session_id].append({"role": "user", "content": user_message})
    conversations[session_id] = trim_history(conversations[session_id])

    def generate():
        full_response = ""
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=conversations[session_id],
            stream=True,
            max_tokens=1000
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                text = chunk.choices[0].delta.content
                full_response += text
                yield f"data: {json.dumps({'text': text})}\n\n"

        conversations[session_id].append({
            "role": "assistant",
            "content": full_response
        })
        yield f"data: {json.dumps({'done': True})}\n\n"

    return Response(
        stream_with_context(generate()),
        mimetype="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no"  # Bắt buộc nếu dùng Nginx
        }
    )

@app.route("/chat/reset", methods=["POST"])
def reset():
    session_id = request.json.get("session_id", "default")
    conversations.pop(session_id, None)
    return jsonify({"status": "ok"})

Frontend JavaScript: Receiving the Stream and Displaying in Real Time

Use the Fetch API with ReadableStream — no additional libraries needed:

const sessionId = Math.random().toString(36).substr(2, 9);

async function sendMessage(message) {
    const assistantDiv = createMessageDiv("assistant", "");

    const response = await fetch("/chat/stream", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ session_id: sessionId, message })
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        const lines = decoder.decode(value).split("\n");
        for (const line of lines) {
            if (!line.startsWith("data: ")) continue;
            try {
                const data = JSON.parse(line.slice(6));
                if (data.text) {
                    assistantDiv.textContent += data.text;
                }
            } catch (e) {}
        }
    }
}

Advanced: Rate Limiting and Cost Control

Hard-learned lesson: without rate limiting, a bot or runaway user can push your OpenAI bill to $30–50 in a single night. Add flask-limiter:

pip install flask-limiter
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
import openai

limiter = Limiter(
    app=app,
    key_func=get_remote_address,
    default_limits=["200 per day", "50 per hour"]
)

@app.route("/chat/stream", methods=["POST"])
@limiter.limit("10 per minute")
def chat_stream():
    try:
        # ... code như trên
        pass
    except openai.RateLimitError:
        return jsonify({"error": "The system is busy, please try again in 30 seconds."}), 429
    except openai.APIConnectionError:
        return jsonify({"error": "Unable to connect to OpenAI API"}), 503
    except openai.AuthenticationError:
        return jsonify({"error": "Invalid API key"}), 401

Practical Tips for Production Deployment

Nginx Requires Special Configuration for SSE

Nginx buffers the entire response before sending it to the client by default. With SSE, this causes the stream to “freeze” — text only appears all at once when the request ends, completely defeating the streaming effect. You must disable buffering:

location /chat/ {
    proxy_pass http://127.0.0.1:5000;
    proxy_buffering off;        # Mandatory for SSE
    proxy_cache off;
    proxy_read_timeout 300s;    # Long enough for a long response
    add_header X-Accel-Buffering no;
}

Choose the Right Model for Your Use Case

  • gpt-4o-mini — Fast, cheap (roughly 30× cheaper than gpt-4o), sufficient for standard support chatbots
  • gpt-4o — Use when you need complex analysis or high-quality code generation
  • gpt-4o with structured outputs — When you need JSON responses with a fixed schema without format hallucination

System Prompt Determines Output Quality

An empty system prompt is the most common mistake. Define the role clearly, the response style, and topic boundaries — the chatbot will hallucinate less and stay on-tone noticeably better. Whenever a chatbot goes off-topic or sounds wrong, the system prompt is always the first thing I check. If you’re weighing whether to use GPT-4o versus other frontier models for your project, this model comparison for junior devs breaks down the trade-offs clearly.

Don’t Store Conversation History Forever

Long conversation history = more tokens = higher cost per request. The trim_history() function above keeps the 20 most recent messages — roughly 10,000–15,000 tokens depending on content, enough to maintain context without blowing your budget. If you need to remember longer-term information (e.g., user names, preferences), summarize it and inject it into the system prompt rather than keeping the full history. For a deeper look at memory architecture patterns, see building conversation history and context memory from scratch.

Share: