2 AM: My AI Pipeline Stopped Working
I was building a RAG (Retrieval-Augmented Generation) tool to aggregate information from technical documentation sites. The demo deadline was the next morning. Everything ran fine on localhost with sample data — but when I started crawling real data from production sites, the system started returning nothing but garbage.
The logs were full of things like this:
[ERROR] Parsed content: "Please enable JavaScript to view this page"
[ERROR] Parsed content: "Verifying you are human. This may take a few seconds."
[ERROR] Empty markdown extracted from https://docs.example.com/api-reference
BeautifulSoup was capturing HTML all right — but it was the loading screen’s HTML, not the actual content. Feed that garbage into an AI and the output is just as bad.
The Problem: Modern Web Isn’t Scraper-Friendly
After about 30 minutes of midnight debugging, I finally understood the problem. Modern documentation sites and web pages typically cause 3 types of issues for traditional scrapers:
- JavaScript rendering: Content loads through React/Vue/Next.js — BeautifulSoup only sees the HTML before JS runs.
- Anti-bot protection: Cloudflare, CAPTCHA, and User-Agent detection block automated requests.
- Dynamic content: Infinite scroll, lazy loading, and content that depends on user interaction.
I tried switching to Selenium. It worked — but it was painfully slow. Crawling 50 pages took 20 minutes, not to mention memory leaks after a few hours of continuous operation. For a pipeline that needs to process thousands of pages, this wasn’t a viable path.
# Old BeautifulSoup approach - fails with JS-rendered pages
import requests
from bs4 import BeautifulSoup
def scrape_old_way(url):
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0..."})
soup = BeautifulSoup(response.text, "html.parser")
# Result: "Please enable JavaScript" 😭
return soup.get_text()
What Is Firecrawl and Why It’s Different
I found Firecrawl after 20 minutes of desperate Googling at 3 AM. It’s an API service (with a self-hosted option), not a general-purpose scraper — it was built to do one thing: provide clean data for AI pipelines. Compared to other tools:
- Fully renders JavaScript before extracting content
- Automatically converts HTML to clean Markdown — the format LLMs digest best
- Handles anti-bot protection, rate limiting, and retries completely automatically
- Crawls entire websites by depth, not just individual pages
Installing the Firecrawl Python SDK
pip install firecrawl-py
Get your API key at firecrawl.dev — there’s a free tier for testing. Then set the environment variable:
export FIRECRAWL_API_KEY="fc-your-api-key-here"
Practical Ways to Use Firecrawl
1. Scraping a Single Page
The simplest use case: give it a URL, get back clean Markdown.
import os
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
# Scrape a single page, get back Markdown
result = app.scrape_url(
"https://docs.python.org/3/library/asyncio.html",
formats=["markdown"]
)
print(result["markdown"][:500])
# Output: clean content, ready to feed into an LLM
The result is pure Markdown — no excess HTML tags, no junk navigation menus, no ad-filled footers. Anyone who’s ever spent time hand-writing regex to extract content will immediately understand how much work this saves.
2. Crawling an Entire Website by Depth
Want to index an entire docs site for a RAG pipeline — not one page at a time:
import os
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
# Crawl the entire docs site, up to 50 pages
crawl_result = app.crawl_url(
"https://docs.example.com",
limit=50,
scrape_options={"formats": ["markdown"]}
)
for page in crawl_result["data"]:
print(f"URL: {page['metadata']['sourceURL']}")
print(f"Content length: {len(page['markdown'])} chars")
print("---")
3. Extracting Structured Data with LLM
The feature I use most — extract information according to a predefined schema, and Firecrawl uses AI to fill it in:
import os
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
# Structured extraction from a product page
result = app.extract(
["https://example.com/product/123"],
schema={
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"},
"features": {
"type": "array",
"items": {"type": "string"}
},
"availability": {"type": "boolean"}
},
"required": ["product_name", "price"]
}
)
print(result["data"])
# Output: {"product_name": "...", "price": 29.99, "features": [...], ...}
The Best Approach: Combining Firecrawl with an LLM
This is the pattern I’ve been running in production since patching that pipeline at 2 AM. Combines Firecrawl for crawling and Claude for content processing:
import os
import anthropic
from firecrawl import FirecrawlApp
firecrawl = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def research_topic(url: str, question: str) -> str:
"""
Crawl a URL and use Claude to answer a question from its content.
"""
# Step 1: Fetch clean content from the web
scrape_result = firecrawl.scrape_url(url, formats=["markdown"])
content = scrape_result.get("markdown", "")
if not content:
return "Unable to retrieve content from this URL."
# Step 2: Use Claude to answer based on the crawled content
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"""Based on the following web content, please answer the question.
Content:
{content[:8000]}
Question: {question}"""
}
]
)
return response.content[0].text
# Example usage
answer = research_topic(
url="https://docs.python.org/3/library/asyncio-task.html",
question="What is the difference between asyncio.create_task() and asyncio.ensure_future()?"
)
print(answer)
Building a Simple RAG Indexer
Need to store data for repeated queries? Here’s the skeleton of a minimal indexer:
import os
from firecrawl import FirecrawlApp
from typing import List, Dict
firecrawl = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
class SimpleRAGIndexer:
def __init__(self):
self.documents: List[Dict] = []
def index_website(self, base_url: str, max_pages: int = 20):
"""Crawl and index an entire website."""
print(f"Crawling {base_url}...")
result = firecrawl.crawl_url(
base_url,
limit=max_pages,
scrape_options={"formats": ["markdown"]}
)
for page in result.get("data", []):
if page.get("markdown"):
self.documents.append({
"url": page["metadata"]["sourceURL"],
"content": page["markdown"],
"title": page["metadata"].get("title", "")
})
print(f"Indexed {len(self.documents)} pages.")
return self.documents
def search(self, query: str, top_k: int = 3) -> List[Dict]:
"""Simple keyword search — use a vector DB in production."""
query_lower = query.lower()
results = [
{**doc, "score": doc["content"].lower().count(query_lower)}
for doc in self.documents
if doc["content"].lower().count(query_lower) > 0
]
return sorted(results, key=lambda x: x["score"], reverse=True)[:top_k]
# Usage
indexer = SimpleRAGIndexer()
indexer.index_website("https://docs.python.org/3/library/", max_pages=30)
relevant_docs = indexer.search("async await coroutine")
for doc in relevant_docs:
print(f"URL: {doc['url']} | Score: {doc['score']}")
Practical Notes from Real-World Use
I’ve been using it in production for about 3 months. Overall it works well — but there are a few things worth knowing upfront:
- Rate limits: Firecrawl’s free tier has a monthly request limit. Estimate your needs before choosing a plan — don’t get cut off mid-crawl on production data.
- Self-hosted option: Firecrawl is open-source (
mendableai/firecrawlon GitHub). Deploy it on your own VPS if you need data privacy or want long-term cost control. - Cache your results: Crawling the same URL repeatedly burns through your quota. Cache with a sensible TTL — 24 hours for documentation, 1 hour for news feeds.
- robots.txt: Firecrawl respects robots.txt by default. Overriding it requires explicit configuration — and always check whether you’re actually authorized to crawl that site.
Quick Comparison of Solutions
Tool | JS Render | Clean Output | Easy to Use | Cost
-----------------|-----------|--------------|-------------|------------------
BeautifulSoup | ❌ | ❌ | ✅ | Free
Selenium | ✅ | ❌ | ❌ | Free (slow)
Playwright | ✅ | ❌ | Medium | Free (needs infra)
Firecrawl API | ✅ | ✅ | ✅ | Paid / Self-host
Firecrawl (self) | ✅ | ✅ | Medium | Infra only
For AI pipelines that need clean, consistent data, Firecrawl solves the problem without writing another 500 lines of edge-case handling code. Not because it’s “the best” on every metric — BeautifulSoup is still perfectly fine for static sites. My RAG pipeline now crawls around 200–300 pages per day, with an error rate under 2%, and most importantly — no more late-night scraper debugging sessions.

