Firecrawl: Web Data Collection for AI Applications When Traditional Scrapers Fall Short

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

2 AM: My AI Pipeline Stopped Working

I was building a RAG (Retrieval-Augmented Generation) tool to aggregate information from technical documentation sites. The demo deadline was the next morning. Everything ran fine on localhost with sample data — but when I started crawling real data from production sites, the system started returning nothing but garbage.

The logs were full of things like this:

[ERROR] Parsed content: "Please enable JavaScript to view this page"
[ERROR] Parsed content: "Verifying you are human. This may take a few seconds."
[ERROR] Empty markdown extracted from https://docs.example.com/api-reference

BeautifulSoup was capturing HTML all right — but it was the loading screen’s HTML, not the actual content. Feed that garbage into an AI and the output is just as bad.

The Problem: Modern Web Isn’t Scraper-Friendly

After about 30 minutes of midnight debugging, I finally understood the problem. Modern documentation sites and web pages typically cause 3 types of issues for traditional scrapers:

  • JavaScript rendering: Content loads through React/Vue/Next.js — BeautifulSoup only sees the HTML before JS runs.
  • Anti-bot protection: Cloudflare, CAPTCHA, and User-Agent detection block automated requests.
  • Dynamic content: Infinite scroll, lazy loading, and content that depends on user interaction.

I tried switching to Selenium. It worked — but it was painfully slow. Crawling 50 pages took 20 minutes, not to mention memory leaks after a few hours of continuous operation. For a pipeline that needs to process thousands of pages, this wasn’t a viable path.

# Old BeautifulSoup approach - fails with JS-rendered pages
import requests
from bs4 import BeautifulSoup

def scrape_old_way(url):
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0..."})
    soup = BeautifulSoup(response.text, "html.parser")
    # Result: "Please enable JavaScript" 😭
    return soup.get_text()

What Is Firecrawl and Why It’s Different

I found Firecrawl after 20 minutes of desperate Googling at 3 AM. It’s an API service (with a self-hosted option), not a general-purpose scraper — it was built to do one thing: provide clean data for AI pipelines. Compared to other tools:

  • Fully renders JavaScript before extracting content
  • Automatically converts HTML to clean Markdown — the format LLMs digest best
  • Handles anti-bot protection, rate limiting, and retries completely automatically
  • Crawls entire websites by depth, not just individual pages

Installing the Firecrawl Python SDK

pip install firecrawl-py

Get your API key at firecrawl.dev — there’s a free tier for testing. Then set the environment variable:

export FIRECRAWL_API_KEY="fc-your-api-key-here"

Practical Ways to Use Firecrawl

1. Scraping a Single Page

The simplest use case: give it a URL, get back clean Markdown.

import os
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

# Scrape a single page, get back Markdown
result = app.scrape_url(
    "https://docs.python.org/3/library/asyncio.html",
    formats=["markdown"]
)

print(result["markdown"][:500])
# Output: clean content, ready to feed into an LLM

The result is pure Markdown — no excess HTML tags, no junk navigation menus, no ad-filled footers. Anyone who’s ever spent time hand-writing regex to extract content will immediately understand how much work this saves.

2. Crawling an Entire Website by Depth

Want to index an entire docs site for a RAG pipeline — not one page at a time:

import os
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

# Crawl the entire docs site, up to 50 pages
crawl_result = app.crawl_url(
    "https://docs.example.com",
    limit=50,
    scrape_options={"formats": ["markdown"]}
)

for page in crawl_result["data"]:
    print(f"URL: {page['metadata']['sourceURL']}")
    print(f"Content length: {len(page['markdown'])} chars")
    print("---")

3. Extracting Structured Data with LLM

The feature I use most — extract information according to a predefined schema, and Firecrawl uses AI to fill it in:

import os
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

# Structured extraction from a product page
result = app.extract(
    ["https://example.com/product/123"],
    schema={
        "type": "object",
        "properties": {
            "product_name": {"type": "string"},
            "price": {"type": "number"},
            "features": {
                "type": "array",
                "items": {"type": "string"}
            },
            "availability": {"type": "boolean"}
        },
        "required": ["product_name", "price"]
    }
)

print(result["data"])
# Output: {"product_name": "...", "price": 29.99, "features": [...], ...}

The Best Approach: Combining Firecrawl with an LLM

This is the pattern I’ve been running in production since patching that pipeline at 2 AM. Combines Firecrawl for crawling and Claude for content processing:

import os
import anthropic
from firecrawl import FirecrawlApp

firecrawl = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def research_topic(url: str, question: str) -> str:
    """
    Crawl a URL and use Claude to answer a question from its content.
    """
    # Step 1: Fetch clean content from the web
    scrape_result = firecrawl.scrape_url(url, formats=["markdown"])
    content = scrape_result.get("markdown", "")

    if not content:
        return "Unable to retrieve content from this URL."

    # Step 2: Use Claude to answer based on the crawled content
    response = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": f"""Based on the following web content, please answer the question.

Content:
{content[:8000]}

Question: {question}"""
            }
        ]
    )

    return response.content[0].text

# Example usage
answer = research_topic(
    url="https://docs.python.org/3/library/asyncio-task.html",
    question="What is the difference between asyncio.create_task() and asyncio.ensure_future()?"
)
print(answer)

Building a Simple RAG Indexer

Need to store data for repeated queries? Here’s the skeleton of a minimal indexer:

import os
from firecrawl import FirecrawlApp
from typing import List, Dict

firecrawl = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

class SimpleRAGIndexer:
    def __init__(self):
        self.documents: List[Dict] = []

    def index_website(self, base_url: str, max_pages: int = 20):
        """Crawl and index an entire website."""
        print(f"Crawling {base_url}...")

        result = firecrawl.crawl_url(
            base_url,
            limit=max_pages,
            scrape_options={"formats": ["markdown"]}
        )

        for page in result.get("data", []):
            if page.get("markdown"):
                self.documents.append({
                    "url": page["metadata"]["sourceURL"],
                    "content": page["markdown"],
                    "title": page["metadata"].get("title", "")
                })

        print(f"Indexed {len(self.documents)} pages.")
        return self.documents

    def search(self, query: str, top_k: int = 3) -> List[Dict]:
        """Simple keyword search — use a vector DB in production."""
        query_lower = query.lower()
        results = [
            {**doc, "score": doc["content"].lower().count(query_lower)}
            for doc in self.documents
            if doc["content"].lower().count(query_lower) > 0
        ]
        return sorted(results, key=lambda x: x["score"], reverse=True)[:top_k]

# Usage
indexer = SimpleRAGIndexer()
indexer.index_website("https://docs.python.org/3/library/", max_pages=30)

relevant_docs = indexer.search("async await coroutine")
for doc in relevant_docs:
    print(f"URL: {doc['url']} | Score: {doc['score']}")

Practical Notes from Real-World Use

I’ve been using it in production for about 3 months. Overall it works well — but there are a few things worth knowing upfront:

  • Rate limits: Firecrawl’s free tier has a monthly request limit. Estimate your needs before choosing a plan — don’t get cut off mid-crawl on production data.
  • Self-hosted option: Firecrawl is open-source (mendableai/firecrawl on GitHub). Deploy it on your own VPS if you need data privacy or want long-term cost control.
  • Cache your results: Crawling the same URL repeatedly burns through your quota. Cache with a sensible TTL — 24 hours for documentation, 1 hour for news feeds.
  • robots.txt: Firecrawl respects robots.txt by default. Overriding it requires explicit configuration — and always check whether you’re actually authorized to crawl that site.

Quick Comparison of Solutions

Tool             | JS Render | Clean Output | Easy to Use | Cost
-----------------|-----------|--------------|-------------|------------------
BeautifulSoup    |     ❌    |      ❌      |     ✅      | Free
Selenium         |     ✅    |      ❌      |     ❌      | Free (slow)
Playwright       |     ✅    |      ❌      |   Medium    | Free (needs infra)
Firecrawl API    |     ✅    |      ✅      |     ✅      | Paid / Self-host
Firecrawl (self) |     ✅    |      ✅      |   Medium    | Infra only

For AI pipelines that need clean, consistent data, Firecrawl solves the problem without writing another 500 lines of edge-case handling code. Not because it’s “the best” on every metric — BeautifulSoup is still perfectly fine for static sites. My RAG pipeline now crawls around 200–300 pages per day, with an error rate under 2%, and most importantly — no more late-night scraper debugging sessions.

Share: