aiohttp: The Secret to 10x Faster Data Crawling and API Calls with Python – ITFROMZERO

Table of Contents

Why You Should Stop Using Requests for Large Tasks

If you work with Python, requests is likely your best friend when it comes to calling APIs. It’s simple and extremely stable. However, trouble arises when you need to handle massive workloads, such as scraping data for 5,000 products on an e-commerce platform or checking 10,000 links for errors simultaneously.

The problem lies in the fact that requests operates on a synchronous mechanism. When you send a request, the entire program “freezes” to wait for a response from the server. Suppose each request takes 0.5 seconds; processing 1,000 web pages would take you over 8 minutes. In a production environment, this is an unacceptable waste of resources.

After optimizing a news collection system for a real-world project, I switched completely to aiohttp. The results were astounding: execution time dropped from 15 minutes to less than 40 seconds. By leveraging the power of asyncio, aiohttp allows you to send a batch of requests without waiting for the previous one to finish before starting the next.

Quick Start in 5 Minutes

To get started, install the library along with a high-speed DNS resolver:

pip install aiohttp aiodns

Here is how to implement a basic data fetching function in an asynchronous style:

import aiohttp
import asyncio

async def fetch_status(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            status = response.status
            print(f"Status: {status} from {url}")
            return await response.text()

if __name__ == "__main__":
    # Run the async function in the event loop
    asyncio.run(fetch_status('https://google.com'))

A small note: You must use the async and await keywords. This is the signal telling Python: “While waiting for the network response, go do something else, don’t just sit there twiddling your thumbs!”.

The Secret Lies in ClientSession

Many beginners often make the mistake of creating a new ClientSession for every function call. In practice, this is the fastest way to overflow memory and cause “socket leak” errors.

Think of ClientSession as a continuous water pipe. You should open it once and share it across all requests to take advantage of connection pooling. Keeping established TCP connections can save up to 30% of processing time by avoiding the need to perform repeated handshakes.

How to Handle Multiple APIs Simultaneously

This is where aiohttp truly shows its strength. We will group tasks into a list and trigger them concurrently using asyncio.gather.

import aiohttp
import asyncio
import time

async def get_data(session, url):
    async with session.get(url) as resp:
        return await resp.json()

async def main():
    urls = ['https://jsonplaceholder.typicode.com/posts/1'] * 50
    
    async with aiohttp.ClientSession() as session:
        tasks = [get_data(session, url) for url in urls]
        # Trigger 50 requests at once
        results = await asyncio.gather(*tasks)
        print(f"Processed {len(results)} results")

start = time.perf_counter()
asyncio.run(main())
print(f"Finished in: {time.perf_counter() - start:.2f} seconds")

Rate Limiting to Avoid IP Bans

Sending too many requests in a single second can cause a server to view you as a DDoS attacker. To avoid getting your IP flagged, I usually use asyncio.Semaphore to limit the number of parallel requests.

A Semaphore acts like a traffic officer, only allowing a certain number of vehicles to pass through a tunnel at a time.

# Allow a maximum of 10 concurrent requests
limit = asyncio.Semaphore(10)

async def safe_fetch(session, url):
    async with limit:
        async with session.get(url) as response:
            return await response.read()

During the data extraction process, if you struggle with complex Regex strings, try using the Regex tester tool at toolcraft.app. It helps you quickly verify patterns directly in the browser, avoiding the time-consuming process of modifying code and re-running scripts repeatedly.

Professional Error and Timeout Handling

Unstable networks are a common occurrence. If you don’t set a timeout, your script could hang indefinitely.

# Configure a total timeout of 10 seconds
timeout = aiohttp.ClientTimeout(total=10) 
async with aiohttp.ClientSession(timeout=timeout) as session:
    try:
        async with session.get(url) as resp:
            data = await resp.json()
    except asyncio.TimeoutError:
        print("Error: Server took too long to respond!")
    except aiohttp.ClientError as e:
        print(f"Network connection error: {e}")

A Few Small but Valuable Tips

After years of operating large-scale scraping systems, I’ve gathered four key insights:

Always Use Context Managers: Always use async with to ensure resources are released even when errors occur.
Upgrade to ujson: If you are parsing heavy JSON files, install ujson. It is about 3-5 times faster than the default library, significantly reducing CPU load.
Leverage DNS Cache: Installing aiodns makes domain resolution faster, which is extremely useful when calling hundreds of different domains.
Know When to Stop: If your script only calls 1-2 simple APIs, just use requests to keep things simple. Don’t overcomplicate the issue if you don’t actually need high performance.

At first, getting used to asynchronous thinking might be a bit confusing. However, once you master aiohttp, you will possess an incredibly powerful tool for high performance large-scale data processing.