Scraping ‘Stubborn’ Data with Playwright Python: The Ultimate Solution for JavaScript-Heavy Websites

Python tutorial - IT technology blog
Python tutorial - IT technology blog

Background & Why Choose Playwright?

Have you ever eagerly used BeautifulSoup or Requests to fetch data, only to receive a blank HTML file or nothing but <script> tags? This is a common pain point when dealing with modern websites like Shopee, Facebook, or stock charts. These sites use Single Page Application (SPA) architecture, where data is only rendered after JavaScript executes.

In these cases, static HTML parsing libraries are completely powerless. I used to rely on Selenium. However, Selenium is quite heavy and often causes driver version conflicts on Linux. One of my automation projects once bloated from 200 lines to 2000 lines of Selenium code, making driver management a nightmare. Since switching to Playwright, execution speed has significantly increased, especially with its seamless asynchronous (async) handling.

Playwright isn’t just fast; it has a secret weapon: Auto-wait. It automatically waits for elements to appear before performing actions, making your scripts much more stable than using time.sleep() indiscriminately.

Setup in Seconds

Forget about searching for the right Chromedriver version to match your browser. Playwright handles it all for you. First, install the library:

pip install playwright

Then, run the command to download the browser engines (Chromium, Firefox, WebKit):

playwright install chromium

My tip: If you’re only scraping standard websites, install chromium is enough, saving about 300MB of disk space.

Real-world Configuration

Don’t just goto and expect data immediately, or you’ll get an empty array. For JavaScript-heavy sites, configuration is survival.

1. Leverage Async API for Speed

If you need to scrape 100 pages simultaneously, don’t use Sync. Use asyncio instead. In a real project, I reduced the scraping time for 500 products from 15 minutes to under 3 minutes by switching to Async.

import asyncio
from playwright.async_api import async_playwright

async def run():
    async with async_playwright() as p:
        # Use headless=True when running on a server to save RAM
        browser = await p.chromium.launch(headless=True) 
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36..."
        )
        page = await context.new_page()
        
        await page.goto("https://tiki.vn/search?q=laptop")
        
        # Automatically wait until the product list is displayed
        await page.wait_for_selector(".product-item")
        
        titles = await page.locator(".product-name").all_text_contents()
        print(f"Found {len(titles)} products!")

        await browser.close()

asyncio.run(run())

2. Handling Infinite Scroll

Sites like Facebook or TikTok only load more data when you scroll down. Instead of waiting for 5 seconds, use a script to simulate real user behavior. It triggers data loading events in the most natural way.

async def scroll_page(page):
    await page.evaluate("""
        async () => {
            let totalHeight = 0;
            let distance = 200;
            while (totalHeight < document.body.scrollHeight) {
                window.scrollBy(0, distance);
                totalHeight += distance;
                await new Promise(r => setTimeout(r, 100));
            }
        }
    """)

3. Applying Page Object Model (POM)

When code exceeds 1000 lines, don’t cram everything into a single function. Break it down into Classes representing each page. This helps you fix bugs faster and makes script maintenance easier when you return to it 3-6 months later.

Monitoring and Debugging: The Trace Viewer Pro Tip

Running scripts on a server (headless) and getting errors? Where do you even start? Don’t just stare at dry text logs.

1. Trace Viewer Black Box

This is Playwright’s “unrivaled” feature. It acts like an airplane’s black box, recording every millisecond: videos, clicks, network requests, and even console errors.

await context.tracing.start(screenshots=True, snapshots=True)
# Scraping logic...
await context.tracing.stop(path="debug_trace.zip")

Just drag and drop the debug_trace.zip file into trace.playwright.dev to see the entire sequence. It has saved me hours of guessing why a script failed.

2. Screenshots on Failure

Always wrap your code in a try...except block. If an error occurs, take a screenshot immediately. This is the fastest way to detect if a website has changed its layout or if you’ve hit a Captcha/Bot detection wall.

Scraping JavaScript-heavy web data isn’t hard if you choose the right tools. Playwright lets you stop worrying about the browser and focus on data extraction logic. Start with small scripts, configure Trace Viewer from the get-go, and you’ll find your automation becomes much more professional.

Share: