Python Automation #4: Web Scraping Part 2 — Dynamic Pages with Playwright

In part 3 we learned to fetch HTML and parse it with BeautifulSoup. But apply the same code to certain sites and something strange happens: the browser clearly shows the data, yet the HTML your code receives is an empty shell. In this post we cover how to handle these JavaScript-rendered pages with Playwright.

Why the HTML comes back empty #

The practice site quotes.toscrape.com has a JavaScript-rendered version at the /js/ path. Let’s fetch it exactly the way part 3 did.

trying the part 3 approach
import requests
from bs4 import BeautifulSoup

html = requests.get("https://quotes.toscrape.com/js/").text
soup = BeautifulSoup(html, "html.parser")
print(len(soup.select(".quote")))  # 0

Open it in a browser and you see ten quotes; the code says zero. This page’s server sends only a skeleton HTML plus JavaScript code, and the actual data appears on screen only after the browser executes that JavaScript. requests can’t execute JavaScript, so the skeleton is all it gets. There’s one fix: drive a real browser — one that can run JavaScript — from code. That tool is Playwright.

Installing Playwright #

Installation is two steps: install the Python package, then download the browser binary it will drive.

install
pip install playwright
playwright install chromium

If you’re on uv, run uv add playwright followed by uv run playwright install chromium. The playwright install chromium step downloads a Chromium build dedicated to Playwright. It’s managed separately from your everyday Chrome, so your existing browser settings are untouched.

First script: open a page and take a screenshot #

first_browser.py
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()  # headless=True is the default
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/")
    page.screenshot(path="page.png")
    browser.close()

Run it and nothing appears on screen, yet page.png captures the page with all ten quotes rendered. That’s because the default is headless mode — a browser running in the background with no window. When you want to watch and debug, switch to p.chromium.launch(headless=False) and an actual browser window opens, showing the code driving it step by step.

Selectors and waiting: wait_for_selector #

The core concept in dynamic page scraping is waiting. Even after the page opens, the data appears only once the JavaScript finishes running, so code that reads too early gets an empty result all over again. Playwright’s answer is wait_for_selector.

waiting until an element appears
page.goto("https://quotes.toscrape.com/js/")
page.wait_for_selector(".quote")       # wait until .quote appears
print(page.locator(".quote").count())  # 10

The difference from a fixed wait like time.sleep(5) matters. sleep waits the full five seconds even when the data shows up in one second, and fails outright when it doesn’t show up within five. wait_for_selector moves to the next line the instant the element appears, and keeps waiting up to a default of 30 seconds. Fast when things are fast, patient when they’re slow. Note that when you click or extract text through a locator, Playwright waits automatically, so explicit waits belong where you need an anchor point — like right after entering a page.

Clicking, typing, logging in #

Pages behind a login work the same way: fill to type, click to click.

login automation
import os

USER = os.environ["QUOTES_USER"]
PASSWORD = os.environ["QUOTES_PASSWORD"]
page.goto("https://quotes.toscrape.com/login")
page.fill("#username", USER)
page.fill("#password", PASSWORD)
page.click("input[type=submit]")
page.wait_for_selector("a[href='/logout']")  # confirm login succeeded

One security note: never write the password directly in the code. Code ends up in git and on shared screens. Split it out into environment variables as above and set it in the terminal before running — export QUOTES_PASSWORD=yourpassword — that’s the baseline. Also note the wait_for_selector on the last line. The logout link appearing means the login actually succeeded, so it doubles as a checkpoint before moving on.

Infinite scroll and “load more” #

Pages that load more data as you scroll are everywhere. The practice site’s /scroll path is exactly this structure. The trick: scroll down and repeat until the item count stops growing.

reading infinite scroll to the end
page.goto("https://quotes.toscrape.com/scroll")
page.wait_for_selector(".quote")
prev = 0
while True:
    page.mouse.wheel(0, 5000)    # scroll down
    page.wait_for_timeout(1000)  # 1 second for loading
    count = page.locator(".quote").count()
    if count == prev:            # stop when it stops growing (100 items)
        break
    prev = count

For pages where you press a “load more” button for the next batch, the structure is even simpler: keep calling page.click("text=Load more") for as long as the button exists.

Extracting data and saving to CSV #

The finish is the same as part 3. On the fully rendered page, pull data with locator and inner_text() and save it to CSV. It maps directly onto BeautifulSoup’s select and get_text, so the structure should look familiar.

scrape_quotes.py
import csv
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/")
    page.wait_for_selector(".quote")
    rows = []
    for q in page.locator(".quote").all():
        text = q.locator(".text").inner_text()
        author = q.locator(".author").inner_text()
        rows.append([text, author])
    browser.close()

with open("quotes.csv", "w", newline="", encoding="utf-8-sig") as f:
    writer = csv.writer(f)
    writer.writerow(["quote", "author"])
    writer.writerows(rows)

Which one to use #

Playwright being more powerful is no reason to use it everywhere — it launches an entire browser, which makes it slow and heavy.

requests + BeautifulSoup (part 3)Playwright (this post)
Speed and resourcesFast and light (one HTML response)Slow and heavy (runs a browser)
JavaScript renderingNoYes
Login, clicks, scrollingLimitedYes

The test is simple too. Open the page source in your browser (Ctrl+U); if the data you want is there, the page is static and the part 3 approach is lighter and faster. If it’s missing from the source but visible on screen, the page is dynamic and you need this post’s Playwright.

Wrap-up #

What this post covered:

  • Why HTML comes back empty, and the decision rule: when JavaScript draws the data, requests gets only a skeleton — reach for Playwright only when the data isn’t in the page source
  • Playwright fundamentals: the two-step install, goto and screenshots with a headless browser
  • The core concept: replace fixed sleep waits with wait_for_selector, waiting until the element appears
  • Practical patterns: login automation (passwords in environment variables), infinite scroll, saving to CSV

In the next post (#5 Email and notifications), we cover delivering your scripts’ results to people — sending scraped data by email and firing notifications into messengers like Slack.

X