Python Automation #3: Web Scraping Part 1 — Static Pages with httpx and BeautifulSoup

Programming Language Python Automation Web Scraping

Saturday, May 16, 2026

6 min read

There’s probably at least one page you open in a browser every day. A product price, an announcement board, a stock indicator — the content differs each time, but the checking motion is identical: open the page, look at the same spot, compare against yesterday. Repetition with a fixed routine like this is exactly what code can take over. In this post we cover static page scraping: fetching HTML with httpx and picking out just the parts we want with BeautifulSoup.

#1 First scripts
#2 Excel automation
#3 Web scraping Part 1: collecting static pages ← this post
#4 Web scraping Part 2: dynamic pages
#5 Email and notifications
#6 Scheduling
#7 Packaging as a CLI tool

HTML is the data #

Behind every screen a browser renders, there’s HTML. Price tables, notice lists, stock badges — all of it is ultimately tags and text. Scraping is the job of finding the tags you want in that HTML and pulling out the text.

So the first step isn’t code — it’s observation. Open the page you want to collect from, right-click the value you’re after, and choose “Inspect”; the developer tools’ Elements tab jumps to that tag. Two things to check:

Which tag the value lives in (<p>, <span>, <a>, etc.)
What distinguishes that tag from the others (class, id, parent structure)

Our practice target for this post is books.toscrape.com, a fictional bookstore published specifically for scraping practice, so you can experiment freely. Each book is one article.product_pod tag, with the title and price inside.

Setup: two packages #

Add httpx and beautifulsoup4 to the project.

add dependencies

uv add httpx beautifulsoup4

httpx is a library for sending HTTP requests. Its API is nearly identical to the long-established requests, but it ships with default timeouts built in and supports HTTP/2 and async, so if you’re starting fresh, httpx is the recommendation. beautifulsoup4 parses the fetched HTML and lets you navigate it tag by tag.

Requesting a page with httpx #

fetch.py

import httpx

headers = {
    "User-Agent": "Mozilla/5.0 (compatible; book-scraper/1.0)"
}

resp = httpx.get(
    "https://books.toscrape.com/",
    headers=headers,
    timeout=10.0,
)
resp.raise_for_status()

print(resp.status_code)   # 200
print(resp.text[:300])    # start of the HTML

Three things worth noting:

User-Agent: the header that tells the server who’s asking. Left at the default, it goes out as python-httpx/0.x, which some sites block. It’s better to set a string that identifies you.
timeout: caps how long you’ll wait when there’s no response. httpx defaults to 5 seconds, but stating it explicitly makes the intent clear.
raise_for_status: throws an exception on failed responses like 404 or 500. It prevents the accident of unknowingly parsing failed HTML.

Picking things out with BeautifulSoup #

Hand the fetched HTML string to BeautifulSoup and you get an object you can navigate tag by tag. There are several ways to search it, but select with CSS selectors covers almost everything.

parse.py

from bs4 import BeautifulSoup

soup = BeautifulSoup(resp.text, "html.parser")

books = soup.select("article.product_pod")   # one book = one article
print(len(books))   # 20

first = books[0]
title = first.select_one("h3 a")["title"]            # pull from an attribute
price = first.select_one("p.price_color").text       # pull as text
print(title, price)

select returns every matching tag as a list; select_one returns just the first. This handful of CSS selectors covers most day-to-day needs:

article.product_pod: an article tag with class product_pod
#content: the tag with id content
h3 a: an a tag anywhere inside an h3
ul > li: an li that is a direct child of a ul
a[href]: an a tag that has an href attribute

Translate the structure you saw in the developer tools into a selector, then check that the select count matches what you counted on screen. That loop is the real substance of scraping.

Complete example: titles and prices to CSV #

Let’s split fetching, parsing, and saving into functions and put them in one file.

scrape.py

import csv

import httpx
from bs4 import BeautifulSoup

HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; book-scraper/1.0)"}


def fetch(url: str) -> str:
    resp = httpx.get(url, headers=HEADERS, timeout=10.0)
    resp.raise_for_status()
    return resp.text


def parse(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "html.parser")
    rows = []
    for book in soup.select("article.product_pod"):
        rows.append({
            "title": book.select_one("h3 a")["title"],
            "price": book.select_one("p.price_color").text.lstrip("£"),
        })
    return rows


def save(rows: list[dict], path: str) -> None:
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["title", "price"])
        writer.writeheader()
        writer.writerows(rows)


if __name__ == "__main__":
    rows = parse(fetch("https://books.toscrape.com/"))
    save(rows, "books.csv")
    print(f"saved {len(rows)} books")

run

uv run scrape.py
# saved 20 books

Open books.csv and the titles and prices are laid out as a table. Chain this with the Excel automation from part 2 and you have a pipeline that flows from collection to a finished report in one pass.

Walking through pagination #

When a listing spans multiple pages, find the URL pattern and loop over it. On books.toscrape.com, pages continue in the form catalogue/page-2.html.

collecting multiple pages

import time

BASE = "https://books.toscrape.com/catalogue/page-{}.html"

all_rows = []
for page in range(1, 6):              # pages 1–5 for now
    html = fetch(BASE.format(page))
    all_rows.extend(parse(html))
    print(f"page {page} collected, {len(all_rows)} books total")
    time.sleep(1)                     # wait 1 second between requests

save(all_rows, "books_all.csv")

The key line is time.sleep(1). A human can’t flip through even one page per second, but code that never pauses fires off dozens of requests per second — from the server’s point of view, indistinguishable from an attack. Spacing out your requests isn’t optional; it’s table stakes.

The lines you don’t cross #

What’s technically possible and what’s acceptable are two different things. Before running scraping code, check the following:

robots.txt: append /robots.txt to the site’s address and you’ll see the site’s stated policy on automated collection. Don’t collect paths marked Disallow, and honor any Crawl-delay.
Terms of service: some services explicitly forbid automated collection. For pages behind a login, check the terms first — especially there.
Request rate: high-volume requests with no spacing put real load on the other side’s server. Add sleep, and fetch only the pages you need.
Personal automation vs. running a service: a script that does your once-a-day check for you is a completely different matter from redistributing collected data or offering it as a service. The latter requires copyright and legal review and is beyond this series’ scope.

When the fetched HTML comes back empty #

Some pages defeat this approach: the response is 200, yet the data you want isn’t in the HTML.

a dynamic page's response

resp = httpx.get("https://example-spa.com/products")
print(resp.text)
# only <div id="root"></div> — no product data

On pages like this, the server sends only a hollow HTML shell, and the data is filled in after the browser runs the JavaScript. httpx doesn’t execute JavaScript, so the shell is all it gets. If you can see it in the developer tools but not in resp.text, this is what’s happening. The fix is to drive an actual browser from code, and that’s the subject of the next post.

Wrap-up #

The flow this post built:

Observing the target value’s tag and class with the developer tools
Setting User-Agent and timeout explicitly on httpx.get and blocking failures with raise_for_status
Extracting just the tags you want with BeautifulSoup’s select and CSS selectors
A complete script that pulls titles and prices into a CSV
Multi-page collection with a URL pattern loop plus time.sleep
The lines to respect: robots.txt, terms of service, request rate

In the next post (#4 Web scraping Part 2: dynamic pages), we handle pages where JavaScript fills in the data — launching a real browser with Playwright and pulling data from the fully rendered screen.