Python Automation #3: Web Scraping Part 1 — Static Pages with httpx and BeautifulSoup
There’s probably at least one page you open in a browser every day. A product price, an announcement board, a stock indicator — the content differs each time, but the checking motion is identical: open the page, look at the same spot, compare against yesterday. Repetition with a fixed routine like this is exactly what code can take over. In this post we cover static page scraping: fetching HTML with httpx and picking out just the parts we want with BeautifulSoup.
- #1 First scripts
- #2 Excel automation
- #3 Web scraping Part 1: collecting static pages ← this post
- #4 Web scraping Part 2: dynamic pages
- #5 Email and notifications
- #6 Scheduling
- #7 Packaging as a CLI tool
HTML is the data #
Behind every screen a browser renders, there’s HTML. Price tables, notice lists, stock badges — all of it is ultimately tags and text. Scraping is the job of finding the tags you want in that HTML and pulling out the text.
So the first step isn’t code — it’s observation. Open the page you want to collect from, right-click the value you’re after, and choose “Inspect”; the developer tools’ Elements tab jumps to that tag. Two things to check:
- Which tag the value lives in (
<p>,<span>,<a>, etc.) - What distinguishes that tag from the others (class, id, parent structure)
Our practice target for this post is books.toscrape.com, a fictional bookstore published specifically for scraping practice, so you can experiment freely. Each book is one article.product_pod tag, with the title and price inside.
Setup: two packages #
Add httpx and beautifulsoup4 to the project.
uv add httpx beautifulsoup4httpx is a library for sending HTTP requests. Its API is nearly identical to the long-established requests, but it ships with default timeouts built in and supports HTTP/2 and async, so if you’re starting fresh, httpx is the recommendation. beautifulsoup4 parses the fetched HTML and lets you navigate it tag by tag.
Requesting a page with httpx #
import httpx
headers = {
"User-Agent": "Mozilla/5.0 (compatible; book-scraper/1.0)"
}
resp = httpx.get(
"https://books.toscrape.com/",
headers=headers,
timeout=10.0,
)
resp.raise_for_status()
print(resp.status_code) # 200
print(resp.text[:300]) # start of the HTMLThree things worth noting:
- User-Agent: the header that tells the server who’s asking. Left at the default, it goes out as
python-httpx/0.x, which some sites block. It’s better to set a string that identifies you. - timeout: caps how long you’ll wait when there’s no response.
httpxdefaults to 5 seconds, but stating it explicitly makes the intent clear. - raise_for_status: throws an exception on failed responses like 404 or 500. It prevents the accident of unknowingly parsing failed HTML.
Picking things out with BeautifulSoup #
Hand the fetched HTML string to BeautifulSoup and you get an object you can navigate tag by tag. There are several ways to search it, but select with CSS selectors covers almost everything.
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text, "html.parser")
books = soup.select("article.product_pod") # one book = one article
print(len(books)) # 20
first = books[0]
title = first.select_one("h3 a")["title"] # pull from an attribute
price = first.select_one("p.price_color").text # pull as text
print(title, price)select returns every matching tag as a list; select_one returns just the first. This handful of CSS selectors covers most day-to-day needs:
article.product_pod: anarticletag with classproduct_pod#content: the tag with idcontenth3 a: anatag anywhere inside anh3ul > li: anlithat is a direct child of aula[href]: anatag that has anhrefattribute
Translate the structure you saw in the developer tools into a selector, then check that the select count matches what you counted on screen. That loop is the real substance of scraping.
Complete example: titles and prices to CSV #
Let’s split fetching, parsing, and saving into functions and put them in one file.
import csv
import httpx
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; book-scraper/1.0)"}
def fetch(url: str) -> str:
resp = httpx.get(url, headers=HEADERS, timeout=10.0)
resp.raise_for_status()
return resp.text
def parse(html: str) -> list[dict]:
soup = BeautifulSoup(html, "html.parser")
rows = []
for book in soup.select("article.product_pod"):
rows.append({
"title": book.select_one("h3 a")["title"],
"price": book.select_one("p.price_color").text.lstrip("£"),
})
return rows
def save(rows: list[dict], path: str) -> None:
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price"])
writer.writeheader()
writer.writerows(rows)
if __name__ == "__main__":
rows = parse(fetch("https://books.toscrape.com/"))
save(rows, "books.csv")
print(f"saved {len(rows)} books")uv run scrape.py
# saved 20 booksOpen books.csv and the titles and prices are laid out as a table. Chain this with the Excel automation from part 2 and you have a pipeline that flows from collection to a finished report in one pass.
Walking through pagination #
When a listing spans multiple pages, find the URL pattern and loop over it. On books.toscrape.com, pages continue in the form catalogue/page-2.html.
import time
BASE = "https://books.toscrape.com/catalogue/page-{}.html"
all_rows = []
for page in range(1, 6): # pages 1–5 for now
html = fetch(BASE.format(page))
all_rows.extend(parse(html))
print(f"page {page} collected, {len(all_rows)} books total")
time.sleep(1) # wait 1 second between requests
save(all_rows, "books_all.csv")The key line is time.sleep(1). A human can’t flip through even one page per second, but code that never pauses fires off dozens of requests per second — from the server’s point of view, indistinguishable from an attack. Spacing out your requests isn’t optional; it’s table stakes.
The lines you don’t cross #
What’s technically possible and what’s acceptable are two different things. Before running scraping code, check the following:
- robots.txt: append
/robots.txtto the site’s address and you’ll see the site’s stated policy on automated collection. Don’t collect paths markedDisallow, and honor anyCrawl-delay. - Terms of service: some services explicitly forbid automated collection. For pages behind a login, check the terms first — especially there.
- Request rate: high-volume requests with no spacing put real load on the other side’s server. Add
sleep, and fetch only the pages you need. - Personal automation vs. running a service: a script that does your once-a-day check for you is a completely different matter from redistributing collected data or offering it as a service. The latter requires copyright and legal review and is beyond this series’ scope.
When the fetched HTML comes back empty #
Some pages defeat this approach: the response is 200, yet the data you want isn’t in the HTML.
resp = httpx.get("https://example-spa.com/products")
print(resp.text)
# only <div id="root"></div> — no product dataOn pages like this, the server sends only a hollow HTML shell, and the data is filled in after the browser runs the JavaScript. httpx doesn’t execute JavaScript, so the shell is all it gets. If you can see it in the developer tools but not in resp.text, this is what’s happening. The fix is to drive an actual browser from code, and that’s the subject of the next post.
Wrap-up #
The flow this post built:
- Observing the target value’s tag and class with the developer tools
- Setting User-Agent and timeout explicitly on
httpx.getand blocking failures withraise_for_status - Extracting just the tags you want with
BeautifulSoup’sselectand CSS selectors - A complete script that pulls titles and prices into a CSV
- Multi-page collection with a URL pattern loop plus
time.sleep - The lines to respect: robots.txt, terms of service, request rate
In the next post (#4 Web scraping Part 2: dynamic pages), we handle pages where JavaScript fills in the data — launching a real browser with Playwright and pulling data from the fully rendered screen.