Python Automation #2: Excel Automation — Ending Repetitive Reports with openpyxl

6 min read

Every Friday, five branch offices send in Excel files built on the same template. Opening each one, copying the data, pasting it into the head-office template, and fixing the formatting again takes an hour. The work is identical every week, and doing it by hand invites mistakes — a skipped row here, the wrong sheet copied there. In this post we automate that entire merge job with the openpyxl library. The pathlib skills from #1 First Scripts come right back into play.

  • #1 First scripts
  • #2 Excel automation ← this post
  • #3 Web scraping (static pages)
  • #4 Web scraping (dynamic pages)
  • #5 Email and notifications
  • #6 Scheduling
  • #7 Packaging as a CLI tool

Installing openpyxl #

openpyxl is a pure-Python library for reading and writing .xlsx files. It works without Excel installed, so it runs just as well on a server. Installation is one line — with pip, the equivalent is pip install openpyxl.

install
uv add openpyxl

Reading a file: load_workbook #

Let’s open a single file first and look at the structure. An Excel file is a three-level hierarchy: a Workbook contains Worksheets, and a Worksheet contains Cells.

read_one.py
from openpyxl import load_workbook

wb = load_workbook("branch_a.xlsx")

print(wb.sheetnames)        # ['July W1', 'Sheet2']
ws = wb["July W1"]          # pick a sheet by name
ws = wb.active              # or the active sheet

print(ws["B2"].value)                  # access a cell by coordinate
print(ws.cell(row=2, column=2).value)  # by row/column number (same cell)

# Iterate row by row. min_row=2: skip the header (row 1)
for row in ws.iter_rows(min_row=2, values_only=True):
    print(row)  # ('Seoul', 'Keyboard', 12, 348000)

Coordinate strings are easy for humans to read, while row/column numbers are handy inside loops. To walk an entire table row by row, use iter_rows. Passing values_only=True yields tuples of values instead of cell objects, which keeps read-only code clean.

Complete example: merging per-branch Excel files into one sheet #

Now for the real thing. Assume a reports/ folder holds per-branch files like branch_a.xlsx and branch_b.xlsx, and the first sheet of every file follows the same template (branch, item, quantity, sales). We grab the file list with pathlib’s glob from part 1, then collect each file’s data rows into one new workbook.

merge_reports.py
from pathlib import Path

from openpyxl import Workbook, load_workbook

REPORT_DIR = Path("reports")
OUT_FILE = Path("merged.xlsx")

merged = Workbook()
out = merged.active
out.title = "Merged"
out.append(["Branch", "Item", "Qty", "Sales"])  # header only once

for file in sorted(REPORT_DIR.glob("*.xlsx")):
    wb = load_workbook(file, data_only=True)
    ws = wb.active
    for row in ws.iter_rows(min_row=2, values_only=True):
        if row[0] is None:  # skip empty rows
            continue
        out.append(row)
    print(f"{file.name}: merged {ws.max_row - 1} rows")

merged.save(OUT_FILE)
print(f"Done: {OUT_FILE}")

Three things carry the whole script. glob("*.xlsx") grabs every target file in the folder, iter_rows(min_row=2) skips each file’s header, and out.append(row) tacks rows onto the bottom of the new sheet. Whether there are 5 files or 50, the code stays the same. The sorted() wrapper is there so the file order doesn’t shuffle between runs.

Writing and styling: making it look like a report #

Merging alone produces a bare table. Let’s bold the header, size the columns to fit the content, and add thousands separators to the sales column so it looks like a proper report.

applying formatting
from openpyxl.styles import Font

# 1. Bold header
for cell in out[1]:  # out[1] is the entire first row
    cell.font = Font(bold=True)

# 2. Column widths (roughly in characters)
widths = {"A": 12, "B": 18, "C": 8, "D": 14}
for col, width in widths.items():
    out.column_dimensions[col].width = width

# 3. Number format for sales (column D): 1234567 -> 1,234,567
for row in out.iter_rows(min_row=2, min_col=4, max_col=4):
    for cell in row:
        cell.number_format = "#,##0"

# 4. Keep the header (row 1) visible while scrolling
out.freeze_panes = "A2"  # freezes everything above A2
merged.save(OUT_FILE)

number_format takes Excel’s own cell format codes verbatim. For dates use "yyyy-mm-dd", for percentages "0.0%" — whatever code you’d use inside Excel works here unchanged.

Handling formulas: openpyxl does not calculate #

This is the pitfall people hit most often. openpyxl is not a formula engine. When cell D10 contains =SUM(D2:D9), what you see depends on how you open the file.

the data_only difference
wb = load_workbook("branch_a.xlsx")
print(wb.active["D10"].value)   # '=SUM(D2:D9)' (the formula string)
wb = load_workbook("branch_a.xlsx", data_only=True)
print(wb.active["D10"].value)   # 858000 (the value computed at last save)

The value data_only=True returns isn’t something openpyxl calculated — it’s the value Excel cached inside the file the last time it was saved. That leads to two caveats:

  • A file created purely in Python and never opened in Excel has no cache, so you get None.
  • If you open a file with formulas in openpyxl and save it back, only the values from the mode you opened it in (formula strings or cached values) survive. The safe habit is to never overwrite the original — write results to a new file.

This is exactly why the merge example above passes data_only=True: even if a branch file’s sales column is formulas, we want numbers, not formula strings. If you need totals, either sum them in Python, or write a =SUM(...) formula into the result sheet and let it calculate when the recipient opens the file in Excel.

Where CSV and pandas fit in #

If your data doesn’t need formatting, the standard library’s csv module is often enough instead of .xlsx. Conversely, if the goal after merging is analysis — group aggregations, pivots — then pandas is the right tool. pandas gets its own treatment in a data analysis series. The focus of this series is automation that creates and styles Excel files themselves, and for that job openpyxl is the most direct fit.

Two common pitfalls #

Merged cells: in a merged range, the value lives only in the top-left cell; the rest read as None. If branch files contain merged cells, the merged output ends up with holes, so you’ll need to inspect the merge ranges via ws.merged_cells.ranges and propagate the top-left value into the remaining rows. On the writing side, you can’t put values into any cell of a merged range other than the top-left one.

Large-file performance: the default mode loads the entire file into memory. For files beyond hundreds of thousands of rows, switch to streaming — load_workbook(file, read_only=True) for reading and Workbook(write_only=True) for writing. The trade-off: read_only mode restricts random cell access, and write_only mode only lets you add rows via append.

Wrap-up #

The flow this post built:

  • Reading Excel row by row with load_workbook + iter_rows(values_only=True)
  • Merging a folder of per-branch files into one sheet with pathlib’s glob + append
  • Report formatting with Font(bold=True), column_dimensions, number_format, and freeze_panes
  • The formula pitfalls (openpyxl doesn’t calculate; data_only=True returns cached values), None from merged cells, and read_only/write_only modes for large files

A merge that used to eat an hour every week now ends with one script run. In the next post (#3 Web scraping (static pages)), we handle the case where the data lives not in Excel files but on web pages — pulling the tables and lists you want out of HTML with requests-style clients and BeautifulSoup.

X