Django Advanced #3: Query optimization — N+1, select_related, EXPLAIN, indexes

9 min read

Building on the queries we sharpened in Intermediate #2 ORM advanced, we now layer proper optimization. In production, 99% of performance problems come from ORM usage patterns. Not CPU, not memory — the number and shape of queries.

Today’s toolbox:

  • N+1 diagnosis and fixes (select_related, prefetch_related, Prefetch)
  • Field selection (only, defer, values, values_list)
  • Measurement (debug-toolbar, connection.queries, .explain())
  • Indexes (Meta.indexes, db_index, multi-column)
  • Bulk operations (bulk_create, bulk_update, update_or_create, iterator)
  • Concurrency (select_for_update)
  • Escape hatch (raw SQL)

N+1 — the most common trap #

Reproducing it #

models.py
class Author(models.Model):
    name = models.CharField(max_length=100)

class Post(models.Model):
    title = models.CharField(max_length=200)
    author = models.ForeignKey(Author, on_delete=models.CASCADE, related_name="posts")
🚫 N+1 happens
posts = Post.objects.all()[:50]
for post in posts:
    print(post.title, post.author.name)   # ← extra query each time

Query trace:

SELECT * FROM post LIMIT 50;                      -- 1 query
SELECT * FROM author WHERE id = ?;                -- 50 queries
-- 51 total

A 50-post page fires 51 queries. With 1000 posts, 1001 queries. That’s N+1.

select_related — forward FK / OneToOne #

Solved in a single JOIN.

✅ select_related
posts = Post.objects.select_related("author").all()[:50]
for post in posts:
    print(post.title, post.author.name)

Query:

SELECT post.*, author.*
FROM post LEFT OUTER JOIN author ON post.author_id = author.id
LIMIT 50;
-- 1 query

You can chain it.

Multi-step
Comment.objects.select_related("post__author__profile").all()

Even comment.post.author.profile is resolved in one query. But more JOINs widen the row, so be careful past 3-4 hops.

select_related works for:

  • ForeignKey (the side that owns the FK)
  • OneToOneField (both sides)

It doesn’t work for:

  • Reverse ForeignKey (author.posts)
  • ManyToMany

Those two are prefetch_related’s territory.

prefetch_related — reverse FK / M2M #

Separate queries + Python-side join.

✅ prefetch_related
authors = Author.objects.prefetch_related("posts").all()[:20]
for author in authors:
    print(author.name, [p.title for p in author.posts.all()])

Query trace:

SELECT * FROM author LIMIT 20;                                    -- 1 query
SELECT * FROM post WHERE author_id IN (1, 2, ..., 20);            -- 1 query
-- 2 total

Django gathers IDs from the first query and fires the second with an IN clause, then maps in Python. The reverse collection is filled with no N+1.

select_relatedprefetch_related
MechanismSQL JOINSeparate query + Python matching
Query count+0+1 (per relation)
FitsForward FK, 1:1Reverse FK, M2M
Result widthWider rowsSame

Chaining works too.

Mixing both
Comment.objects.select_related("author").prefetch_related("post__tags")

Prefetch object — customize #

When you want extra conditions or ordering on the prefetched query.

Prefetch
from django.db.models import Prefetch

authors = Author.objects.prefetch_related(
    Prefetch(
        "posts",
        queryset=Post.objects.filter(published=True).order_by("-created_at"),
        to_attr="recent_posts",   # access via author.recent_posts
    )
).all()

for author in authors:
    for post in author.recent_posts:
        ...

to_attr lets you fill a separate attribute, decoupled from the original manager. author.posts.all() still returns all posts; author.recent_posts returns the filtered ones.

Field selection — only, defer, values, values_list #

If a model has 30 columns and you only use 5, you’re wasting 70% of the data.

only — load only the listed fields immediately #

only
posts = Post.objects.only("id", "title").all()
for p in posts:
    print(p.title)        # OK; accessing a non-loaded field triggers an extra query
    print(p.body)         # ← extra query (deferred field)

Big effect when you have a heavy text column like body that’s not used on a list page.

defer — defer only the listed fields #

defer — opposite of only
posts = Post.objects.defer("body", "raw_html").all()

values, values_list — skip ORM object construction #

Skip even the cost of instantiating ORM objects, returning dicts/tuples.

values / values_list
# list of dicts
Post.objects.values("id", "title")
# [{"id": 1, "title": "..."}, ...]

# list of tuples
Post.objects.values_list("id", "title")
# [(1, "..."), ...]

# single column flattened
ids = list(Post.objects.values_list("id", flat=True))
# [1, 2, 3, ...]

Decisive when you only need IDs, when serializing straight to JSON, or for CSV export.

Practical — pulling IDs for an IN clause
recent_ids = Post.objects.filter(
    created_at__gte=cutoff
).values_list("id", flat=True)

Comment.objects.filter(post_id__in=recent_ids).delete()

Django automatically pushes recent_ids down as a subquery and merges into a single SQL.

Measurement #

No optimization without measurement. Django comes with the right tooling.

django-debug-toolbar #

The standard for development environments.

Install
pip install django-debug-toolbar
settings.py
INSTALLED_APPS = [..., "debug_toolbar"]
MIDDLEWARE = [..., "debug_toolbar.middleware.DebugToolbarMiddleware"]
INTERNAL_IPS = ["127.0.0.1"]
urls.py
from django.conf import settings
from django.urls import include, path

urlpatterns = [...]

if settings.DEBUG:
    urlpatterns += [path("__debug__/", include("debug_toolbar.urls"))]

The right-hand panel shows query count, each query, EXPLAIN, and timings. If the query count scales with page item count, suspect N+1.

connection.queries #

When you want to inspect from inside the code.

Only when DEBUG=True
from django.db import connection, reset_queries

reset_queries()
posts = list(Post.objects.select_related("author")[:50])
for p in posts:
    _ = p.author.name
print(len(connection.queries), "queries")
for q in connection.queries:
    print(q["time"], q["sql"][:100])

It’s only populated when DEBUG=True or CONN_MAX_AGE is active. In tests, assertNumQueries is more standard.

Lock it in with a test
from django.test import TestCase

class PostListQueryCountTest(TestCase):
    def test_no_n_plus_one(self):
        for i in range(10):
            author = Author.objects.create(name=f"A{i}")
            Post.objects.create(title=f"P{i}", author=author)

        with self.assertNumQueries(2):   # author + posts
            list(self.client.get("/posts/").context["posts"])

A regression test on the query count is the most effective protection against N+1 reappearing.

.explain() #

See the actual execution plan the DB will use.

explain
qs = Post.objects.filter(published=True, author__name__startswith="K")
print(qs.explain())

PostgreSQL output (excerpt):

Hash Join  (cost=15.50..120.30 rows=200 width=64)
  Hash Cond: (post.author_id = author.id)
  ->  Seq Scan on post  (cost=0.00..80.00 rows=2000 width=40)
        Filter: published
  ->  Hash  (cost=12.00..12.00 rows=200 width=24)
        ->  Seq Scan on author  (cost=0.00..12.00 rows=200 width=24)
              Filter: ((name)::text ~~ 'K%'::text)

Seeing Seq Scan on a big table is a signal of a missing index. After adding an index, re-explain to confirm Index Scan.

PostgreSQL-only options:

ANALYZE too
qs.explain(analyze=True, verbose=True, buffers=True)

Actually runs the query to show row counts, time, and IO. Only meaningful on data similar to production.

Indexes #

db_index=True — single column #

Per column
class Post(models.Model):
    slug = models.SlugField(max_length=200, unique=True)   # unique → auto index
    status = models.CharField(max_length=20, db_index=True)
    created_at = models.DateTimeField(db_index=True)

unique=True, primary_key=True, and ForeignKey get indexes automatically.

Meta.indexes — multi-column, function, partial indexes #

Meta.indexes
from django.db.models import Index, F, Q

class Post(models.Model):
    status = models.CharField(max_length=20)
    created_at = models.DateTimeField()
    author = models.ForeignKey("Author", on_delete=models.CASCADE)
    title = models.CharField(max_length=200)

    class Meta:
        indexes = [
            # multi-column (status, created_at) — order matters
            Index(fields=["status", "-created_at"], name="post_status_recent_idx"),

            # function index (PostgreSQL)
            Index(F("title").lower(), name="post_title_lower_idx"),

            # partial index (conditional)
            Index(
                fields=["created_at"],
                condition=Q(status="published"),
                name="post_published_recent_idx",
            ),
        ]

Rules:

  • For multi-column indexes, order columns to match the WHERE clause for best effect
  • Adding indexes increases write cost — don’t slap an index on every column
  • A partial index indexes only frequently queried subsets — small and fast

Where indexes don’t help #

Wrapping a column in a function disables the regular index.

🚫 No index can help
WHERE LOWER(title) = 'hello'

Fix: create a function index (Index(F("title").lower(), ...)), or normalize on the data side (store lowercase).

LIKE 'hello%' can use the index, but LIKE '%hello' can’t (leading wildcard). For full text, use PostgreSQL’s SearchVector / GIN index.

Bulk operations #

bulk_create — INSERT in one shot #

🚫 1000 INSERTs
for data in items:
    Post.objects.create(**data)
✅ 1 INSERT
posts = [Post(**data) for data in items]
Post.objects.bulk_create(posts, batch_size=500)

Hundreds of times faster. Caveats:

  • Auto PK fill is only guaranteed on PostgreSQL (other DBs may leave None)
  • pre_save / post_save signals are not fired (#5)
  • M2M needs separate handling

bulk_update — UPDATE in one shot #

bulk_update
posts = list(Post.objects.filter(status="draft"))
for p in posts:
    p.status = "published"
Post.objects.bulk_update(posts, ["status"], batch_size=500)

You list the fields to update. Signals from save() don’t fire.

update_or_create — idempotent upsert #

upsert pattern
DailyStat.objects.update_or_create(
    date=target,
    defaults={"order_count": n, "total_amount": total},
)

The same one we saw in #2’s stats command. Safe to re-run.

PostgreSQL’s true INSERT … ON CONFLICT is supported via bulk_create(..., update_conflicts=True) since 5.x:

bulk_create's upsert
Post.objects.bulk_create(
    posts,
    update_conflicts=True,
    update_fields=["title", "body"],
    unique_fields=["slug"],
)

iterator — save memory #

🚫 Million rows in memory
for post in Post.objects.all():
    process(post)
✅ chunked
for post in Post.objects.iterator(chunk_size=2000):
    process(post)

iterator doesn’t cache results and streams via a DB cursor. The standard for iterating big tables.

Concurrency — select_for_update #

The scenario where multiple workers update the same row at once — balance deductions, stock decrements, and so on.

🚫 race condition
def withdraw(account_id, amount):
    account = Account.objects.get(pk=account_id)
    if account.balance < amount:
        raise ValueError("insufficient")
    account.balance -= amount
    account.save()

If two requests arrive simultaneously, both pass the balance check at get time, both decrement — and the balance goes negative.

✅ select_for_update + atomic
from django.db import transaction

def withdraw(account_id, amount):
    with transaction.atomic():
        account = Account.objects.select_for_update().get(pk=account_id)
        if account.balance < amount:
            raise ValueError("insufficient")
        account.balance -= amount
        account.save()

A row-level lock is held until the transaction ends. The second request waits until the first commits/rolls back.

nowait, skip_locked #

Skip to another row instead of waiting
Job.objects.select_for_update(skip_locked=True).filter(status="pending").first()

Common in job queue patterns — skip rows another worker has locked and grab the next job.

Fail fast instead of waiting
account = Account.objects.select_for_update(nowait=True).get(pk=id)

Avoiding deadlocks #

A deadlock is when two transactions wait on each other’s locks. Avoidance rules:

  • Always acquire locks in the same order (e.g., always smaller ID first)
  • Keep transactions short — don’t put external API calls inside a transaction
  • Lock only the rows you need — keep select_for_update().filter(...) narrow

Raw SQL — escape hatch #

For cases where expressing something through the ORM is too complex or you need DB-specific features.

Manager.raw
posts = Post.objects.raw(
    "SELECT * FROM post WHERE to_tsvector('simple', title) @@ to_tsquery('simple', %s)",
    ["python | django"],
)
for p in posts:
    print(p.title)

raw maps results back into ORM objects, so model methods and properties stay alive.

A lower layer:

connection.cursor
from django.db import connection

with connection.cursor() as cur:
    cur.execute(
        "UPDATE post SET status = %s WHERE created_at < %s RETURNING id",
        ["archived", cutoff],
    )
    archived_ids = [row[0] for row in cur.fetchall()]

Decoupled from the ORM. Always use parameter binding (%s) to prevent SQL injection — never build SQL with Python f-strings.

Common pitfalls #

1) len(qs) vs qs.count() #

🚫 fetch all just to count
n = len(Post.objects.all())
✅ COUNT(*)
n = Post.objects.count()

len loads the entire result into memory and counts. If you only need the count, use count().

That said, if the queryset is already cached, len works without an extra query — it depends on the situation.

2) exists() vs count() > 0 #

🚫 count everything
if Post.objects.filter(slug=s).count() > 0: ...
✅ check one row
if Post.objects.filter(slug=s).exists(): ...

exists ends with LIMIT 1.

3) NULL check with .first() #

✅ clean
post = Post.objects.filter(slug=s).first()
if post is None:
    return Http404

Shorter than get + try/except DoesNotExist, and intent is clearer.

4) M2M queries #

Without prefetch_related, running for tag in post.tags.all() over N posts gives you N+1.

5) count() is not cached #

Even within the same transaction, each count() call is a fresh SQL. If you need the same count multiple times, store it in a variable.

Wrap-up #

What you covered this time:

  • N+1 — forward use select_related (JOIN); reverse/M2M use prefetch_related (separate query)
  • Prefetch(queryset=..., to_attr=...) to customize the prefetched result
  • only, defer for column selection; values/values_list(flat=True) to avoid ORM instantiation
  • Measurement: debug-toolbar, assertNumQueries, connection.queries, .explain()
  • Indexes: db_index, Meta.indexes (multi-column / function / partial); column order matters in multi-column
  • Bulk: bulk_create, bulk_update, update_or_create, iterator(chunk_size=...)
  • Concurrency: select_for_update (+ atomic), skip_locked (job queue), avoid deadlocks
  • Raw escape hatch: Manager.raw, connection.cursor — parameter binding required
  • Pitfalls: len vs count, count > 0 vs exists, M2M, count not cached

In the next post (#4 Caching) we layer caching on top. Per-view, template fragment, low-level, Redis backend, invalidation strategies, stampede — another axis for cutting queries.

X