Breakups and Reconciliations, Solving N+1 Across Database Boundaries

In a single database setup, when we need that a Django Database object references another tables, to diminish the amount of atomic searches (N+1) we would use prefetch_related() in case of ManyToMany relationships and select_related() in OneToOne or ManyToOne relationships. But what happens when your data lives in two different databases?

Imagine you’re building a monitoring system. Your time-series events live in a specialized database optimized for write-heavy workloads (think TimescaleDB, InfluxDB, or even a separate PostgreSQL instance). Meanwhile, your reference data—users, devices, locations—lives in your primary application database, and you need to display those events with human-readable context.

For example, you have two databases:

Database A (Time-Series DB): Stores order events

# Using Database A (router sends this to timeseries DB)
class OrderEvent(models.Model):
    timestamp = models.DateTimeField(primary_key=True)
    customer_id = models.IntegerField()  # Just an ID, not a foreign key!
    event_type = models.CharField(max_length=50)
    amount = models.DecimalField(max_digits=10, decimal_places=2)

    class Meta:
        db_table = 'order_events'
        managed = False

Database B (Application DB): Stores customer details

# Using Database B (your default database)
class Customer(models.Model):
    name = models.CharField(max_length=200)
    email = models.EmailField()
    tier = models.CharField(max_length=20)

Now you need to build an API that shows recent order events with customer names.

The Naive Approach

So you accept that you need to do the lookup in application code. Here’s the straightforward approach:

class OrderEventSerializer(serializers.ModelSerializer):
    customer_name = serializers.SerializerMethodField()
    customer_tier = serializers.SerializerMethodField()

    class Meta:
        model = OrderEvent
        fields = ['timestamp', 'customer_id', 'customer_name',
                  'customer_tier', 'event_type', 'amount']

    def get_customer_name(self, obj):
        try:
            customer = Customer.objects.using('app_db').get(id=obj.customer_id)
            return customer.name
        except Customer.DoesNotExist:
            return 'Unknown Customer'

    def get_customer_tier(self, obj):
        try:
            customer = Customer.objects.using('app_db').get(id=obj.customer_id)
            return customer.tier
        except Customer.DoesNotExist:
            return None

This works, but look at what happens when you serialize 100 order events:

# In your viewset
queryset = OrderEvent.objects.using('timeseries_db').all()[:100]
serializer = OrderEventSerializer(queryset, many=True)
data = serializer.data

Query count:

1 query to fetch 100 events from Database A
200 queries to Database B (2 per event: one for name, one for tier)
Total: 201 queries

This is the cross-database N+1 problem, and it’s worse than the traditional version:

Different connection pools: Each query hits a different database connection
Network latency: Potentially different servers, different network paths
No foreign keys: The database can’t enforce referential integrity
No ORM magic: select_related() and prefetch_related() only work within a single database

Django’s ORM query optimization tools assume a single database. When you break that assumption, you’re on your own. Let’s fix this.

Expanding The Serializer Context

The information required to construct the serialized data is in the Serializer Context, and we can change that method to add the information needed for the fields that are from the other database.

Step 1: Override `get_serializer_context()` in Your Viewset

class OrderEventViewSet(ReadOnlyModelViewSet):
    queryset = OrderEvent.objects.using('timeseries_db').all()
    serializer_class = OrderEventSerializer

    def get_serializer_context(self):
        """
        Pre-fetch all customer data and pass it via context.
        This prevents N+1 queries to the application database.
        """
        context = super().get_serializer_context()

        # Single query to Database B - fetch all customers we might need
        customers = Customer.objects.using('app_db').all()

        # Build a lookup dictionary: customer_id -> customer data
        customer_context = {
            customer.id: {
                'name': customer.name,
                'email': customer.email,
                'tier': customer.tier,
            }
            for customer in customers
        }

        context['customer_context'] = customer_context
        return context

Step 2: Use the added Context in Your Serializer

class OrderEventSerializer(serializers.ModelSerializer):
    customer_name = serializers.SerializerMethodField()
    customer_tier = serializers.SerializerMethodField()

    class Meta:
        model = OrderEvent
        fields = ['timestamp', 'customer_id', 'customer_name',
                  'customer_tier', 'event_type', 'amount']

    def get_customer_name(self, obj):
        customer_context = self.context.get('customer_context', {})
        customer_data = customer_context.get(obj.customer_id)
        return customer_data['name'] if customer_data else 'Unknown Customer'

    def get_customer_tier(self, obj):
        customer_context = self.context.get('customer_context', {})
        customer_data = customer_context.get(obj.customer_id)
        return customer_data['tier'] if customer_data else None

The Result

Query count:

1 query to fetch 100 events from Database A
1 query to fetch all customers from Database B
Total: 2 queries

Performance improvement: From 201 queries to 2 queries. That’s a 99% reduction.

The lookup is now O(1) dictionary access instead of O(n) database queries. Each serialization touches the view context instead of hitting the database.

Key Design Decisions

When implementing this pattern, you need to make several design choices:

1. Context Key Structure

The simplest approach is using the foreign key value directly:

customer_context = {customer.id: customer_data for customer in customers}

But sometimes you need composite keys. Choose a key structure that matches how your events reference the data.

2. Pre-filtering the Reference Data

Don’t store everything if you don’t need to:

# Bad: Cache ALL customers (could be millions)
customers = Customer.objects.using('app_db').all()

# Better: Cache only active customers
customers = Customer.objects.using('app_db').filter(is_active=True)

# Even better: Cache only customers in the current events
event_customer_ids = queryset.values_list('customer_id', flat=True).distinct()
customers = Customer.objects.using('app_db').filter(id__in=event_customer_ids)

3. What Data to Add to the Context

Add only needed fields

customer_context = {
    customer.id: {
        'name': customer.name,
        'tier': customer.tier,
    }
    for customer in customers
}

Some may say that memory is cheap, but is better to be explicit about what is needed, and update the serializer if new fields are needed.

4. Graceful Degradation

Always handle missing keys:

def get_customer_name(self, obj):
    customer_context = self.context.get('customer_context', {})
    customer_data = customer_context.get(obj.customer_id)

    if customer_data:
        return customer_data['name']

    # Graceful fallback
    return f'Unknown Customer (ID: {obj.customer_id})'

Don’t let a missing reference break your entire API response.

5. Add a Cache Layer if needed

To add information to the context, we could also use redis cache or any other solution to store that extra information needed for the responses. This solution is extremely useful and allows us to implement it in any way that is needed.

def get_serializer_context(self):
    context = super().get_serializer_context()

    customer_cache = cache.get('customer_cache_v1')
    if not customer_cache:
        customers = Customer.objects.using('app_db').all()
        customer_cache = {c.id: {...} for c in customers}
        cache.set('customer_cache_v1', customer_cache, timeout=300)

    context['customer_context'] = customer_cache
    return context

Now you’re trading freshness for performance. Choose based on:

How often reference data changes (users rarely, prices frequently)
How stale you can tolerate (5 minutes? 1 hour?)
Request volume (100 req/sec makes caching compelling)

Nothing is Free

Let’s examine the trade-offs carefully.

Memory vs. Network I/O

The Trade: You’re loading all reference data into memory for each request.

When it works:

Reference dataset is small (hundreds to low thousands of records)
Records are lightweight (few fields, no large text/binary data)
Request frequency is high (many requests per second)

When it breaks:

100,000+ customers → Gigabytes of memory per request
Large fields (e.g., full customer profiles with images)
Low request frequency (memory waste for infrequent requests)

Rule of thumb: If your reference data in JSON form is < 10MB, you’re probably fine. Beyond that, consider alternatives.

Complexity vs. Explicitness

The Trade: This pattern adds cognitive load. Future developers need to understand:

Where the context is being built (get_serializer_context)
How it’s used in serializers
What happens when data is missing

Mitigation strategies:

Document heavily:

def get_serializer_context(self):
    """
    PERFORMANCE OPTIMIZATION: Pre-fetch customer data.

    This prevents N+1 queries when serializing order events.
    Each event references a customer_id that lives in a separate database.

    The customer_cache maps customer_id -> customer data dict.
    See OrderEventSerializer.get_customer_name() for usage.
    """

Name things clearly:

context['customer_context'] = ...

Fail loudly in development:

def get_customer_name(self, obj):
    cache = self.context.get('customer_cache')

    if cache is None and settings.DEBUG:
        raise ValueError(
            "customer_cache not found in context. "
            "Did you override get_serializer_context()?"
        )

    # ... continue with lookup

Pagination Accuracy

If you filter data at the application layer, pagination counts might be wrong:

def list(self, request, *args, **kwargs):
    response = super().list(request, *args, **kwargs)

    # Post-serialization filtering
    if isinstance(response.data, list):
        response.data = [
            event for event in response.data
            if event.get('customer_name') != 'Unknown Customer'
        ]

    # Count is now wrong! It counted before filtering.
    return response

This might be acceptable for your use case, or you might need to handle it:

# Option 1: Filter in queryset (before serialization)
# Only possible if you can express the filter in SQL

# Option 2: Custom pagination that accounts for filtering
# Complex, usually not worth it

# Option 3: Accept the inaccuracy
# Often the pragmatic choice if filtered items are rare

When to Use This Pattern

If you’re dealing with cross-database queries that need to be solved at the application level and facing N+1 problems, this pattern is your solution. The main constraint: your reference data needs to fit in memory (< 10MB is a good rule of thumb).

Living in the Unknown

Django provides get_serializer_context() as an extension point. It’s documented, it’s designed for passing data to serializers, and using it for caching or extending the information needed for the serializer is exactly what it’s meant for. This isn’t fighting the framework—it’s using it as intended. Frameworks give you tools, not rules.

Performance isn’t optional. Users notice when your app is slow. Infrastructure costs scale with inefficiency. A 99% reduction in queries isn’t premature optimization—it’s fundamental engineering.

Keep the problem in mind, document your code, and measure your improvements.

Tags: python, Django, Optimization