Why is my spatial join ignoring the GiST index?

The planner avoids the index when table statistics are stale or selectivity is too low. Run ANALYZE on both tables, then re-check EXPLAIN output. If the index is still skipped, use SET enable_seqscan = off in a test session to confirm the index path is viable, then investigate cost parameters (random_page_cost, effective_cache_size).

What causes the 'Operation on mixed SRID geometries' error?

PostGIS rejects spatial predicates when the two geometry columns carry different SRIDs. Reproject one side with ST_Transform(geom, target_srid) inside the query, or fix the column SRID permanently with ALTER TABLE … ALTER COLUMN geom TYPE geometry(Geometry, 4326) USING ST_SetSRID(geom, 4326).

How do I avoid OOM when a spatial join returns millions of rows in Python?

Replace fetchall() with a named server-side cursor (psycopg2) or yield_per() (SQLAlchemy). Set itersize on the cursor to control how many rows are fetched per round-trip. For extremely large results, write directly to an output table using INSERT … SELECT inside the database.

Spatial Joins in PostGIS and Python

Spatial joins are a core technique within Mastering Core Spatial Query Patterns: they let you correlate records across two tables using geometric relationships — intersection, containment, proximity — instead of shared primary keys. The result is a join whose ON clause is a topological predicate evaluated against coordinate data rather than an equality check on an integer column.

This page walks through the full workflow: validating preconditions, constructing the SQL join, streaming results safely in Python, and interpreting the execution plan to confirm that the query is index-driven rather than brute-force.

Prerequisites and Infrastructure Validation

Before executing a spatial join, confirm that the environment meets three hard requirements.

PostGIS extension enabled

sql

CREATE EXTENSION IF NOT EXISTS postgis;
SELECT PostGIS_Full_Version();

Python packages

psycopg2-binary>=2.9
sqlalchemy>=2.0
geoalchemy2>=0.14
shapely>=2.0

SRID alignment

Both geometry columns must carry the same SRID. Query the catalog to check:

sql

SELECT f_table_name, f_geometry_column, srid, type
FROM geometry_columns
WHERE f_table_name IN ('parcels', 'zoning');

If the SRIDs differ, reproject inline with ST_Transform(geom, 4326) or fix the column permanently:

sql

-- One-time reprojection: reproject zoning to EPSG 4326 in place
ALTER TABLE zoning
  ALTER COLUMN geom TYPE geometry(MultiPolygon, 4326)
  USING ST_Transform(geom, 4326);

-- Update the constraint so PostGIS enforces the SRID going forward
UPDATE geometry_columns
SET srid = 4326
WHERE f_table_name = 'zoning' AND f_geometry_column = 'geom';

GiST index existence check

sql

SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename IN ('parcels', 'zoning')
  AND indexdef ILIKE '%gist%';

If the query returns no rows, create the indexes before running any spatial join:

sql

CREATE INDEX CONCURRENTLY parcels_geom_gidx ON parcels USING GIST (geom);
CREATE INDEX CONCURRENTLY zoning_geom_gidx  ON zoning  USING GIST (geom);

Without these indexes, PostGIS falls back to sequential scans: every row in parcels is tested against every row in zoning, producing O(n × m) geometry evaluations and catastrophic runtimes at scale.

Core Execution Workflow

Step 1 — Explicit Bounding-Box Pre-Filter

PostGIS spatial predicates trigger a two-phase evaluation: a fast bounding-box (&&) sweep using the GiST index, followed by the exact geometry check. The query planner performs this automatically, but explicitly combining && with your topological predicate ensures the optimizer never skips the index — even when statistics are stale:

sql

-- Explicit two-phase pattern: bounding-box prune + exact check
SELECT a.parcel_id, b.zone_code, ST_Area(a.geom) AS parcel_area_m2
FROM parcels a
JOIN zoning b
  ON a.geom && b.geom          -- phase 1: GiST bounding-box sweep
 AND ST_Intersects(a.geom, b.geom)  -- phase 2: exact geometry test
WHERE b.zone_type = 'RESIDENTIAL'
  AND a.status = 'ACTIVE';

See bounding-box filtering for a detailed treatment of the && operator and how the planner uses it to prune the candidate set before running the more expensive predicate.

Step 2 — Choosing the Right Topological Predicate

The predicate you choose changes both the semantic result and the query cost:

Predicate	Returns rows where…	Index-accelerated?
`ST_Intersects(a, b)`	geometries share any point	Yes (via `&&`)
`ST_Contains(b, a)`	`b` fully contains `a`	Yes
`ST_Within(a, b)`	`a` falls entirely inside `b`	Yes
`ST_Covers(b, a)`	`b` covers `a`, including boundary	Yes
`ST_Touches(a, b)`	share only a boundary point	Yes, but low selectivity
`ST_DWithin(a, b, dist)`	within distance `dist` in CRS units	Yes — see ST_DWithin radius searches

Avoid ST_Distance(a.geom, b.geom) = 0 as an intersection substitute: it computes the full Hausdorff distance for every candidate pair and bypasses all index acceleration.

For joins where strict containment is required (points inside polygons, parcels inside administrative boundaries), ST_Contains or ST_Within is more precise than ST_Intersects and may be faster because polygon-boundary intersections are excluded from the result set.

Step 3 — Python Integration with Server-Side Cursors

Naive Python integration calls fetchall() on the connection cursor. For spatial joins returning thousands or millions of rows this causes out-of-memory errors because the entire result set is buffered in the Python process.

Use a named server-side cursor with psycopg2 instead. The cursor streams rows from PostgreSQL in batches controlled by itersize:

python

import psycopg2
from psycopg2.extras import RealDictCursor

def stream_spatial_join(dsn: str, chunk_size: int = 5000):
    """Yield rows from a spatial join using a server-side cursor."""
    with psycopg2.connect(dsn) as conn:
        # Named cursor → server-side; rows fetched in batches of chunk_size
        with conn.cursor(
            name="spatial_join_cursor",
            cursor_factory=RealDictCursor,
            withhold=False,
        ) as cur:
            cur.itersize = chunk_size
            cur.execute("""
                SELECT
                    a.parcel_id,
                    b.zone_code,
                    ST_Area(ST_Transform(a.geom, 3857)) AS area_m2,
                    ST_AsGeoJSON(a.geom)::json         AS geometry
                FROM parcels a
                JOIN zoning b
                  ON a.geom && b.geom
                 AND ST_Contains(b.geom, a.geom)
                WHERE a.status = 'ACTIVE'
            """)
            for row in cur:
                yield row

If you use GeoAlchemy2 model mapping with SQLAlchemy, swap to yield_per() on the query object:

python

from sqlalchemy import select
from sqlalchemy.orm import Session
from myapp.models import Parcel, Zoning  # GeoAlchemy2 Geometry columns

def stream_with_sqlalchemy(session: Session, batch: int = 2000):
    stmt = (
        select(Parcel.parcel_id, Zoning.zone_code)
        .join(Zoning, Parcel.geom.ST_Contains(Zoning.geom))
        .where(Parcel.status == "ACTIVE")
    )
    yield from session.execute(stmt).yield_per(batch)

For very large result sets (tens of millions of rows), skip Python altogether and write the join output directly into a staging table:

sql

-- Write the join result to a staging table inside the database
INSERT INTO staging_parcel_zones (parcel_id, zone_code, area_m2)
SELECT
    a.parcel_id,
    b.zone_code,
    ST_Area(ST_Transform(a.geom, 3857))
FROM parcels a
JOIN zoning b
  ON a.geom && b.geom
 AND ST_Contains(b.geom, a.geom)
WHERE a.status = 'ACTIVE';

See batch processing spatial joins in Python for memory-safe pagination, cursor-management patterns, and transaction scoping for multi-batch writes.

Step 4 — Result Handling and SRID-Aware Serialisation

When returning spatial join results through an API, always serialize geometries at the database level rather than loading raw WKB into Python and converting there:

python

import psycopg2
import json

def fetch_join_as_geojson(dsn: str) -> list[dict]:
    """Return join results as Python dicts with GeoJSON geometry."""
    with psycopg2.connect(dsn) as conn:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT
                    a.parcel_id,
                    b.zone_code,
                    ST_AsGeoJSON(ST_Transform(a.geom, 4326))::json AS geom_geojson
                FROM parcels a
                JOIN zoning b
                  ON a.geom && b.geom
                 AND ST_Intersects(a.geom, b.geom)
                LIMIT 1000
            """)
            cols = [d[0] for d in cur.description]
            return [dict(zip(cols, row)) for row in cur.fetchall()]

ST_AsGeoJSON converts geometry to RFC 7946 JSON inside the database; the Python side receives a plain dict. This avoids importing shapely or geoalchemy2 runtime dependencies in API workers that only need to forward the result.

Performance Considerations

Reading the EXPLAIN Plan

Always validate a new spatial join with EXPLAIN (ANALYZE, BUFFERS):

sql

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT a.parcel_id, b.zone_code
FROM parcels a
JOIN zoning b
  ON a.geom && b.geom
 AND ST_Intersects(a.geom, b.geom);

A healthy plan looks like this:

Hash Join  (cost=12.50..890.20 rows=240 width=44)
           (actual time=1.8..18.4 rows=312 loops=1)
  Hash Cond: (a.id = b.id)
  Buffers: shared hit=84 read=12
  ->  Index Scan using parcels_geom_gidx on parcels a
        (cost=0.28..420.10 rows=9800 width=22)
        (actual time=0.05..6.2 rows=9800 loops=1)
        Index Cond: (geom && b.geom)
        Filter: ST_Intersects(geom, b.geom)
        Rows Removed by Filter: 340
  ->  Index Scan using zoning_geom_gidx on zoning b
        ...

Key signals to look for:

Index Scan using …_gidx — the GiST index is being used. Good.
Seq Scan on parcels — the planner chose a sequential scan. Investigate.
Rows Removed by Filter — rows that passed the bounding-box check but failed the exact predicate. A very high ratio means the bounding boxes are large relative to the actual polygons; consider partial GiST indexes to narrow the candidate set.
Buffers: shared read — pages read from disk rather than cache. High values after the first run indicate the working set exceeds effective_cache_size.

For a deeper treatment of plan interpretation, see reading EXPLAIN ANALYZE output for spatial joins.

Relevant GUC Settings

Parameter	Recommended value	Effect on spatial joins
`work_mem`	`32MB`–`256MB` for batch jobs	Allows larger in-memory hash builds
`random_page_cost`	`1.1` on SSD	Encourages index paths over sequential scans
`effective_cache_size`	50–75 % of RAM	Tells planner how much is in OS file cache
`enable_seqscan`	`off` (test sessions only)	Forces index path to confirm index viability

Refresh planner statistics after bulk loads:

sql

ANALYZE parcels;
ANALYZE zoning;

Common Failure Modes and Fixes

SRID Mismatch

Symptom: ERROR: Operation on mixed SRID geometries (4326 != 32632)

Diagnosis:

sql

SELECT ST_SRID(geom) AS srid, count(*)
FROM parcels
GROUP BY 1;

SELECT ST_SRID(geom) AS srid, count(*)
FROM zoning
GROUP BY 1;

Fix: Cast on the fly in the join predicate:

sql

ON ST_Intersects(
    ST_Transform(a.geom, 4326),
    ST_Transform(b.geom, 4326)
)

For repeated joins, permanently reproject the offending table rather than paying the transform cost on every query.

Missing or Invisible GiST Index

Symptom: EXPLAIN shows Seq Scan despite an index existing.

Diagnosis:

sql

-- Confirm the index exists and is valid (not broken)
SELECT indexname, indisvalid
FROM pg_indexes
JOIN pg_index ON pg_index.indexrelid = (
    SELECT oid FROM pg_class WHERE relname = 'parcels'
)
WHERE tablename = 'parcels'
  AND indexdef ILIKE '%gist%';

Fix: If indisvalid is false, the index build failed or was interrupted. Drop and recreate:

sql

DROP INDEX IF EXISTS parcels_geom_gidx;
CREATE INDEX CONCURRENTLY parcels_geom_gidx ON parcels USING GIST (geom);

If the index is valid but the planner ignores it, the statistics may indicate the table is small enough for a sequential scan to be cheaper. Update them:

sql

ANALYZE parcels;

If that still doesn’t help, lower random_page_cost or temporarily set enable_seqscan = off in your session to verify whether the index path produces a faster actual runtime before adjusting cost parameters permanently.

Planner Chooses Sequential Scan on a Large Table

Symptom: Slow spatial join on a table with millions of rows; EXPLAIN shows Seq Scan.

Cause: The planner estimates low selectivity — perhaps because the geometries cover a large area and most bounding boxes overlap anyway, or because statistics are stale after a bulk insert.

Fix sequence:

Run ANALYZE on both tables.
Check pg_statistic target: ALTER TABLE parcels ALTER COLUMN geom SET STATISTICS 500; — a higher target gives the planner more histogram buckets for the geometry column.
Increase default_statistics_target globally if many spatial tables are affected: SET default_statistics_target = 200;
If the table is large and the join is over a known geographic sub-region, add a WHERE clause that restricts to a bounding box first — this increases effective selectivity and makes the index attractive.

OOM in Python When Fetching Large Result Sets

Symptom: Python process killed or raises MemoryError during cursor.fetchall().

Diagnosis: The spatial join returns more rows than fit in the process’s heap. A simple check:

sql

SELECT count(*)
FROM parcels a
JOIN zoning b
  ON a.geom && b.geom
 AND ST_Intersects(a.geom, b.geom);

If the count exceeds roughly 100,000 rows with wide geometry columns, fetchall() will be problematic.

Fix: Switch to a named server-side cursor (see Step 3 above). Set cur.itersize to a value that matches your memory budget: at 1 KB per row, itersize=5000 uses roughly 5 MB per batch.

Invalid Geometry Causing Predicate Errors

Symptom: ERROR: input geometry for lwgeom_intersection is not simple or silent NULL returns from ST_Intersects.

Diagnosis:

sql

SELECT id, ST_IsValidReason(geom)
FROM parcels
WHERE NOT ST_IsValid(geom);

Fix: Repair geometries before joining:

sql

-- Non-destructive: repair on the fly in the join
SELECT a.parcel_id, b.zone_code
FROM (
    SELECT parcel_id, ST_MakeValid(geom) AS geom
    FROM parcels
) a
JOIN zoning b
  ON a.geom && b.geom
 AND ST_Intersects(a.geom, b.geom);

-- Destructive: repair in place (use in ETL pipelines, not ad hoc queries)
UPDATE parcels
SET geom = ST_MakeValid(geom)
WHERE NOT ST_IsValid(geom);

Verification

After implementing the join, confirm correctness and performance with these checks.

Row count plausibility:

sql

-- Baseline: total parcels
SELECT count(*) FROM parcels WHERE status = 'ACTIVE';

-- After join: each active parcel should have at least one zone
SELECT count(DISTINCT a.parcel_id)
FROM parcels a
JOIN zoning b
  ON a.geom && b.geom
 AND ST_Intersects(a.geom, b.geom)
WHERE a.status = 'ACTIVE';

If the second number is much lower than the first, geometries may not be aligned spatially (wrong region, inverted coordinates, or SRID mismatch that wasn’t caught).

Timing comparison (index vs. sequential scan):

sql

\timing on

-- Forced sequential scan (baseline)
SET enable_indexscan = off;
SET enable_bitmapscan = off;
SELECT count(*) FROM parcels a JOIN zoning b ON ST_Intersects(a.geom, b.geom);

-- Reset and run with index
RESET enable_indexscan;
RESET enable_bitmapscan;
SELECT count(*) FROM parcels a JOIN zoning b ON a.geom && b.geom AND ST_Intersects(a.geom, b.geom);

A well-indexed join should be 10–100× faster than the sequential baseline on datasets of 100,000+ rows.

EXPLAIN confirmation:

sql

EXPLAIN (ANALYZE, BUFFERS)
SELECT a.parcel_id, b.zone_code
FROM parcels a
JOIN zoning b
  ON a.geom && b.geom
 AND ST_Intersects(a.geom, b.geom)
WHERE a.status = 'ACTIVE'
LIMIT 100;

Confirm Index Scan using parcels_geom_gidx or Bitmap Index Scan appears in the plan. If Seq Scan persists after running ANALYZE, investigate the cost parameters listed in the GUC table above.

Python cursor integrity check:

python

import psycopg2

def verify_join(dsn: str) -> None:
    with psycopg2.connect(dsn) as conn:
        with conn.cursor(name="verify_cursor") as cur:
            cur.itersize = 500
            cur.execute("""
                SELECT a.parcel_id, b.zone_code
                FROM parcels a
                JOIN zoning b
                  ON a.geom && b.geom
                 AND ST_Intersects(a.geom, b.geom)
                WHERE a.status = 'ACTIVE'
            """)
            row_count = sum(1 for _ in cur)
    print(f"Verified: {row_count} rows returned via server-side cursor")

verify_join("postgresql://user:pass@localhost/gisdb")

Mastering Core Spatial Query Patterns — parent section covering all spatial query strategies
Bounding Box Filtering — the && operator that accelerates the first phase of every spatial join
ST_DWithin Radius Searches — proximity-based joins using distance predicates instead of topological containment
KNN Nearest-Neighbor Queries — when you need the closest geometry rather than all geometries within a region
Batch Processing Spatial Joins in Python — memory-safe pagination and cursor management for large join outputs
Advanced GiST Indexing Optimization — index maintenance, bloat detection, and partial indexes that directly affect join performance