Batch Processing Spatial Joins in Python

Q: Why does a named psycopg2 cursor require an open transaction?

PostgreSQL's DECLARE CURSOR is transaction-scoped. psycopg2 issues DECLARE on the first execute() call of a named cursor, and the cursor's scroll position is held in server memory until the transaction ends. With autocommit=True there is no transaction to bind to, so PostgreSQL raises 'cannot DECLARE a cursor outside a transaction'.

Q: Can I commit mid-stream without closing the server-side cursor?

No. A COMMIT closes every open cursor in the transaction. If you need incremental commits to mark rows as processed, switch to primary-key pagination (WHERE id > last_id ORDER BY id LIMIT batch_size) rather than a server-side cursor.

Q: Do I need to call ST_Transform if my tables use different SRIDs?

Yes. ST_Intersects(a.geom, b.geom) silently returns false — not an error — when the SRIDs differ. Wrap one side in ST_Transform(b.geom, 4326) and ensure the index column matches the SRID you query in, or PostGIS will fall back to a sequential scan.

The exact scenario this page addresses: you have two large PostGIS geometry tables — say, delivery zones and order locations — and a naive fetchall() join crashes your container with an OOM kill before the first row reaches your application. This page is a companion to the Spatial Joins topic, which covers the broader join patterns available in PostGIS. Here you will get a single, copy-paste-ready solution: server-side cursor streaming with explicit SRID handling, GiST index validation, and a verification step you can run immediately after.

Why the Naive Approach Fails

The typical first attempt calls cursor.fetchall() after a ST_Intersects join:

python

# DO NOT do this in production
import psycopg2

conn = psycopg2.connect(DSN)
cur = conn.cursor()
cur.execute("""
    SELECT s.id, s.geom, t.attr_a
    FROM source_table s
    JOIN target_table t ON ST_Intersects(s.geom, t.geom)
""")
rows = cur.fetchall()   # materialises the entire result set in Python heap

This fails for two related reasons. First, fetchall() instructs psycopg2 to load every matching row into the Python process’s heap before your for loop starts. A 500 k-row join across two geometry tables — even with WKB bytes rather than parsed GEOS objects — can consume 3–8 GB of RAM. Second, if you then pass each row into shapely.wkb.loads(), you instantiate a full GEOS object for every row simultaneously. On a containerised workload with a 512 MB memory limit, the process is killed before it processes a single record.

The fix is not to reduce the batch size on the Python side after fetchall() — the damage is already done by the time Python sees the data. The fix is to keep the result set on the PostgreSQL server and pull it through a named cursor in bounded chunks.

Production-Ready Implementation

The complete solution below uses psycopg2 server-side cursors, SET LOCAL work_mem, explicit SRID validation, and WKB streaming. Replace DSN, source_table, and target_table with your connection string and table names.

python

import psycopg2
import psycopg2.extras
import logging

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")

DSN = "dbname=gis_db user=app_user password=secret host=10.0.0.5 port=5432"
BATCH_SIZE = 5_000   # rows per fetchmany() call; tune between 2000–10000


def _check_indexes(cur: psycopg2.extensions.cursor) -> None:
    """Abort early if GiST indexes are missing — sequential scans will OOM."""
    cur.execute("""
        SELECT tablename, indexname
        FROM pg_indexes
        WHERE indexdef ILIKE '%using gist%'
          AND tablename IN ('source_table', 'target_table')
          AND schemaname = 'public'
        ORDER BY tablename
    """)
    indexed_tables = {row[0] for row in cur.fetchall()}
    missing = {"source_table", "target_table"} - indexed_tables
    if missing:
        raise RuntimeError(
            f"GiST index missing on: {missing}. "
            "Run: CREATE INDEX CONCURRENTLY ON <table> USING gist (geom);"
        )


def batch_spatial_join(sink_callback) -> int:
    """
    Stream ST_Intersects(source_table.geom, target_table.geom) results to
    sink_callback(rows) in BATCH_SIZE chunks.  Both geometry columns must be
    SRID 4326; add ST_Transform() if they differ.

    Returns total row count processed.
    """
    conn = psycopg2.connect(DSN)
    conn.autocommit = False   # server-side cursors require a live transaction

    try:
        with conn.cursor() as setup:
            # Allow PostGIS hash/sort operations to run in RAM, not temp files
            setup.execute("SET LOCAL work_mem = '512MB'")
            _check_indexes(setup)

        # Passing name= triggers DECLARE CURSOR on first execute()
        cur = conn.cursor(
            name="spatial_join_cursor",
            cursor_factory=psycopg2.extras.RealDictCursor,
            withhold=False,   # cursor is closed automatically on COMMIT
        )
        cur.itersize = BATCH_SIZE   # hint psycopg2's internal buffer

        cur.execute("""
            SELECT
                s.id                          AS source_id,
                s.geom                        AS source_geom,   -- hex WKB, SRID 4326
                t.id                          AS target_id,
                t.attr_a,
                t.attr_b
            FROM source_table  s
            JOIN target_table  t
              ON ST_Intersects(
                     ST_SetSRID(s.geom, 4326),
                     ST_SetSRID(t.geom, 4326)
                 )
            WHERE s.processed = FALSE
            ORDER BY s.id     -- stable order; remove if not needed
        """)

        processed = 0
        while True:
            rows = cur.fetchmany(BATCH_SIZE)
            if not rows:
                break

            # Pass raw dicts to caller; geometries are hex-WKB strings.
            # Only call shapely.wkb.loads(bytes.fromhex(row['source_geom']))
            # when topology operations are genuinely required.
            sink_callback(rows)

            processed += len(rows)
            logging.info("Processed %d rows total", processed)

        conn.commit()
        cur.close()
        return processed

    except Exception:
        conn.rollback()
        raise
    finally:
        conn.close()


if __name__ == "__main__":
    total = batch_spatial_join(lambda rows: None)   # replace lambda with real sink
    logging.info("Done. Total rows: %d", total)

Key points in the implementation:

conn.autocommit = False is required. DECLARE CURSOR is transaction-scoped in PostgreSQL; without an open transaction psycopg2 raises can't use a named cursor outside of transactions.
cur.itersize = BATCH_SIZE sets the number of rows psycopg2 requests from the server in each internal network round-trip, matching fetchmany(BATCH_SIZE) to avoid under-fetching.
ST_SetSRID(..., 4326) guards against silent predicate failures when rows lack an SRID. If your geometries are already consistently SRID 4326 you may omit this, but the guard costs nothing and prevents a common failure mode.
withhold=False means the cursor is discarded at COMMIT, which is the correct lifecycle for a read-only streaming job. Do not set withhold=True unless you need the cursor to survive beyond the transaction — it keeps server resources allocated.

Configuration and Tuning Knobs

GUC / setting	Recommended value	Effect
`work_mem` (SET LOCAL)	256 MB – 512 MB	Allows PostGIS to sort and hash intermediate geometry candidates in RAM rather than spilling to `pg_temp`. Use `SET LOCAL` to scope it to this transaction only.
`BATCH_SIZE` / `itersize`	2 000 – 10 000 rows	Controls Python heap footprint. Larger batches reduce network round-trips; smaller batches give finer progress granularity. 5 000 is a safe default.
`random_page_cost`	1.1 (SSD) / 4.0 (HDD)	If the planner chooses a sequential scan instead of the GiST index, lower `random_page_cost` so index seeks appear cheaper. Set at session level with `SET random_page_cost = 1.1`.
`effective_cache_size`	75% of total RAM	Tells the planner how much OS page cache is available. Too-low values make index scans appear more expensive than they are.
`statement_timeout`	0 (disabled for batch)	Batch jobs can run for minutes or hours; ensure `statement_timeout` is not set at the role level or the cursor declaration will time out mid-stream.

For GiST index tuning specific to high-volume spatial joins, see the advanced GiST indexing optimisation section, particularly composite indexes for geometry and timestamp columns when your join also filters on a time range.

Verification Steps

After running the batch job, confirm correctness with two checks.

Row-count assertion — compare the batch total against a direct COUNT(*) of the join:

python

import psycopg2

def verify_join_count(expected: int) -> None:
    conn = psycopg2.connect(DSN)
    with conn.cursor() as cur:
        cur.execute("""
            SELECT COUNT(*)
            FROM source_table  s
            JOIN target_table  t
              ON ST_Intersects(
                     ST_SetSRID(s.geom, 4326),
                     ST_SetSRID(t.geom, 4326)
                 )
            WHERE s.processed = FALSE
        """)
        actual = cur.fetchone()[0]
    conn.close()
    assert actual == expected, f"Row count mismatch: expected {expected}, got {actual}"
    print(f"Verification passed: {actual} rows")

EXPLAIN check — run this before the batch job to confirm the planner uses an Index Scan, not a sequential scan:

sql

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT s.id, t.attr_a
FROM source_table  s
JOIN target_table  t
  ON ST_Intersects(ST_SetSRID(s.geom, 4326), ST_SetSRID(t.geom, 4326))
WHERE s.processed = FALSE
LIMIT 100;

Look for Index Scan using <index_name> on target_table in the output. If you see Seq Scan instead, the GiST index is either missing or the planner’s cost estimates are stale — run ANALYZE source_table; ANALYZE target_table; and re-check. For a detailed guide to reading spatial EXPLAIN output, see reading EXPLAIN ANALYZE output for spatial joins.

Gotchas Checklist

Autocommit blocks the cursor. Setting conn.autocommit = True after connecting is the most common reason DECLARE CURSOR fails. Always set conn.autocommit = False before the first execute() on a named cursor. There is no way to convert an existing autocommit connection mid-stream without reconnecting.
SRID mismatches return zero rows, not an error. If source_table.geom is SRID 4326 and target_table.geom is SRID 3857, ST_Intersects evaluates coordinates in different units and returns false for every pair. Verify with SELECT DISTINCT ST_SRID(geom) FROM source_table before running the batch. Use ST_Transform to reproject one side, and ensure the target index covers the transformed SRID.
Mid-stream COMMIT destroys the cursor. PostgreSQL closes all WITHOUT HOLD cursors at COMMIT. If you need to mark rows as processed after each batch (e.g., UPDATE source_table SET processed = TRUE WHERE id = ANY(%s)), do not use a server-side cursor for the outer loop. Switch to primary-key pagination: WHERE id > %s ORDER BY id LIMIT %s.
Large work_mem multiplied by concurrency. SET LOCAL work_mem = '512MB' is per-sort-operation, and a single query can run multiple sorts. If you run four batch workers in parallel, peak RAM from work_mem alone can reach several gigabytes. Start with 128 MB and increase only after profiling pg_stat_activity for temp_files.
Index bloat after bulk inserts. GiST indexes on geometry columns accumulate dead pages during high-throughput inserts. If your batch runs after a bulk load, run VACUUM ANALYZE source_table first. Without fresh statistics the planner may underestimate selectivity and choose a sequential scan even with a healthy index.

FAQ: Server-side cursors and spatial joins

Why does a named psycopg2 cursor require an open transaction?

PostgreSQL’s DECLARE CURSOR is transaction-scoped. psycopg2 issues DECLARE on the first execute() call of a named cursor, and the cursor’s scroll position is held in server memory until the transaction ends. With autocommit=True there is no transaction to bind to, so PostgreSQL raises cannot DECLARE a cursor outside a transaction.

Can I commit mid-stream without closing the server-side cursor?

No. A COMMIT closes every open cursor in the transaction (unless declared WITH HOLD). If you need incremental commits to mark rows as processed, switch to primary-key pagination (WHERE id > last_id ORDER BY id LIMIT batch_size) rather than a server-side cursor.

Do I need to call ST_Transform if my tables use different SRIDs?

Yes. ST_Intersects(a.geom, b.geom) silently returns false — not an error — when the SRIDs differ. Wrap one side in ST_Transform(b.geom, 4326) and ensure the index column matches the SRID you query against, or PostGIS will fall back to a sequential scan because the index stores a different SRID than the function receives.

Spatial Joins — parent topic covering join strategies, predicate selection, and index interaction
ST_DWithin Radius Searches — pair with batch cursor streaming for proximity-filtered joins
Advanced GiST Indexing Optimisation — index creation, bloat detection, and VACUUM ANALYZE cadence for geometry-heavy tables
Reading EXPLAIN ANALYZE Output for Spatial Joins — interpret planner output to confirm index usage
Tuning ST_DWithin for High-Traffic APIs — related streaming and connection-pool patterns for spatial queries under load

Batch Processing Spatial Joins in Python

Why the Naive Approach Fails

Production-Ready Implementation

Configuration and Tuning Knobs

Verification Steps

Gotchas Checklist

Related Topics