Spatial indexes are absolutely crucial part of any spatial database and - as I tend to say quite often - only a fool would try to query spatial data without building spatial indexes beforehand.
Spatial indexes are based on bounding box comparisons, which are generally very fast. Yet, there are situations when spatial indexes don’t help much (or they don’t help as much as they could, if you wish).
Bounding box comparisons are effective with lots of small bounding boxes rather then few large ones. Why? See the picture above. The curved line (imagine it’s a pipeline for example) clearly demonstrates when the spatial index/bounding box comparison might fall short of what you’d expect.
Once the bounding box gets really big, it intersects so many other geometries’ bounding boxes that the whole comparison starts to slow down.
Luckily, PostGIS 2.2 introduced a ST_Subdivide function that can lend a helping hand in here.
Until today, we delivered the parcel geometries into our real estate acquisition process system with the following query, that takes all the geometries from the
req_geom table (pipelines, remember?) and intersects them with cadastral parcels. The second part of the query adds those parcels that haven’t been digitalized and were created manually by one of my workmates.
INSERT INTO requested_parcels (uid, par_id) SELECT reqs.uid, b.id par_id FROM running_requests reqs JOIN req_geom a ON (reqs.uid = a.uid) JOIN pargeo b ON (ST_Intersects(a.geom, b.geom)) UNION SELECT reqs.uid, a.idpar::numeric FROM running_requests reqs JOIN req_man a ON (reqs.uid = a.uid);
It’s a perfectly standard query that intersects several request geometries with ~20M parcels, nothing really fancy. Except that it takes 25 minutes to finish. Why? Pipelines, remember?
Yet, the query below takes only 30 seconds to finish (that’s a huge time saver considering that the whole process used to take ~40 minutes)! Why? Because the
ST_Subdivide effectively shrinks the
req_geom geometries until they have 50 vertices each at most. Such small geometries are perfect input for the bounding box comparison. Remember to call
DISTINCT when using
ST_Subdivide, you’d probably get duplicate parcel ids otherwise.
I also replaced the
UNION with the
WHERE NOT EXISTS expression, as it’s reasonable to assume that numeric ids comparison will be faster.
INSERT INTO requested_parcels (uid, par_id) SELECT DISTINCT reqs.uid, b.id par_id FROM running_requests reqs JOIN ( SELECT uid, ST_Subdivide(geom, 50) geom FROM req_geom ) a ON (reqs.uid = a.uid) JOIN pargeo b ON (ST_Intersects(a.geom, b.geom)); INSERT INTO requested_parcels (uid, par_id) SELECT reqs.uid, a.idpar::numeric FROM running_requests reqs JOIN req_man a ON (reqs.uid = a.uid) WHERE NOT EXISTS ( SELECT 1 FROM pozadovane_parcely pp WHERE pp.par_id = a.idpar );
I got to count occurrences of / character today and found out no built-in function exists in PostgreSQL, so here’s my shot at it. Pretty simple, yet useful.
CREATE OR REPLACE FUNCTION how_many(IN text, IN varchar, OUT integer) RETURNS integer AS $how_many$ SELECT length($1) - length(replace($1, $2, '')); $how_many$ LANGUAGE SQL SECURITY DEFINER; -- SELECT how_many('test', 't'); -- returns number 2
It’s pretty easy to accomplish this with a custom PLPGSQL procedure below, that takes line geometry, number of sections and radius of the inner circle as parameters.
CREATE OR REPLACE FUNCTION ST_WindRose( line geometry, directions int, radius numeric ) RETURNS TABLE ( id integer, geom geometry(LINESTRING) ) AS $ST_WindRose$ BEGIN IF directions % 2 <> 0 THEN RAISE EXCEPTION 'Odd number of directions found, please provide even number of directions instead.'; END IF; IF radius > ST_Length(line) THEN RAISE EXCEPTION 'Inner circle radius is bigger than the wind rose diameter, please make it smaller.'; END IF; RETURN QUERY WITH rose AS ( SELECT ST_Rotate(_line, radians(360) / directions * dirs.id, ST_Centroid(_line)) _line FROM ( SELECT line _line ) a CROSS JOIN ( SELECT generate_series(1, directions / 2) id ) dirs ) SELECT row_number() OVER ()::integer id, _line geom FROM ( SELECT _line FROM rose UNION ALL SELECT ST_ExteriorRing(ST_Buffer(ST_Centroid(line), radius, 30)) -- inner circle UNION ALL SELECT ST_ExteriorRing(ST_Buffer(ST_Centroid(line), ST_Length(line)/2, 30)) -- outer circle ) a; END $ST_WindRose$ LANGUAGE PLPGSQL;
Wind rose created with this function might look like the one below.
Run it as follows. The
line parameter should be a simple straight line made of just two vertices.
SELECT * FROM ST_WindRose(ST_MakeLine(ST_MakePoint(0,0), ST_MakePoint(0,1)), 12, 0.01);
Needed to create a polygon from a point defining its size in both axes, here’s a little syntax sugar to make life easier.
CREATE OR REPLACE FUNCTION ST_PolygonFromCentroid(centroid geometry, xsize numeric, ysize numeric) RETURNS geometry AS $ST_PolygonFromCentroid$ SELECT ST_MakeEnvelope( ST_X(ST_Translate($1, -$2, -$3)), ST_Y(ST_Translate($1, -$2, -$3)), ST_X(ST_Translate($1, $2, $3)), ST_Y(ST_Translate($1, $2, $3)) ); $ST_PolygonFromCentroid$ LANGUAGE SQL;
Run it as:
SELECT ST_PolygonFromCentroid(ST_SetSRID(ST_MakePoint(13.912,50.633),4326), 1, 1);
Doing overlays (
ST_Intersection()) in PostGIS based on spatial relationships (
ST_Contains(), …) is so easy it is not something you get particularly excited about.
Today I faced a bit more interesting task: given two polygon layers, get me all the polygons from layer A such that they lie across the polygons from layer B and… a picture worth a thousand words, right?
I hope you got the idea, it is fairly simple:
WITH overlays AS ( /* nothing fancy here */ SELECT A.ogc_fid a_id, B.ogc_fid b_id, ST_Intersection(A.geom, B.geom) geom, ST_Area(ST_Intersection(A.geom, B.geom) area_shared FROM A JOIN B ON (ST_Intersects(A.geom, B.geom) ), diffs AS ( /* note this is a 1:1 relationship in ST_Difference */ /* a little hack is needed to prevent PostGIS from returning its usual difference mess */ SELECT o.a_id, o.b_id, (ST_Dump(ST_Difference(ST_Buffer(A.geom, -0.0001), o.geom))).geom, -- ugly hack o.area_shared FROM overlays o JOIN A ON (o.a_id = A.id) ), merged AS ( /* put those two result sets together */ SELECT * FROM overlays UNION ALL SELECT * FROM diffs ), merged_reduced AS ( /* get only those A polygons that consist of three parts at least for each intersection with B polygon */ SELECT m.* FROM merged m JOIN ( SELECT a_id, b_id FROM merged GROUP BY a_id, b_id HAVING COUNT(1) > 2 ) a ON (a.a_id = m.a_id AND a.b_id = m.b_id) ) /* do as you wish with the result */ SELECT * FROM merged_reduced;
In my case, centerlines of layer B were also included and their length inside each intersection was used to divide the area of the smallest part with. It was fun, actually.
Since PostgreSQL 9.3 there has been a handy little keyword called
LATERAL, which - combined with
JOIN - might rock your GIS world in a second. To keep it simple, a
LATERAL JOIN enables a subquery in the
FROM part of a query to reference columns from preceding expressions in the
FROM list. What the heck?
Imagine that not so unusual request to generate random points in polygons (something I needed to do today). Do it automatically without your favorite piece of desktop GIS software.
It is pretty easy using
SELECT a.id, b.* FROM ( VALUES( 1, ST_SetSRID( ST_GeomFromText( 'POLYGON((0 0, -1 0, -1 -1, 0 -1, 0 0))' ), 4326) ) UNION ALL VALUES( 2, ST_SetSRID( ST_GeomFromText( 'POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))' ), 4326) ) ) a(id, geom) CROSS JOIN LATERAL ( SELECT ST_SetSRID(ST_MakePoint(tmp.x, tmp.y), 4326) geom FROM ( SELECT random() * (ST_XMax(a.geom) - ST_XMin(a.geom)) + ST_XMin(a.geom) x, random() * (ST_YMax(a.geom) - ST_YMin(a.geom)) + ST_YMin(a.geom) y FROM generate_series(0,200) ) tmp ) b;
What actually happened over there? If you want to put points inside polygons, you need… polygons. We will do just fine with two of them created inside this query:
VALUES( 1, ST_SetSRID( ST_GeomFromText( 'POLYGON((0 0, -1 0, -1 -1, 0 -1, 0 0))' ), 4326) ) UNION ALL VALUES( 2, ST_SetSRID( ST_GeomFromText( 'POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))' ), 4326) )
All the magic happens inside the
LATERAL JOIN part of the query:
CROSS JOIN LATERAL ( SELECT ST_SetSRID(ST_MakePoint(tmp.x, tmp.y), 4326) geom FROM ( SELECT random() * (ST_XMax(a.geom) - ST_XMin(a.geom)) + ST_XMin(a.geom) x, random() * (ST_YMax(a.geom) - ST_YMin(a.geom)) + ST_YMin(a.geom) y FROM generate_series(0,200) ) tmp ) b;
SELECT calculates random points based on the extent of the polygon. Note it directly calls
a.geom, a value that comes from the previous
LATERAL JOIN part is thus run for every row of the previous
SELECT statement inside
FROM part of the query. This means it will return 201 points for each of the two polygons (run the query inside QGIS to see the result).
Note all the points fall inside the polygons by accident, because they are square. Otherwise a
ST_Within should be used inside the outermost
WHERE query to filter outliers. This part could use some tweaking.
Imagine you run two database machines hosting structurally the same databases on two separate servers and you need to transfer data from one to another. Not very often, let’s say once a month. Your tables aren’t small nor huge, let’s say millions rows in general.
You’re going to use
pg_dump and pipe it to
psql, but the indices on your tables will slow you down a lot.
That’s why you’ll want to drop all indices and constraints (
SELECT 'ALTER TABLE ' || tc.table_schema || '.' || tc.table_name || ' DROP CONSTRAINT ' || tc.constraint_name || ';' FROM information_schema.table_constraints tc JOIN information_schema.constraint_column_usage ccu ON (tc.constraint_catalog = ccu.constraint_catalog AND tc.constraint_schema = ccu.constraint_schema AND tc.constraint_name = ccu.constraint_name) WHERE tc.table_schema IN (SELECT unnest(string_to_array(:'schemas', ','))) UNION ALL SELECT 'DROP INDEX IF EXISTS ' || schemaname || '.' || indexname || ';' FROM pg_indexes WHERE schemaname IN (SELECT unnest(string_to_array(:'schemas', ',')));
Then you will transfer the data:
pg_dump -a -t "schema1.*" -t "schema2.*" -O -d source -v | psql -h localhost -d target
And restore the already dropped indices and constraints (
WITH constraints AS ( SELECT 'ALTER TABLE ' || tc.table_schema || '.' || tc.table_name || ' ADD CONSTRAINT ' || tc.constraint_name || ' ' || tc.constraint_type || '(' || string_agg(ccu.column_name::text, ', ')|| -- column order should be taken into account here ');' def, tc.table_schema, tc.table_name, tc.constraint_name FROM information_schema.table_constraints tc JOIN information_schema.constraint_column_usage ccu ON (tc.constraint_catalog = ccu.constraint_catalog AND tc.constraint_schema = ccu.constraint_schema AND tc.constraint_name = ccu.constraint_name) WHERE tc.table_schema IN (SELECT unnest(string_to_array(:'schemas', ','))) AND tc.constraint_type = 'PRIMARY KEY' GROUP BY tc.table_schema, tc.table_name, tc.constraint_name, tc.constraint_type ) SELECT def FROM constraints UNION ALL SELECT indexdef || ';' FROM pg_indexes WHERE schemaname IN (SELECT unnest(string_to_array(:'schemas', ','))) AND NOT EXISTS ( SELECT 1 FROM constraints c WHERE pg_indexes.schemaname = c.table_schema AND pg_indexes.tablename = c.table_name AND pg_indexes.indexname = c.constraint_name );
:schemas. Variable assignment is one of the
psqlfeatures I really like.
DROP INDEX IF EXISTSand the CTE used in the drop code - that’s due to the fact that dropping the constraint obviously drops the underlying index as well and you don’t want to dropping something that doesn’t exist or creating something that exists already.
The bash script proposal might look as follows:
# store indices and constraint definitions psql -qAt -d target -v schemas='schema1','schema2' -f create_indices_constraints.sql > create.sql # drop indices and constraints psql -qAt -d target -v schemas='schema1','schema2' -f drop_indices_constraints.sql | psql -d target # load data pg_dump -a -t "schema1.*" -t "schema2.*" -O -d source -v | psql -h localhost -d target #renew indices and constraints psql -qAt -d target -f create.sql
PostgreSQL foreign data wrappers are used to connect PostgreSQL database to different datasources, e.g. other SQL databases, CSV files, XLS spreadsheets×
The one I’ve been interested in for several months is Paul Ramsey’s OGR FDW - it gives you access to OGR supported spatial formats directly from your database. No more shapefiles lying around?
Each foreign data wrapper should have three basic components:
First define the foreign server object:
CREATE SERVER dibavod FOREIGN DATA WRAPPER ogr_fdw OPTIONS ( datasource '/downloads/dibavod', format 'ESRI Shapefile', config_options 'SHAPE_ENCODING=CP1250' );
Note the OGR specific driver configuration options are available inside
config_options. In case of ESRI Shapefiles, the
datasource is the directory your files reside in.
Let’s create PostgreSQL tables (use
ogrinfo or Paul’s
ogr_fdw_info to list the columns):
CREATE FOREIGN TABLE rivers ( fid integer, utokj_id numeric, utokjn_id numeric, utokjn_f numeric, prprop_z integer, ex_jh integer, pozn text, shape_leng numeric, naz_tok text, idvt integer, tok_id numeric, shape_len numeric, geom geometry(LINESTRING, 5514) ) SERVER dibavod OPTIONS (layer 'A02_Vodni_tok_JU'); CREATE FOREIGN TABLE dams ( fid integer, objectid integer, naz_na text, nadr_gid numeric, kota_hladi numeric, hloubka numeric, zatop_ploc numeric, objem numeric, kota_hraz numeric, kota_preli numeric, kota_vypus numeric, plocha_m2 numeric, shape_area numeric, shape_len numeric, geom geometry(MULTIPOLYGON, 5514) ) SERVER dibavod OPTIONS (LAYER 'A05_Vodni_nadrze');
fid column - required for write access to underlying datasource.
Things to remember:
CREATE UNLOGGED TABLE dams2 AS SELECT * FROM dams, not sure what to use it for though
All my GIS life I’ve been using a simple
JOIN clause to find a row with an
id = previous_id + 1. In other words, imagine a simple table with no indices:
CREATE TABLE test (id integer); INSERT INTO test SELECT i FROM generate_series(1,10000000) i;
Let’s retrieve next row for each row in that table:
SELECT a.id, b.id FROM test a LEFT JOIN test b ON (a.id + 1 = b.id); -- note the LEFT JOIN is needed to get the last row as well
Execution plan looks like this:
Hash Join (cost=311087.17..953199.41 rows=10088363 width=8) (actual time=25440.770..79591.869 rows=10000000 loops=1) Hash Cond: ((a.id + 1) = b.id) -> Seq Scan on test a (cost=0.00..145574.63 rows=10088363 width=4) (actual time=0.588..10801.584 rows=10000001 loops=1) -> Hash (cost=145574.63..145574.63 rows=10088363 width=4) (actual time=25415.282..25415.282 rows=10000001 loops=1) Buckets: 16384 Batches: 128 Memory Usage: 2778kB -> Seq Scan on test b (cost=0.00..145574.63 rows=10088363 width=4) (actual time=0.422..11356.108 rows=10000001 loops=1) Planning time: 0.155 ms Execution time: 90134.248 ms
If we add an index with
CREATE INDEX ON test (id), the plan changes:
Merge Join (cost=0.87..669369.85 rows=9999844 width=8) (actual time=0.035..56219.294 rows=10000001 loops=1) Merge Cond: (a.id = b.id) -> Index Only Scan using test_id_idx on test a (cost=0.43..259686.10 rows=9999844 width=4) (actual time=0.015..11101.937 rows=10000001 loops=1) Heap Fetches: 0 -> Index Only Scan using test_id_idx on test b (cost=0.43..259686.10 rows=9999844 width=4) (actual time=0.012..11827.895 rows=10000001 loops=1) Heap Fetches: 0 Planning time: 0.244 ms Execution time: 65973.421 ms
Window functions are real fun. They’re great if you’re doing counts, sums or ranks by groups. And, to my surprise, they’re great in finding next rows as well.
With the same
test table, we retrieve next row for each row with the following query:
SELECT id, lead(id) OVER (ORDER BY id) FROM test.test;
How does that score without an index? Better than the
WindowAgg (cost=1581246.90..1756294.50 rows=10002720 width=4) (actual time=28785.388..63819.071 rows=10000001 loops=1) -> Sort (cost=1581246.90..1606253.70 rows=10002720 width=4) (actual time=28785.354..40117.899 rows=10000001 loops=1) Sort Key: id Sort Method: external merge Disk: 136848kB -> Seq Scan on test (cost=0.00..144718.20 rows=10002720 width=4) (actual time=0.020..10797.961 rows=10000001 loops=1) Planning time: 0.242 ms Execution time: 73391.024 ms
And it works even better if indexed. It’s actually ~1,5× faster than the
WindowAgg (cost=0.43..409770.03 rows=10002720 width=4) (actual time=0.087..35647.815 rows=10000001 loops=1) -> Index Only Scan using test_id_idx on test (cost=0.43..259729.23 rows=10002720 width=4) (actual time=0.059..11310.879 rows=10000001 loops=1) Heap Fetches: 0 Planning time: 0.247 ms Execution time: 45388.202 ms
It reads well and the purpose of such a query is pretty obvious.
Filtering really big (millions of rows) point datasets by distance might get catchy when solved with wrong PostGIS functions. Without spatial indexes PostGIS would take ages to finish such task.
Someone over StackExchange asked why the next query had been running for 15 hours (!) with no result:
SELECT a.gid, b.gid, ST_Distance(a.geom,b.geom) FROM shp1 a, shp2 b WHERE ST_Intersects( ST_Difference( ST_Buffer(a.geom,2000), ST_Buffer(a.geom,500) ), b.geom ) AND abs(a.value - b.value) > 400
The answer is quite simple: because it was using wrong functions. Let’s see:
ST_Distance()does not use spatial index, it’s a simple mathematical calculation, it’s expensive and it can be replaced by spatial operator for point datasets.
ST_Buffer()will definitely take time to build polygons around points. And it’s being run twice!
ST_Difference()needs more time to compare the buffers and return just the portion of space they don’t share. Isn’t it a huge waste to create buffers and then throw them away?
ST_Intersects()finally checks whether the point should be included in the result.
That was a great challenge to test my knowledge of how PostGIS works and here’s my shot at it:
SELECT * FROM ( SELECT a.gid, b.gid, a.geom <-> b.geom distance FROM shp1 a, shp2 b WHERE abs(a.value - b.value) > 400 AND ST_DWithin(a.geom, b.geom, 2000) ) sq WHERE distance > 500;
<->, a.k.a geometry distance centroid instead of
ST_Distance(). It takes advantage of spatial indexes, thus it’s fast. And it’s perfectly fine to use it with point dataset, because the centroid of a bounding box of a point is? That point, exactly. Spatial indexes have to be built beforehand.
ST_Dwithin()was made for. With the inner
WHEREclause I get all the points lying at most 2,000 meters from the current, having the absolute value difference bigger than 400.
ST_Dwithin()will make use of any spatial index available, unlike
WHEREclause throws away all the points closer than 500 meters. Remember, we already got only those not further than 2,000 meters from the previous step.
It took PostGIS 1060545,590 ms (~ 17 minutes) on my Quad-Core Intel® Core™ i5-4210M CPU @ 2.60GHz, 4 GB RAM, 500 GB SSD hard drive, PostgreSQL 9.3 and PostGIS 2.1 with two point datasets having 4M and 300K rows, respectively.