Michal Zimmermann Pieces of knowledge from the world of GIS.

Articles tagged with sql tag

Subdivide and Conquer: Effective Spatial Indexes in PostGIS

Written on Jan 10, 2017 and marked as sql, postgresql, postgis | SQL

Spatial indexes are absolutely crucial part of any spatial database and - as I tend to say quite often - only a fool would try to query spatial data without building spatial indexes beforehand.

Spatial indexes are based on bounding box comparisons, which are generally very fast. Yet, there are situations when spatial indexes don’t help much (or they don’t help as much as they could, if you wish).

Bounding box comparisons are effective with lots of small bounding boxes rather then few large ones. Why? See the picture above. The curved line (imagine it’s a pipeline for example) clearly demonstrates when the spatial index/bounding box comparison might fall short of what you’d expect.

Once the bounding box gets really big, it intersects so many other geometries’ bounding boxes that the whole comparison starts to slow down.

Luckily, PostGIS 2.2 introduced a ST_Subdivide function that can lend a helping hand in here.

Until today, we delivered the parcel geometries into our real estate acquisition process system with the following query, that takes all the geometries from the req_geom table (pipelines, remember?) and intersects them with cadastral parcels. The second part of the query adds those parcels that haven’t been digitalized and were created manually by one of my workmates.

INSERT INTO requested_parcels (uid, par_id)
SELECT
    reqs.uid,
    b.id par_id
 FROM
    running_requests reqs
 JOIN
    req_geom a ON (reqs.uid = a.uid)
 JOIN
    pargeo b ON (ST_Intersects(a.geom, b.geom))
 UNION
 SELECT
    reqs.uid,
    a.idpar::numeric
 FROM
    running_requests reqs
 JOIN
     req_man a ON (reqs.uid = a.uid);

It’s a perfectly standard query that intersects several request geometries with ~20M parcels, nothing really fancy. Except that it takes 25 minutes to finish. Why? Pipelines, remember?

Yet, the query below takes only 30 seconds to finish (that’s a huge time saver considering that the whole process used to take ~40 minutes)! Why? Because the ST_Subdivide effectively shrinks the req_geom geometries until they have 50 vertices each at most. Such small geometries are perfect input for the bounding box comparison. Remember to call DISTINCT when using ST_Subdivide, you’d probably get duplicate parcel ids otherwise.

I also replaced the UNION with the WHERE NOT EXISTS expression, as it’s reasonable to assume that numeric ids comparison will be faster.

INSERT INTO requested_parcels (uid, par_id)
SELECT DISTINCT
    reqs.uid,
    b.id par_id
 FROM
    running_requests reqs
 JOIN
    (
        SELECT
            uid,
            ST_Subdivide(geom, 50) geom
        FROM
            req_geom
     ) a ON (reqs.uid = a.uid)
 JOIN
     pargeo b ON (ST_Intersects(a.geom, b.geom));

 INSERT INTO requested_parcels (uid, par_id)
 SELECT
     reqs.uid,
     a.idpar::numeric
 FROM
     running_requests reqs
 JOIN
     req_man a ON (reqs.uid = a.uid)
 WHERE NOT EXISTS (
     SELECT 1
     FROM pozadovane_parcely pp
     WHERE pp.par_id = a.idpar
  );

Counting substring occurrences in PostgreSQL

Written on Dec 19, 2016 and marked as sql, postgresql | SQL

I got to count occurrences of / character today and found out no built-in function exists in PostgreSQL, so here’s my shot at it. Pretty simple, yet useful.

CREATE OR REPLACE FUNCTION how_many(IN text, IN varchar, OUT integer)
RETURNS integer
AS
$how_many$
    SELECT length($1) - length(replace($1, $2, ''));
$how_many$
LANGUAGE SQL
SECURITY DEFINER;

-- SELECT how_many('test', 't'); -- returns number 2

PostGIS Custom Function to Create Wind Rose

Written on Sep 1, 2016 and marked as postgis, postgresql, sql | SQL

I’ve come across the beautiful GIS StackExchange question recently, asking how to draw a wind rose within PostGIS.

It’s pretty easy to accomplish this with a custom PLPGSQL procedure below, that takes line geometry, number of sections and radius of the inner circle as parameters.

CREATE OR REPLACE FUNCTION ST_WindRose(
    line geometry,
    directions int,
    radius numeric
)
RETURNS TABLE (
    id integer,
    geom geometry(LINESTRING)
)
AS $ST_WindRose$
BEGIN
    IF directions % 2 <> 0 THEN
        RAISE EXCEPTION 'Odd number of directions found, please provide even number of directions instead.';
    END IF;

IF radius > ST_Length(line) THEN
    RAISE EXCEPTION 'Inner circle radius is bigger than the wind rose diameter, please make it smaller.';
END IF;

RETURN QUERY
WITH rose AS (
    SELECT
        ST_Rotate(_line, radians(360) / directions * dirs.id, ST_Centroid(_line)) _line
    FROM (
        SELECT line _line
    ) a
    CROSS JOIN (
        SELECT generate_series(1, directions / 2) id
    ) dirs
)
SELECT
    row_number() OVER ()::integer id,
    _line geom
FROM (
    SELECT _line FROM rose
    UNION ALL
    SELECT ST_ExteriorRing(ST_Buffer(ST_Centroid(line), radius, 30)) -- inner circle
    UNION ALL
    SELECT ST_ExteriorRing(ST_Buffer(ST_Centroid(line), ST_Length(line)/2, 30)) -- outer circle
) a;
END
$ST_WindRose$
LANGUAGE PLPGSQL;

Wind rose created with this function might look like the one below.

Run it as follows. The line parameter should be a simple straight line made of just two vertices.

SELECT * FROM ST_WindRose(ST_MakeLine(ST_MakePoint(0,0), ST_MakePoint(0,1)), 12, 0.01);

PostGIS Custom Function to Create Polygon from Centroid

Written on Aug 28, 2016 and marked as postgis, postgresql, sql | SQL

Needed to create a polygon from a point defining its size in both axes, here’s a little syntax sugar to make life easier.

CREATE OR REPLACE FUNCTION ST_PolygonFromCentroid(centroid geometry, xsize numeric, ysize numeric)
RETURNS geometry
AS $ST_PolygonFromCentroid$
SELECT ST_MakeEnvelope(
    ST_X(ST_Translate($1, -$2, -$3)),
    ST_Y(ST_Translate($1, -$2, -$3)),
    ST_X(ST_Translate($1, $2, $3)),
    ST_Y(ST_Translate($1, $2, $3))
);
$ST_PolygonFromCentroid$
LANGUAGE SQL;

Run it as:

SELECT ST_PolygonFromCentroid(ST_SetSRID(ST_MakePoint(13.912,50.633),4326), 1, 1);

Finding Polygons Lying across Other Polygons with PostGIS

Written on Aug 5, 2016 and marked as postgis, postgresql, sql | SQL

Doing overlays (ST_Intersection()) in PostGIS based on spatial relationships (ST_Intersects(), ST_Contains(), …) is so easy it is not something you get particularly excited about.

Today I faced a bit more interesting task: given two polygon layers, get me all the polygons from layer A such that they lie across the polygons from layer B and… a picture worth a thousand words, right?

I hope you got the idea, it is fairly simple:

  1. Intersect A (red, blue) with B (green)
  2. Subtract the result of previous from layer A
  3. Combine results from steps 1 and 2
  4. Keep polygon only if its id occurs more than twice (that means it went straight through the layer B)
  5. Profit!
WITH overlays AS (
/* nothing fancy here */
    SELECT
        A.ogc_fid a_id,
        B.ogc_fid b_id,
        ST_Intersection(A.geom, B.geom) geom,
        ST_Area(ST_Intersection(A.geom, B.geom) area_shared
    FROM A
    JOIN B ON (ST_Intersects(A.geom, B.geom)
),
diffs AS (
/* note this is a 1:1 relationship in ST_Difference */
/* a little hack is needed to prevent PostGIS from returning its usual difference mess */
    SELECT
        o.a_id,
        o.b_id,
        (ST_Dump(ST_Difference(ST_Buffer(A.geom, -0.0001), o.geom))).geom, -- ugly hack
        o.area_shared
    FROM overlays o
    JOIN A ON (o.a_id = A.id)
),

merged AS (
/* put those two result sets together */
    SELECT * FROM overlays
    UNION ALL
    SELECT * FROM diffs
),

merged_reduced AS (
/* get only those A polygons that consist of three parts at least for each intersection with B polygon */
  SELECT
    m.*
  FROM merged m
  JOIN (
    SELECT
      a_id,
      b_id
    FROM merged
    GROUP BY a_id, b_id
    HAVING COUNT(1) > 2
  ) a ON (a.a_id = m.a_id AND a.b_id = m.b_id)
)
/* do as you wish with the result */
SELECT *
FROM merged_reduced;

In my case, centerlines of layer B were also included and their length inside each intersection was used to divide the area of the smallest part with. It was fun, actually.

Dead Simple Random Points in Polygons with PostGIS

Written on Aug 3, 2016 and marked as postgis, postgresql, sql | SQL

Since PostgreSQL 9.3 there has been a handy little keyword called LATERAL, which - combined with JOIN - might rock your GIS world in a second. To keep it simple, a LATERAL JOIN enables a subquery in the FROM part of a query to reference columns from preceding expressions in the FROM list. What the heck?

Imagine that not so unusual request to generate random points in polygons (something I needed to do today). Do it automatically without your favorite piece of desktop GIS software.

It is pretty easy using LATERAL JOIN:

SELECT
    a.id,
    b.*
FROM (
    VALUES(
        1,
        ST_SetSRID(
            ST_GeomFromText(
                'POLYGON((0 0, -1 0, -1 -1, 0 -1, 0 0))'
            ),
        4326)
    )
    UNION ALL
    VALUES(
        2,
        ST_SetSRID(
            ST_GeomFromText(
                'POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))'
            ),
        4326)
    )
) a(id, geom)
CROSS JOIN LATERAL (
    SELECT ST_SetSRID(ST_MakePoint(tmp.x, tmp.y), 4326) geom
    FROM (
        SELECT
            random() * (ST_XMax(a.geom) - ST_XMin(a.geom)) + ST_XMin(a.geom) x,
            random() * (ST_YMax(a.geom) - ST_YMin(a.geom)) + ST_YMin(a.geom) y
        FROM generate_series(0,200)
    ) tmp
) b;

What actually happened over there? If you want to put points inside polygons, you need… polygons. We will do just fine with two of them created inside this query:

VALUES(
    1,
    ST_SetSRID(
        ST_GeomFromText(
            'POLYGON((0 0, -1 0, -1 -1, 0 -1, 0 0))'
        ),
    4326)
)
UNION ALL
VALUES(
    2,
    ST_SetSRID(
        ST_GeomFromText(
            'POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))'
        ),
    4326)
)

All the magic happens inside the LATERAL JOIN part of the query:

CROSS JOIN LATERAL (
    SELECT ST_SetSRID(ST_MakePoint(tmp.x, tmp.y), 4326) geom
    FROM (
        SELECT
            random() * (ST_XMax(a.geom) - ST_XMin(a.geom)) + ST_XMin(a.geom) x,
            random() * (ST_YMax(a.geom) - ST_YMin(a.geom)) + ST_YMin(a.geom) y
        FROM generate_series(0,200)
    ) tmp
) b;

The inner SELECT calculates random points based on the extent of the polygon. Note it directly calls a.geom, a value that comes from the previous SELECT! The LATERAL JOIN part is thus run for every row of the previous SELECT statement inside FROM part of the query. This means it will return 201 points for each of the two polygons (run the query inside QGIS to see the result).

Note all the points fall inside the polygons by accident, because they are square. Otherwise a ST_Contains or ST_Within should be used inside the outermost WHERE query to filter outliers. This part could use some tweaking.

Looking for the Next Row with PostgreSQL

Written on Jan 23, 2016 and marked as postgresql, sql | SQL

Using JOIN clause

All my GIS life I’ve been using a simple JOIN clause to find a row with an id = previous_id + 1. In other words, imagine a simple table with no indices:

CREATE TABLE test (id integer);
INSERT INTO test SELECT i FROM generate_series(1,10000000) i;

Let’s retrieve next row for each row in that table:

SELECT a.id, b.id
FROM test a
LEFT JOIN test b ON (a.id + 1 = b.id); -- note the LEFT JOIN is needed to get the last row as well

Execution plan looks like this:

Hash Join  (cost=311087.17..953199.41 rows=10088363 width=8) (actual time=25440.770..79591.869 rows=10000000 loops=1)
   Hash Cond: ((a.id + 1) = b.id)
   ->  Seq Scan on test a  (cost=0.00..145574.63 rows=10088363 width=4) (actual time=0.588..10801.584 rows=10000001 loops=1)
   ->  Hash  (cost=145574.63..145574.63 rows=10088363 width=4) (actual time=25415.282..25415.282 rows=10000001 loops=1)
         Buckets: 16384  Batches: 128  Memory Usage: 2778kB
         ->  Seq Scan on test b  (cost=0.00..145574.63 rows=10088363 width=4) (actual time=0.422..11356.108 rows=10000001 loops=1)
 Planning time: 0.155 ms
 Execution time: 90134.248 ms

If we add an index with CREATE INDEX ON test (id), the plan changes:

Merge Join  (cost=0.87..669369.85 rows=9999844 width=8) (actual time=0.035..56219.294 rows=10000001 loops=1)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using test_id_idx on test a  (cost=0.43..259686.10 rows=9999844 width=4) (actual time=0.015..11101.937 rows=10000001 loops=1)
         Heap Fetches: 0
   ->  Index Only Scan using test_id_idx on test b  (cost=0.43..259686.10 rows=9999844 width=4) (actual time=0.012..11827.895 rows=10000001 loops=1)
         Heap Fetches: 0
 Planning time: 0.244 ms
 Execution time: 65973.421 ms

Not bad.

Using window function

Window functions are real fun. They’re great if you’re doing counts, sums or ranks by groups. And, to my surprise, they’re great in finding next rows as well.

With the same test table, we retrieve next row for each row with the following query:

SELECT id, lead(id) OVER (ORDER BY id)
FROM test.test;

How does that score without an index? Better than the JOIN clause.

WindowAgg  (cost=1581246.90..1756294.50 rows=10002720 width=4) (actual time=28785.388..63819.071 rows=10000001 loops=1)
   ->  Sort  (cost=1581246.90..1606253.70 rows=10002720 width=4) (actual time=28785.354..40117.899 rows=10000001 loops=1)
         Sort Key: id
         Sort Method: external merge  Disk: 136848kB
         ->  Seq Scan on test  (cost=0.00..144718.20 rows=10002720 width=4) (actual time=0.020..10797.961 rows=10000001 loops=1)
 Planning time: 0.242 ms
 Execution time: 73391.024 ms

And it works even better if indexed. It’s actually ~1,5× faster than the JOIN way.

WindowAgg  (cost=0.43..409770.03 rows=10002720 width=4) (actual time=0.087..35647.815 rows=10000001 loops=1)
   ->  Index Only Scan using test_id_idx on test  (cost=0.43..259729.23 rows=10002720 width=4) (actual time=0.059..11310.879 rows=10000001 loops=1)
         Heap Fetches: 0
 Planning time: 0.247 ms
 Execution time: 45388.202 ms

It reads well and the purpose of such a query is pretty obvious.

PostgreSQL IN vs EXISTS

Written on Oct 9, 2015 and marked as sql | development

Until recently, SQL IN and EXISTS were almost exactly the same to me. There is a significant difference both in execution plans and time of execution though, as I found out after not being able to speed up my workmate’s query.

Assume two not-as-small-as-they-might-be tables:

BEGIN;

CREATE UNLOGGED TABLE test.small AS
SELECT * FROM generate_series(0, 500000) id;

CREATE UNLOGGED TABLE test.big AS
SELECT (random() * 4000000)::integer id
FROM generate_series(0, 4000000);

COMMIT;

To find out what rows from test.big is missing in test.small, you’ll use one of these queries:

SELECT id
FROM test.big
WHERE id NOT IN (SELECT id FROM test.small);

                            QUERY PLAN
-----------------------------------------------------------------------------------------
Seq Scan on big  (cost=8463.01..42313.02 rows=1000000 width=4) (actual time=177.061..864.043 rows=1500894 loops=1)
    Filter: (NOT (hashed SubPlan 1))
    Rows Removed by Filter: 499107
    SubPlan 1
    ->  Seq Scan on small  (cost=0.00..7213.01 rows=500001 width=4) (actual time=0.045..34.727 rows=500001 loops=1)
    Total runtime: 904.413 ms
(6 rows)


SELECT id
FROM test.big
WHERE NOT EXISTS (
    SELECT 1
    FROM test.small
    WHERE test.big.id = test.small.id
);
                            QUERY PLAN
-----------------------------------------------------------------------------------------
Hash Anti Join  (cost=15417.02..82100.58 rows=955189 width=4) (actual time=100.257..1240.343 rows=1500894 loops=1)
    Hash Cond: (big.id = small.id)
    ->  Seq Scan on big  (cost=0.00..28850.01 rows=2000001 width=4) (actual time=0.016..125.024 rows=2000001 loops=1)
    ->  Hash  (cost=7213.01..7213.01 rows=500001 width=4) (actual time=100.068..100.068 rows=500001 loops=1)
        Buckets: 65536  Batches: 2  Memory Usage: 8800kB
        ->  Seq Scan on small  (cost=0.00..7213.01 rows=500001 width=4) (actual time=0.011..35.543 rows=500001 loops=1)
Total runtime: 1280.609 ms

That’s not a significant difference in time execution, is it?

What if you want to find out what rows from test.small is missing in test.big?

SELECT id
FROM test.small
WHERE id NOT IN (SELECT id FROM test.big);

                                QUERY PLAN
---------------------------------------------------------------------------
Seq Scan on small  (cost=0.00..12915788669.52 rows=250000 width=4)
    Filter: (NOT (SubPlan 1))
    SubPlan 1
    ->  Materialize  (cost=0.00..46663.01 rows=2000001 width=4)
        ->  Seq Scan on big  (cost=0.00..28850.01 rows=2000001 width=4)
(5 rows)


SELECT id
FROM test.small
WHERE NOT EXISTS (
    SELECT 1
    FROM test.big
    WHERE test.big.id = test.small.id
);

                               QUERY PLAN
-------------------------------------------------------------------------
Hash Anti Join  (cost=61663.02..180597.23 rows=1 width=4)
    Hash Cond: (small.id = big.id)
    ->  Seq Scan on small  (cost=0.00..7213.01 rows=500001 width=4)
    ->  Hash  (cost=28850.01..28850.01 rows=2000001 width=4)
        ->  Seq Scan on big  (cost=0.00..28850.01 rows=2000001 width=4)
(5 rows)

It took me ~750 ms to get the result with EXISTS expression. I kept IN running whole night with no result. I’m not really sure why IN is so much slower, it might be caused by checks for NULL values. The speed is also related to the size of the subquery, thus the difference when tables were switched.

LEFT JOIN can be used to achieve the same result, I find its syntax less obvious though.

No indexes were built this time, I know they don’t help the IN performance at all from my previous tests. Tested with PostgreSQL 9.3.9.