Michal Zimmermann Pieces of knowledge from the world of GIS.

PostGIS Custom Function to Create Polygon from Centroid

Written on Aug 28, 2016 and marked as postgis, postgresql, sql | SQL

Needed to create a polygon from a point defining its size in both axes, here’s a little syntax sugar to make life easier.

CREATE OR REPLACE FUNCTION ST_PolygonFromCentroid(centroid geometry, xsize numeric, ysize numeric)
RETURNS geometry
AS $ST_PolygonFromCentroid$
SELECT ST_MakeEnvelope(
    ST_X(ST_Translate($1, -$2, -$3)),
    ST_Y(ST_Translate($1, -$2, -$3)),
    ST_X(ST_Translate($1, $2, $3)),
    ST_Y(ST_Translate($1, $2, $3))
);
$ST_PolygonFromCentroid$
LANGUAGE SQL;

Run it as:

SELECT ST_PolygonFromCentroid(ST_SetSRID(ST_MakePoint(13.912,50.633),4326), 1, 1);

Finding Polygons Lying across Other Polygons with PostGIS

Written on Aug 5, 2016 and marked as postgis, postgresql, sql | SQL

Doing overlays (ST_Intersection()) in PostGIS based on spatial relationships (ST_Intersects(), ST_Contains(), …) is so easy it is not something you get particularly excited about.

Today I faced a bit more interesting task: given two polygon layers, get me all the polygons from layer A such that they lie across the polygons from layer B and… a picture worth a thousand words, right?

I hope you got the idea, it is fairly simple:

  1. Intersect A (red, blue) with B (green)
  2. Subtract the result of previous from layer A
  3. Combine results from steps 1 and 2
  4. Keep polygon only if its id occurs more than twice (that means it went straight through the layer B)
  5. Profit!
WITH overlays AS (
/* nothing fancy here */
    SELECT
        A.ogc_fid a_id,
        B.ogc_fid b_id,
        ST_Intersection(A.geom, B.geom) geom,
        ST_Area(ST_Intersection(A.geom, B.geom) area_shared
    FROM A
    JOIN B ON (ST_Intersects(A.geom, B.geom)
),
diffs AS (
/* note this is a 1:1 relationship in ST_Difference */
/* a little hack is needed to prevent PostGIS from returning its usual difference mess */
    SELECT
        o.a_id,
        o.b_id,
        (ST_Dump(ST_Difference(ST_Buffer(A.geom, -0.0001), o.geom))).geom, -- ugly hack
        o.area_shared
    FROM overlays o
    JOIN A ON (o.a_id = A.id)
),

merged AS (
/* put those two result sets together */
    SELECT * FROM overlays
    UNION ALL
    SELECT * FROM diffs
),

merged_reduced AS (
/* get only those A polygons that consist of three parts at least for each intersection with B polygon */
  SELECT
    m.*
  FROM merged m
  JOIN (
    SELECT
      a_id,
      b_id
    FROM merged
    GROUP BY a_id, b_id
    HAVING COUNT(1) > 2
  ) a ON (a.a_id = m.a_id AND a.b_id = m.b_id)
)
/* do as you wish with the result */
SELECT *
FROM merged_reduced;

In my case, centerlines of layer B were also included and their length inside each intersection was used to divide the area of the smallest part with. It was fun, actually.

Dead Simple Random Points in Polygons with PostGIS

Written on Aug 3, 2016 and marked as postgis, postgresql, sql | SQL

Since PostgreSQL 9.3 there has been a handy little keyword called LATERAL, which - combined with JOIN - might rock your GIS world in a second. To keep it simple, a LATERAL JOIN enables a subquery in the FROM part of a query to reference columns from preceding expressions in the FROM list. What the heck?

Imagine that not so unusual request to generate random points in polygons (something I needed to do today). Do it automatically without your favorite piece of desktop GIS software.

It is pretty easy using LATERAL JOIN:

SELECT
    a.id,
    b.*
FROM (
    VALUES(
        1,
        ST_SetSRID(
            ST_GeomFromText(
                'POLYGON((0 0, -1 0, -1 -1, 0 -1, 0 0))'
            ),
        4326)
    )
    UNION ALL
    VALUES(
        2,
        ST_SetSRID(
            ST_GeomFromText(
                'POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))'
            ),
        4326)
    )
) a(id, geom)
CROSS JOIN LATERAL (
    SELECT ST_SetSRID(ST_MakePoint(tmp.x, tmp.y), 4326) geom
    FROM (
        SELECT
            random() * (ST_XMax(a.geom) - ST_XMin(a.geom)) + ST_XMin(a.geom) x,
            random() * (ST_YMax(a.geom) - ST_YMin(a.geom)) + ST_YMin(a.geom) y
        FROM generate_series(0,200)
    ) tmp
) b;

What actually happened over there? If you want to put points inside polygons, you need… polygons. We will do just fine with two of them created inside this query:

VALUES(
    1,
    ST_SetSRID(
        ST_GeomFromText(
            'POLYGON((0 0, -1 0, -1 -1, 0 -1, 0 0))'
        ),
    4326)
)
UNION ALL
VALUES(
    2,
    ST_SetSRID(
        ST_GeomFromText(
            'POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))'
        ),
    4326)
)

All the magic happens inside the LATERAL JOIN part of the query:

CROSS JOIN LATERAL (
    SELECT ST_SetSRID(ST_MakePoint(tmp.x, tmp.y), 4326) geom
    FROM (
        SELECT
            random() * (ST_XMax(a.geom) - ST_XMin(a.geom)) + ST_XMin(a.geom) x,
            random() * (ST_YMax(a.geom) - ST_YMin(a.geom)) + ST_YMin(a.geom) y
        FROM generate_series(0,200)
    ) tmp
) b;

The inner SELECT calculates random points based on the extent of the polygon. Note it directly calls a.geom, a value that comes from the previous SELECT! The LATERAL JOIN part is thus run for every row of the previous SELECT statement inside FROM part of the query. This means it will return 201 points for each of the two polygons (run the query inside QGIS to see the result).

Note all the points fall inside the polygons by accident, because they are square. Otherwise a ST_Contains or ST_Within should be used inside the outermost WHERE query to filter outliers. This part could use some tweaking.

Syncing Two PostgreSQL Databases Faster

Written on Jul 17, 2016 and marked as postgresql, bash | SQL

Imagine you run two database machines hosting structurally the same databases on two separate servers and you need to transfer data from one to another. Not very often, let’s say once a month. Your tables aren’t small nor huge, let’s say millions rows in general.

You’re going to use pg_dump and pipe it to psql, but the indices on your tables will slow you down a lot.

That’s why you’ll want to drop all indices and constraints (drop_indices_constraints.sql):

SELECT 'ALTER TABLE ' ||
    tc.table_schema ||
    '.' ||
    tc.table_name ||
    ' DROP CONSTRAINT ' ||
    tc.constraint_name  ||
    ';'
FROM information_schema.table_constraints tc
JOIN information_schema.constraint_column_usage ccu ON (tc.constraint_catalog = ccu.constraint_catalog AND tc.constraint_schema = ccu.constraint_schema AND tc.constraint_name = ccu.constraint_name)
WHERE tc.table_schema IN (SELECT unnest(string_to_array(:'schemas', ',')))
UNION ALL
SELECT
    'DROP INDEX IF EXISTS ' || schemaname || '.' || indexname || ';'
FROM pg_indexes
WHERE schemaname IN (SELECT unnest(string_to_array(:'schemas', ',')));

Then you will transfer the data:

pg_dump -a -t "schema1.*" -t "schema2.*" -O -d source -v | psql -h localhost -d target

And restore the already dropped indices and constraints (create_indices_constraints.sql):

WITH constraints AS (
SELECT 'ALTER TABLE ' ||
    tc.table_schema ||
    '.' ||
    tc.table_name ||
    ' ADD CONSTRAINT ' ||
    tc.constraint_name ||
    ' ' ||
    tc.constraint_type ||
    '(' ||
    string_agg(ccu.column_name::text, ', ')|| -- column order should be taken into account here
    ');' def,
    tc.table_schema,
    tc.table_name,
    tc.constraint_name
FROM information_schema.table_constraints tc
JOIN information_schema.constraint_column_usage ccu ON (tc.constraint_catalog = ccu.constraint_catalog AND tc.constraint_schema = ccu.constraint_schema AND tc.constraint_name = ccu.constraint_name)
WHERE tc.table_schema IN (SELECT unnest(string_to_array(:'schemas', ',')))
    AND tc.constraint_type = 'PRIMARY KEY'
GROUP BY
    tc.table_schema,
    tc.table_name,
    tc.constraint_name,
    tc.constraint_type
)
SELECT def FROM constraints
UNION ALL
SELECT indexdef || ';'
FROM pg_indexes
WHERE schemaname IN (SELECT unnest(string_to_array(:'schemas', ','))) 
AND NOT EXISTS (
    SELECT 1 FROM
    constraints c
    WHERE pg_indexes.schemaname = c.table_schema
        AND pg_indexes.tablename = c.table_name
        AND pg_indexes.indexname = c.constraint_name
);

Few sidenotes

  1. Run the second piece of code first. If you forget, run that code on the source database.
  2. Notice the :schemas. Variable assignment is one of the psql features I really like.
  3. Notice DROP INDEX IF EXISTS and the CTE used in the drop code - that’s due to the fact that dropping the constraint obviously drops the underlying index as well and you don’t want to dropping something that doesn’t exist or creating something that exists already.

The bash script proposal might look as follows:

# store indices and constraint definitions
psql -qAt -d target -v schemas='schema1','schema2' -f create_indices_constraints.sql > create.sql

# drop indices and constraints
psql -qAt -d target -v schemas='schema1','schema2' -f drop_indices_constraints.sql | psql -d target

​# load data
pg_dump -a -t "schema1.*" -t "schema2.*" -O -d source -v | psql -h localhost -d target

#renew indices and constraints
psql -qAt -d target -f create.sql
​

Testing PostgreSQL OGR FDW

Written on Jul 1, 2016 and marked as postgresql, gdal | SQL

PostgreSQL foreign data wrappers are used to connect PostgreSQL database to different datasources, e.g. other SQL databases, CSV files, XLS spreadsheets×

The one I’ve been interested in for several months is Paul Ramsey’s OGR FDW - it gives you access to OGR supported spatial formats directly from your database. No more shapefiles lying around?

Each foreign data wrapper should have three basic components:

  • foreign server object
  • foreign user mapping - not necessary if you’re not connecting to other database
  • foreign table(s)

I got some data about rivers and dams from DIBAVOD open datasets to play with.

First define the foreign server object:

CREATE SERVER dibavod
FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (
    datasource '/downloads/dibavod',
    format 'ESRI Shapefile',
    config_options 'SHAPE_ENCODING=CP1250'
);

Note the OGR specific driver configuration options are available inside config_options. In case of ESRI Shapefiles, the datasource is the directory your files reside in.

Let’s create PostgreSQL tables (use ogrinfo or Paul’s ogr_fdw_info to list the columns):

CREATE FOREIGN TABLE rivers (
    fid integer,
    utokj_id numeric,
    utokjn_id numeric,
    utokjn_f numeric,
    prprop_z integer,
    ex_jh integer,
    pozn text,
    shape_leng numeric,
    naz_tok text,
    idvt integer,
    tok_id numeric,
    shape_len numeric,
    geom geometry(LINESTRING, 5514)
)
SERVER dibavod
OPTIONS (layer 'A02_Vodni_tok_JU');

CREATE FOREIGN TABLE dams (
    fid integer,
    objectid integer,
    naz_na text,
    nadr_gid numeric,
    kota_hladi numeric,
    hloubka numeric,
    zatop_ploc numeric,
    objem numeric,
    kota_hraz numeric,
    kota_preli numeric,
    kota_vypus numeric,
    plocha_m2 numeric,
    shape_area numeric,
    shape_len numeric,
    geom geometry(MULTIPOLYGON, 5514)
)
SERVER dibavod
OPTIONS (LAYER 'A05_Vodni_nadrze');

Note the fid column - required for write access to underlying datasource.

Things to remember:

  • foreign tables mean no constraints nor indices
  • no indices mean spatial queries are terribly slow compared to PostGIS
  • I like the idea of CREATE UNLOGGED TABLE dams2 AS SELECT * FROM dams, not sure what to use it for though

A Month of Commuting on My Own

Written on Apr 9, 2016 and marked as spatial, visualization | space

I’ve been sort of living in Brno for the last 7 years (college included). It’s quite a hilly city, with lots of cars, very good public transportation system and ever-improving cycling infrastructure. All these years I was using trams, buses and trolleybuses to get myself from one place to another.

These are all great, because:

  • you can read while you ride
  • you are generally faster than car during rush hour
  • you don’t waste your time trying to park your car

These all suck, because:

  • public transportation is quite expensive in Brno
  • people stink in summer
  • some people stink in winter as well
  • you usually have to change several lines to actually get where you want
  • they’re crowded

My period card expired on March 8 and I decided not to renew it. Why? See the list above. As I don’t have a car and I work at the far end of the city, I can either ride a bike or run to work. Ask me how it’s been for the first month? Not bad at all.

Figure: March Strava log.

Figure: April Strava log.

What’s so great about commuting?

Not so long ago I considered commuting a waste of time. It took me 40-50 minutes to get to work and about the same to get back home. That’s 1-2 hours not being productive, not doing anything at all actually, just changing places.

That’s a terrible mistake to do. It’s much better to see this time as an opportunity to do that little extra for yourself - walk, run, ride. Even though it takes me a bit longer than public transport (showering and dressing included), it leaves me with totally different state of mind in the end - it just starts me up (hello Rolling Stones).

You can go for a ride right from work. That’s priceless.

As a by-product I started to care more about what I eat and when I eat it. I actually spend time cooking so I get enough food during the day. Something I didn’t do before, because you can always buy something sweet before the bus comes, right?

Figure: Daily commute in Brno: bike in pink, run in green. See the full version.

What’s not so great about commuting?

Weather, especially in spring and autumn, often sucks. Sometimes I come to work soaking wet, nothing a hot shower wouldn’t fix though. Someone still needs to clean the bike…

Traffic sucks in the evening. I get up before six, leave home before half past six, thus avoid heavy traffic. Riding a bike home in the evening is threatening sometimes and a bit of mutual respect between pedestrians, cyclists and drivers would do.

Cycling paths sometimes end right before the big crossroads. Often drivers use parts of the network as parking lanes, which puts you in danger suddenly.

Other cyclists, skaters, people walking their dogs, little kids usually don’t care about you at all. You better don’t get distracted if you want to get home safe and sound.

Books are hard to read on the bike.

Does it tell you something about your city?

I guess the city you see on foot or from atop a saddle is completely different than the one seen from a bus or a car.

Is it rather car or bike friendly? Do you feel at risk riding a bike or running? Is it faster to run/ride or drive? Does your city actually want you to leave your car at home at all, or has it been designed for cars?

River seems to be blessing when your city has one (unless flood strikes, different story). If done right, its shores might become one of the most beautiful parts of the city. Something Brno needs to catch up with other cities.

I hope one day I’ll get up and see Brno changing in front of me. Just like Paris is right now. We all die in the end, so why not to take a walk before we do?

Do You Really Need Gulp? Or Grunt? Or Bower? Or What?

Written on Mar 20, 2016 and marked as javascript | development

Disclaimer: I’m an enthuastic developer, but I do not code for a living. I’m just the ordinary guy who keeps editing a wrong file wondering why the heck the changes are not being applied.

TL;DR: I do think npm might be the answer.

Wonderful world of JavaScript DevOps

When I first started using JavaScript on the server side with node.js, I felt overwhelmed by numerous options to automate tasks. There was npm taking care of backend dependencies. Then I would build a frontend and found out about bower for handling frontend dependencies. Then it would be great to have some kind of minification/obfuscation/uglification/you-name-it task. And the build task. And the build:prod task. And how about eslint task? And then I would end up spending hours doing nothing, just reading blogs about the tools being used by others who do code for a living.

Intermezzo: I think my coding is slow. Definitely slower than yours. I’m getting better though.

Using the force

Looking back I find it a bit stressful - how the heck do I choose the right tools? Where’s Yoda to help me out? Anyway, next to adopt after npm was bower. And I liked it, even though some packages were missing - but who cares as long as there is no better way, right? Except there is… I guess.

Automation was next in the line to tackle. So I chose gulp without a bit of hesitation. It was a hype, a bigger than grunt back then. I even heard of yeoman, but until now I still don’t know what it actually does. And I’m happy with that.

A short summary so far:

  • npm for backend dependencies
  • bower for frontend dependencies
  • gulp for running tasks

So far, so good.

Is Bower going to die?

Then I stumbled upon this tweet and started panicking. Or rather started to feel cheated. It took me time to set all this up and now it’s useless? Or what?

Seeing it now, I’m glad I read this. And I really don’t know what happened to Bower, if anything at all.

Keeping it simple

So Bower’s dying, what are you going to do about that? You’ll use npm instead! And you’ll have a single source of truth called package.json. You’l resolve all the dependencies with a single npm install command and feel like a king. We’re down to two now - npm and gulp.

Gulp, Gulp everywhere!

When you get rid of Bower, next feeling you have is your gulpfile.js just got off the leash. It got really big and grew to ~160 lines of code and became a nightmare to manage.

So you split it into task files and a config file. What a relief. But you still realize a half of your package.json dependencies starts with gulp-. And you hate it.

Webpack for the win

For me, a non-developer, setting the webpack wasn’t easy. I didn’t find docs very helpful either. Reading the website for the first time, I didn’t even understand what it should be used for. I got it working eventually. And I got rid of gulp, gulp-connect, gulp-less, gulp-nodemon, gulp-rename, gulp-replace, gulp-task-listing and gutil. And the whole gulpfile.js. That was a big win for me.

But how do you run tasks?

Well…

npm run start-dev # which in turn calls the code below
npm run start-webpack & NODE_ENV=development nodemon server.js # where start-webpack does the following
node_modules/webpack-dev-server/bin/webpack-dev-server.js --quiet --inline --hot --watch

That’s it. If I need to build code, I run npm run build, which calls some other tasks from scripts section in the package.json.

That’s pretty much it. I don’t think it’s a silver bullet, but I feel like I finally found peace of mind for my future JavaScript development. At least for a month or so before some other guy comes to town.

How to convert DGN to Tiff with GDAL

Written on Feb 21, 2016 and marked as gdal | automation

We have to deal with DGN drawings quite often at CleverMaps - heavily used for infrastructure projects (highways, roads, pipelines), they are a pure nightmare to the GIS person inside me. Right now, I’m only capable of converting it into a raster file and serve it with Geoserver. The transformation from DGN to PDF to PNG to Tiff is not something that makes me utterly happy though.

All you need to do the same is GDAL, ImageMagick, some PDF documents created out of DGN files - something MicroStation can help you with - and their upper left and lower right corner coordinates.

# I recommend putting some limits on ImageMagick - it tends to eat up all the resources and quit
export MAGICK_MEMORY_LIMIT=1512
export MAGICK_MAP_LIMIT=512
export MAGICK_AREA_LIMIT=1024
export MAGICK_FILES_LIMIT=512
export MAGICK_TMPDIR=/partition/large/enough

# I expect two files on the input: the first is PDF file with drawing, the second is a simple text file with four coordinates on a single line in the following order: upper left x, upper left y, lower right x, lower right y
INPUT=${1:?"PDF file path"}
COORDS=${2:?"Bounding box file path"}
OUTPUTDIRNAME=$(dirname $INPUT)
OUTPUTFILENAME=$(basename $INPUT | cut -d. -f1).png
OUTPUTPATH=$OUTPUTDIRNAME/$OUTPUTFILENAME

# create PNG image - I actually don't remember why it didn't work directly to Tiff
gdal_translate \
    -co WORLDFILE=YES \
    -co ZLEVEL=5 \
    -of PNG \
    --config GDAL_CACHEMAX 500 \
    --config GDAL_PDF_DPI 300 \
    -a_srs EPSG:5514 \ # Czech local CRS
    -a_ullr $(echo $(cat $COORDS)) \ # read the file with coordinates
    $INPUT \
    $OUTPUTPATH

# convert to Tiff image
convert \
    -define tiff:tile-geometry=256x256 \
    -transparent white \ # drawings come with white background
    $OUTPUTPATH \
    ${OUTPUTPATH/.png}_alpha.tif

# build overwies to speed things up
gdaladdo ${OUTPUTPATH/.png}_alpha.tif 2 4 8 16 32

And you’re done. The .wld file will be present for each resulting file. I rename it manually to match the name of a GeoTiff - that should be probably done automatically as well.

Looking for the Next Row with PostgreSQL

Written on Jan 23, 2016 and marked as postgresql, sql | SQL

Using JOIN clause

All my GIS life I’ve been using a simple JOIN clause to find a row with an id = previous_id + 1. In other words, imagine a simple table with no indices:

CREATE TABLE test (id integer);
INSERT INTO test SELECT i FROM generate_series(1,10000000) i;

Let’s retrieve next row for each row in that table:

SELECT a.id, b.id
FROM test a
LEFT JOIN test b ON (a.id + 1 = b.id); -- note the LEFT JOIN is needed to get the last row as well

Execution plan looks like this:

Hash Join  (cost=311087.17..953199.41 rows=10088363 width=8) (actual time=25440.770..79591.869 rows=10000000 loops=1)
   Hash Cond: ((a.id + 1) = b.id)
   ->  Seq Scan on test a  (cost=0.00..145574.63 rows=10088363 width=4) (actual time=0.588..10801.584 rows=10000001 loops=1)
   ->  Hash  (cost=145574.63..145574.63 rows=10088363 width=4) (actual time=25415.282..25415.282 rows=10000001 loops=1)
         Buckets: 16384  Batches: 128  Memory Usage: 2778kB
         ->  Seq Scan on test b  (cost=0.00..145574.63 rows=10088363 width=4) (actual time=0.422..11356.108 rows=10000001 loops=1)
 Planning time: 0.155 ms
 Execution time: 90134.248 ms

If we add an index with CREATE INDEX ON test (id), the plan changes:

Merge Join  (cost=0.87..669369.85 rows=9999844 width=8) (actual time=0.035..56219.294 rows=10000001 loops=1)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using test_id_idx on test a  (cost=0.43..259686.10 rows=9999844 width=4) (actual time=0.015..11101.937 rows=10000001 loops=1)
         Heap Fetches: 0
   ->  Index Only Scan using test_id_idx on test b  (cost=0.43..259686.10 rows=9999844 width=4) (actual time=0.012..11827.895 rows=10000001 loops=1)
         Heap Fetches: 0
 Planning time: 0.244 ms
 Execution time: 65973.421 ms

Not bad.

Using window function

Window functions are real fun. They’re great if you’re doing counts, sums or ranks by groups. And, to my surprise, they’re great in finding next rows as well.

With the same test table, we retrieve next row for each row with the following query:

SELECT id, lead(id) OVER (ORDER BY id)
FROM test.test;

How does that score without an index? Better than the JOIN clause.

WindowAgg  (cost=1581246.90..1756294.50 rows=10002720 width=4) (actual time=28785.388..63819.071 rows=10000001 loops=1)
   ->  Sort  (cost=1581246.90..1606253.70 rows=10002720 width=4) (actual time=28785.354..40117.899 rows=10000001 loops=1)
         Sort Key: id
         Sort Method: external merge  Disk: 136848kB
         ->  Seq Scan on test  (cost=0.00..144718.20 rows=10002720 width=4) (actual time=0.020..10797.961 rows=10000001 loops=1)
 Planning time: 0.242 ms
 Execution time: 73391.024 ms

And it works even better if indexed. It’s actually ~1,5× faster than the JOIN way.

WindowAgg  (cost=0.43..409770.03 rows=10002720 width=4) (actual time=0.087..35647.815 rows=10000001 loops=1)
   ->  Index Only Scan using test_id_idx on test  (cost=0.43..259729.23 rows=10002720 width=4) (actual time=0.059..11310.879 rows=10000001 loops=1)
         Heap Fetches: 0
 Planning time: 0.247 ms
 Execution time: 45388.202 ms

It reads well and the purpose of such a query is pretty obvious.

Liftago Open Dataset Infographics

Written on Dec 21, 2015 and marked as python, postgis, svg, visualization | data

Liftago (the Czech analogy of Uber) has recently released a sample of its data covering four weeks of driver/pasenger interactions.

Have a look at my infographics created with PostGIS, Inkscape, Python and pygal.