Michal ZimmermannPieces of knowledge from the world of GIS.

Articles in the development category


Until recently, SQL IN and EXISTS were almost exactly the same to me. There is a significant difference both in execution plans and time of execution though, as I found out after not being able to speed up my workmate’s query.

Assume two not-as-small-as-they-might-be tables:


SELECT * FROM generate_series(0, 500000) id;

SELECT (random() * 4000000)::integer id
FROM generate_series(0, 4000000);


To find out what rows from test.big is missing in test.small, you’ll use one of these queries:

FROM test.big
WHERE id NOT IN (SELECT id FROM test.small);

                            QUERY PLAN
Seq Scan on big  (cost=8463.01..42313.02 rows=1000000 width=4) (actual time=177.061..864.043 rows=1500894 loops=1)
    Filter: (NOT (hashed SubPlan 1))
    Rows Removed by Filter: 499107
    SubPlan 1
    ->  Seq Scan on small  (cost=0.00..7213.01 rows=500001 width=4) (actual time=0.045..34.727 rows=500001 loops=1)
    Total runtime: 904.413 ms
(6 rows)

FROM test.big
    SELECT 1
    FROM test.small
    WHERE test.big.id = test.small.id
                            QUERY PLAN
Hash Anti Join  (cost=15417.02..82100.58 rows=955189 width=4) (actual time=100.257..1240.343 rows=1500894 loops=1)
    Hash Cond: (big.id = small.id)
    ->  Seq Scan on big  (cost=0.00..28850.01 rows=2000001 width=4) (actual time=0.016..125.024 rows=2000001 loops=1)
    ->  Hash  (cost=7213.01..7213.01 rows=500001 width=4) (actual time=100.068..100.068 rows=500001 loops=1)
        Buckets: 65536  Batches: 2  Memory Usage: 8800kB
        ->  Seq Scan on small  (cost=0.00..7213.01 rows=500001 width=4) (actual time=0.011..35.543 rows=500001 loops=1)
Total runtime: 1280.609 ms

That’s not a significant difference in time execution, is it?

What if you want to find out what rows from test.small is missing in test.big?

FROM test.small
WHERE id NOT IN (SELECT id FROM test.big);

                                QUERY PLAN
Seq Scan on small  (cost=0.00..12915788669.52 rows=250000 width=4)
    Filter: (NOT (SubPlan 1))
    SubPlan 1
    ->  Materialize  (cost=0.00..46663.01 rows=2000001 width=4)
        ->  Seq Scan on big  (cost=0.00..28850.01 rows=2000001 width=4)
(5 rows)

FROM test.small
    SELECT 1
    FROM test.big
    WHERE test.big.id = test.small.id

                               QUERY PLAN
Hash Anti Join  (cost=61663.02..180597.23 rows=1 width=4)
    Hash Cond: (small.id = big.id)
    ->  Seq Scan on small  (cost=0.00..7213.01 rows=500001 width=4)
    ->  Hash  (cost=28850.01..28850.01 rows=2000001 width=4)
        ->  Seq Scan on big  (cost=0.00..28850.01 rows=2000001 width=4)
(5 rows)

It took me ~750 ms to get the result with EXISTS expression. I kept IN running whole night with no result. I’m not really sure why IN is so much slower, it might be caused by checks for NULL values. The speed is also related to the size of the subquery, thus the difference when tables were switched.

LEFT JOIN can be used to achieve the same result, I find its syntax less obvious though.

No indexes were built this time, I know they don’t help the IN performance at all from my previous tests. Tested with PostgreSQL 9.3.9.

SSH GRASS Processing Status Check

I’ve been running some GRASS/PostGIS computations on a remote server that were taking hours to finish. Once in a while I checked for their state by issuing tail log_XX.log from my laptop to see if they were ready yet. It suddenly became pretty annoying to check five different logs every ten minutes.

Instead of waiting and checking the logs, I thought it would be great to automate this. And it would be awesome if checking was fun. So I wrote a simple routine that takes log number as an argument (every process logs to a separate logfile) and checks it every minute until it says done. Right after that notify-send gives me a neat popup and Queen starts playing their We are the champions thanks to mpg123.

#!/usr/bin/env bash

while true; do
    echo "############ ${item} ############"
    x=$(ssh [email protected] "tail -n 30 path/to/my/log_${item}.log")

    if [[ $x == *"done"* ]]
            notify-send -u critical "Finally ${item}"
            mpg123 -n 250 ~/Music/queen-we_are_the_champions.mp3
        else echo "Not yet"
    sleep 60

What seemed to be really frustrating makes me happy right now. Unless Freddie starts singing in the middle of the night.

Hosting Website On Openshift

I decided to migrate my web to OpenShift. It was a bit frustrating but I got it working eventually.

Things to know before taking the leap

Some domain providers don’t support CNAME changes for root domains (zimmi.cz in my case). This means you can’t simply tell your domain to serve content from OpenShift address. But what you can do is to tell your www subdomain to do so:

www.zimmi.cz CNAME hp-zimmi.rhcloud.com

Which is great until you realize you’ve just created two different websites. That’s where wwwizer lends you a hand and lets you redirect your naked domain to your www domain:

zimmi.cz A

Now everything works fine and you have your www.domain.tld up and running.

OpenShift subdomains

I wasn’t successful creating a subdomain on the same application where I run my domain. This can be easily solved by creating another application and pointing DNS to it:

posts.zimmi.cz A
www.posts.zimmi.cz CNAME posts-zimmi.rhcloud.com

Just don’t forget to handle both naked and www version. When Google reindexes new URLs (http://www.zimmi.cz/posts instead of http://posts.zimmi.cz) subdomain application might be deleted.

Leaflet With Custom CRS (EPSG:5514)

If you ever find yourself in need to use custom projection with Leaflet, feel free to start with this example of Czech national coordinate system. All you need is Leaflet, proj4.js and proj4 for Leaflet plugin. I’m still not sure how origin coordinates work though.

Bash: Prepend To Filename

for f in *; do mv "$f" "prepend_$f"; done

Whenever you need to prepend anything to your files.