Michal ZimmermannPieces of knowledge from the world of GIS.

Articles in the development category

PostgreSQL IN vs EXISTS

Until recently, SQL IN and EXISTS were almost exactly the same to me. There is a significant difference both in execution plans and time of execution though, as I found out after not being able to speed up my workmate’s query.

Assume two not-as-small-as-they-might-be tables:

BEGIN;

CREATE UNLOGGED TABLE test.small AS
SELECT * FROM generate_series(0, 500000) id;

CREATE UNLOGGED TABLE test.big AS
SELECT (random() * 4000000)::integer id
FROM generate_series(0, 4000000);

COMMIT;

To find out what rows from test.big is missing in test.small, you’ll use one of these queries:

SELECT id
FROM test.big
WHERE id NOT IN (SELECT id FROM test.small);

                            QUERY PLAN
-----------------------------------------------------------------------------------------
Seq Scan on big  (cost=8463.01..42313.02 rows=1000000 width=4) (actual time=177.061..864.043 rows=1500894 loops=1)
    Filter: (NOT (hashed SubPlan 1))
    Rows Removed by Filter: 499107
    SubPlan 1
    ->  Seq Scan on small  (cost=0.00..7213.01 rows=500001 width=4) (actual time=0.045..34.727 rows=500001 loops=1)
    Total runtime: 904.413 ms
(6 rows)


SELECT id
FROM test.big
WHERE NOT EXISTS (
    SELECT 1
    FROM test.small
    WHERE test.big.id = test.small.id
);
                            QUERY PLAN
-----------------------------------------------------------------------------------------
Hash Anti Join  (cost=15417.02..82100.58 rows=955189 width=4) (actual time=100.257..1240.343 rows=1500894 loops=1)
    Hash Cond: (big.id = small.id)
    ->  Seq Scan on big  (cost=0.00..28850.01 rows=2000001 width=4) (actual time=0.016..125.024 rows=2000001 loops=1)
    ->  Hash  (cost=7213.01..7213.01 rows=500001 width=4) (actual time=100.068..100.068 rows=500001 loops=1)
        Buckets: 65536  Batches: 2  Memory Usage: 8800kB
        ->  Seq Scan on small  (cost=0.00..7213.01 rows=500001 width=4) (actual time=0.011..35.543 rows=500001 loops=1)
Total runtime: 1280.609 ms

That’s not a significant difference in time execution, is it?

What if you want to find out what rows from test.small is missing in test.big?

SELECT id
FROM test.small
WHERE id NOT IN (SELECT id FROM test.big);

                                QUERY PLAN
---------------------------------------------------------------------------
Seq Scan on small  (cost=0.00..12915788669.52 rows=250000 width=4)
    Filter: (NOT (SubPlan 1))
    SubPlan 1
    ->  Materialize  (cost=0.00..46663.01 rows=2000001 width=4)
        ->  Seq Scan on big  (cost=0.00..28850.01 rows=2000001 width=4)
(5 rows)


SELECT id
FROM test.small
WHERE NOT EXISTS (
    SELECT 1
    FROM test.big
    WHERE test.big.id = test.small.id
);

                               QUERY PLAN
-------------------------------------------------------------------------
Hash Anti Join  (cost=61663.02..180597.23 rows=1 width=4)
    Hash Cond: (small.id = big.id)
    ->  Seq Scan on small  (cost=0.00..7213.01 rows=500001 width=4)
    ->  Hash  (cost=28850.01..28850.01 rows=2000001 width=4)
        ->  Seq Scan on big  (cost=0.00..28850.01 rows=2000001 width=4)
(5 rows)

It took me ~750 ms to get the result with EXISTS expression. I kept IN running whole night with no result. I’m not really sure why IN is so much slower, it might be caused by checks for NULL values. The speed is also related to the size of the subquery, thus the difference when tables were switched.

LEFT JOIN can be used to achieve the same result, I find its syntax less obvious though.

No indexes were built this time, I know they don’t help the IN performance at all from my previous tests. Tested with PostgreSQL 9.3.9.

SSH GRASS Processing Status Check

I’ve been running some GRASS/PostGIS computations on a remote server that were taking hours to finish. Once in a while I checked for their state by issuing tail log_XX.log from my laptop to see if they were ready yet. It suddenly became pretty annoying to check five different logs every ten minutes.

Instead of waiting and checking the logs, I thought it would be great to automate this. And it would be awesome if checking was fun. So I wrote a simple routine that takes log number as an argument (every process logs to a separate logfile) and checks it every minute until it says done. Right after that notify-send gives me a neat popup and Queen starts playing their We are the champions thanks to mpg123.

#!/usr/bin/env bash
item=$1

while true; do
    echo "############ ${item} ############"
    x=$(ssh [email protected] "tail -n 30 path/to/my/log_${item}.log")

    if [[ $x == *"done"* ]]
        then
            notify-send -u critical "Finally ${item}"
            mpg123 -n 250 ~/Music/queen-we_are_the_champions.mp3
            exit
        else echo "Not yet"
    fi
    sleep 60
done

What seemed to be really frustrating makes me happy right now. Unless Freddie starts singing in the middle of the night.

Hosting Website On Openshift

I decided to migrate my web to OpenShift. It was a bit frustrating but I got it working eventually.

Things to know before taking the leap

Some domain providers don’t support CNAME changes for root domains (zimmi.cz in my case). This means you can’t simply tell your domain to serve content from OpenShift address. But what you can do is to tell your www subdomain to do so:

www.zimmi.cz CNAME hp-zimmi.rhcloud.com

Which is great until you realize you’ve just created two different websites. That’s where wwwizer lends you a hand and lets you redirect your naked domain to your www domain:

zimmi.cz A 174.129.25.170

Now everything works fine and you have your www.domain.tld up and running.

OpenShift subdomains

I wasn’t successful creating a subdomain on the same application where I run my domain. This can be easily solved by creating another application and pointing DNS to it:

posts.zimmi.cz A 174.179.25.170
www.posts.zimmi.cz CNAME posts-zimmi.rhcloud.com

Just don’t forget to handle both naked and www version. When Google reindexes new URLs (http://www.zimmi.cz/posts instead of http://posts.zimmi.cz) subdomain application might be deleted.

Leaflet With Custom CRS (EPSG:5514)

If you ever find yourself in need to use custom projection with Leaflet, feel free to start with this example of Czech national coordinate system. All you need is Leaflet, proj4.js and proj4 for Leaflet plugin. I’m still not sure how origin coordinates work though.

Bash: Prepend To Filename

for f in *; do mv "$f" "prepend_$f"; done

Whenever you need to prepend anything to your files.