Michal Zimmermann Pieces of knowledge from the world of GIS.

Articles tagged with postgresql tag

Upgrading PostgreSQL 9.5 to PostgreSQL 9.6 with PostGIS

Written on Mar 1, 2017 and marked as sql, postgresql | SQL

Thanks to pg_upgrade tool the PostgreSQL upgrade on Ubuntu is pretty straightforward. Different PostGIS versions might cause troubles though. This post covers PostgreSQL 9.5, PostGIS 2.2 to PostgreSQL 9.6, PostGIS 2.3 migration.

First of all, install the PostgreSQL 9.6 with PostGIS 2.3.

apt install postgresql-9.6 postgresql-9.6-postgis-2.3

Mind that newly installed database cluster runs on port 5433.

If you run pg_upgrade at this stage, it will fail with the following error.

could not load library "$libdir/postgis_topology-2.2":
ERROR:  could not access file "$libdir/postgis_topology-2.2": No such file or directory

pg_upgrade can’t run the upgrade because PostGIS versions don’t match. Install the PostGIS 2.3 for PostgreSQL 9.5 and update extensions in all your databases.

apt install postgresql-9.5-postgis-2.3

:::sql
ALTER EXTENSION postgis UPDATE;

With both clusters using the same PostGIS version, the upgrade can begin. First, stop them with

service postgresql stop

Then, run the actual pg_upgrade command as postgres user. Make sure the pg_hba.conf file is set to allow local connections.

/usr/lib/postgresql/9.6/bin/pg_upgrade \
-b /usr/lib/postgresql/9.5/bin/ \
-B /usr/lib/postgresql/9.6/bin/ \
-d /var/lib/postgresql/9.5/main \
-D /var/lib/postgresql/9.6/main \
-o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' \
-O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'

The following result means the upgrade was smooth.

Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for reg* system OID user data types                ok
Checking for contrib/isn with bigint-passing mismatch       ok
Checking for roles starting with 'pg_'                      ok
Creating dump of global objects                             ok
Creating dump of database schemas
                                                            ok
Checking for presence of required libraries                 ok
Checking database user is the install user                  ok
Checking for prepared transactions                          ok

If pg_upgrade fails after this point, you must re-initdb the
new cluster before continuing.

Performing Upgrade
------------------
Analyzing all rows in the new cluster                       ok
Freezing all rows on the new cluster                        ok
Deleting files from new pg_clog                             ok
Copying old pg_clog to new server                           ok
Setting next transaction ID and epoch for new cluster       ok
Deleting files from new pg_multixact/offsets                ok
Copying old pg_multixact/offsets to new server              ok
Deleting files from new pg_multixact/members                ok
Copying old pg_multixact/members to new server              ok
Setting next multixact ID and offset for new cluster        ok
Resetting WAL archives                                      ok
Setting frozenxid and minmxid counters in new cluster       ok
Restoring global objects in the new cluster                 ok
Restoring database schemas in the new cluster
                                                            ok
Copying user relation files
                                                            ok
Setting next OID for new cluster                            ok
Sync data directory to disk                                 ok
Creating script to analyze new cluster                      ok
Creating script to delete old cluster                       ok

Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:
    ./analyze_new_cluster.sh

Running this script will delete the old cluster's data files:
    ./delete_old_cluster.sh

The old cluster can be removed and the new one switched back to port 5432. Run /usr/lib/postgresql/9.6/bin/vacuumdb -p 5433 --all --analyze-in-stages to collect statistics.

Executing dynamic SQL query right away

Written on Feb 28, 2017 and marked as sql, postgresql | SQL

PostgreSQL 9.6 comes with a handy psql command called \gexec that sends the current query input buffer to the server and treats the result as a SQL statement to be executed (right, whatever). What that means is that instead of doing this

psql -c "SELECT 'DROP TABLE ' || tablename FROM information_schema.tables WHERE table_name LIKE '%to_be_dropped%" | psql

you’ll do that

SELECT 'DROP TABLE ' || tablename FROM information_schema.tables WHERE table_name LIKE '%to_be_dropped%'\gexec

Brilliant.

Subdivide and Conquer: Effective Spatial Indexes in PostGIS

Written on Jan 10, 2017 and marked as sql, postgresql, postgis | SQL

Spatial indexes are absolutely crucial part of any spatial database and - as I tend to say quite often - only a fool would try to query spatial data without building spatial indexes beforehand.

Spatial indexes are based on bounding box comparisons, which are generally very fast. Yet, there are situations when spatial indexes don’t help much (or they don’t help as much as they could, if you wish).

Bounding box comparisons are effective with lots of small bounding boxes rather then few large ones. Why? See the picture above. The curved line (imagine it’s a pipeline for example) clearly demonstrates when the spatial index/bounding box comparison might fall short of what you’d expect.

Once the bounding box gets really big, it intersects so many other geometries’ bounding boxes that the whole comparison starts to slow down.

Luckily, PostGIS 2.2 introduced a ST_Subdivide function that can lend a helping hand in here.

Until today, we delivered the parcel geometries into our real estate acquisition process system with the following query, that takes all the geometries from the req_geom table (pipelines, remember?) and intersects them with cadastral parcels. The second part of the query adds those parcels that haven’t been digitalized and were created manually by one of my workmates.

INSERT INTO requested_parcels (uid, par_id)
SELECT
    reqs.uid,
    b.id par_id
 FROM
    running_requests reqs
 JOIN
    req_geom a ON (reqs.uid = a.uid)
 JOIN
    pargeo b ON (ST_Intersects(a.geom, b.geom))
 UNION
 SELECT
    reqs.uid,
    a.idpar::numeric
 FROM
    running_requests reqs
 JOIN
     req_man a ON (reqs.uid = a.uid);

It’s a perfectly standard query that intersects several request geometries with ~20M parcels, nothing really fancy. Except that it takes 25 minutes to finish. Why? Pipelines, remember?

Yet, the query below takes only 30 seconds to finish (that’s a huge time saver considering that the whole process used to take ~40 minutes)! Why? Because the ST_Subdivide effectively shrinks the req_geom geometries until they have 50 vertices each at most. Such small geometries are perfect input for the bounding box comparison. Remember to call DISTINCT when using ST_Subdivide, you’d probably get duplicate parcel ids otherwise.

I also replaced the UNION with the WHERE NOT EXISTS expression, as it’s reasonable to assume that numeric ids comparison will be faster.

INSERT INTO requested_parcels (uid, par_id)
SELECT DISTINCT
    reqs.uid,
    b.id par_id
 FROM
    running_requests reqs
 JOIN
    (
        SELECT
            uid,
            ST_Subdivide(geom, 50) geom
        FROM
            req_geom
     ) a ON (reqs.uid = a.uid)
 JOIN
     pargeo b ON (ST_Intersects(a.geom, b.geom));

 INSERT INTO requested_parcels (uid, par_id)
 SELECT
     reqs.uid,
     a.idpar::numeric
 FROM
     running_requests reqs
 JOIN
     req_man a ON (reqs.uid = a.uid)
 WHERE NOT EXISTS (
     SELECT 1
     FROM pozadovane_parcely pp
     WHERE pp.par_id = a.idpar
  );

Counting substring occurrences in PostgreSQL

Written on Dec 19, 2016 and marked as sql, postgresql | SQL

I got to count occurrences of / character today and found out no built-in function exists in PostgreSQL, so here’s my shot at it. Pretty simple, yet useful.

CREATE OR REPLACE FUNCTION how_many(IN text, IN varchar, OUT integer)
RETURNS integer
AS
$how_many$
    SELECT length($1) - length(replace($1, $2, ''));
$how_many$
LANGUAGE SQL
SECURITY DEFINER;

-- SELECT how_many('test', 't'); -- returns number 2

Plotting the Czech Cadastre Land Use with d3: Data Load (part III)

Written on Nov 15, 2016 and marked as postgresql, d3, javascript, svg | data

This post is the third part of the series summarizing the process of visualizing landuse data with bash, PostgreSQL and d3.js. Read other parts:

  1. Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I)
  2. Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II)
  3. you’re reading it now

ETL process

Before the d3 viz can be crafted, it’s necessary to:

  1. extract CSV data from the URLs provided via the Atom feed
  2. transform those data into a relational database, do some math
  3. load data into a d3.js viz
  4. profit (as usual)

Extract

See Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I).

Transform

See Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II).

Load

Thanks to the way I transformed the data, the whole load is done with simple

#!/bin/bash

touch ./data/data.js
echo "let data =" > ./data/data.js

(
cat << EOF | psql -qAt --no-psqlrc
    SELECT
    array_to_json(array_agg(row_to_json(r)))
    FROM (
    SELECT *
    FROM data
    ) r
EOF
) >> ./data/data.js

That’s the whole ETL process! Next time, I’ll cover the d3.js viz.

Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II)

Written on Nov 14, 2016 and marked as javascript, d3, postgresql, svg | data

This post is the second part of the series summarizing the process of visualizing landuse data with bash, PostgreSQL and d3.js. Read other parts:

  1. Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I)
  2. you’re reading it now
  3. Plotting the Czech Cadastre Land Use with d3: Data Transformation (part III)

ETL process

Before the d3 viz can be crafted, it’s necessary to:

  1. extract CSV data from the URLs provided via the Atom feed
  2. transform those data into a relational database, do some math
  3. load data into a d3.js viz
  4. profit (as usual)

Extract

See Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I).

Transform

Last time, I extracted the data from multiple CSV files to separate PostgreSQL tables named by data_YYYYMMDD pattern. My current goal is to transform it into the one big data table, where each row represents one cadastral area. Here’s what I’m trying to achieve:

-[ RECORD 1 ]----------+----------------------------------
ku_kod                 | 600881
ku_nazev               | Bantice
celkova_vymera         | {3763255,3763255,3763256,3763256}
celkovy_pocet_parcel   | {670,668,664,667}
chmelnice_pp           | {0,0,0,0}
chmelnice_pp_r         | {0.00,0.00,0.00,0.00}
chmelnice_v            | {0,0,0,0}
chmelnice_v_avg        | {0,0,0,0}
chmelnice_v_r          | {0.00,0.00,0.00,0.00}
lesni_pozemek_pp       | {25,25,25,25}
lesni_pozemek_pp_r     | {3.73,3.74,3.77,3.75}
lesni_pozemek_v        | {83879,83879,83879,83879}
lesni_pozemek_v_avg    | {3355,3355,3355,3355}
lesni_pozemek_v_r      | {2.23,2.23,2.23,2.23}
orna_puda_pp           | {88,88,89,89}
orna_puda_pp_r         | {13.13,13.17,13.40,13.34}
orna_puda_v            | {3066230,3066230,3066230,3066230}
orna_puda_v_avg        | {34844,34844,34452,34452}
orna_puda_v_r          | {81.48,81.48,81.48,81.48}
ostatni_plocha_pp      | {201,199,199,201}
ostatni_plocha_pp_r    | {30.00,29.79,29.97,30.13}
ostatni_plocha_v       | {283468,283468,283468,284562}
ostatni_plocha_v_avg   | {1410,1424,1424,1416}
ostatni_plocha_v_r     | {7.53,7.53,7.53,7.56}
ovocny_sad_pp          | {0,0,0,0}
ovocny_sad_pp_r        | {0.00,0.00,0.00,0.00}
ovocny_sad_v           | {0,0,0,0}
ovocny_sad_v_avg       | {0,0,0,0}
ovocny_sad_v_r         | {0.00,0.00,0.00,0.00}
ttp_pp                 | {44,44,44,45}
ttp_pp_r               | {6.57,6.59,6.63,6.75}
ttp_v                  | {49002,49002,49002,47908}
ttp_v_avg              | {1114,1114,1114,1065}
ttp_v_r                | {1.30,1.30,1.30,1.27}
vinice_pp              | {1,1,1,1}
vinice_pp_r            | {0.15,0.15,0.15,0.15}
vinice_v               | {106178,106178,106178,106178}
vinice_v_avg           | {106178,106178,106178,106178}
vinice_v_r             | {2.82,2.82,2.82,2.82}
vodni_plocha_pp        | {23,23,23,23}
vodni_plocha_pp_r      | {3.43,3.44,3.46,3.45}
vodni_plocha_v         | {27877,27877,27877,27877}
vodni_plocha_v_avg     | {1212,1212,1212,1212}
vodni_plocha_v_r       | {0.74,0.74,0.74,0.74}
zahrada_pp             | {115,115,115,115}
zahrada_pp_r           | {17.16,17.22,17.32,17.24}
zahrada_v              | {77381,77381,77353,77353}
zahrada_v_avg          | {673,673,673,673}
zahrada_v_r            | {2.06,2.06,2.06,2.06}
zastavena_plocha_pp    | {173,173,168,168}
zastavena_plocha_pp_r  | {25.82,25.90,25.30,25.19}
zastavena_plocha_v     | {69240,69240,69269,69269}
zastavena_plocha_v_avg | {400,400,412,412}
zastavena_plocha_v_r   | {1.84,1.84,1.84,1.84}

Several stats were calculated for each land use category (vinice → vineyard, ovocny_sad → orchard, …):

  • v_r suffix stands for land use area ratio
  • pp_r suffix stands for land use parcel count ratio
  • v_avg stands for average parcel area

All statistical columns are kept as PostgreSQL ARRAYs, ordered by dates (very handy for the future d3.js viz by the way).

Note that since the FULL OUTER JOIN is needed in the next step, SQLite can’t be used. Pity though.

The whole transformation bash script is the plain:

#!/bin/bash

psql -qAt --no-psqlrc -f transform.sql | psql -qAt --no-psqlrc

The transform.sql file is used to build the dynamic SQL query, which - once built - is piped to another psql command. I admit, pipes are super awesome.

WITH tables AS (
-- FULL OUTER JOIN all the data_YYYYMMDD tables
SELECT
    table_name,
    table_schema,
    'd' || id tbl,
    CASE WHEN id = 1
        THEN table_schema || '.' || table_name || ' d' || id
        ELSE 'FULL OUTER JOIN ' || table_schema || '.' || table_name || ' d' || id || ' ON (d1.ku_kod = d' || id || '.ku_kod)'
    END tbl_join
FROM (
    SELECT
        table_name,
        table_schema,
        row_number() OVER (ORDER BY table_name) id
    FROM information_schema.tables
    WHERE table_name LIKE 'data_%'
        AND table_type = 'BASE TABLE'
        AND table_schema = 'public'
) a
)
-- create data table with the correct values order for each statistical column
-- note that the whole process would crash if d1.ku_kod would be NULL -> @todo fix me
SELECT 'DROP TABLE IF EXISTS data;
    CREATE TABLE data AS
    SELECT d1.ku_kod, d1.ku_nazev,'
UNION ALL
SELECT
    array_to_string(array_agg(r), ', ') r
FROM (
    SELECT
    'ARRAY[' || array_to_string(array_agg(tables.tbl || '.' || columns.column_name ORDER BY tables.table_name), ', ') || ']' || ' ' || columns.column_name r
    FROM tables
    JOIN (
    SELECT
        table_schema,
        table_name,
        column_name
    FROM information_schema.columns
    WHERE column_name NOT LIKE 'ku_%'
    ORDER BY ordinal_position
    ) columns
    ON (tables.table_name = columns.table_name AND columns.table_schema = tables.table_schema)
    GROUP BY columns.column_name
) a
UNION ALL
SELECT 'FROM'
UNION ALL
SELECT tbl_join FROM tables;

psql -qAt --no-psqlrc -f transform.sql builds the actual query from the query above, | psql -qAt --no-psqlrc sends it to the database again. This part was really fun to implement!

I’m still considering to store diff values instead of absolute values in those ARRAYs - that would save some serious bandwidth!

Load

See Plotting the Czech Cadastre Land Use with d3: Data Transformation (part III).

Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I)

Written on Nov 13, 2016 and marked as javascript, d3, postgresql, svg | data

This post is the first part of the upcoming series summarizing the process of visualizing landuse data with bash, PostgreSQL and d3.js. Read other parts:

  1. you’re reading it now
  2. Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II)
  3. Plotting the Czech Cadastre Land Use with d3: Data Transformation (part III)

Czech Office for Surveying, Mapping and Cadastre has recently published lot of data via Atom feed. There’s pretty small and a bit boring dataset included, featuring quarterly updated landuse-related values for all 13,091 cadastral areas:

  • absolute number of land lots within given category (arable land, forests, etc.)
  • absolute area of land lots within given category

Data are published as CSV files linked from the Atom feed. Sadly, they come windows-1250 encoded, using Windows line endings, with trailing semicolons and header rows using diacritics.

ETL process

Before the d3 viz can be crafted, it’s necessary to:

  1. extract CSV data from the URLs provided via the Atom feed
  2. transform those data into a relational database, do some math
  3. load data into a d3.js viz
  4. profit (as usual)

Extract

#!/bin/bash
# extract.sh -f YYYYMMDD

while [[ $# -gt 1 ]]
do
key="$1"

case $key in
    -f|--file)
    FILE="$2"
    shift # past argument
    ;;
    *)
        # unknown option
    ;;
esac
shift # past argument or value
done

URL=http://services.cuzk.cz/sestavy/UHDP/UHDP-
CSVFILE=$FILE.csv
CSVUTF8FILE=${CSVFILE%.*}.utf.csv
URL+=$CSVFILE

echo "downloading $URL"
wget -q $URL -O $CSVFILE

if [[ $? != 0 ]]; then
    rm -f $CSVFILE
    echo "download failed"
    exit
fi

echo "converting to utf-8"
iconv -f WINDOWS-1250 -t UTF-8 $CSVFILE -o $CSVUTF8FILE && \
echo "modifying ${FILE}"
sed -i 's/^M$//' $CSVUTF8FILE && \
sed -i 's/\r$//' $CSVUTF8FILE && \
sed -i 's/;*$//g' $CSVUTF8FILE && \
sed -i '1d' $CSVUTF8FILE

echo "importing to database"
sed -e "s/\${DATE}/$FILE/g" extract.sql | psql -qAt --no-psqlrc

rm $CSVFILE $CSVUTF8FILE

This script downloads CSV file, deals with all the pitfalls mentioned above and, when done, copy command within extract.sql loads the data into a data_YYYYMMDD table. Putting all the files into the one table would have saved me a lot of transformation SQL, yet it didn’t feel quite right though.

Transform

See Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II).

Load

See Plotting the Czech Cadastre Land Use with d3: Data Transformation (part III).

PostGIS Custom Function to Create Wind Rose

Written on Sep 1, 2016 and marked as postgis, postgresql, sql | SQL

I’ve come across the beautiful GIS StackExchange question recently, asking how to draw a wind rose within PostGIS.

It’s pretty easy to accomplish this with a custom PLPGSQL procedure below, that takes line geometry, number of sections and radius of the inner circle as parameters.

CREATE OR REPLACE FUNCTION ST_WindRose(
    line geometry,
    directions int,
    radius numeric
)
RETURNS TABLE (
    id integer,
    geom geometry(LINESTRING)
)
AS $ST_WindRose$
BEGIN
    IF directions % 2 <> 0 THEN
        RAISE EXCEPTION 'Odd number of directions found, please provide even number of directions instead.';
    END IF;

IF radius > ST_Length(line) THEN
    RAISE EXCEPTION 'Inner circle radius is bigger than the wind rose diameter, please make it smaller.';
END IF;

RETURN QUERY
WITH rose AS (
    SELECT
        ST_Rotate(_line, radians(360) / directions * dirs.id, ST_Centroid(_line)) _line
    FROM (
        SELECT line _line
    ) a
    CROSS JOIN (
        SELECT generate_series(1, directions / 2) id
    ) dirs
)
SELECT
    row_number() OVER ()::integer id,
    _line geom
FROM (
    SELECT _line FROM rose
    UNION ALL
    SELECT ST_ExteriorRing(ST_Buffer(ST_Centroid(line), radius, 30)) -- inner circle
    UNION ALL
    SELECT ST_ExteriorRing(ST_Buffer(ST_Centroid(line), ST_Length(line)/2, 30)) -- outer circle
) a;
END
$ST_WindRose$
LANGUAGE PLPGSQL;

Wind rose created with this function might look like the one below.

Run it as follows. The line parameter should be a simple straight line made of just two vertices.

SELECT * FROM ST_WindRose(ST_MakeLine(ST_MakePoint(0,0), ST_MakePoint(0,1)), 12, 0.01);

PostGIS Custom Function to Create Polygon from Centroid

Written on Aug 28, 2016 and marked as postgis, postgresql, sql | SQL

Needed to create a polygon from a point defining its size in both axes, here’s a little syntax sugar to make life easier.

CREATE OR REPLACE FUNCTION ST_PolygonFromCentroid(centroid geometry, xsize numeric, ysize numeric)
RETURNS geometry
AS $ST_PolygonFromCentroid$
SELECT ST_MakeEnvelope(
    ST_X(ST_Translate($1, -$2, -$3)),
    ST_Y(ST_Translate($1, -$2, -$3)),
    ST_X(ST_Translate($1, $2, $3)),
    ST_Y(ST_Translate($1, $2, $3))
);
$ST_PolygonFromCentroid$
LANGUAGE SQL;

Run it as:

SELECT ST_PolygonFromCentroid(ST_SetSRID(ST_MakePoint(13.912,50.633),4326), 1, 1);

Finding Polygons Lying across Other Polygons with PostGIS

Written on Aug 5, 2016 and marked as postgis, postgresql, sql | SQL

Doing overlays (ST_Intersection()) in PostGIS based on spatial relationships (ST_Intersects(), ST_Contains(), …) is so easy it is not something you get particularly excited about.

Today I faced a bit more interesting task: given two polygon layers, get me all the polygons from layer A such that they lie across the polygons from layer B and… a picture worth a thousand words, right?

I hope you got the idea, it is fairly simple:

  1. Intersect A (red, blue) with B (green)
  2. Subtract the result of previous from layer A
  3. Combine results from steps 1 and 2
  4. Keep polygon only if its id occurs more than twice (that means it went straight through the layer B)
  5. Profit!
WITH overlays AS (
/* nothing fancy here */
    SELECT
        A.ogc_fid a_id,
        B.ogc_fid b_id,
        ST_Intersection(A.geom, B.geom) geom,
        ST_Area(ST_Intersection(A.geom, B.geom) area_shared
    FROM A
    JOIN B ON (ST_Intersects(A.geom, B.geom)
),
diffs AS (
/* note this is a 1:1 relationship in ST_Difference */
/* a little hack is needed to prevent PostGIS from returning its usual difference mess */
    SELECT
        o.a_id,
        o.b_id,
        (ST_Dump(ST_Difference(ST_Buffer(A.geom, -0.0001), o.geom))).geom, -- ugly hack
        o.area_shared
    FROM overlays o
    JOIN A ON (o.a_id = A.id)
),

merged AS (
/* put those two result sets together */
    SELECT * FROM overlays
    UNION ALL
    SELECT * FROM diffs
),

merged_reduced AS (
/* get only those A polygons that consist of three parts at least for each intersection with B polygon */
  SELECT
    m.*
  FROM merged m
  JOIN (
    SELECT
      a_id,
      b_id
    FROM merged
    GROUP BY a_id, b_id
    HAVING COUNT(1) > 2
  ) a ON (a.a_id = m.a_id AND a.b_id = m.b_id)
)
/* do as you wish with the result */
SELECT *
FROM merged_reduced;

In my case, centerlines of layer B were also included and their length inside each intersection was used to divide the area of the smallest part with. It was fun, actually.