Michal Zimmermann Pieces of knowledge from the world of GIS.

Articles tagged with javascript tag

Plotting the Czech Cadastre Land Use with d3: Data Viz (part IV)

Written on Nov 20, 2016 and marked as d3, javascript, svg | data

This post is the fourth part of the series summarizing the process of visualizing land use data with bash, PostgreSQL and d3.js. Read other parts:

  1. Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I)
  2. Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II)
  3. Plotting the Czech Cadastre Land Use with d3: Data Load (part III)

Data vizualization

Those of you who’ve been following this series know all the data are set and ready to be used. The rest of you, shame on you by the way, can go through the above posts to catch up.

The result is available at https://www.zimmi.cz/kn-landuse-monitor and works like the gif below.

Features

  • land use data for 13,093 cadastral areas between 2015/01/01 and 2016/10/01
  • relative area and parcel count per land use type
  • similar cadastres based on land use relative area values
  • time series plots for various charachteristics (including agricultural land area and parcel count)

Todos

  • time series chart titles onmouseover
  • barchart titles onmouseover
  • absolute values chart (?)
  • fetch API polyfill
  • Firefox seems to be broken

Technologies

I implemented the whole app with vanilla JavaScript. The app resided in the Monitor variable, had several modules that were communicating via custom events with each other.

So far, so good. Once the app was production-ready, I stumbled upon vue.js, which is by miles the best JavaScript framework experience I’ve had so far. Reinventing the app once again was the matter of two days (thanks to this amazing setup - hot reload included).

Thus, the current version of the app is based on:

vue.js

Thanks to the easy-to-understand system of components, properties and methods, learning curve is really steep. The app is now divided into several components (Search, Dashboard with child components for charts and similar cadastres list).

vuex

Vuex, probably inspired by Flux or Redux, is the “state management pattern + library”, the single source of truth for your apps. That’s pretty much it: there’s only one place in your app (called the store), where you go to put or get your data. Not necessarily every single piece of data, just those pieces used across several components. It plays really nice with the vue.js.

D3.js

Tried it before, D3.js was really hard to grasp. And it still is, I guess. At the same time, it’s damn good at plotting the data. Yet, being a bit less low-level would be great.

Dexie

I hate writing servers for my pet projects. The server means no Github Pages. Thus, I decided to load the whole dataset with fetch API from the external JSON file. Loading the 13K objects × 30 properties × array with 8 items in each didn’t seem like the best idea ever, so… Here comes Dexie, a IndexedDB API wrapper that makes it easy on you (unlike the IndexedDB API itself, which doesn’t even let you find out whether the database you’re creating already exists. Seriously?).

Dexie loads the initial dataset into the IndexedDB storage and reads it every time user comes back without loading the JSON file again. On data change, the fresh file will be loaded, the database flushed and the new data written. Behold; I hate the way it’s written.

Flex

Used flex for the first time, I’m not sure I understand how it actually works though. CSS feels more complicated every time I need it.

Bottom line: I use localStorage to keep track of the database existence.

Resume

Two pet projects completed in one month definitely means the winter is here! Looking forward to using more vue.js.

Plotting the Czech Cadastre Land Use with d3: Data Load (part III)

Written on Nov 15, 2016 and marked as postgresql, d3, javascript, svg | data

This post is the third part of the series summarizing the process of visualizing landuse data with bash, PostgreSQL and d3.js. Read other parts:

  1. Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I)
  2. Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II)
  3. you’re reading it now

ETL process

Before the d3 viz can be crafted, it’s necessary to:

  1. extract CSV data from the URLs provided via the Atom feed
  2. transform those data into a relational database, do some math
  3. load data into a d3.js viz
  4. profit (as usual)

Extract

See Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I).

Transform

See Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II).

Load

Thanks to the way I transformed the data, the whole load is done with simple

#!/bin/bash

touch ./data/data.js
echo "let data =" > ./data/data.js

(
cat << EOF | psql -qAt --no-psqlrc
    SELECT
    array_to_json(array_agg(row_to_json(r)))
    FROM (
    SELECT *
    FROM data
    ) r
EOF
) >> ./data/data.js

That’s the whole ETL process! Next time, I’ll cover the d3.js viz.

Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II)

Written on Nov 14, 2016 and marked as javascript, d3, postgresql, svg | data

This post is the second part of the series summarizing the process of visualizing landuse data with bash, PostgreSQL and d3.js. Read other parts:

  1. Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I)
  2. you’re reading it now
  3. Plotting the Czech Cadastre Land Use with d3: Data Transformation (part III)

ETL process

Before the d3 viz can be crafted, it’s necessary to:

  1. extract CSV data from the URLs provided via the Atom feed
  2. transform those data into a relational database, do some math
  3. load data into a d3.js viz
  4. profit (as usual)

Extract

See Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I).

Transform

Last time, I extracted the data from multiple CSV files to separate PostgreSQL tables named by data_YYYYMMDD pattern. My current goal is to transform it into the one big data table, where each row represents one cadastral area. Here’s what I’m trying to achieve:

-[ RECORD 1 ]----------+----------------------------------
ku_kod                 | 600881
ku_nazev               | Bantice
celkova_vymera         | {3763255,3763255,3763256,3763256}
celkovy_pocet_parcel   | {670,668,664,667}
chmelnice_pp           | {0,0,0,0}
chmelnice_pp_r         | {0.00,0.00,0.00,0.00}
chmelnice_v            | {0,0,0,0}
chmelnice_v_avg        | {0,0,0,0}
chmelnice_v_r          | {0.00,0.00,0.00,0.00}
lesni_pozemek_pp       | {25,25,25,25}
lesni_pozemek_pp_r     | {3.73,3.74,3.77,3.75}
lesni_pozemek_v        | {83879,83879,83879,83879}
lesni_pozemek_v_avg    | {3355,3355,3355,3355}
lesni_pozemek_v_r      | {2.23,2.23,2.23,2.23}
orna_puda_pp           | {88,88,89,89}
orna_puda_pp_r         | {13.13,13.17,13.40,13.34}
orna_puda_v            | {3066230,3066230,3066230,3066230}
orna_puda_v_avg        | {34844,34844,34452,34452}
orna_puda_v_r          | {81.48,81.48,81.48,81.48}
ostatni_plocha_pp      | {201,199,199,201}
ostatni_plocha_pp_r    | {30.00,29.79,29.97,30.13}
ostatni_plocha_v       | {283468,283468,283468,284562}
ostatni_plocha_v_avg   | {1410,1424,1424,1416}
ostatni_plocha_v_r     | {7.53,7.53,7.53,7.56}
ovocny_sad_pp          | {0,0,0,0}
ovocny_sad_pp_r        | {0.00,0.00,0.00,0.00}
ovocny_sad_v           | {0,0,0,0}
ovocny_sad_v_avg       | {0,0,0,0}
ovocny_sad_v_r         | {0.00,0.00,0.00,0.00}
ttp_pp                 | {44,44,44,45}
ttp_pp_r               | {6.57,6.59,6.63,6.75}
ttp_v                  | {49002,49002,49002,47908}
ttp_v_avg              | {1114,1114,1114,1065}
ttp_v_r                | {1.30,1.30,1.30,1.27}
vinice_pp              | {1,1,1,1}
vinice_pp_r            | {0.15,0.15,0.15,0.15}
vinice_v               | {106178,106178,106178,106178}
vinice_v_avg           | {106178,106178,106178,106178}
vinice_v_r             | {2.82,2.82,2.82,2.82}
vodni_plocha_pp        | {23,23,23,23}
vodni_plocha_pp_r      | {3.43,3.44,3.46,3.45}
vodni_plocha_v         | {27877,27877,27877,27877}
vodni_plocha_v_avg     | {1212,1212,1212,1212}
vodni_plocha_v_r       | {0.74,0.74,0.74,0.74}
zahrada_pp             | {115,115,115,115}
zahrada_pp_r           | {17.16,17.22,17.32,17.24}
zahrada_v              | {77381,77381,77353,77353}
zahrada_v_avg          | {673,673,673,673}
zahrada_v_r            | {2.06,2.06,2.06,2.06}
zastavena_plocha_pp    | {173,173,168,168}
zastavena_plocha_pp_r  | {25.82,25.90,25.30,25.19}
zastavena_plocha_v     | {69240,69240,69269,69269}
zastavena_plocha_v_avg | {400,400,412,412}
zastavena_plocha_v_r   | {1.84,1.84,1.84,1.84}

Several stats were calculated for each land use category (vinice → vineyard, ovocny_sad → orchard, …):

  • v_r suffix stands for land use area ratio
  • pp_r suffix stands for land use parcel count ratio
  • v_avg stands for average parcel area

All statistical columns are kept as PostgreSQL ARRAYs, ordered by dates (very handy for the future d3.js viz by the way).

Note that since the FULL OUTER JOIN is needed in the next step, SQLite can’t be used. Pity though.

The whole transformation bash script is the plain:

#!/bin/bash

psql -qAt --no-psqlrc -f transform.sql | psql -qAt --no-psqlrc

The transform.sql file is used to build the dynamic SQL query, which - once built - is piped to another psql command. I admit, pipes are super awesome.

WITH tables AS (
-- FULL OUTER JOIN all the data_YYYYMMDD tables
SELECT
    table_name,
    table_schema,
    'd' || id tbl,
    CASE WHEN id = 1
        THEN table_schema || '.' || table_name || ' d' || id
        ELSE 'FULL OUTER JOIN ' || table_schema || '.' || table_name || ' d' || id || ' ON (d1.ku_kod = d' || id || '.ku_kod)'
    END tbl_join
FROM (
    SELECT
        table_name,
        table_schema,
        row_number() OVER (ORDER BY table_name) id
    FROM information_schema.tables
    WHERE table_name LIKE 'data_%'
        AND table_type = 'BASE TABLE'
        AND table_schema = 'public'
) a
)
-- create data table with the correct values order for each statistical column
-- note that the whole process would crash if d1.ku_kod would be NULL -> @todo fix me
SELECT 'DROP TABLE IF EXISTS data;
    CREATE TABLE data AS
    SELECT d1.ku_kod, d1.ku_nazev,'
UNION ALL
SELECT
    array_to_string(array_agg(r), ', ') r
FROM (
    SELECT
    'ARRAY[' || array_to_string(array_agg(tables.tbl || '.' || columns.column_name ORDER BY tables.table_name), ', ') || ']' || ' ' || columns.column_name r
    FROM tables
    JOIN (
    SELECT
        table_schema,
        table_name,
        column_name
    FROM information_schema.columns
    WHERE column_name NOT LIKE 'ku_%'
    ORDER BY ordinal_position
    ) columns
    ON (tables.table_name = columns.table_name AND columns.table_schema = tables.table_schema)
    GROUP BY columns.column_name
) a
UNION ALL
SELECT 'FROM'
UNION ALL
SELECT tbl_join FROM tables;

psql -qAt --no-psqlrc -f transform.sql builds the actual query from the query above, | psql -qAt --no-psqlrc sends it to the database again. This part was really fun to implement!

I’m still considering to store diff values instead of absolute values in those ARRAYs - that would save some serious bandwidth!

Load

See Plotting the Czech Cadastre Land Use with d3: Data Transformation (part III).

Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I)

Written on Nov 13, 2016 and marked as javascript, d3, postgresql, svg | data

This post is the first part of the upcoming series summarizing the process of visualizing landuse data with bash, PostgreSQL and d3.js. Read other parts:

  1. you’re reading it now
  2. Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II)
  3. Plotting the Czech Cadastre Land Use with d3: Data Transformation (part III)

Czech Office for Surveying, Mapping and Cadastre has recently published lot of data via Atom feed. There’s pretty small and a bit boring dataset included, featuring quarterly updated landuse-related values for all 13,091 cadastral areas:

  • absolute number of land lots within given category (arable land, forests, etc.)
  • absolute area of land lots within given category

Data are published as CSV files linked from the Atom feed. Sadly, they come windows-1250 encoded, using Windows line endings, with trailing semicolons and header rows using diacritics.

ETL process

Before the d3 viz can be crafted, it’s necessary to:

  1. extract CSV data from the URLs provided via the Atom feed
  2. transform those data into a relational database, do some math
  3. load data into a d3.js viz
  4. profit (as usual)

Extract

#!/bin/bash
# extract.sh -f YYYYMMDD

while [[ $# -gt 1 ]]
do
key="$1"

case $key in
    -f|--file)
    FILE="$2"
    shift # past argument
    ;;
    *)
        # unknown option
    ;;
esac
shift # past argument or value
done

URL=http://services.cuzk.cz/sestavy/UHDP/UHDP-
CSVFILE=$FILE.csv
CSVUTF8FILE=${CSVFILE%.*}.utf.csv
URL+=$CSVFILE

echo "downloading $URL"
wget -q $URL -O $CSVFILE

if [[ $? != 0 ]]; then
    rm -f $CSVFILE
    echo "download failed"
    exit
fi

echo "converting to utf-8"
iconv -f WINDOWS-1250 -t UTF-8 $CSVFILE -o $CSVUTF8FILE && \
echo "modifying ${FILE}"
sed -i 's/^M$//' $CSVUTF8FILE && \
sed -i 's/\r$//' $CSVUTF8FILE && \
sed -i 's/;*$//g' $CSVUTF8FILE && \
sed -i '1d' $CSVUTF8FILE

echo "importing to database"
sed -e "s/\${DATE}/$FILE/g" extract.sql | psql -qAt --no-psqlrc

rm $CSVFILE $CSVUTF8FILE

This script downloads CSV file, deals with all the pitfalls mentioned above and, when done, copy command within extract.sql loads the data into a data_YYYYMMDD table. Putting all the files into the one table would have saved me a lot of transformation SQL, yet it didn’t feel quite right though.

Transform

See Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II).

Load

See Plotting the Czech Cadastre Land Use with d3: Data Transformation (part III).

Introducing Blind Maps Project

Written on Nov 2, 2016 and marked as javascript, leaflet | web maps

I’d like to introduce you to my little pet project, which might just as well be awarded the first pet project I’ve ever completed, called Blind maps.

It’s a very simple, yet useful web application built on top of the great Leaflet library meant to help you get to know our world a bit better. As the name suggests, the app shows you, well… a blind map, and you try to fill as many features as you can.

The app is ready and can be used:

  • online at Blind maps with the map of your choice (if available)
  • offline, downloaded to your computer and filled with whatever data you want

What I find great about this project is the ease of adding new dataset. For starters, I filled it with data coming from Natural Earth:

  • CONUS states
  • European states
  • World capitals

If you wish, feel free to send me a pull request with GeoJSON data, I’ll be happy to have more datasets available! The process is described at the project homepage.

As you notice at the project homepage, there are two versions of the game available:

  • one lets you find map features by their names
  • the other one lets you type name highlighted feature (much tougher)

Have fun!

Degrees To Decimal With Javascript Reworked

Written on Oct 28, 2016 and marked as javascript | development

Two years ago I was pretty happy with this little piece of code to transform degrees to the decimal value. Yesterday, I found a neater way to do the same:

let deg = [50, 30, 0];

function degToDec(prev, cur, curIndex) {
    return prev + cur / Math.pow(60, curIndex);
}

deg.reduce(degToDec);

Once you have an input array, that’s pretty much it. Love JavaScript.

Do You Really Need Gulp? Or Grunt? Or Bower? Or What?

Written on Mar 20, 2016 and marked as javascript | development

Disclaimer: I’m an enthuastic developer, but I do not code for a living. I’m just the ordinary guy who keeps editing a wrong file wondering why the heck the changes are not being applied.

TL;DR: I do think npm might be the answer.

Wonderful world of JavaScript DevOps

When I first started using JavaScript on the server side with node.js, I felt overwhelmed by numerous options to automate tasks. There was npm taking care of backend dependencies. Then I would build a frontend and found out about bower for handling frontend dependencies. Then it would be great to have some kind of minification/obfuscation/uglification/you-name-it task. And the build task. And the build:prod task. And how about eslint task? And then I would end up spending hours doing nothing, just reading blogs about the tools being used by others who do code for a living.

Intermezzo: I think my coding is slow. Definitely slower than yours. I’m getting better though.

Using the force

Looking back I find it a bit stressful - how the heck do I choose the right tools? Where’s Yoda to help me out? Anyway, next to adopt after npm was bower. And I liked it, even though some packages were missing - but who cares as long as there is no better way, right? Except there is… I guess.

Automation was next in the line to tackle. So I chose gulp without a bit of hesitation. It was a hype, a bigger than grunt back then. I even heard of yeoman, but until now I still don’t know what it actually does. And I’m happy with that.

A short summary so far:

  • npm for backend dependencies
  • bower for frontend dependencies
  • gulp for running tasks

So far, so good.

Is Bower going to die?

Then I stumbled upon this tweet and started panicking. Or rather started to feel cheated. It took me time to set all this up and now it’s useless? Or what?

Seeing it now, I’m glad I read this. And I really don’t know what happened to Bower, if anything at all.

Keeping it simple

So Bower’s dying, what are you going to do about that? You’ll use npm instead! And you’ll have a single source of truth called package.json. You’l resolve all the dependencies with a single npm install command and feel like a king. We’re down to two now - npm and gulp.

Gulp, Gulp everywhere!

When you get rid of Bower, next feeling you have is your gulpfile.js just got off the leash. It got really big and grew to ~160 lines of code and became a nightmare to manage.

So you split it into task files and a config file. What a relief. But you still realize a half of your package.json dependencies starts with gulp-. And you hate it.

Webpack for the win

For me, a non-developer, setting the webpack wasn’t easy. I didn’t find docs very helpful either. Reading the website for the first time, I didn’t even understand what it should be used for. I got it working eventually. And I got rid of gulp, gulp-connect, gulp-less, gulp-nodemon, gulp-rename, gulp-replace, gulp-task-listing and gutil. And the whole gulpfile.js. That was a big win for me.

But how do you run tasks?

Well…

npm run start-dev # which in turn calls the code below
npm run start-webpack & NODE_ENV=development nodemon server.js # where start-webpack does the following
node_modules/webpack-dev-server/bin/webpack-dev-server.js --quiet --inline --hot --watch

That’s it. If I need to build code, I run npm run build, which calls some other tasks from scripts section in the package.json.

That’s pretty much it. I don’t think it’s a silver bullet, but I feel like I finally found peace of mind for my future JavaScript development. At least for a month or so before some other guy comes to town.

Twitter REST API Data Mining on OpenShift (Part II)

Written on Dec 6, 2015 and marked as javascript, openshift, twitter | development

Last time I described the setup of my OpenShift Twitter crawler and let it running and downloading data. It’s been more than two months since I started and I got interesting amount of data. I also made a simple ETL process to load it into my local PostGIS database, which I’d like to cover in this post.

Extract data

Each day is written to the separate sqlite file with a name like tw_day_D_M_YYYY. Bash is used to gzip all the files before downloading them from OpenShift.

#!/bin/bash

ssh openshift << EOF
    cd app-root/data
    tar czf twitter.tar.gz *.db
EOF

scp openshift:/var/lib/openshift/55e487587628e1280b0000a9/app-root/data/twitter.tar.gz ./data
cd data &&
tar -xzf twitter.tar.gz &&
cd -

echo "Extract done"

Transform data

The transformation part operates on downloaded files and merges them into one big CSV file. That’s pretty straightforward. Note that’s quite simple with sqlite flags, some sed and tail commands.

#!/bin/bash

rm -rf ./data/csv
mkdir ./data/csv

for db in ./data/*.db; do
    FILENAME=$(basename $db)
    DBNAME=${FILENAME%%.db}
    CSVNAME=$DBNAME.csv
    echo "$DBNAME to csv..."
    sqlite3 -header -csv $db "select * from $DBNAME;" > ./data/csv/$CSVNAME
done

cd ./data/csv
touch tweets.csv
echo $(sed -n 1p $(ls -d -1 *.csv | head -n 1)) > tweets.csv # get column names

for csv in tw_*.csv; do
    echo $csv
    tail -n +2 $csv >> tweets.csv # get all lines without the first one
done

Load data

In the last step, the data is loaded with SQL \copy command.

#!/bin/bash

export PG_USE_COPY=YES

DATABASE=mzi_dizertace
SCHEMA=dizertace
TABLE=tweets

psql $DATABASE << EOF
    DROP TABLE IF EXISTS $SCHEMA.$TABLE;
    CREATE UNLOGGED TABLE $SCHEMA.$TABLE (id text, author text, author_id text, tweet text, created_at text, lon float, lat float, lang text);
    \copy $SCHEMA.$TABLE FROM 'data/csv/tweets.csv' CSV HEADER DELIMITER ','
    ALTER TABLE $SCHEMA.$TABLE ADD COLUMN wkb_geometry geometry(POINT, 4326);
    UPDATE $SCHEMA.$TABLE SET wkb_geometry = ST_SetSRID(ST_MakePoint(lon, lat), 4326);
    CREATE INDEX ${TABLE}_geom_idx ON $SCHEMA.$TABLE USING gist(wkb_geometry);
    COMMIT;
EOF

First statistics

Some interesting charts and numbers follow.

PostGIS Case Study: Vozejkmap Open Data (Part III)

Written on Nov 14, 2015 and marked as postgresql, postgis, leaflet, javascript | web maps

After a while I got back to my PostGIS open data case study. Last time I left it with clustering implemented, looking forward to incorporate Turf.js in the future. And the future is now. The code is still available on GitHub.

Subgroup clustering

Vozejkmap data is categorized based on the place type (banks, parking lots, pubs, …). One of the core features of map showing such data should be the easy way to turn these categories on and off.

As far as I know, it’s not trivial to do this with the standard Leaflet library. Extending L.control.layers and implement its addOverlay, removeOverlay methods on your own might be the way to add needed behavior. Fortunately, there’s an easier option thanks to Leaflet.FeatureGroup.SubGroup that can handle such use case and is really straightforward. See the code below.

cluster = L.markerClusterGroup({
    chunkedLoading: true,
    chunkInterval: 500
});

cluster.addTo(map);

...

for (var category in categories) {
    // just use L.featureGroup.subGroup instead of L.layerGroup or L.featureGroup
    overlays[my.Style.set(category).type] = L.featureGroup.subGroup(cluster, categories[category]);
}

mapkey = L.control.layers(null, overlays).addTo(map);

With this piece of code you get a map key with checkboxes for all the categories, yet they’re still kept in the single cluster on the map. Brilliant!

Using Turf.js for analysis

Turf is one of those libraries I get amazed easily with, spending a week trying to find a use case, finally putting it aside with “I’ll get back to it later”. I usually don’t. This time it’s different.

I use Turf to get the nearest neighbor for any marker on click. My first try ended up with the same marker being the result as it was a member of a feature collection passed to turf.nearest() method. After snooping around the docs I found turf.remove() method that can filter GeoJSON based on key-value pair.

Another handy function is turf.distance() that gives you distance between two points. The code below adds an information about the nearest point and its distance into the popup.

// data is a geojson feature collection
json = L.geoJson(data, {
    onEachFeature: function(feature, layer) {
        layer.on("click", function(e) {
            var nearest = turf.nearest(layer.toGeoJSON(), turf.remove(data, "title", feature.properties.title)),
                distance = turf.distance(layer.toGeoJSON(), nearest, "kilometers").toPrecision(2),
                popup = L.popup({offset: [0, -35]}).setLatLng(e.latlng),
                content = L.Util.template(
                    "<h1>{title}</h1><p>{description}</p> \
                    <p>Nejbližší bod: {nearest} je {distance} km daleko.</p>", {
                    title: feature.properties.title,
                    description: feature.properties.description,
                    nearest: nearest.properties.title,
                    distance: distance
                });

            popup.setContent(content);
            popup.openOn(map);

            ...

From what I’ve tried so far, Turf seems to be incredibly fast and easy to use. I’ll try to find the nearest point for any of the categories, that could take Turf some time.

Update

Turf is blazing fast! I’ve implemented nearest point for each of the categories and it gets done in a blink of an eye. Some screenshots below. Geolocation implemented as well.

You can locate the point easily.

You can hide the infobox.

You can jump to any of the nearest places.

Twitter REST API Data Mining on OpenShift (Part I)

Written on Nov 6, 2015 and marked as javascript, openshift, twitter | development

More than a year ago I wrote about analyzing Twitter languages with Streaming API. Back then I kept my laptop running for a week to download data. Not a comfortable way, especially if you decide to get more data. One year uptime doesn’t sound like anything you want to be part of. OpenShift by Red Hat seems to be almost perfect replacement. Almost.

OpenShift setup

I started with Node.js application running on one small gear. Once running, you can easily git push the code to your OpenShift repo and login via SSH. I quickly found simple copy-pasting my local solution wasn’t going to work. and fixed it with some minor tweaks. That’s where the fun begins…

I based the downloader on Node.js a year ago. Until now I still don’t get how that piece of software works. Frankly, I don’t really care as long as it works.

Pitfalls

If your application doesn’t generate any traffic, OpenShift turns it off. It wakes up once someone visits again. I had no idea about that and spent some time trying to stop that behavior. Obviously, I could have scheduled a cron job on my laptop pinging it every now and then. Luckily, OpenShift can run cron jobs itself. All you need is to embed a cron cartridge into the running application (and install a bunch of ruby dependencies beforehand).

rhc cartridge add cron-1.4 -a app-name

Then create .openshift/cron/{hourly,daily,weekly,monthly} folder in the git repository and put your script running a simple curl command into one of those.

curl http://social-zimmi.rhcloud.com > /dev/null

Another problem was just around the corner. Once in a while, the app stopped writing data to the database without saying a word. What helped was restarting it - the only automatic way to do so being a git push command. Sadly, I haven’t found a way to restart the app from within itself; it probably can’t be done.

When you git push, the gear stops, builds, deploys and restarts the app. By using hot deployment you can minimize the downtime. Just put the hot_deploy file into .openshift/markers folder.

git commit --allow-empty -m "Restart gear" && git push

This solved the problem until I realize that every restart deleted all the data collected so far. If your data are to stay safe and sound, save them in process.env.OPENSHIFT_DATA_DIR (which is app-root/data).

Anacron to the rescue

How do you push an empty commit once a day? With cron of course. Even better, anacron.

mkdir ~/.anacron
cd ~/.anacron
mkdir cron.daily cron.weekly cron.monthly spool etc

cat <<EOT > ~/.anacron/etc/anacrontab

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/$HOME/bin
HOME=$HOME
LOGNAME=$USER

1 5  daily-cron nice run-parts --report $HOME/.anacron/cron.daily
7 10 weekly-cron nice run-parts --report $HOME/.anacron/cron.weekly
@monthly 15 monthly-cron nice run-parts --report $HOME/.anacron/cron.monthly

EOT

cat <<EOT >> ~/.zprofile # I use zsh shell
rm -f $HOME/.anacron/anacron.log
/usr/sbin/anacron -t /home/zimmi/.anacron/etc/anacrontab -S /home/zimmi/.anacron/spool &> /home/zimmi/.anacron/anacron.log

EOT

Anacron is to laptop what cron is to 24/7 running server. It just runs automatic jobs when the laptop is running. If it’s not and the job should be run, it runs it once the OS boots. Brilliant idea.

It runs the following code for me to keep the app writing data to the database.

#!/bin/bash

workdir='/home/zimmi/documents/zimmi/dizertace/social'
logfile=$workdir/restart-gear.log
date > $logfile

{ 
HOME=/home/zimmi
cd $workdir && \
git merge origin/master && \
git commit --allow-empty -m "Restart gear" && \
git push && \
echo "Success" ; 
} >> $logfile 2>&1

UPDATE: Spent a long time debugging the “Permission denied (publickey).”-like errors. What seems to help is:

  1. Use id_rsa instead of any other SSH key
  2. Put a new entry into the ~/.ssh/config file

I don’t know which one did the magic though.

I’ve been harvesting Twitter for a month with about 10-15K tweets a day (only interested in the Czech Republic). 16 to 15 of them is located with latitude and longitude. More on this next time.