Articles tagged with twitter tag

Twitter REST API Data Mining on OpenShift (Part II)

Last time I described the setup of my OpenShift Twitter crawler and let it running and downloading data. It’s been more than two months since I started and I got interesting amount of data. I also made a simple ETL process to load it into my local PostGIS database, which I’d like to cover in this post.

Extract data

Each day is written to the separate sqlite file with a name like tw_day_D_M_YYYY. Bash is used to gzip all the files before downloading them from OpenShift.

#!/bin/bash

ssh openshift &lt;&lt; EOF
    cd app-root/data
    tar czf twitter.tar.gz *.db
EOF

scp openshift:/var/lib/openshift/55e487587628e1280b0000a9/app-root/data/twitter.tar.gz ./data
cd data &amp;&amp;
tar -xzf twitter.tar.gz &amp;&amp;
cd -

echo "Extract done"

Transform data

The transformation part operates on downloaded files and merges them into one big CSV file. That’s pretty straightforward. Note that’s quite simple with sqlite flags, some sed and tail commands.

#!/bin/bash

rm -rf ./data/csv
mkdir ./data/csv

for db in ./data/*.db; do
    FILENAME=$(basename $db)
    DBNAME=${FILENAME%%.db}
    CSVNAME=$DBNAME.csv
    echo "$DBNAME to csv..."
    sqlite3 -header -csv $db "select * from $DBNAME;" &gt; ./data/csv/$CSVNAME
done

cd ./data/csv
touch tweets.csv
echo $(sed -n 1p $(ls -d -1 *.csv | head -n 1)) &gt; tweets.csv # get column names

for csv in tw_*.csv; do
    echo $csv
    tail -n +2 $csv &gt;&gt; tweets.csv # get all lines without the first one
done

Load data

In the last step, the data is loaded with SQL \copy command.

#!/bin/bash

export PG_USE_COPY=YES

DATABASE=mzi_dizertace
SCHEMA=dizertace
TABLE=tweets

psql $DATABASE &lt;&lt; EOF
    DROP TABLE IF EXISTS $SCHEMA.$TABLE;
    CREATE UNLOGGED TABLE $SCHEMA.$TABLE (id text, author text, author_id text, tweet text, created_at text, lon float, lat float, lang text);
    \copy $SCHEMA.$TABLE FROM 'data/csv/tweets.csv' CSV HEADER DELIMITER ','
    ALTER TABLE $SCHEMA.$TABLE ADD COLUMN wkb_geometry geometry(POINT, 4326);
    UPDATE $SCHEMA.$TABLE SET wkb_geometry = ST_SetSRID(ST_MakePoint(lon, lat), 4326);
    CREATE INDEX ${TABLE}_geom_idx ON $SCHEMA.$TABLE USING gist(wkb_geometry);
    COMMIT;
EOF

First statistics

Some interesting charts and numbers follow.

Twitter REST API Data Mining on OpenShift (Part I)

More than a year ago I wrote about analyzing Twitter languages with Streaming API. Back then I kept my laptop running for a week to download data. Not a comfortable way, especially if you decide to get more data. One year uptime doesn’t sound like anything you want to be part of. OpenShift by Red Hat seems to be almost perfect replacement. Almost.

OpenShift setup

I started with Node.js application running on one small gear. Once running, you can easily git push the code to your OpenShift repo and login via SSH. I quickly found simple copy-pasting my local solution wasn’t going to work. and fixed it with some minor tweaks. That’s where the fun begins…

I based the downloader on Node.js a year ago. Until now I still don’t get how that piece of software works. Frankly, I don’t really care as long as it works.

Pitfalls

If your application doesn’t generate any traffic, OpenShift turns it off. It wakes up once someone visits again. I had no idea about that and spent some time trying to stop that behavior. Obviously, I could have scheduled a cron job on my laptop pinging it every now and then. Luckily, OpenShift can run cron jobs itself. All you need is to embed a cron cartridge into the running application (and install a bunch of ruby dependencies beforehand).

rhc cartridge add cron-1.4 -a app-name

Then create .openshift/cron/{hourly,daily,weekly,monthly} folder in the git repository and put your script running a simple curl command into one of those.

curl http://social-zimmi.rhcloud.com &gt; /dev/null

Another problem was just around the corner. Once in a while, the app stopped writing data to the database without saying a word. What helped was restarting it - the only automatic way to do so being a git push command. Sadly, I haven’t found a way to restart the app from within itself; it probably can’t be done.

When you git push, the gear stops, builds, deploys and restarts the app. By using hot deployment you can minimize the downtime. Just put the hot_deploy file into .openshift/markers folder.

git commit --allow-empty -m "Restart gear" &amp;&amp; git push

This solved the problem until I realize that every restart deleted all the data collected so far. If your data are to stay safe and sound, save them in process.env.OPENSHIFT_DATA_DIR (which is app-root/data).

Anacron to the rescue

How do you push an empty commit once a day? With cron of course. Even better, anacron.

mkdir ~/.anacron
cd ~/.anacron
mkdir cron.daily cron.weekly cron.monthly spool etc

cat &lt;&lt;EOT &gt; ~/.anacron/etc/anacrontab

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/$HOME/bin
HOME=$HOME
LOGNAME=$USER

1 5  daily-cron nice run-parts --report $HOME/.anacron/cron.daily
7 10 weekly-cron nice run-parts --report $HOME/.anacron/cron.weekly
@monthly 15 monthly-cron nice run-parts --report $HOME/.anacron/cron.monthly

EOT

cat &lt;&lt;EOT &gt;&gt; ~/.zprofile # I use zsh shell
rm -f $HOME/.anacron/anacron.log
/usr/sbin/anacron -t /home/zimmi/.anacron/etc/anacrontab -S /home/zimmi/.anacron/spool &amp;&gt; /home/zimmi/.anacron/anacron.log

EOT

Anacron is to laptop what cron is to 24/7 running server. It just runs automatic jobs when the laptop is running. If it’s not and the job should be run, it runs it once the OS boots. Brilliant idea.

It runs the following code for me to keep the app writing data to the database.

#!/bin/bash

workdir='/home/zimmi/documents/zimmi/dizertace/social'
logfile=$workdir/restart-gear.log
date &gt; $logfile

{
HOME=/home/zimmi
cd $workdir &amp;&amp; \
git merge origin/master &amp;&amp; \
git commit --allow-empty -m "Restart gear" &amp;&amp; \
git push &amp;&amp; \
echo "Success" ;
} &gt;&gt; $logfile 2&gt;&amp;1

UPDATE: Spent a long time debugging the “Permission denied (publickey).”-like errors. What seems to help is:

Use id_rsa instead of any other SSH key
Put a new entry into the ~/.ssh/config file

I don’t know which one did the magic though.

I’ve been harvesting Twitter for a month with about 10-15K tweets a day (only interested in the Czech Republic). ¹⁄₆ to ¹⁄₅ of them is located with latitude and longitude. More on this next time.

Analyzing Twitter Languages With Streaming API

I am writing a diploma thesis focused on extracting spatial data from social networks. I have been working mainly with Twitter API and results I have got so far look really promising. This post was written as a reaction to many retweets I got when I shared one of my visualizations. It aims to make it clear how to connect to Twitter Streaming API using node.js, Leaflet and SQLite and retrieve tweets to analyze them later.

If you have any further questions after reading this paper, feel free to contact me via Twitter or e-mail. I must say right here that the code will be shared as well as the map, but there are still some bugs/features I would like to remove/add.

On a side note: I have been studying cartography and GIS for the last five years at Masaryk University in Brno, Czech Republic. I am mostly interested in ways computers can make data handling easier. I like to code in Python.

Using Twitter Streaming API

As you probably know, Twitter offers three different APIs:

REST API which is obviously RESTful. You can access almost every piece of information on Twitter with this one: tweets, users, places, retweets, followers…
Search API used for getting search results. You can customize these by sending parameters with your requests.
Streaming API which I am going to tell you about. It is really different, as (again, obviously) it keeps streaming tweets from the time you connect to the server. This means, once the connection is made, it has to stay open as long as you want tweets coming to you. The important thing here is that you get real time tweets delivered to you via this stream, which implies you cannot use this API to get tweets already tweeted.

To sum it up: You get a small sample of tweets in a real time as long as the connection to the server stays open.

What you need

To use any of the Twitter APIs, you need to authenticate you (or your app) against Twitter via OAuth protocol. To be able to do so, you need a Twitter account, because only then you can create apps, obtain access tokens and get authenticated for API use.

And then, obviously, you need something to connect to server with. I chose node.js because it seemed as a good tool to keep connection alive. I have also been interested in this technology for the couple of months but never really had a task to use it for.

The good thing about node.js is that it comes with lots of handy libraries. You get socket.io for streaming, ntwitter for using Twitter API and sqlite3 for working with SQLite databases.

You need something to store the data in also. As mentioned, I picked SQLite for this task. It is lightweight, does not need server nor configuration to run, just what I was looking for. Seems we are set to go, right?

Filtering the data

I guess none of you is interested in obtaining random tweets from around the world, neither was I. I live in the Czech republic and that is the area I want to get tweets from. How?

It is fairly simple, you tell Twitter with the locations parameter of statuses/filter resource. This parameter specifies a set of bounding boxes to track.

To sum it up: you connect to the server and tell it you just want to get tweets from the area you specified with the locations parameter. The server understands and keeps you posted.

Is it that simple?

No. Twitter decides whether to post you the tweet or not according to what the value of coordinates field is. It goes like this:

If the coordinates field is not empty, it gets tested against the bounding box. If it matches, it is sent to the stream.
If the coordinates field is empty, but the place field is not, it is the place field that gets checked. If if it by any extent intersects the bounding box, it is sent to the stream.
If both of the fields are empty, nothing is sent.

I decided to throw away the tweets with the empty coordinates field, because the accuracy of the value specified in the place field can be generally considered very low and insufficient for my purposes. You still need to account for position inaccuracies of users’ devices though, however that is not something that we can deal with. Let us just assume that geotagged tweets are accurate.

Figure: Twitter seems not to be very accurate when matching tweets against bounding box.

Although, as you can see in the picture, they are not. Or they are, but Twitter is not good at telling so. Besides that, none of the countries in the world is shaped like a rectangle and we would need to clip the data anyway. That is where SQLite comes in, because I have been saving incoming tweets right into the database.

If you use any GUI manager (sqlitebrowser for Linux is just fine), you can easily export your data to the CSV file, load it into QGIS, clip it with Natural Earth countries shapefile and save them to the GeoJSON file. It is just a matter of few JavaScript lines of code to put GeoJSON on a Leaflet map.

Displaying the data

Once a GeoJSON file is ready, it can be used for making an appealing viz to get a sense of what may be called “nationalities spatial patterns”. The lang field (stored in the database, remember?) of every tweet is used to colour the marker accordingly. Its value represents a two-letter language code as specified in ISO 639-1 document.

However, as those codes are guessed by Twitter’s language algorithms, they are prone to error. There are actually three scenarios we might be facing:

User tweets in the same language as used in the Twitter account.
User tweets in his/her mother language, but has set different Twitter account language.
User does not tweet in his/her mother language, but has it set as a Twitter account language.

We basically have to deal with 2) and 3), because 1) means we can be pretty sure what nationality the user is. Sadly though, I have not found an easy way to tell which one of these two we came across, thus which language settings should be prioritized. I made an arbitrary decision to prioritize the language the tweet was written in, based on assumption that the most of the users tweet in their mother language. No matter what you do, the data is still going to be biased by automatically generated tweets, especially ones sent by Foursquare saying “I’m at @WhateverBarItIs (http://someurl.co)”. It works fine for the strange languages like Russian and Arabic though.

From Jan 2 to Jan 4 this year 5,090 tweets were collected. Leaflet is becoming a little sluggish without clustering turned on displaying all of them. Plans are to let the collection run until Jan 7 and then put all the tweets on the map. I guess that might be around 10,000 geotagged tweets by that time.

I am definitely willing to share the code and the final viz. Meanwhile, you can have a look at the screenshot on picture [*]. I have already implemented nationality switch (legend items are clickable) and I would like to add a day/night switch to see whether there are any differences between the peoples’ behaviour.

Figure: Final map screenshot. A legend is used to turn nationalities on and off. You are looking at Prague by the way.

Obviously the most tweets were sent from the most populated places, e.g. Prague, Brno, Ostrava.

Going 3D With Space Time Cube

Seeing Anita’s space-time cube back in 2013 was a moment of woooow for me. I’ve been interested in unusual ways of displaying data ever since I started studying GIS and this one was just great. How the hell did she make it?!, I thought back then.

And I asked her, we had a little e-mail conversation and that was it. I got busy and had to postpone my attemps to create that viz until I dove into my diploma thesis. So…here you go.

Recipe

What you need is:

processing.py which is a Python port of processing environment.
A basemap that fits the extent you are about to show in the viz. I recommend QGIS for obtaining an image.
A JSON file with tweets you got via Twitter REST API (yes, the viz was made to display tweets).
A python script I wrote.

How to make it delicious

First things first, you need to add a timestamp property to tweets you want to show (with the following Python code). created_at param is a datetime string like Sat Jun 22 21:30:42 +0000 2013 of every tweet in a loop. As a result you get a number of seconds since 1.1.1970.

def string_to_timestamp(created_at):
    """Return the timestamp from created_at object."""
    locale.setlocale(locale.LC_TIME, 'en_US.utf8')
    created_at = created_at.split(' ')
    created_at[1] = str(strptime(created_at[1], '%b').tm_mon)
    timestamp = strptime(' '.join(created_at[i] for i in [1,2,3,5]), '%m %d %H:%M:%S %Y') # returns Month Day Time Year
    return mktime(timestamp)

As you probably guess, the timestamp property is the one we’re gonna display on the vertical axis. You definitely want the tweets to be sorted chronologically in your JSON file!

#!/usr/bin/python
# -*- coding: utf-8 -*-
#avconv -i frame-%04d.png -r 25 -b 65536k  video.mp4

from peasy import PeasyCam
import json

basemap = None
tweets = []
angle = 0

def setup():
    global basemap
    global tweets

    size(1010, 605, P3D)

    data = loadJSONArray('./tweets.json')
    count = data.size()

    last = data.getJSONObject(data.size()-1).getFloat('timestamp')
    first = data.getJSONObject(0).getFloat('timestamp')

    for i in range(0, count):
        lon = data.getJSONObject(i).getJSONObject('coordinates').getJSONArray('coordinates').getFloat(0)
        lat = data.getJSONObject(i).getJSONObject('coordinates').getJSONArray('coordinates').getFloat(1)
        time = data.getJSONObject(i).getFloat('timestamp')

        x = map(lon, -19.68624620368202116, 58.92453879754536672, 0, width)
        y = map(time, first, last, 0, 500)
        z = map(lat, 16.59971950210866964, 63.68835804244784526, 0, height)

        tweets.append({'x': x, 'y': y, 'z': z})

    basemap = loadImage('basemap.png')

    cam = PeasyCam(this,53,100,-25,700)
    cam.setMinimumDistance(1)
    cam.setMaximumDistance(1500)

def draw():
    global basemap
    global tweets
    global angle

    background(0)

    # Uncomment to rotate the cube
    """if angle < 360:
        rotateY(radians(angle))
        angle += 1
    else:
        angle = 360 - angle"""

    # box definition
    stroke(150,150,150)
    strokeWeight(.5)
    noFill()
    box(1010,500,605)


    # basemap definition
    translate(-505,250,-302.5)
    rotateX(HALF_PI)
    image(basemap,0,0)

    for i in range(0, len(tweets)):
        strokeWeight(.5)
        stroke(255,255,255)
        line(tweets[i].get('x'), height-tweets[i].get('z'), tweets[i].get('y'), tweets[i].get('x'), height-tweets[i].get('z'), 0)

        strokeWeight(5)
        stroke(255,0,0)
        point(tweets[i].get('x'), height-tweets[i].get('z'), tweets[i].get('y'))

        strokeWeight(2)
        stroke(255,255,255)
        point(tweets[i].get('x'), height-tweets[i].get('z'), 0)
        lrp = map(i, 0, len(tweets), 0, 1)
        frm = color(255,0,0)
        to = color(0,0,255)
        if i < len(tweets)-1:
            strokeWeight(1)
            stroke(lerpColor(frm,to,lrp))
            line(tweets[i].get('x'), height-tweets[i].get('z'), tweets[i].get('y'), tweets[i+1].get('x'), height-tweets[i+1].get('z'), tweets[i+1].get('y'))

    # Uncomment to capture the screens
    """if frameCount > 360:
        noLoop()
    else:
        saveFrame('screens/frame-####.png')"""

You should be most interested in these lines:

x = map(lon, -19.68624620368202116, 58.92453879754536672, 0, width)
y = map(time, first, last, 0, 500)
z = map(lat, 16.59971950210866964, 63.68835804244784526, 0, height)

Processing coordinate system They define how coordinates inside the cube should be computed. As you see, x is the result of mapping longitudinal extent of our area to the width of cube, the same happens to z and latitude, and to y (but here we map time, not coordinates).

The bounding box used in those computations is the bounding box of the basemap. Interesting thing about Processing and its 3D environment is how it defines the beginning of the coordinate system. As you can see on the left, it might be slighty different from what you could expect. That’s what you need to be careful about.

Michal ZimmermannPieces of knowledge from the world of GIS.