Mobile phone data analysis

@ NetMob 2015

Luc Rocher / rocher.lc / lrocher@mit.edu

Our schedule!

  • Navigate on the virtual machine
  • Load the datasets with Python and Bandicoot
  • Perform outstanding analyses

ACT I — Unix

Goal: join the NetMob virtual machines

PuTTY main screen

PuTTY on Windows

SSH Syntax

ssh {account}@{server}

Example

ssh datathon2@195.25.101.106
datathon2@vc1-vm1:~$

And now? The datasets are in /data.

  • How to navigate on the server?
  • How to look at files, move, or copy them?
  • Best practices to work remotely

How to navigate on the server?

$ cd /data

$ ls
ContextData  SET1  SET2  SET3

$ ls /data/ContextData
bandicoot_v01.pdf  D4D_Senegal.pdf  senegal_arr_centroids.csv
SENEGAL_ARR_V2.csv  Shapefile_Senegal_V2.zip  SITE_ARR_LONLAT.CSV

I'm lost!

$ pwd
/data
$ ls -lh /data/ContextData
total 1.4M
-rw-r--r-- 1 root root 150K Mar 23 10:55 bandicoot_v01.pdf
-rw-r--r-- 1 root root 200K Mar 23 10:55 D4D_Senegal.pdf
-rw-r--r-- 1 root root 9.9K Mar 23 10:55 senegal_arr_centroids.csv
-rw-r--r-- 1 root root 3.6K Mar 23 10:55 SENEGAL_ARR_V2.csv
-rw-r--r-- 1 root root 994K Mar 23 10:55 Shapefile_Senegal_V2.zip
-rw-r--r-- 1 root root  46K Mar 23 10:55 SITE_ARR_LONLAT.CSV

Move

mv {old file} {new file}

Copy

cp {old file} {new file}

/!\ Delete

rm {file}
rm -r {directory}
rmdir {empty directory}

Quick look at SITE_ARR_LONLAT.CSV


$ head SITE_ARR_LONLAT.CSV
site_id,arr_id,lon,lat
1,2,-17.525142,14.746832
2,2,-17.524360,14.747434
3,2,-17.522576,14.745198
4,2,-17.516398,14.746730
5,2,-17.512870,14.740658
6,2,-17.512103,14.748411
7,2,-17.510958,14.737403
8,2,-17.508395,14.730968
9,2,-17.507036,14.740671
                    

Use tail for the end of the file

Less is more


less SITE_ARR_LONLAT.CSV
                    

Less is a terminal pager: moving forward and backward through (large) files.

Use the arrow keys to go up and down; [PageUp], [PageDown], or [space] to go faster.

SET1, SET2, and SET3 files

Large compressed files!


gunzip file.gz
                    

Warning: large files. :) It's better to keep the files compressed and load them in memory.

How to know more? Great tutorial on:

ee.surrey.ac.uk/Teaching/Unix/

Best practices

  • Monitor resources (disk, memory, CPU)
  • Save the current session

Monitor resources

htop to see an interactive overview of used resources

htop
“htop is an interactive text-mode process viewer for Linux. It aims to be a better 'top'.”
tmux

tmux: terminal multiplexer

Launch a session

tmux ls
to list previous sessions

tmux a
to attach a previous session

From the session

Ctrl + b c
to create a new ‘tab’

Ctrl + b 5
to go to window 5 (etc.)

Ctrl + b d
to detach

byobu

byobu: terminal multiplexer

Questions for this first part?

ACT II — Python

  • How to Python!
  • Load the datasets and visualize them (using pandas and matplotlib)

Great reference at diveintopython3.net

Python interpreter


$ python
Python 2.7.9 (default, Jan  7 2015, 11:49:12)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
                    

IPython interpreter


$ ipython
Python 3.4.2 (default, Jan  7 2015, 11:54:58)
Type "copyright", "credits" or "license" for more information.

IPython 3.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]:
                    

Why is IPython better?

  • Autocompletion with Tab: commands, files…
  • Help for unknown commands:

In [1]: len?
Type:        builtin_function_or_method
String form: <built-in function len>
Namespace:   Python builtin
Docstring:
len(object)

Return the number of items of a sequence or collection.
                    

Python is easy!


x = "Ha-ha!"
x = 3
y = x + 5
print y # Prints out 8
                    

Indentation with four spaces


if x < 5 or (x > 10 and x < 20):
    print "The value is OK."

for i in [1,2,3,4,5]:
    print "This is iteration number", i

# Print out the values from 0 to 99 inclusive.
for value in range(100):
    print value
                    

Conditional expressions


if x < 5:
    print "x is very small"
elif x == 6:
    print "Yes, x is 6!"
else:
    print "x is too large, I can't handle it"
                        

Working with lists


>>> kitchen = ["spam", "spam", "spam", "eggs", "tomato", "spam"]
>>> print len(kitchen)
6
>>> print kitchen[0], kitchen[-1]
spam spam
>>> kitchen[1:3]
['spam', 'spam']
                    

For loops


for item in kitchen:
    if item != "spam":
        print item
# Prints out:
# eggs
# tomato
                    

for x in range(5):
    print x*x
# Prints out: 0, 1, 4, 9, 16
                    

Working with dictionaries


>>> call_duration = {"Alice": 45, "Boris": 100,
                     "Clarice": 10, "Doris": 20}

>>> call_duration["Alice"]
45

>>> call_duration["Robin"] = 5
>>> call_duration["Robin"]
5

>>> call_duration.keys()
['Boris', 'Clarice', 'Alice', 'Doris', 'Robin']

>>> max(call_duration.values())
100
                    

Functions


def square(x):
    return x*x

print square(6)  # Prints out 36
                    

Going deeper!

  1. NumPy: fast computation on arrays and matrices
  2. SciPy: scientific computation (optimization, linear algebra, integration, signal processing…)
  3. pandas: data manipulation and analysis
  4. networkx: manipulation and study of graphs
  5. matplotlib: 2D and 3D plotting

Second example: SET1S_01.CSV.gz


# timestamp, outgoing id, incoming id, number of texts
2013-01-01 00,1,61,1
2013-01-01 00,1,340,1
2013-01-01 00,1,419,1
2013-01-01 00,1,420,1
2013-01-01 00,1,447,2
                    

What we would like to do:

  • Distributions of number_of_texts per hour
  • Network on a specific date
  • Most central cell tower

ACT III — Behavioral indicators

See more at bandicoot.mit.edu

{notebook for bandicoot}