Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

DrSkippy/Gnip-Python-Historical-Utilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

381dc3d · Nov 19, 2012

History

14 Commits
Nov 19, 2012
Sep 26, 2012
Sep 19, 2012
Sep 26, 2012

Repository files navigation

                                 Python Library
                                      and
                             Command Line Utilities
                       for Gnip Historical PowerTrack API


The process for launching and retrieveing data for an historical historical job 
requires only a few steps:
  1) create job
  2) retrieve and review job quote
  3) accept or reject job
  4) download data files list
  5) download data

Untilities are included to assist with each step.

SETUP UTILITY
=============
First, set up your Gnip credentials.  There is a simple utility to create the local credential 
file named ".gnip".

$ ./setup_gnip_creds.py 
Username: [email protected]    
Password: 
Password again: 
Endpoint URL. Enter your Account Name (eg https://historical.gnip.com:443/accounts/<account name>/): shendrickson
Done creating file ./.gnip
Be sure to run:
chmod og-w .gnip
    
$ chmod og-w .gnip

You will likely wish to run these utilities from other locations.  Be sure the export an
updated PYTHONPATH

$ export PYTHONPATH=${PYTHONPATH}:path-to-gnip-python-historical-utilities

CREATE JOB
==========
Create a job description by editing the example JSON file provided ("bieber_job1.json").

You will end up with a single JSON record like this (see GNIP documentation for option 
details). the fromDate and toDate are in the format YYYYmmddHHMM:

{
    "dataFormat" : "activity-streams",
    "fromDate" : "201201010000",
    "publisher" : "twitter",
    "rules" : 
    [
        {
            "tag" : "bestRuleEver",
            "value" : "bieber"
        }
    ],
    "serviceUsername" : "DrSkippy27",
    "streamType" : "track",
    "title" : "BieberJob1",
    "toDate" : "201201010001"
}

To create the job,

$ ./create_job.py -f./bieber_job1.json -t "Social Data Phenoms - Bieber"

The response is the JSON record returned by the server. It will describe the job (including
JobID and the JobURL, or any error messages.

To get help,

$ ./create_job.py -h
Usage: create_job.py [options]

Options:
  -h, --help            show this help message and exit
  -u URL, --url=URL     Job url.
  -l, --prev-url        Use previous Job URL (only from this configuration
                        file.).
  -v, --verbose         Detailed output.
  -f FILENAME, --filename=FILENAME
                        File defining job (JSON)
  -t TITLE, --title=TITLE
                        Title of project, this title supercedes title in file.


LIST JOBS, get JOB QUOTES and get JOB STATUS:
=============================================
$ ./list_jobs.py -h
Usage: list_jobs.py [options]

Options:
  -h, --help            show this help message and exit
  -u URL, --url=URL     Job url.
  -l, --prev-url        Use previous Job URL (only from this configuration
                        file.).
  -v, --verbose         Detailed output.
  -d SINCEDATESTRING, --since-date=SINCEDATESTRING
                        Only list jobs after date, (default
                        2012-01-01T00:00:00)

For example, I have three completed jobs, a Gnip job, a Bieber job and a SXSW 
job for which data is avaiable.

$  ./list_jobs.py 
#########################
TITLE:     GNIP2012
STATUS:    finished
PROGRESS:  100.0 %
JOB URL:   https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/eeh2vte64.json
#########################
TITLE:     Justin Bieber 2009
STATUS:    finished
PROGRESS:  100.0 %
JOB URL:   https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/j5epx4e5c3.json
#########################
TITLE:     SXSW2010-2012
STATUS:    finished
PROGRESS:  100.0 %
JOB URL:   https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d.json


To see detailed information or download data filelist, 
specify URL with -u or add -v flag (data_files.txt contains 
only URLs from last job in list)

DOWNLOAD URLS OF FILES CONTAINING DATA
======================================
To retrieve the file locations for the data files this job created on S3, pass 
the job URL with the -u flag (or if you used -u for this job previously, just use -l--see help),

$  ./list_jobs.py -u  https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d.json
#########################
TITLE:     SXSW2010-2012
STATUS:    finished
PROGRESS:  100.0 %
JOB URL:   https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d.json

RESULT:
 Job completed at ........ 2012-09-01 04:35:23
 No. of Activities ....... -1
 No. of Files ............ -1
 Files size (MB) ......... -1
 Data URL ................ https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d/results.json
DATA SET:
 No. of URLs ............. 131,211
 File size (bytes)........ 2,151,308,466
 Files (URLs) ............ https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/00_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=hDSc0a%2BRQeG%2BknaSAWpzSUoM1F0%3D
https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/10_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=DOZlXKuMByv5uKgmw4QrCOpmEVw%3D
https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/20_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=X4SFTxwM2X9Y7qwgKCwG6fH8h7w%3D
https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/30_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=WVubKurX%2BAzYeZLX9UnBamSCrHg%3D
https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/40_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=OG9ygKlXNxFvJLlAEWi3hes5yyw%3D
...

Writing files to data_files.txt...

Filenames for the 131K files created on S3 by the job have been downloaded to a file in 
the local directory, ./data_files.txt.

DOWNLOAD DATA
=============

To retrieve this data use the utility,

$ ./get_data_files.bash
...

This will lauch up to 8 simultaneousl cUrl connections to S3 to download the files 
into a local ./data/year/month/day/hour... directory tree (see name_mangle.py for details).

ACCEPT/REJECT JOB
=================
After a job is quoted, you can accept or reject the job.  The job will not start until it is accepted.

$ ./accept_job -u https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historicals/track/jobs/c9pe0day6h.json

or 

$ ./reject_job -u https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historicals/track/jobs/c9pe0day6h.json

The module gnip_historical.py provides additional functionality you can access programatically.

==
Gnip-Python-Historical-Utilities by Scott Hendrickson is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.