This repository has been archived by the owner on Nov 9, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 19
Use Gnip's API to create and control Historical Powertrack jobs.
License
DrSkippy/Gnip-Python-Historical-Utilities
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Latest commit381dc3d · | ||||
Repository files navigation
Python Library and Command Line Utilities for Gnip Historical PowerTrack API The process for launching and retrieveing data for an historical historical job requires only a few steps: 1) create job 2) retrieve and review job quote 3) accept or reject job 4) download data files list 5) download data Untilities are included to assist with each step. SETUP UTILITY ============= First, set up your Gnip credentials. There is a simple utility to create the local credential file named ".gnip". $ ./setup_gnip_creds.py Username: [email protected] Password: Password again: Endpoint URL. Enter your Account Name (eg https://historical.gnip.com:443/accounts/<account name>/): shendrickson Done creating file ./.gnip Be sure to run: chmod og-w .gnip $ chmod og-w .gnip You will likely wish to run these utilities from other locations. Be sure the export an updated PYTHONPATH $ export PYTHONPATH=${PYTHONPATH}:path-to-gnip-python-historical-utilities CREATE JOB ========== Create a job description by editing the example JSON file provided ("bieber_job1.json"). You will end up with a single JSON record like this (see GNIP documentation for option details). the fromDate and toDate are in the format YYYYmmddHHMM: { "dataFormat" : "activity-streams", "fromDate" : "201201010000", "publisher" : "twitter", "rules" : [ { "tag" : "bestRuleEver", "value" : "bieber" } ], "serviceUsername" : "DrSkippy27", "streamType" : "track", "title" : "BieberJob1", "toDate" : "201201010001" } To create the job, $ ./create_job.py -f./bieber_job1.json -t "Social Data Phenoms - Bieber" The response is the JSON record returned by the server. It will describe the job (including JobID and the JobURL, or any error messages. To get help, $ ./create_job.py -h Usage: create_job.py [options] Options: -h, --help show this help message and exit -u URL, --url=URL Job url. -l, --prev-url Use previous Job URL (only from this configuration file.). -v, --verbose Detailed output. -f FILENAME, --filename=FILENAME File defining job (JSON) -t TITLE, --title=TITLE Title of project, this title supercedes title in file. LIST JOBS, get JOB QUOTES and get JOB STATUS: ============================================= $ ./list_jobs.py -h Usage: list_jobs.py [options] Options: -h, --help show this help message and exit -u URL, --url=URL Job url. -l, --prev-url Use previous Job URL (only from this configuration file.). -v, --verbose Detailed output. -d SINCEDATESTRING, --since-date=SINCEDATESTRING Only list jobs after date, (default 2012-01-01T00:00:00) For example, I have three completed jobs, a Gnip job, a Bieber job and a SXSW job for which data is avaiable. $ ./list_jobs.py ######################### TITLE: GNIP2012 STATUS: finished PROGRESS: 100.0 % JOB URL: https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/eeh2vte64.json ######################### TITLE: Justin Bieber 2009 STATUS: finished PROGRESS: 100.0 % JOB URL: https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/j5epx4e5c3.json ######################### TITLE: SXSW2010-2012 STATUS: finished PROGRESS: 100.0 % JOB URL: https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d.json To see detailed information or download data filelist, specify URL with -u or add -v flag (data_files.txt contains only URLs from last job in list) DOWNLOAD URLS OF FILES CONTAINING DATA ====================================== To retrieve the file locations for the data files this job created on S3, pass the job URL with the -u flag (or if you used -u for this job previously, just use -l--see help), $ ./list_jobs.py -u https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d.json ######################### TITLE: SXSW2010-2012 STATUS: finished PROGRESS: 100.0 % JOB URL: https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d.json RESULT: Job completed at ........ 2012-09-01 04:35:23 No. of Activities ....... -1 No. of Files ............ -1 Files size (MB) ......... -1 Data URL ................ https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d/results.json DATA SET: No. of URLs ............. 131,211 File size (bytes)........ 2,151,308,466 Files (URLs) ............ https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/00_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=hDSc0a%2BRQeG%2BknaSAWpzSUoM1F0%3D https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/10_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=DOZlXKuMByv5uKgmw4QrCOpmEVw%3D https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/20_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=X4SFTxwM2X9Y7qwgKCwG6fH8h7w%3D https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/30_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=WVubKurX%2BAzYeZLX9UnBamSCrHg%3D https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/40_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=OG9ygKlXNxFvJLlAEWi3hes5yyw%3D ... Writing files to data_files.txt... Filenames for the 131K files created on S3 by the job have been downloaded to a file in the local directory, ./data_files.txt. DOWNLOAD DATA ============= To retrieve this data use the utility, $ ./get_data_files.bash ... This will lauch up to 8 simultaneousl cUrl connections to S3 to download the files into a local ./data/year/month/day/hour... directory tree (see name_mangle.py for details). ACCEPT/REJECT JOB ================= After a job is quoted, you can accept or reject the job. The job will not start until it is accepted. $ ./accept_job -u https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historicals/track/jobs/c9pe0day6h.json or $ ./reject_job -u https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historicals/track/jobs/c9pe0day6h.json The module gnip_historical.py provides additional functionality you can access programatically. == Gnip-Python-Historical-Utilities by Scott Hendrickson is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
About
Use Gnip's API to create and control Historical Powertrack jobs.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published