Finding great stories in Internet Data
Mining Social Media will show you the kind of data that can be mined on the social web, the insights that can be gained from it, and the limitations of its scope. You’ll learn how to find out what kind of data is available on popular social media juggernauts like Facebook and Twitter and how to recognize the value of what is measured.
Practical exercises interweave with conceptual lessons that cover ways to use Python to extract data from social media sources, analyze it, and make sense of it visually. You’ll learn how to write a script that taps into an API, how to scrape data from websites, and even how to analyze data from an automated Twitter bot.
This repository holds code and data related to the exercises detailed in the book. It is set to publish at the end of 2019.
1. Make sure you have OS level dependencies
- Python 3
- more to come
2. Clone this repo
git clone https://github.com/lamthuyvo/social-media-data-book.git
cd social-media-data-book
3. Install required python libraries
Optional but recommended: make a virtual environment using venv.
[more details about the computer setup to come]
While most coding files are hosted on this repository some data files were too large to be included her. Below are instructions on how to access them:
-
askscience_submissions.csv
— This file is required for the data exercises in chapter 8 and 9. If you're working with a downloaded version of this repository, you will need to first create adata
inside thechapter08_09
folder, then download the data fileaskscience_submissions.csv
and, lastly, place the data file inside thedata
folder. You can download the file here. The data was provided by data archivist Jason Baumgartner and represents a small sliver of the data he makes available to academics and researchers at Pushshift.io. -
iranian_tweets_csv_hashed.csv
— This file is required for the data exercises in chapter 10. If you're working with a downloaded version of this repository, you will need to first create adata
inside thechapter10
folder, then download the data fileiranian_tweets_csv_hashed.csv
and, lastly, place it inside the data folder. You can download the file here or directly from Twitter. You can find more information about this data on Twitter's elections integrity page.
[More to come]
Please feel free to contact me on Github or via [email protected]