This repo contains code used to obtain data for our paper.
The core program is the iterative scraper, a Selenium based tool for performant text data collection from the web. It supports filtered discovery of connected webpages in the webgraph, and is resistant to page-dynamicity and some anti-scraping methods. User provided seed links create access points to the web, and user provided search terms limit inclusion of irrelevant data. Users are able to configure scraper variables to adjust scrape quality, speed, etc.
We have also written extractor and refiner scripts for extracting text from raw HTMLs and refining said text for purposes of coding, etc. These scripts are also customizable and can be run in sequence with the scraper using a provided bash script.
Also included are supporting scripts for filling in missing data points both manually and automatically. This is typically needed due to rate-limiting on the web and
The scraper requires as input:
- A structured xlsx file of seed links to scrape. Search areas are organized sheet-wise, sites row-wise.
- A csv file of search terms. Used to determine if discovered pages are relevant and should be included in the dataset.
- User specified options
To run the main iterative script, run
$ python main_iterative.py [links] [search_terms] [OPTIONS]
For more details about main_iterative.py positional arguments and options, run
$ python main_iterative.py --help
The main_iterative.py
script outputs structured json files containing raw HTML data from scraped pages.
In the specified data directory, the script creates a all_htmls
folder.
The json files are organized by site and search area as so:
AREA_SITE_DATETIME.json
For example:
copyright_ebay_06_15_23_14_56.json
The json files are structured by unique page as so:
{
site-id: id of the site (int)
site-name: name of the site (str)
site-url: url of the site (str)
pages: {
0:
url: url of the page/article (str)
html: raw html from the url (str)
1: ...
...
}
}
For example:
{
site-id: 4
site-name: "twitter"
site-url: "https://www.twitter.com"
pages: {
0:
url: "https://help.twitter.com/en/rules-and-policies/crisis-misinformation"
html: "<html lang\"en"\ dir=\"...></script></body></html>"
1:
url: "https://help.twitter.com/en/rules-and-policies/medical-misinformation-policy"
html: "<html lang\"en"\ dir=\"...></script></body></html>"
2:
url: "https://help.twitter.com/en/rules-and-policies/france-false-information"
html: "<html lang\"en"\ dir=\"...></script></body></html>"
}
}
The extractor script pulls page text out of raw HTMLs into structured json files.
The extractor requires as input:
- A path to the data directory. It should contain an
all_htmls
folder containing scraper output. - User specified options
To run the extractor script, run
$ python extractor.py [datadir] [OPTIONS]
For more details about extractor.py positional arguments and options, run
$ python extractor.py --help
The extractor.py script outputs folders of structured json files containing text extracted from raw scraped HTMLs.
In the specified data directory, the script creates a all_text
folder containing a folder for each search area. In each area folder, there are json files for each site
For example:
copyright/ebay.json
The json files are structured by unique page as so:
{
platform: platform,
area: search term area,
pages: [
{
page_id: page id,
source: link from which page was scraped,
text: list of text fragments extracted from page
},
...
]
}
For example:
{
platform: 'instagram',
area: 'copyright',
pages: [
{
page_id: 10,
source: 'www.instagram.com/copyright,
text: [
'Instagram Copyright',
'Our policy\n',
...
]
},
...
]
}
The refiner reconstructs sentences from fragmented strings extracted from raw HTML.
The refiner requires as input:
- A path to the data directory. It should contain an
all_text
folder containing extractor output. - A csv file of search terms.
- User specified options
To run the extractor script, run:
$ python refiner.py [datadir] [search_terms] [OPTIONS]
For more details about refiner.py positional arguments and options, run:
$ python refiner.py --help
The refiner.py script outputs folders of structured json files containing refined sentences.
In the specified data directory, the script creates a passages
folder containing a folder for each search area. In each area folder, there are json files for each site.
For example:
copyright/ebay.json
The json files are structured by unique page as so:
{
platform: platform,
area: search term area,
pages: [
{
page_id: page id,
source: link from which page was scraped,
passages: [
{
terms: list of found search terms,
text: list of refined sentences
},
...
]
},
...
]
}
For example:
{
platform: 'instagram',
area: 'copyright',
pages: [
{
page_id: 10,
source: 'www.instagram.com/copyright',
passages: [
{
terms: ['copyright'],
text: [
'This is our copyright policy for the instagram app.\n',
...
]
},
...
]
},
...
]
}
We provide a bash script to run the scraper, extractor, and refiner in sequence.
The pipeline script requires as input:
- path to xlsx file containing seed links
- path to csv file containing search terms
- user specified options
To run the pipeline script, run:
$ ./pipeline.sh -l [links] -t [search terms] [OPTIONS]
For more details about pipeline.sh positional arguments and options, run:
$ ./pipeline.sh -h
The pipeline script runs each of the three scripts in sequence, so the output is the exact same as if run manually. See above.
The fill script can be used for filling in missing data. Runs on pipeline output.
Searches for individual missing pages (see find_empties) and then randomly scrapes them. Designed to overcome possible request throttling, etc.
The fill script requires as input:
- path to directory containing all pipeline output
To run the fill script, run:
$ python util/fill.py [datadir]
For details about fill.py positional arguments and options, run:
$ python util/fill.py -h
No explicit output. Directly inserts data into json files in data directory. Can verify data filled using the find_empties script in the util folder.
The find_empties script allows json output of empties list used by fill script.
Searches for bad data in refiner output and compiles a list. Running script in main outputs this list to a json.
Can also change the criteria for whether a datapoint is bad via the success() function.
The find_empties script requires as input:
- path to directory containing all pipeline output
- filename to output json to
To run the find_empties script, run:
$ python util/find_empties.py [datadir] [outfile]
For details about fill.py positional arguments and options, run:
$ python util/find_empties.py -h
This script outputs a json file to the filename specified by the outfile argument.
The json file may be loaded into a list.
The json file is structured as so:
[
(
site,
area,
page_id,
source
),
...
]
For example:
[
(
'instagram',
'copyright',
'10',
'www.instagram.com/copyright'
),
...
]