This is a public code and data release for the research paper "I never signed up for this! Privacy implications of email tracking.", which will appear at PETS 2018. Portions of the code for this project borrow heavily from Jeffrey's undergraduate senior thesis, available here.
Authors: Steven Englehardt (@englehardt), Jeffrey Han (@itdelatrisu), and Arvind Narayanan (@randomwalker)
Paper: available here.
Core components:
crawler_emails/
- A web crawler, built on OpenWPM, to simulate email views and link clicks.crawler_mailinglists/
- A web crawler, built on OpenWPM, to find and submit mailing list sign-ups.email-tracking-tester/
- A tool to test the privacy properties of a mail client.mailserver/
- The mail server used to collect our corpus of emails.analysis/
- Coming soon
Additional documentation is available in the README
of each component
subdirectory.
- The framework is fully tested only on Ubuntu 16.04, and requires Java and Python runtime environments.
- The processes (described below) can be run on separate machines. The mail server is OS-independent, but the web crawlers only run on Linux.
- Depending on the number of registered sites, the mail server might store anywhere from a few hundred megabytes to tens of gigabytes of data on disk per month.
The system consists of three long-running processes:
- The mail server, which receives, stores, and analyzes incoming mail.
$ cd mailsever $ mvn clean package $ java -jar target/mailserver.jar
- The mailing list crawler, which crawls a list of input sites and searches for
mailing lists.
$ cd crawler_mailinglists $ python crawl_mailinglist_signup.py
- The email crawler, which renders emails in a simulated webmail environment
and visits links from those emails.
$ cd crawler_emails $ python crawl_*.py
Running the mail server requires a domain name with MX records pointing to the server. Additionally, if running the mailing list crawler from machines other than the mail server's machine, host records (A, CNAME) must also be set.
The following data used in the analysis is available for download:
Includes email meta data (subjects, sender, etc) and email body content.
Download link: mailbox.tar.bz2
Contents:
email_inbox.sqlite
users
table -- Email address registration records. Maps email address to registration site and time.inbox
table -- Subject, sender, delivery time, and other metadata for each email
mail/
-- Directory of raw.eml
files saved by the mail server. Use theinbox
table of theemail_inbox.sqlite
database to navigate.html/
-- HTML bodies parsed from the corresponding raw email bodies. These are the HTML emails loaded by the crawlers.html_after_filtering/
-- HTML bodies after filtering tracking tags using EasyList and EasyPrivacy. See Section 7 of the paper.
Crawl data generated by opening the HTML email bodies given in the html/
directory of the mailbox using a simulated webmail client. This is the primary
dataset used for the results in Section 4.
Download link: 2017-05-17_email_tracking_view_crawl.sqlite.bz2
Crawl data generated by opening the HTML email bodies given in the
filtered_html/
directory of the mailbox using a simulated webmail client.
This is the primary dataset used for the results in the "Server-side email
content filtering" subsection of Section 7.
Download link: 2017-05-28_email_tracking_filtered_view_crawl.sqlite.bz2
Crawl data generated by visiting a sample of links extracted from the HTML
email bodies of each email in the html/
directory of the mailbox. This is the
primary dataset used for the results in Section 5.
Download link: 2017-05-17_email_tracking_click_crawl.sqlite.bz2
Crawl data generated by running our mailing list sign-up procedure on the top sites, instrumenting the resulting pages to compute the overall level of successful sign-ups. This is the primary dataset used for the results in the "Form submission measurement" subsection of Section 3.
Download link: 2017-08-13_signup_success_measurement.sqlite.bz2
This project was funded by NSF Grant CNS 1526353, a research grant from Mozilla, and Amazon AWS Cloud Credits for Research.