This repository contains materials reproduce the findings featured in our story, "Google Ad Portal Equated 'Black Girls' With Porn" from our series, Google the Giant.
Screenshots and figures from our story can be found in the data
folder.
Jupyter notebooks used for data preprocessing and analysis are avialble in the notebooks
folder.
💡 Disclaimer: This repository contains code and data with explicit and graphically sexual language.
pip install -r requirements.txt
Where the raw inputs and intermediaries are stored.
data/
├── input
│ ├── browser
│ ├── raw-exports
│ └── screenshots
├── intermediary
│ ├── all-keywords.csv
│ ├── keywords-labelled-as-adult.json
│ ├── preprocessed
│ ├── websites-from-search.csv
│ └── websites-we-found-to-be-pornographic.csv
└── output
├── volume-of-adult-rec-keywords.csv
└── volume-of-adult-rec-keywords.png
We have raw exports from Google Keyword Planner in data/input/raw-exports
.
The same input is exported with and without the "exclude Adult ideas" filters.
The only column we use is the recommended Keyword
s column.
Collected July 8-12, 2020.
You can view screenshots from Keyword Planner in data/input/screenshots.
We have two screenshots for a search for "Black girls" with- and without the adult filters.
We preprocess and merge these files in data/intermediary/preprocessed
.
Here we add three boolean columns:
Google_Adult
- True
if Google filtered out the keyword when you "exclude adult ideas".
SERP_Adult
- True
if the recommended keyword's corresponding Google search is majority self-described pornographic sites.
All_Adult
- True
if either of the two previously mentioned bolumns is True
.
We have the source code (HTML) of Google search results page (SERP) for all the 1.9K recommended keywords in data/input/browser
We have the 200 most-shared web domains (from the SERPs above) in data/intermediary/websites-labelled-as-pornographic.csv
.
We determine which of these sites self identify as pornographic by looking for "porn" in the search listings for each website. We found 132 of these websites to be pornographic.
We have aggregated tables and figures featured in our story in data/output
. The table volume-of-adult-rec-keywords.csv
contains both counts and percentages of recommended keywords that Google identifies as "adult", which keywords have majority self-described pornographic sites in their search results, and neither adult or pornographic.
If you want to reroduce our results, the notebooks should be run sequentially.
Gets the top-shared domains from the 1.9K keywords recommended by Keyword Manager. Determines how many recommended keywords' search results contain links to self-identified pornographic sites
For each of our eight inputs, we get the count and percentage of recommended keywords which Google claims are "Adult", and which keywords we found to be pornographic. This is also where the figure featured in our story is produced.
Copyright 2020, The Markup News Inc.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
-
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
-
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.