ethics/ transparency audit #3

jwzimmer-zz · 2021-01-07T16:32:44Z

Alpha version of checklist at: https://www.overleaf.com/read/vrqgnmmysrbc

jwzimmer-zz · 2021-01-10T20:38:24Z

Rough list of potential items for checklist

Ethics as part of a project's preliminary Needs Assessment

Identify tools needed - any issues with procuring tools, or resources or knowledge required for tools?
Identify data needed - any issues with procuring data, or resources or knowledge related to data?
Any issue with publishing data? Should it be anonymized? How easy would it be to de-anonymize?

Since this project passively analyzes data that already exists (no gathering new data or participants), what is the analogue of informed consent for the people who created the data?
How is the outcome of this project going to be disseminated? Who will see it? Who can see it? Who should see it?
What is the potential impact of this project?
What are the potential harms of this project?
What will happen to the data and other work-product generated during this project, once the project is over?
For transparency and usability: create a "birth certificate" summary of project
Recurring audit for transparency, ethics, etc.
Make changes in response to audit
Donate money to (1) a social justice cause and (2) an environmental cause

jwzimmer-zz · 2021-01-10T20:45:41Z

First pass

Ethics as part of a project's preliminary Needs Assessment

Identify tools needed - any issues with procuring tools, or resources or knowledge required for tools?

Identify data needed - any issues with procuring data, or resources or knowledge related to data?

Any issue with publishing data? Should it be anonymized? How easy would it be to de-anonymize?

I think since all of the tvtropes site content I have access to is already public, and there is no plan to focus on individual users in any way, there aren't any new risks to those users. I think the tools needed will be mostly open-source, semi-open-source, or provided by UVM.

Since this project passively analyzes data that already exists (no gathering new data or participants), what is the analogue of informed consent for the people who created the data?

I think their participation in the site is sufficient consent.

How is the outcome of this project going to be disseminated? Who will see it? Who can see it? Who should see it?

I don't know.

What is the potential impact of this project?

I don't know.

What are the potential harms of this project?

Could cause burden on the tvtropes site by scraping; hopefully that is completely mitigated by the rate limit on the wget process. We could reinforce narratives and tropes by discussing them. We could paint the participants of the site reductively.

What will happen to the data and other work-product generated during this project, once the project is over?

I think it will stay indefinitely on github. I don't think that introduces any additional risk.

jwzimmer-zz · 2021-01-22T18:11:22Z

Going over the questions in the Datasheets for Datasets paper and answering some of them in the interest of transparency... these aren't the most thorough and careful answers ever, but I'd rather have something than nothing as far as describing the repo in an organized fashion. I think reading over this at least gives you an idea of what this is.

Questions below are from https://arxiv.org/abs/1803.09010:

Subjects: | Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as: | arXiv:1803.09010 [cs.DB]
  | (or arXiv:1803.09010v7 [cs.DB] for this version)

Re https://github.com/jwzimmer/tv-tropening & https://github.com/jwzimmer/tv-tropes

Section 3.1 Motivations

For what purpose was the dataset created? Was there a specific task
in mind? Was there a specific gap that needed to be filled? Please provide
a description.
We gathered this data - I think it may be too disorganized and chaotic to properly comprise a dataset - because we wanted to try and gain insight into how stories work in our culture via the information in the TV Tropes website. Proximally, we were working on a project for our Data Science class and following the advice of our advisors/ mentors.
Who created the dataset (e.g., which team, research group) and on
behalf of which entity (e.g., company, institution, organization)?
Phil Nguyen & Julia Zimmerman on behalf of UVM, Nick Cheney's Stat 287 class and Danforth & Dodd's Computational Story Lab.
Who funded the creation of the dataset? If there is an associated
grant, please provide the name of the grantor and the grant name and
number.
UVM, indirectly, since they provide the relevant classes and support the students and professors mentioned.

Section 3.2 Composition

What do the instances that comprise the dataset represent (e.g.,
documents, photos, people, countries)? Are there multiple types of
instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.
Most of the pages we downloaded from the tvtropes site are "trope" pages, describing a trope that appears in media (a movie, a tv show, a comic book, etc.), listing works it has appeared in, and linking to related tropes. We constructed objects that have a network structure, from dictionaries that map a trope page to the tropes that page links to, to gml files and gephi files that model parts of the site as a network whose nodes are tropes and whose edges indicate links within the body of that trope. There are other assorted objects, scripts, and content as well.
How many instances are there in total (of each type, if appropriate)?
I don't know. There are around 27000 trope pages.
Does the dataset contain all possible instances or is it a sample
(not necessarily random) of instances from a larger set? If the
dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how
this representativeness was validated/verified. If it is not representative
of the larger set, please describe why not (e.g., to cover a more diverse
range of instances, because instances were withheld or unavailable).
Not all. We defined "trope" based on Codebook/ what are all the things tv-tropes#13 (comment) and intended to capture all of those.
What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.
There are objects created via python, mostly json and gml files, and charts. The raw data is mostly .html files we downloaded from the tv tropes site (text only).
Are there recommended data splits (e.g., training, development/validation,
testing)? If so, please provide a description of these splits, explaining
the rationale behind them.
No. Not intended for machine learning.
Are there any errors, sources of noise, or redundancies in the
dataset? If so, please provide a description.
Probably! So caveat emptor please.
Is the dataset self-contained, or does it link to or otherwise rely on
external resources (e.g., websites, tweets, other datasets)?
I think "self-contained", although there are almost certainly links in the html pages that link to pages we did not download from tv tropes. So maybe not self contained?

Section 3.3 Collection process

How was the data associated with each instance acquired? Was
the data directly observable (e.g., raw text, movie ratings), reported by
subjects (e.g., survey responses), or indirectly inferred/derived from other
data (e.g., part-of-speech tags, model-based guesses for age or language)?
If data was reported by subjects or indirectly inferred/derived from other
data, was the data validated/verified? If so, please describe how.
All the data was scraped from the TV Tropes website (https://tvtropes.org/). We used beautiful soup to look through the html for the parts of each page we thought were relevant. We sort-of documented what different things in the repo are here: Codebook/ what are all the things tv-tropes#13
What mechanisms or procedures were used to collect the data
(e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?
Wget, beautiful soup, we also downloaded some pages manually. Our validation processes were mostly programmatic, looping over things and making sure that output matched what was expected, or manual, comparing our results to what we could see on the site and making sure they looked reasonably similar (use index dicts to make trope list to compare to list from Main directory tv-tropes#9, check random subset of dicts to verify they match each other and the website tv-tropes#12, Index page dicts - links to masterlist tropes only tv-tropes#18).
If the dataset is a sample from a larger set, what was the sampling
strategy (e.g., deterministic, probabilistic with specific sampling
probabilities)?
We intended to get every trope html page on the site that was tagged as page type "trope", but I don't have proof we succeeded.
Who was involved in the data collection process (e.g., students,
crowdworkers, contractors) and how were they compensated (e.g.,
how much were crowdworkers paid)?
_Students (Phil and Julia), with some advice from faculty and staff at UVM. Julia was compensated indirectly as a full-time PhD student (thank you, UVM!).
Over what timeframe was the data collected? Does this timeframe
match the creation timeframe of the data associated with the instances
(e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.
Dates should actually be reliable! We pushed and pulled from the repo as we worked, so timestamps should be pretty accurate!
Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review
processes, including the outcomes, as well as a link or other access point
to any supporting documentation.
We discussed the ethics ourselves, and included that in our final write-up. This was not a formal process. These answers are also a part of an ethical review, kind of, since I (Julia) and going over these questions in an attempt to provide some transparency as an ethical requirement of this project. Not a high bar, but basically I don't want to be a jerk.
Does the dataset relate to people? If not, you may skip the remainder
of the questions in this section.
Yes
Did you collect the data from the individuals in question directly,
or obtain it via third parties or other sources (e.g., websites)?
From the tvtropes website.
Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point
to, or otherwise reproduce, the exact language of the notification itself.
No. The website is public; I don't think any user of the site could use it without understanding that, so we didn't think we needed to notify anyone.
• Did the individuals in question consent to the collection and use
of their data? If so, please describe (or show with screenshots or other
information) how consent was requested and provided, and provide a
link or other access point to, or otherwise reproduce, the exact language
to which the individuals consented.
No.

Section 3.4 Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g.,
discretization or bucketing, tokenization, part-of-speech tagging,
SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the
remainder of the questions in this section.
Sort of, in that we were only looking at certain pages and certain attributes within those pages.
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled
data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.
Yes, everything we scraped directly from the site is saved in these repos as well.
Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point.
All (or maybe "most", if we left something out unintentionally) the python scripts and notebooks we used are in these repos as well. We mostly used python, wget, and beautiful soup.

Section 3.5 Uses

Has the dataset been used for any tasks already? If so, please provide
a description.
Yes, it's been used for a Data Science 1 project, and may be used for continuation of related projects.
Is there a repository that links to any or all papers or systems that
use the dataset? If so, please provide a link or other access point.
Yes: https://github.com/jwzimmer/tv-tropening & https://github.com/jwzimmer/tv-tropes
What (other) tasks could the dataset be used for?
I'm not sure.
Is there anything about the composition of the dataset or the way
it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user
might need to know to avoid uses that could result in unfair treatment
of individuals or groups (e.g., stereotyping, quality of service issues) or
other undesirable harms (e.g., financial harms, legal risks) If so, please
provide a description. Is there anything a future user could do to mitigate
these undesirable harms?
Yes, the nature of the data is that it involves a lot of stereotypes! It's not exactly clear to me how to say when you are definitely referencing or using a trope that has stereotypes in it vs. when, by doing so, you're endorsing it. I don't want to reinforce stereotypes, but I do want to study stories. I hope any future users will feel similarly.
• Are there tasks for which the dataset should not be used? If so,
please provide a description.
Probably lots of tasks for which this data isn't suited. It isn't meant to be used as a basis for machine learning or solving any particular problem.

Section 3.6 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which
the dataset was created? If so, please provide a description.
No. It's an open repo, so it's public, but I have no plan to intentionally disseminate it or get other people to use it.
How will the dataset will be distributed (e.g., tarball on website,
API, GitHub)? Does the dataset have a digital object identifier (DOI)?
It lives in two repos on GitHub indefinitely: https://github.com/jwzimmer/tv-tropening & https://github.com/jwzimmer/tv-tropes
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use
(ToU)? If so, please describe this license and/or ToU, and provide a link
or other access point to, or otherwise reproduce, any relevant licensing
terms or ToU, as well as any fees associated with these restrictions.
Not as far as I'm concerned. I don't know about the relevant legal things. I wanted to look at a static version of the TV Tropes site and be able to mess with it without being onerous to their infrastructure, that's the reason for making our "dataset".

Section 3.7 Maintenance

Who is supporting/hosting/maintaining the dataset?
As long as we're working on the project, Phil Nguyen and Julia Zimmerman. As long as GitHub lets me keep my repos here simply and for free, also GitHub, indirectly.
How can the owner/curator/manager of the dataset be contacted
(e.g., email address)?
[email protected]
Is there an erratum? If so, please provide a link or other access point.
No...
Will the dataset be updated (e.g., to correct labeling errors, add
new instances, delete instances)? If so, please describe how often, by
whom, and how updates will be communicated to users (e.g., mailing list,
GitHub)?
Yes, it will be updated as Phil and I work on the project. There are no other users so updates won't be communicated, although they'll be documented on GitHub as the project has been up to this point.
If the dataset relates to people, are there applicable limits on the
retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a
fixed period of time and then deleted)? If so, please describe these
limits and explain how they will be enforced.
No.
If others want to extend/augment/build on/contribute to the
dataset, is there a mechanism for them to do so? If so, please
provide a description. Will these contributions be validated/verified?
If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please
provide a description.
Well, there is no process for this. If you contact me I will be happy to check out your proposed contributions.

jwzimmer-zz · 2021-06-28T16:11:54Z

This is still relevant and should be re-visited. What we wrote above was mainly thinking of the tv tropes data, not the character space data, so we should re-visit this issue in that context.

jwzimmer-zz changed the title ~~make an ethics/ transparency audit checklist~~ ethics/ transparency audit Jan 10, 2021

jwzimmer-zz mentioned this issue Jan 22, 2021

Codebook/ what are all the things jwzimmer-zz/tv-tropes#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ethics/ transparency audit #3

ethics/ transparency audit #3

jwzimmer-zz commented Jan 7, 2021 •

edited

Loading

jwzimmer-zz commented Jan 10, 2021

jwzimmer-zz commented Jan 10, 2021

jwzimmer-zz commented Jan 22, 2021 •

edited

Loading

jwzimmer-zz commented Jun 28, 2021 •

edited

Loading

ethics/ transparency audit #3

ethics/ transparency audit #3

Comments

jwzimmer-zz commented Jan 7, 2021 • edited Loading

jwzimmer-zz commented Jan 10, 2021

jwzimmer-zz commented Jan 10, 2021

jwzimmer-zz commented Jan 22, 2021 • edited Loading

jwzimmer-zz commented Jun 28, 2021 • edited Loading

jwzimmer-zz commented Jan 7, 2021 •

edited

Loading

jwzimmer-zz commented Jan 22, 2021 •

edited

Loading

jwzimmer-zz commented Jun 28, 2021 •

edited

Loading