Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ethics/ transparency audit #3

Open
jwzimmer-zz opened this issue Jan 7, 2021 · 4 comments
Open

ethics/ transparency audit #3

jwzimmer-zz opened this issue Jan 7, 2021 · 4 comments

Comments

@jwzimmer-zz
Copy link
Owner

jwzimmer-zz commented Jan 7, 2021

Alpha version of checklist at: https://www.overleaf.com/read/vrqgnmmysrbc

@jwzimmer-zz
Copy link
Owner Author

Rough list of potential items for checklist

  1. Ethics as part of a project's preliminary Needs Assessment
  • Identify tools needed - any issues with procuring tools, or resources or knowledge required for tools?
  • Identify data needed - any issues with procuring data, or resources or knowledge related to data?
  • Any issue with publishing data? Should it be anonymized? How easy would it be to de-anonymize?
  1. Since this project passively analyzes data that already exists (no gathering new data or participants), what is the analogue of informed consent for the people who created the data?
  2. How is the outcome of this project going to be disseminated? Who will see it? Who can see it? Who should see it?
  3. What is the potential impact of this project?
  4. What are the potential harms of this project?
  5. What will happen to the data and other work-product generated during this project, once the project is over?
  6. For transparency and usability: create a "birth certificate" summary of project
  7. Recurring audit for transparency, ethics, etc.
  8. Make changes in response to audit
  9. Donate money to (1) a social justice cause and (2) an environmental cause

@jwzimmer-zz
Copy link
Owner Author

First pass

  1. Ethics as part of a project's preliminary Needs Assessment
  • Identify tools needed - any issues with procuring tools, or resources or knowledge required for tools?
  • Identify data needed - any issues with procuring data, or resources or knowledge related to data?
  • Any issue with publishing data? Should it be anonymized? How easy would it be to de-anonymize?

I think since all of the tvtropes site content I have access to is already public, and there is no plan to focus on individual users in any way, there aren't any new risks to those users. I think the tools needed will be mostly open-source, semi-open-source, or provided by UVM.

  1. Since this project passively analyzes data that already exists (no gathering new data or participants), what is the analogue of informed consent for the people who created the data?

I think their participation in the site is sufficient consent.

  1. How is the outcome of this project going to be disseminated? Who will see it? Who can see it? Who should see it?

I don't know.

  1. What is the potential impact of this project?

I don't know.

  1. What are the potential harms of this project?

Could cause burden on the tvtropes site by scraping; hopefully that is completely mitigated by the rate limit on the wget process. We could reinforce narratives and tropes by discussing them. We could paint the participants of the site reductively.

  1. What will happen to the data and other work-product generated during this project, once the project is over?

I think it will stay indefinitely on github. I don't think that introduces any additional risk.

@jwzimmer-zz jwzimmer-zz changed the title make an ethics/ transparency audit checklist ethics/ transparency audit Jan 10, 2021
@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Jan 22, 2021

Going over the questions in the Datasheets for Datasets paper and answering some of them in the interest of transparency... these aren't the most thorough and careful answers ever, but I'd rather have something than nothing as far as describing the repo in an organized fashion. I think reading over this at least gives you an idea of what this is.

Questions below are from https://arxiv.org/abs/1803.09010:

Subjects: | Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as: | arXiv:1803.09010 [cs.DB]
  | (or arXiv:1803.09010v7 [cs.DB] for this version)

Re https://github.com/jwzimmer/tv-tropening & https://github.com/jwzimmer/tv-tropes

Section 3.1 Motivations

  • For what purpose was the dataset created? Was there a specific task
    in mind? Was there a specific gap that needed to be filled? Please provide
    a description.
    We gathered this data - I think it may be too disorganized and chaotic to properly comprise a dataset - because we wanted to try and gain insight into how stories work in our culture via the information in the TV Tropes website. Proximally, we were working on a project for our Data Science class and following the advice of our advisors/ mentors.
  • Who created the dataset (e.g., which team, research group) and on
    behalf of which entity (e.g., company, institution, organization)?
    Phil Nguyen & Julia Zimmerman on behalf of UVM, Nick Cheney's Stat 287 class and Danforth & Dodd's Computational Story Lab.
  • Who funded the creation of the dataset? If there is an associated
    grant, please provide the name of the grantor and the grant name and
    number.
    UVM, indirectly, since they provide the relevant classes and support the students and professors mentioned.

Section 3.2 Composition

  • What do the instances that comprise the dataset represent (e.g.,
    documents, photos, people, countries)? Are there multiple types of
    instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.
    Most of the pages we downloaded from the tvtropes site are "trope" pages, describing a trope that appears in media (a movie, a tv show, a comic book, etc.), listing works it has appeared in, and linking to related tropes. We constructed objects that have a network structure, from dictionaries that map a trope page to the tropes that page links to, to gml files and gephi files that model parts of the site as a network whose nodes are tropes and whose edges indicate links within the body of that trope. There are other assorted objects, scripts, and content as well.
  • How many instances are there in total (of each type, if appropriate)?
    I don't know. There are around 27000 trope pages.
  • Does the dataset contain all possible instances or is it a sample
    (not necessarily random) of instances from a larger set? If the
    dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how
    this representativeness was validated/verified. If it is not representative
    of the larger set, please describe why not (e.g., to cover a more diverse
    range of instances, because instances were withheld or unavailable).
    Not all. We defined "trope" based on Codebook/ what are all the things tv-tropes#13 (comment) and intended to capture all of those.
  • What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.
    There are objects created via python, mostly json and gml files, and charts. The raw data is mostly .html files we downloaded from the tv tropes site (text only).
  • Are there recommended data splits (e.g., training, development/validation,
    testing)? If so, please provide a description of these splits, explaining
    the rationale behind them.
    No. Not intended for machine learning.
  • Are there any errors, sources of noise, or redundancies in the
    dataset? If so, please provide a description.
    Probably! So caveat emptor please.
  • Is the dataset self-contained, or does it link to or otherwise rely on
    external resources (e.g., websites, tweets, other datasets)?
    I think "self-contained", although there are almost certainly links in the html pages that link to pages we did not download from tv tropes. So maybe not self contained?

Section 3.3 Collection process

  • How was the data associated with each instance acquired? Was
    the data directly observable (e.g., raw text, movie ratings), reported by
    subjects (e.g., survey responses), or indirectly inferred/derived from other
    data (e.g., part-of-speech tags, model-based guesses for age or language)?
    If data was reported by subjects or indirectly inferred/derived from other
    data, was the data validated/verified? If so, please describe how.
    All the data was scraped from the TV Tropes website (https://tvtropes.org/). We used beautiful soup to look through the html for the parts of each page we thought were relevant. We sort-of documented what different things in the repo are here: Codebook/ what are all the things tv-tropes#13
  • What mechanisms or procedures were used to collect the data
    (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?
    Wget, beautiful soup, we also downloaded some pages manually. Our validation processes were mostly programmatic, looping over things and making sure that output matched what was expected, or manual, comparing our results to what we could see on the site and making sure they looked reasonably similar (use index dicts to make trope list to compare to list from Main directory tv-tropes#9, check random subset of dicts to verify they match each other and the website tv-tropes#12, Index page dicts - links to masterlist tropes only tv-tropes#18).
  • If the dataset is a sample from a larger set, what was the sampling
    strategy (e.g., deterministic, probabilistic with specific sampling
    probabilities)?
    We intended to get every trope html page on the site that was tagged as page type "trope", but I don't have proof we succeeded.
  • Who was involved in the data collection process (e.g., students,
    crowdworkers, contractors) and how were they compensated (e.g.,
    how much were crowdworkers paid)?
    _Students (Phil and Julia), with some advice from faculty and staff at UVM. Julia was compensated indirectly as a full-time PhD student (thank you, UVM!).
  • Over what timeframe was the data collected? Does this timeframe
    match the creation timeframe of the data associated with the instances
    (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.
    Dates should actually be reliable! We pushed and pulled from the repo as we worked, so timestamps should be pretty accurate!
  • Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review
    processes, including the outcomes, as well as a link or other access point
    to any supporting documentation.
    We discussed the ethics ourselves, and included that in our final write-up. This was not a formal process. These answers are also a part of an ethical review, kind of, since I (Julia) and going over these questions in an attempt to provide some transparency as an ethical requirement of this project. Not a high bar, but basically I don't want to be a jerk.
  • Does the dataset relate to people? If not, you may skip the remainder
    of the questions in this section.
    Yes
  • Did you collect the data from the individuals in question directly,
    or obtain it via third parties or other sources (e.g., websites)?
    From the tvtropes website.
  • Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point
    to, or otherwise reproduce, the exact language of the notification itself.
    No. The website is public; I don't think any user of the site could use it without understanding that, so we didn't think we needed to notify anyone.
    • Did the individuals in question consent to the collection and use
    of their data? If so, please describe (or show with screenshots or other
    information) how consent was requested and provided, and provide a
    link or other access point to, or otherwise reproduce, the exact language
    to which the individuals consented.
    No.

Section 3.4 Preprocessing/cleaning/labeling

  • Was any preprocessing/cleaning/labeling of the data done (e.g.,
    discretization or bucketing, tokenization, part-of-speech tagging,
    SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the
    remainder of the questions in this section.
    Sort of, in that we were only looking at certain pages and certain attributes within those pages.
  • Was the “raw” data saved in addition to the preprocessed/cleaned/labeled
    data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.
    Yes, everything we scraped directly from the site is saved in these repos as well.
  • Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point.
    All (or maybe "most", if we left something out unintentionally) the python scripts and notebooks we used are in these repos as well. We mostly used python, wget, and beautiful soup.

Section 3.5 Uses

  • Has the dataset been used for any tasks already? If so, please provide
    a description.
    Yes, it's been used for a Data Science 1 project, and may be used for continuation of related projects.
  • Is there a repository that links to any or all papers or systems that
    use the dataset? If so, please provide a link or other access point.
    Yes: https://github.com/jwzimmer/tv-tropening & https://github.com/jwzimmer/tv-tropes
  • What (other) tasks could the dataset be used for?
    I'm not sure.
  • Is there anything about the composition of the dataset or the way
    it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user
    might need to know to avoid uses that could result in unfair treatment
    of individuals or groups (e.g., stereotyping, quality of service issues) or
    other undesirable harms (e.g., financial harms, legal risks) If so, please
    provide a description. Is there anything a future user could do to mitigate
    these undesirable harms?
    Yes, the nature of the data is that it involves a lot of stereotypes! It's not exactly clear to me how to say when you are definitely referencing or using a trope that has stereotypes in it vs. when, by doing so, you're endorsing it. I don't want to reinforce stereotypes, but I do want to study stories. I hope any future users will feel similarly.
    • Are there tasks for which the dataset should not be used? If so,
    please provide a description.
    Probably lots of tasks for which this data isn't suited. It isn't meant to be used as a basis for machine learning or solving any particular problem.

Section 3.6 Distribution

  • Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which
    the dataset was created? If so, please provide a description.
    No. It's an open repo, so it's public, but I have no plan to intentionally disseminate it or get other people to use it.
  • How will the dataset will be distributed (e.g., tarball on website,
    API, GitHub)? Does the dataset have a digital object identifier (DOI)?
    It lives in two repos on GitHub indefinitely: https://github.com/jwzimmer/tv-tropening & https://github.com/jwzimmer/tv-tropes
  • Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use
    (ToU)? If so, please describe this license and/or ToU, and provide a link
    or other access point to, or otherwise reproduce, any relevant licensing
    terms or ToU, as well as any fees associated with these restrictions.
    Not as far as I'm concerned. I don't know about the relevant legal things. I wanted to look at a static version of the TV Tropes site and be able to mess with it without being onerous to their infrastructure, that's the reason for making our "dataset".

Section 3.7 Maintenance

  • Who is supporting/hosting/maintaining the dataset?
    As long as we're working on the project, Phil Nguyen and Julia Zimmerman. As long as GitHub lets me keep my repos here simply and for free, also GitHub, indirectly.
  • How can the owner/curator/manager of the dataset be contacted
    (e.g., email address)?
    [email protected]
  • Is there an erratum? If so, please provide a link or other access point.
    No...
  • Will the dataset be updated (e.g., to correct labeling errors, add
    new instances, delete instances)? If so, please describe how often, by
    whom, and how updates will be communicated to users (e.g., mailing list,
    GitHub)?
    Yes, it will be updated as Phil and I work on the project. There are no other users so updates won't be communicated, although they'll be documented on GitHub as the project has been up to this point.
  • If the dataset relates to people, are there applicable limits on the
    retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a
    fixed period of time and then deleted)? If so, please describe these
    limits and explain how they will be enforced.
    No.
  • If others want to extend/augment/build on/contribute to the
    dataset, is there a mechanism for them to do so? If so, please
    provide a description. Will these contributions be validated/verified?
    If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please
    provide a description.
    Well, there is no process for this. If you contact me I will be happy to check out your proposed contributions.

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Jun 28, 2021

This is still relevant and should be re-visited. What we wrote above was mainly thinking of the tv tropes data, not the character space data, so we should re-visit this issue in that context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant