-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating final de-identified datasets #11
Comments
One question: Is it important that the deidentified string be identical for identical raw string values? Like if "cats" appears 12 times, should the strings post-anonymization be identical? i.e. all 12 instances would be "sdlfijosd98fs"? If so, we should consider getting hash values, maybe via the digest package. |
Thanks, @ChrisMuir! Was thinking about this particular point and hashing. Ya, we do want to map each specific string to a particular value. Basically, it allows us to achieve our purpose---not make it too easy for people to look up specific people---without losing much info. On to the point about losing info.: Names are pretty useful for imputing gender and ethnicity. And we probably want to enrich the data a bit---impute race and gender using lincoln mullen's gender + my ethnicolr package---in lieu of losing this info. do you think that's a reasonable way to go? i worry just a bit about having names of people but perhaps we should just go w/ it. what are your thoughts? |
I feel like removing proper names is a good idea. Even though all of this data is public, it feels weird to leave in people's names and make it all available in one central source. If we want to hash the names prior to release, I assume we shouldn't impute gender and race prior to hashing them, and add them as two new variables? As in, that's not a good option, probably because it would be seen as not very transparent....is this correct? Yeah the more I think about it, the more I'm leaning towards hashing the names. I'm not an expert in data science ethics though. |
We are on the same page. In the final 'clean' data we package in R, we won't have actual names. As is the norm, we will have two packages: one data package (downloadable from GH) and one that provides the API. Proposed order for starting on our effort:
|
Time has arrived for building the first draft of the final data.frames + dictionary that we will include in the R data package. And it makes sense to pick low hanging fruits first. Let's start with California. It has the twin virtues of being relatively clean and big.
For CA, write a script that:
a. Replaces name with a random 10 character string
b. Does data integrity checks and flags or fixes issues as needed
c. unzips and rbinds years and tiers of government and adds useful information such as what level of government or what year the data are from if such information is missing.
d. final outcome = tidy data
After that, write a Rmd that presents some basic summaries of the data and presents a dictionary.
Note: if you think you can improve the description of the issue, please do. And don't let the description keep you from doing sensible things.
The text was updated successfully, but these errors were encountered: