Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where do animint2's datasets come from, and where are their codebooks? #100

Open
ampurr opened this issue Jul 2, 2023 · 17 comments
Open

Comments

@ampurr
Copy link
Contributor

ampurr commented Jul 2, 2023

This is vaguely related to issue #97. I'm trying to generate a very simple example for the basic usage section and decided to use a default dataset. I noticed that animint2 contains a lot of datasets—33 by my count. Some of them come from ggplot2. But I'm not sure where the rest are from.

Where are they from? (For example, where is the WorldBank dataset from?) And where can I find their corresponding codebooks?

As always, no rush in responding. Thanks in advance. 🐈

@tdhock
Copy link
Collaborator

tdhock commented Jul 3, 2023

Where do the data sets come from? It should be documented on the man page, under "sources" otherwise I don't know.
Codebooks? I don't know what you mean, but maybe I could help create one if you clarify?

@ampurr
Copy link
Contributor Author

ampurr commented Jul 3, 2023

Got it. I've spotted the "Source" subsection in the manual—thanks! :>

You've probably written codebooks before. That word's just jargon for metadata about the datasets. Codebooks usually describe the dataset's variables and how the data were collected. They're great for reproducibility, since variable names themselves are usually insufficient for describing the data.

The diamonds dataset has one. WorldBank and montreal.bikes don't. After some of the other website stuff is set up, I'd be down for writing codebooks together. It'd have to be together for at least some of the datasets, since I don't know the data for e.g. montreal.bikes and you do.

Or you could just write them yourself. Up to you, obviously. I'm not your boss. 🐈🐈🐈

EDIT: Corrected lots of typos.

@ampurr
Copy link
Contributor Author

ampurr commented Jul 3, 2023

I looked it up. "Codebook" is social sciences jargon. Sorry about that! I didn't realize the term wasn't universal in science.

@tdhock
Copy link
Collaborator

tdhock commented Jul 17, 2023

sure, please open a PR with some edits to the man pages, please put TODO where you think I should add some info.

@ampurr
Copy link
Contributor Author

ampurr commented Jul 17, 2023

Sure thing. :>

@ampurr
Copy link
Contributor Author

ampurr commented Jul 27, 2023

Status update: At least one of the datasets has its source in the comments, which hopefully means that's the case for all of them. The dataset is animint2/data-raw/economics.R, and the source can be found here.

Note to self: Datasets can be found in animint2/data-raw.

@tdhock
Copy link
Collaborator

tdhock commented Aug 7, 2023

hi again, if this is still an issue, can you please link a PR with the TODOs? Otherwise, can you please close?

@ampurr
Copy link
Contributor Author

ampurr commented Aug 7, 2023

No problem. I've been preoccupied with the reference website, hence the delay. Unless you want me to prioritize this, I'll do it after I throw the website online. To-do for me:

  • Look through all the datasets and see if they have a source.
  • If they do, continue. If they don't, mark them with a TO-DO.
  • Look through all the datasets and see if they have a codebook.
  • If they do, continue. If they don't, mark them with a TO-DO.
  • Throw up a pull request with the edited datasets.

@ampurr
Copy link
Contributor Author

ampurr commented Aug 10, 2023

Okay, website has been thrown online. Do this now, @ampurr. 🐈

@ampurr
Copy link
Contributor Author

ampurr commented Aug 10, 2023

Everything checked has a source attached (and therefore I won't need to attach a TODO to it):

  • breakpoints
  • change
  • ChromHMMiterations
  • climate
  • compare
  • diamonds (ggplot2)
  • economics (ggplot2)
  • economics_long (ggplot2)
  • faithfuld (ggplot2)
  • FluView
  • FunctionalPruning
  • generation.loci
  • intreg
  • luv_colours (ggplot2)
  • malaria
  • midwest (ggplot2)
  • mixtureKNN
  • montreal.bikes
  • mpg (ggplot2)
  • msleep (ggplot2)
  • PeakConsistency
  • pirates
  • presidential (ggplot2)
  • prior
  • prostateLasso
  • seals
  • TestROC
  • txhousing (ggplot2)
  • UStornadoes
  • VariantModels
  • vervet
  • WorldBank (is "copied from" sufficient?)
  • worldPop

@ampurr
Copy link
Contributor Author

ampurr commented Aug 10, 2023

Everything checked has a codebook attached (and therefore I won't need to attach a codebook TODO to it):

  • breakpoints
  • change
  • ChromHMMiterations
  • climate
  • compare
  • diamonds
  • economics
  • economics_long
  • faithfuld
  • FluView
  • FunctionalPruning
  • generation.loci
  • intreg
  • luv_colours
  • malaria
  • midwest
  • mixtureKNN
  • montreal.bikes
  • mpg
  • msleep
  • PeakConsistency
  • pirates
  • presidential
  • prior
  • prostateLasso
  • seals
  • TestROC
  • txhousing
  • UStornadoes
  • VariantModels
  • vervet
  • WorldBank
  • worldPop

@ampurr
Copy link
Contributor Author

ampurr commented Aug 10, 2023

Note to self: Not all .Rd files are generated by roxygen2. Some files were manually created.

@ampurr
Copy link
Contributor Author

ampurr commented Aug 11, 2023

Adding TODOs—a progress report:

  • ChromHMMiterations (edited .Rd file)
  • climate (edited .Rd file)
  • compare (edited .Rd file)
  • FluView (edited .Rd file)
  • FunctionalPruning (edited .Rd file)
  • generation.loci (edited .Rd file)
  • intreg (edited .Rd file)
  • malaria (edited .Rd file)
  • mixtureKNN (edited .Rd file)
  • montreal.bikes (edited .Rd file)
  • PeakConsistency (edited .Rd file)
  • pirates (edited .Rd file)
  • presidential (edited .Rd file)
  • prior (edited .Rd file)
  • prostateLasso (edited .Rd file)
  • seals (edited .Rd file)
  • TestROC (edited .Rd file)
  • UStornadoes (edited .Rd file)
  • VariantModels (edited .Rd file)
  • vervet (edited .Rd file)
  • WorldBank (edited .Rd file)

@tdhock
Copy link
Collaborator

tdhock commented Aug 15, 2023

thanks this is useful, I will look at that PR and edit when I get a chance.

@ampurr
Copy link
Contributor Author

ampurr commented Aug 15, 2023

Thank you! No rush. :>

@tdhock
Copy link
Collaborator

tdhock commented Oct 27, 2023

hi @ampurr for another project I have sas codebooks defined as below

K2Q01_D in (1,2) then TeethCond_21 = 1;
if K2Q01_D = 3 then TeethCond_21 = 2;
if K2Q01_D in (4,5) then TeethCond_21 = 3;
if K2Q01_D = .M then TeethCond_21 = .M;
if K2Q01_D = 6 then TeethCond_21 = .L;
if SC_AGE_YEARS

do you know if there is any existing package to parse such sas codebook data into R? I did a web search but did not find anything obvious.

@ampurr
Copy link
Contributor Author

ampurr commented Oct 28, 2023

Hey, @tdhock. :>

Unfortunately, my department never used SAS, so I don't have any special insight into your problem. Looking it up...

If you just need to parse the output of a SAS program into something R can read, the haven package has a read_sas() function.

The SASmarkdown package will let you use SAS code with R Markdown.

A possible wacky chain solution:

  1. The SASPy Python package says that it lets you "exchange values between python variables and SAS macro variables," which seems promising.
  2. The reticulate R package lets you translate between R and Python objects.
  3. You might be able to use these two packages in conjunction.

Hope this helps. Good luck with your project. 🐈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants