Skip to content

Dataset info #1057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Dataset info #1057

wants to merge 2 commits into from

Conversation

SiQube
Copy link
Member

@SiQube SiQube commented Mar 25, 2025

add information about the dataset as a property.

additionally, resolves #987.

a final version of info property could be integrated into the download process, to make the user aware of the underlying information and citation.

@SiQube SiQube requested a review from saeub March 25, 2025 20:12
Copy link

codecov bot commented Mar 25, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (35f6b88) to head (5375871).
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main     #1057   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           80        84    +4     
  Lines         3602      3679   +77     
  Branches       646       646           
=========================================
+ Hits          3602      3679   +77     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

def info(self) -> None:
"""The information about the dataset.

Print dataset information and citation key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the property should return a string instead of printing it. A user can easily call print(dataset.info) if necessary

@dkrako
Copy link
Contributor

dkrako commented Mar 25, 2025

Great! Let's discuss this PR tomorrow.

I guess some of the formatting should be edited into a more human-readable format.

Currently the publications are cited in two non-human-readable formats: first as the sphinx-citation and then a bibtex citation.

Furthermore, I would like to split the info field into info and citation, as I think printing the whole info on downloading would be just too much. Instead, a much shorter download disclaimer could be something like:

You are downloading the BSC dataset. Please be aware that pymovements does not
host or distribute any dataset resources and only provides a convenient interface to
download the public dataset resources that were published by their respective authors.

If you intend to use the dataset in your research, please cite the publication as:

Jinger Pan, Ming Yan, Eike M. Richter, Hua Shu, and Reinhold Kliegl. The Beijing Sentence Corpus: a Chinese sentence corpus with eye movement data and predictability norms. Behavior Research Methods, 2022.

This way we can simply store the main part of the disclaimer somewhere else and just fill in the name and the citation of the dataset. I would be in favor of a human-readable citation format instead of bibtex, because something like this is just hell to parse visually:

    @inproceedings{CopCoL1Hollenstein,
      title = "The {C}openhagen Corpus of Eye Tracking Recordings from
      Natural Reading of {D}anish Texts",
      author = {Hollenstein, Nora  and
        Barrett, Maria  and
        Bj{\\"o}rnsd{\\'o}ttir, Marina},
      editor = "Calzolari, Nicoletta  and
        B{\\'e}chet, Fr{\\'e}d{\\'e}ric  and
        Blache, Philippe  and
        Choukri, Khalid  and
        Cieri, Christopher  and
        Declerck, Thierry  and
        Goggi, Sara  and
        Isahara, Hitoshi  and
        Maegaard, Bente  and
        Mariani, Joseph  and
        Mazo, H{\\'e}l{\\`e}ne  and
        Odijk, Jan  and
        Piperidis, Stelios",
      booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
      month = jun,
      year = "2022",
      address = "Marseille, France",
      publisher = "European Language Resources Association",
      url = "https://aclanthology.org/2022.lrec-1.182",
      pages = "1712--1720",
    }

Moreover, not everyone uses bibtex and we don't want to force our citation preferences on users. If you want to keep the bibtex in someway, then please add a bibtex field. We will then have to make sure that the bibliography.bib and the definition bibtex are consistent, but this could be handled in a followup, where we autopopulate the bibliography.bib with the bibtex field of each dataset definition (which would be great as everything would be specified neatly in each definition file). (maybe we could even autogenerate the citation field from the bibtex field in the future) (edit: we could also add something like use Dataset.definition.bibtex to get the citation in bibtex format at the end of the disclaimer message)

Of course we could additionally link to our documentation page in the disclaimer, but I don't know if this is necessary.

Planning further ahead, the info field could then be used to autogenerate dataset pages and should include the sphinx citation. EDIT: I reconsidered and now I think the sphinx citation should be completely left out of the info string. It can be easily added to the dataset docpages if necessary via string formatting.

@dkrako
Copy link
Contributor

dkrako commented Mar 25, 2025

Maybe instead of info it would be better to call the field description, as that name indicates a text description of the dataset, while info is more ambiguous about its content (as all data is information).

We already have the fileinfo field in Dataset and info is unrelated so I'm afraid the naming could be even more confusing.

I like info as it's nice and short, but the ambiguities weigh more I guess.

@@ -47,6 +47,9 @@ class DatasetDefinition:
----------
name: str
The name of the dataset. (default: '.')
info: str
Information about the dataset including but not limited to original citation,
general information. (default: '.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default is an empty string isn't it?

@@ -105,6 +109,31 @@ class BSCII(DatasetDefinition):

name: str = 'BSCII'

info: str = """\
BSCII dataset :cite:p:`BSCII`.
Copy link
Contributor

@dkrako dkrako Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor to remove the first line from all of the description strings, as the name of the dataset is already known to the user and the sphinx cite directive is not very useful when calling the property.

Moreover, if we use the string as a basis for autogenerating dataset docpages, the first line can be easily recreated by something like f'{dataset.name} dataset :cite:p:`{_get_bibtex_id(dataset.bibtex)}`' (within the autogenerator script, and not included in the description string or any definition file)

Nevertheless, one thing that we could add to the description is the verbose name of the dataset.
For example instead of writing:

This dataset includes monocular eye tracking data from several ...

It would be nicer to write:

The Beijing Sentence Corpus II (BSCII) includes monocular eye tracking data from several ...

@dkrako dkrako marked this pull request as draft March 26, 2025 13:36
@dkrako
Copy link
Contributor

dkrako commented Apr 1, 2025

regarding the citation: str field: in case a dataset has multiple citations, just add a line break to the string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix erroneous docstrings in dataset definitions
2 participants