Dataset info #1057

SiQube · 2025-03-25T20:12:11Z

add information about the dataset as a property.

additionally, resolves #987.

a final version of info property could be integrated into the download process, to make the user aware of the underlying information and citation.

codecov · 2025-03-25T20:15:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (35f6b88) to head (5375871).
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##              main     #1057   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           80        84    +4     
  Lines         3602      3679   +77     
  Branches       646       646           
=========================================
+ Hits          3602      3679   +77

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dkrako · 2025-03-25T20:20:33Z

src/pymovements/dataset/dataset.py

+    def info(self) -> None:
+        """The information about the dataset.
+
+        Print dataset information and citation key.


the property should return a string instead of printing it. A user can easily call print(dataset.info) if necessary

dkrako · 2025-03-25T21:06:35Z

Great! Let's discuss this PR tomorrow.

I guess some of the formatting should be edited into a more human-readable format.

Currently the publications are cited in two non-human-readable formats: first as the sphinx-citation and then a bibtex citation.

Furthermore, I would like to split the info field into info and citation, as I think printing the whole info on downloading would be just too much. Instead, a much shorter download disclaimer could be something like:

You are downloading the BSC dataset. Please be aware that pymovements does not
host or distribute any dataset resources and only provides a convenient interface to
download the public dataset resources that were published by their respective authors.

If you intend to use the dataset in your research, please cite the publication as:

Jinger Pan, Ming Yan, Eike M. Richter, Hua Shu, and Reinhold Kliegl. The Beijing Sentence Corpus: a Chinese sentence corpus with eye movement data and predictability norms. Behavior Research Methods, 2022.

This way we can simply store the main part of the disclaimer somewhere else and just fill in the name and the citation of the dataset. I would be in favor of a human-readable citation format instead of bibtex, because something like this is just hell to parse visually:

    @inproceedings{CopCoL1Hollenstein,
      title = "The {C}openhagen Corpus of Eye Tracking Recordings from
      Natural Reading of {D}anish Texts",
      author = {Hollenstein, Nora  and
        Barrett, Maria  and
        Bj{\\"o}rnsd{\\'o}ttir, Marina},
      editor = "Calzolari, Nicoletta  and
        B{\\'e}chet, Fr{\\'e}d{\\'e}ric  and
        Blache, Philippe  and
        Choukri, Khalid  and
        Cieri, Christopher  and
        Declerck, Thierry  and
        Goggi, Sara  and
        Isahara, Hitoshi  and
        Maegaard, Bente  and
        Mariani, Joseph  and
        Mazo, H{\\'e}l{\\`e}ne  and
        Odijk, Jan  and
        Piperidis, Stelios",
      booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
      month = jun,
      year = "2022",
      address = "Marseille, France",
      publisher = "European Language Resources Association",
      url = "https://aclanthology.org/2022.lrec-1.182",
      pages = "1712--1720",
    }

Moreover, not everyone uses bibtex and we don't want to force our citation preferences on users. If you want to keep the bibtex in someway, then please add a bibtex field. We will then have to make sure that the bibliography.bib and the definition bibtex are consistent, but this could be handled in a followup, where we autopopulate the bibliography.bib with the bibtex field of each dataset definition (which would be great as everything would be specified neatly in each definition file). (maybe we could even autogenerate the citation field from the bibtex field in the future) (edit: we could also add something like use Dataset.definition.bibtex to get the citation in bibtex format at the end of the disclaimer message)

Of course we could additionally link to our documentation page in the disclaimer, but I don't know if this is necessary.

Planning further ahead, the info field could then be used to autogenerate dataset pages and should include the sphinx citation. EDIT: I reconsidered and now I think the sphinx citation should be completely left out of the info string. It can be easily added to the dataset docpages if necessary via string formatting.

dkrako · 2025-03-25T22:58:34Z

Maybe instead of info it would be better to call the field description, as that name indicates a text description of the dataset, while info is more ambiguous about its content (as all data is information).

We already have the fileinfo field in Dataset and info is unrelated so I'm afraid the naming could be even more confusing.

I like info as it's nice and short, but the ambiguities weigh more I guess.

dkrako · 2025-03-25T22:59:49Z

src/pymovements/dataset/dataset_definition.py

@@ -47,6 +47,9 @@ class DatasetDefinition:
    ----------
    name: str
        The name of the dataset. (default: '.')
+    info: str
+        Information about the dataset including but not limited to original citation,
+        general information. (default: '.')


the default is an empty string isn't it?

dkrako · 2025-03-25T23:12:01Z

src/pymovements/datasets/bsc2.py

@@ -105,6 +109,31 @@ class BSCII(DatasetDefinition):

    name: str = 'BSCII'

+    info: str = """\
+BSCII dataset :cite:p:`BSCII`.


I would be in favor to remove the first line from all of the description strings, as the name of the dataset is already known to the user and the sphinx cite directive is not very useful when calling the property.

Moreover, if we use the string as a basis for autogenerating dataset docpages, the first line can be easily recreated by something like f'{dataset.name} dataset :cite:p:`{_get_bibtex_id(dataset.bibtex)}`' (within the autogenerator script, and not included in the description string or any definition file)

Nevertheless, one thing that we could add to the description is the verbose name of the dataset.
For example instead of writing:

This dataset includes monocular eye tracking data from several ...

It would be nicer to write:

The Beijing Sentence Corpus II (BSCII) includes monocular eye tracking data from several ...

dkrako · 2025-04-01T07:41:52Z

regarding the citation: str field: in case a dataset has multiple citations, just add a line break to the string.

SiQube added 2 commits March 25, 2025 20:59

feat: add info to public datasets

69624a5

fix erroneous docstrings in dataset definitions (#987)

5375871

SiQube requested review from dkrako and prassepaul as code owners March 25, 2025 20:12

SiQube requested a review from saeub March 25, 2025 20:12

dkrako reviewed Mar 25, 2025

View reviewed changes

dkrako marked this pull request as draft March 26, 2025 13:36

dkrako mentioned this pull request Apr 1, 2025

add disclaimer when downloading a dataset #1075

Open

8 tasks

SiQube mentioned this pull request Apr 16, 2025

feat: print disclaimer when downloading a public dataset #1097

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset info #1057

Dataset info #1057

SiQube commented Mar 25, 2025

codecov bot commented Mar 25, 2025

dkrako Mar 25, 2025

dkrako commented Mar 25, 2025 •

edited

Loading

dkrako commented Mar 25, 2025

dkrako Mar 25, 2025

dkrako Mar 25, 2025 •

edited

Loading

dkrako commented Apr 1, 2025

Dataset info #1057

Are you sure you want to change the base?

Dataset info #1057

Conversation

SiQube commented Mar 25, 2025

codecov bot commented Mar 25, 2025

Codecov Report

dkrako Mar 25, 2025

Choose a reason for hiding this comment

dkrako commented Mar 25, 2025 • edited Loading

dkrako commented Mar 25, 2025

dkrako Mar 25, 2025

Choose a reason for hiding this comment

dkrako Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

dkrako commented Apr 1, 2025

dkrako commented Mar 25, 2025 •

edited

Loading

dkrako Mar 25, 2025 •

edited

Loading