Skip to content

dataset: add MECO datasets #1054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

dataset: add MECO datasets #1054

wants to merge 15 commits into from

Conversation

SiQube
Copy link
Member

@SiQube SiQube commented Mar 22, 2025

add meco datasets, there are some precomputed_reading_measures missing but we can add those later

resolves #947

Type of change

  • New feature (non-breaking change which adds functionality)
  • New dataset

✨ Enhancements

  • add feature to read precomputed event R files

📀 Datasets

  • add MECO first wave native reader (MECOL1W1)
  • add MECO first wave second language reader (MECOL2W1)
  • add MECO second wave second language reader (MECOL2W2)

How Has This Been Tested?

  • passing tests/unit/datasets/datasets_test.py::test_public_dataset_registered for MECOL1W1
  • passing tests/unit/datasets/datasets_test.py::test_public_dataset_registered for MECOL2W1
  • passing tests/unit/datasets/datasets_test.py::test_public_dataset_registered for MECOL2W2
  • passing adjusted test for loading precomputed event files tests/unit/dataset/dataset_files_test.py::test_load_precomputed_file_unsupported_file_format
  • add and passing test for tests/unit/dataset/dataset_files_test.py::test_load_precomputed_file_rda_raise_value_error, which fails because the 'r_dataframe_key': 'joint.fix' is not defined.
  • load precomputed events for R file tests/unit/dataset/dataset_files_test.py::test_load_precomputed_file_rda

Copy link

codecov bot commented Mar 22, 2025

Codecov Report

Attention: Patch coverage is 93.15068% with 5 lines in your changes missing coverage. Please review.

Project coverage is 99.87%. Comparing base (8768ff9) to head (6eab3d2).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/pymovements/dataset/dataset_files.py 73.68% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              main    #1054      +/-   ##
===========================================
- Coverage   100.00%   99.87%   -0.13%     
===========================================
  Files           87       91       +4     
  Lines         3818     3908      +90     
  Branches       679      683       +4     
===========================================
+ Hits          3818     3903      +85     
- Misses           0        4       +4     
- Partials         0        1       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@SiQube
Copy link
Member Author

SiQube commented Mar 22, 2025

@dkrako don't know what is going on here...maybe you have an idea?

@SiQube
Copy link
Member Author

SiQube commented Mar 24, 2025

possibly no wheels for pyreadr on windows and python 3.9?

@dkrako
Copy link
Contributor

dkrako commented Mar 25, 2025

possibly no wheels for pyreadr on windows and python 3.9?

bummer. how do we want to proceed?

Copy link
Contributor

@dkrako dkrako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descirption and the citations of the datasets are missing. Please add the python class implementation of the dataset definitions, as the datasets won't be visible in the documentation otherwise.

@dkrako
Copy link
Contributor

dkrako commented Mar 25, 2025

This error is very strange. You specified the pyreadr dependency as:

"pyreadr>=0.5.2,<0.6",

There is a windows wheel for v0.5.2: https://pypi.org/project/pyreadr/0.5.2/#files
but there is none for v0.5.3: https://pypi.org/project/pyreadr/0.5.3/#files

Why does pip try to build v0.5.3 instead of downloading v0.5.2?

Maybe there's a command-line option for pip to resolve this in the github workflow?

@dkrako
Copy link
Contributor

dkrako commented Mar 25, 2025

Found it: --prefer-binary

Maybe it has to be combined with --no-binary :pymovements: to make sure that pymovements is build from source.

Here's the line to update in the github workflow:

run: tox -vv --notest -e ${{ matrix.tox_env }}

@dkrako dkrako marked this pull request as draft March 26, 2025 13:35
@dkrako dkrako changed the title add meco data dataset: add MECO datasets Mar 26, 2025
@dkrako dkrako added the dataset label Mar 26, 2025
@SiQube SiQube force-pushed the dataset-mecol1w1 branch 2 times, most recently from 0367c54 to a735986 Compare March 30, 2025 06:35
@SiQube SiQube force-pushed the dataset-mecol1w1 branch from a735986 to 0c6d876 Compare March 30, 2025 07:07
@SiQube SiQube marked this pull request as ready for review March 30, 2025 07:15
@SiQube
Copy link
Member Author

SiQube commented Mar 30, 2025

this seems to work now. I'm pretty unsure about the line I've added to tox.ini. @dkrako can you quadruple check it? I'll add the py files later

@dkrako
Copy link
Contributor

dkrako commented Mar 30, 2025

this seems to work now. I'm pretty unsure about the line I've added to tox.ini. @dkrako can you quadruple check it? I'll add the py files later

Ah great! even better to include it in tox.ini than in the github workflow.

@SiQube SiQube enabled auto-merge (squash) April 17, 2025 03:58
@SiQube
Copy link
Member Author

SiQube commented Apr 17, 2025

@dkrako in this PR we have two contributions, one are the three datasets. second is the logic to read R files, e.g. .rda

@SiQube SiQube disabled auto-merge April 28, 2025 06:38
Copy link
Contributor

@dkrako dkrako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a full documentation on the new functionality.

@SiQube
Copy link
Member Author

SiQube commented Apr 28, 2025

Please review the rest, I want to merge it at some point, and not account for 200 prs that happened in the meantime

@dkrako
Copy link
Contributor

dkrako commented Apr 28, 2025

Alright, then first please incorporate the changes from:

@SiQube SiQube force-pushed the dataset-mecol1w1 branch from 5a6ba35 to 80e51db Compare April 28, 2025 14:30
@SiQube
Copy link
Member Author

SiQube commented Apr 28, 2025

done, forced push to retrigger readthedocs which failed

@SiQube
Copy link
Member Author

SiQube commented Apr 28, 2025

added changes as well

@SiQube SiQube requested a review from dkrako April 28, 2025 15:31
@SiQube SiQube force-pushed the dataset-mecol1w1 branch from d51de2a to 6eab3d2 Compare May 6, 2025 13:09
precomputed_reading_measure_df = pl.read_csv(data_path, **custom_read_kwargs)
elif data_path.suffix in r_extensions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document this functionality in Dataset.load_precomputed_events()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add it after #1099 is merged due to merge conflicts

precomputed_event_df = pl.read_csv(data_path, **custom_read_kwargs)
elif data_path.suffix in r_extensions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document this functionality in Dataset.load_precomputed_reading_measures()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add it after #1099 is merged due to merge conflicts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this file is a dataset related test file this should be visible in the filename.

please also make sure to not use any data from the datasets but fill it with own data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add meco dataset
2 participants