Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[CHORE] Add tests for parquet size estimations (#3405)
Unskipping the tests produces failures with 2 passing cases: ``` FAILED tests/test_size_estimations.py::test_estimations_unique_strings[dict-snappy] - AssertionError: Expected 8.16MB file estimated vs actual to be within 30.0%: 24.47MB vs 113.89MB (-365.40%) FAILED tests/test_size_estimations.py::test_estimations_unique_strings[dict-no_compression] - AssertionError: Expected 109.93MB file estimated vs actual to be within 30.0%: 329.80MB vs 113.89MB (65.47%) FAILED tests/test_size_estimations.py::test_estimations_unique_strings[no_dict-snappy] - AssertionError: Expected 8.14MB file estimated vs actual to be within 30.0%: 24.42MB vs 113.89MB (-366.43%) FAILED tests/test_size_estimations.py::test_estimations_unique_strings[no_dict-no_compression] - AssertionError: Expected 109.91MB file estimated vs actual to be within 30.0%: 329.74MB vs 113.89MB (65.46%) FAILED tests/test_size_estimations.py::test_estimations_dup_strings[dict-snappy] - AssertionError: Expected 0.00MB file estimated vs actual to be within 30.0%: 0.00MB vs 108.00MB (-3345625.15%) FAILED tests/test_size_estimations.py::test_estimations_dup_strings[dict-no_compression] - AssertionError: Expected 0.00MB file estimated vs actual to be within 30.0%: 0.00MB vs 108.00MB (-3082092.01%) FAILED tests/test_size_estimations.py::test_estimations_dup_strings[no_dict-snappy] - AssertionError: Expected 4.92MB file estimated vs actual to be within 30.0%: 14.76MB vs 108.00MB (-631.79%) FAILED tests/test_size_estimations.py::test_estimations_dup_strings[no_dict-no_compression] - AssertionError: Expected 104.02MB file estimated vs actual to be within 30.0%: 312.07MB vs 108.00MB (65.39%) FAILED tests/test_size_estimations.py::test_estimations_unique_ints[dict-snappy] - AssertionError: Expected 4.28MB file estimated vs actual to be within 30.0%: 12.85MB vs 8.00MB (37.72%) FAILED tests/test_size_estimations.py::test_estimations_unique_ints[dict-no_compression] - AssertionError: Expected 8.28MB file estimated vs actual to be within 30.0%: 24.84MB vs 8.00MB (67.79%) FAILED tests/test_size_estimations.py::test_estimations_unique_ints[no_dict-snappy] - AssertionError: Expected 4.00MB file estimated vs actual to be within 30.0%: 12.01MB vs 8.00MB (33.40%) FAILED tests/test_size_estimations.py::test_estimations_unique_ints[no_dict-no_compression] - AssertionError: Expected 8.00MB file estimated vs actual to be within 30.0%: 24.00MB vs 8.00MB (66.67%) FAILED tests/test_size_estimations.py::test_estimations_dup_ints[dict-snappy] - AssertionError: Expected 0.00MB file estimated vs actual to be within 30.0%: 0.00MB vs 8.00MB (-454962.57%) FAILED tests/test_size_estimations.py::test_estimations_dup_ints[dict-no_compression] - AssertionError: Expected 0.00MB file estimated vs actual to be within 30.0%: 0.00MB vs 8.00MB (-458090.15%) FAILED tests/test_size_estimations.py::test_estimations_dup_ints[no_dict-snappy] - AssertionError: Expected 0.38MB file estimated vs actual to be within 30.0%: 1.13MB vs 8.00MB (-607.74%) FAILED tests/test_size_estimations.py::test_estimations_dup_ints[no_dict-no_compression] - AssertionError: Expected 8.00MB file estimated vs actual to be within 30.0%: 24.00MB vs 8.00MB (66.67%) FAILED tests/test_size_estimations.py::test_canonical_files_in_hf[smoktalk] - AssertionError: Expected 105.42MB file estimated vs actual to be within 30.0%: 316.26MB vs 213.66MB (32.44%) FAILED tests/test_size_estimations.py::test_canonical_files_in_hf[cifar10] - AssertionError: Expected 23.94MB file estimated vs actual to be within 30.0%: 71.82MB vs 22.85MB (68.18%) FAILED tests/test_size_estimations.py::test_canonical_files_in_hf[standfordnlp-imdb] - AssertionError: Expected 42.00MB file estimated vs actual to be within 30.0%: 125.99MB vs 67.31MB (46.58%) FAILED tests/test_size_estimations.py::test_canonical_files_in_hf[jat-dataset-tokenized] - AssertionError: Expected 7.34MB file estimated vs actual to be within 30.0%: 22.01MB vs 499.57MB (-2169.79%) ``` Follow-on fixes for the worst cases: 1. Duplicate strings/ints with dictionary encodings (`-3345625.15%`) 2. Duplicate strings/ints with no dictionary encodings, but with snappy compression (`-631%`) 3. `jat-dataset-tokenized`: tokenization dataset with lots of lists-of-ints (`-2169%`), likely highly compressed --------- Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
- Loading branch information