Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flights 3m only has 200k rows #607

Open
domoritz opened this issue Sep 19, 2024 · 1 comment
Open

Flights 3m only has 200k rows #607

domoritz opened this issue Sep 19, 2024 · 1 comment

Comments

@domoritz
Copy link
Member

domoritz commented Sep 19, 2024

https://github.com/vega/vega-datasets/blob/main/data/flights-3m.csv seems to only have 200k rows.

wc -l flights-3m.csv
  231084 flights-3m.csv

Added in 1e70098 by @arvind

@dsmedia
Copy link
Contributor

dsmedia commented Sep 20, 2024

Looks like the count in flights_200k may also be off.

from vega_datasets import data

datasets = ['flights_2k', 'flights_5k', 'flights_10k', 'flights_20k', 'flights_200k', 'flights_3m']

for dataset_name in datasets:
    dataset = getattr(data, dataset_name)()
    row_count = len(dataset)
    print(f"{dataset_name}: {row_count} rows")

Results:

flights_2k: 2000 rows
flights_5k: 5000 rows
flights_10k: 10000 rows
flights_20k: 20000 rows
flights_200k: 231083 rows
flights_3m: 231083 rows

We can regenerate 3m rows using this script, create a csv from the 3m parquet file here or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants