Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to vega datasets v2 #40

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,21 @@ Change Log

Release v0.9 (unreleased)
-------------------------
- Add `football.json`. Thanks to @eitanlees!
- Add `penguins.json`.
- Add `seattle-weather-hourly-normals.csv`.
- Update `weather.csv` and `seattle-weather.csv` with better encoded weather condition, indicating more rain. Thanks to @visnup!
- Update co2-concentration data and add seasonally adjusted CO2 field.
- Switch to ISO 8601 dates in `seattle-weather.csv`.
- Rename `weball26.json` to `political-contributions.json`.
- Convert `birdstrikes.json` to `birdstrikes.csv` and use ISO 8601 dates.
- Convert `movies.json` to use column names with spaces use ISO 8601 dates.
- Remove `climate.json`.
- Replace `seattle-temps.csv` with more general `seattle-weather-hourly-normals.csv`.
- Remove `sf-temps.csv`.
- Remove `graticule.json`. Use graticule generator instead.
- Remove `points.json`.
- Remove `iris.json`. Use `penguins.json` instead.
- Change urls to use jsDelivr (a fast CDN) with a fixed version number, instead of GitHub.

Release v0.8 (Dec 14, 2019)
Expand Down
35 changes: 17 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,31 +30,31 @@ The main object in this library is ``data``:
```

It contains attributes that access all available datasets, locally if
available. For example, here is the well-known iris dataset:
available. For example, here is the [Palmer penguins](https://github.com/allisonhorst/palmerpenguins) dataset:

```python
>>> df = data.iris()
>>> df = data.penguins()
>>> df.head()
petalLength petalWidth sepalLength sepalWidth species
0 1.4 0.2 5.1 3.5 setosa
1 1.4 0.2 4.9 3.0 setosa
2 1.3 0.2 4.7 3.2 setosa
3 1.5 0.2 4.6 3.1 setosa
4 1.4 0.2 5.0 3.6 setosa
Species Island Beak Length (mm) Beak Depth (mm) Flipper Length (mm) Body Mass (g) Sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
3 Adelie Torgersen NaN NaN NaN NaN None
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE
```

If you're curious about the source data, you can access the URL for any of the available datasets:

```python
>>> data.iris.url
'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/iris.json'
>>> data.penguins.url
'https://cdn.jsdelivr.net/npm/vega-datasets@2.1.0/data/penguins.json'
```

For datasets bundled with the package, you can also find their location on disk:

```python
>>> data.iris.filepath
'/lib/python3.6/site-packages/vega_datasets/data/iris.json'
>>> data.penguins.filepath
'/lib/python3.8/site-packages/vega_datasets/data/penguins.json'
```

## Available Datasets
Expand All @@ -63,16 +63,15 @@ To list all the available datsets, use ``list_datasets``:

```python
>>> data.list_datasets()
['7zip', 'airports', 'anscombe', 'barley', 'birdstrikes', 'budget', 'budgets', 'burtin', 'cars', 'climate', 'co2-concentration', 'countries', 'crimea', 'disasters', 'driving', 'earthquakes', 'ffox', 'flare', 'flare-dependencies', 'flights-10k', 'flights-200k', 'flights-20k', 'flights-2k', 'flights-3m', 'flights-5k', 'flights-airport', 'gapminder', 'gapminder-health-income', 'gimp', 'github', 'graticule', 'income', 'iris', 'jobs', 'londonBoroughs', 'londonCentroids', 'londonTubeLines', 'lookup_groups', 'lookup_people', 'miserables', 'monarchs', 'movies', 'normal-2d', 'obesity', 'points', 'population', 'population_engineers_hurricanes', 'seattle-temps', 'seattle-weather', 'sf-temps', 'sp500', 'stocks', 'udistrict', 'unemployment', 'unemployment-across-industries', 'us-10m', 'us-employment', 'us-state-capitals', 'weather', 'weball26', 'wheat', 'world-110m', 'zipcodes']
['7zip', 'airports', 'annual-precip', 'anscombe', 'barley', 'birdstrikes', 'budget', 'budgets', 'burtin', 'cars', 'co2-concentration', 'countries', 'crimea', 'disasters', 'driving', 'earthquakes', 'ffox', 'flare', 'flare-dependencies', 'flights-10k', 'flights-200k', 'flights-20k', 'flights-2k', 'flights-3m', 'flights-5k', 'flights-airport', 'football', 'gapminder', 'gapminder-health-income', 'gimp', 'github', 'income', 'iowa-electricity', 'jobs', 'la-riots', 'londonBoroughs', 'londonCentroids', 'londonTubeLines', 'lookup_groups', 'lookup_people', 'miserables', 'monarchs', 'movies', 'normal-2d', 'obesity', 'ohlc', 'penguins', 'points', 'political-contributions', 'population', 'population_engineers_hurricanes', 'seattle-weather', 'seattle-weather-hourly-normals', 'sp500', 'stocks', 'udistrict', 'unemployment', 'unemployment-across-industries', 'uniform-2d', 'us-10m', 'us-employment', 'us-state-capitals', 'volcano', 'weather', 'wheat', 'windvectors', 'world-110m', 'zipcodes']
```

To list local datasets (i.e. those that are bundled with the package and can be used without a web connection), use the ``local_data`` object instead:

```python
>>> from vega_datasets import local_data
>>> local_data.list_datasets()

['airports', 'anscombe', 'barley', 'burtin', 'cars', 'crimea', 'driving', 'iowa-electricity', 'iris', 'seattle-temps', 'seattle-weather', 'sf-temps', 'stocks', 'us-employment', "wheat"]
['airports', 'anscombe', 'barley', 'burtin', 'cars', 'crimea', 'driving', 'iowa-electricity', 'la-riots', 'ohlc', 'penguins', 'seattle-weather', 'seattle-weather-hourly-normals', 'stocks', 'us-employment', 'wheat']
```

We plan to add more local datasets in the future, subject to size and licensing constraints. See the [local datasets issue](https://github.com/altair-viz/vega_datasets/issues/1) if you would like to help with this.
Expand All @@ -82,9 +81,9 @@ We plan to add more local datasets in the future, subject to size and licensing
If you want more information about any dataset, you can use the ``description`` property:

```python
>>> data.iris.description
'This classic dataset contains lengths and widths of petals and sepals for 150 iris flowers, drawn from three species. It was introduced by R.A. Fisher in 1936 [1]_.'
>>> data.penguins.description
'Palmer Archipelago (Antarctica) penguin data collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. For more information visit https://github.com/allisonhorst/penguins.'
```

This information is also part of the ``data.iris`` doc string.
This information is also part of the ``data.penguins`` doc string.
Descriptions are not yet included for all the datasets in the package; we hope to add more information on this in the future.
5 changes: 2 additions & 3 deletions tools/download_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,11 @@
"crimea",
"driving",
"iowa-electricity",
"iris",
"la-riots",
"ohlc",
"seattle-temps",
"penguins",
"seattle-weather",
"sf-temps",
"seattle-weather-hourly-normals",
"stocks",
"us-employment",
"wheat",
Expand Down
4 changes: 2 additions & 2 deletions tools/generate_datasets_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@


def main(tag):
cwd = os.path.dirname(__file__)
cwd = os.path.dirname(os.path.abspath(__file__))
datasets_src = os.path.join(cwd, "vega-datasets")
if not os.path.exists(datasets_src):
print("Cloning vega-datsets...")
Expand Down Expand Up @@ -42,7 +42,7 @@ def main(tag):

print("Updating SOURCE_TAG in core file")
subprocess.check_call(
["sed", "-i", ".bak", f"s/SOURCE_TAG.*/SOURCE_TAG = {tag!r}/g", core_file]
["sed", "-i", ".bak", f"s/SOURCE_TAG\ =\ .*/SOURCE_TAG = {tag!r}/g", core_file]
)
subprocess.check_call(["rm", f"{core_file}.bak"])

Expand Down
Loading