Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pandas helper functions for get_stat_* #144

Merged
merged 37 commits into from
Aug 26, 2020
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
e05e404
Add pandas wrapper function for time series data frame.
tjann Aug 23, 2020
2f070fd
Save work so far on pandas.
tjann Aug 24, 2020
b1feeca
Minor edits.
tjann Aug 24, 2020
516033b
Add function for creating covariate pandas df.
tjann Aug 24, 2020
529aeb3
Add latest date sorting to covariate as well. Add test for covariate …
tjann Aug 24, 2020
a7868e2
stat_vars_test: make response and expected response strings consisten…
tjann Aug 24, 2020
ea3c2ff
Add an example for covariate_pd_input
tjann Aug 24, 2020
ab3f755
Make stat_var examples quoting consistent.
tjann Aug 24, 2020
e72ae4a
Create dcpandas module that uses pandas natively.
tjann Aug 24, 2020
32a0284
Do the python release in another PR.
tjann Aug 24, 2020
160eee6
Remove stale refs in datacommons library to pandas features.
tjann Aug 24, 2020
771b0a5
Update pandas readme.
tjann Aug 24, 2020
5a86466
Cleanup format.
tjann Aug 24, 2020
5780970
Remove pd-related mocks from python testing.
tjann Aug 24, 2020
cb83487
Cosmetics.
tjann Aug 24, 2020
85b3a9b
Update docstring
tjann Aug 24, 2020
4044c03
Fix import statement for pip. Always sort time series df columns.
tjann Aug 24, 2020
d6290be
Restore pandas setup to prepare for release.
tjann Aug 24, 2020
4bba808
change _group_stat_all_by_obs_options mode parameter to time_series b…
tjann Aug 24, 2020
c81eaa6
Address some documentation suggestions from cyin.
tjann Aug 24, 2020
d3a618d
Fix bug from reassigning parameter time_series value in _group_stat_a…
tjann Aug 24, 2020
a4bcf4e
Make df_builder examples more readable.
tjann Aug 24, 2020
a2202c0
Update the docstrings for both PyPI release setup*.py files. Change d…
tjann Aug 24, 2020
0ebb20f
Rename time_series parameter to keep_series for _group_stat_all_by_ob…
tjann Aug 24, 2020
efd2e0c
dcpandas to datacommons_pandas, including all datacommons functions
tjann Aug 25, 2020
f645f3f
Fix various docstrings.
tjann Aug 25, 2020
1ea347d
Merge branch 'master' into pandas-funcs
tjann Aug 25, 2020
5454a93
Merge branch 'pandas-funcs' of github.com:tjann/api-python into panda…
tjann Aug 25, 2020
18cb93e
Add optional args to pandas lib build_time_series to pass onto python…
tjann Aug 25, 2020
51582c8
Update docstrings for time series funcs.
tjann Aug 25, 2020
a49a095
Remove will from CHANGELOG.
tjann Aug 25, 2020
7f46fdd
Reference TODO for cloudbuild pandas-python sync check. Update change…
tjann Aug 25, 2020
87b13ec
Rename covariate* to multivariate*, address cyin's comments on df_bui…
tjann Aug 25, 2020
f116e42
Update docstring for _group_stat_all_by_obs_options.
tjann Aug 25, 2020
975c956
Make err msg for _group_stat_all_by_obs_options no data more general.
tjann Aug 25, 2020
9865e09
Parameterize some pandas lib example functions.
tjann Aug 26, 2020
14dea40
Released pandas.
tjann Aug 26, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions datacommons/examples/stat_vars.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Basic examples for StatisticalVariable-based param_set Commons API functions."""
"""Basic examples for StatisticalVariable-based param_set Data Commons API functions."""

from __future__ import absolute_import
from __future__ import division
Expand All @@ -25,16 +25,16 @@ def main():
param_sets = [
{
'place': 'geoId/06085',
'stat_var': 'Count_Person',
'stat_var': "Count_Person",
},
{
'place': 'geoId/06085',
'stat_var': 'Count_Person',
'stat_var': "Count_Person",
'date': '2018',
},
{
'place': 'geoId/06085',
'stat_var': 'Count_Person',
'stat_var': "Count_Person",
'date': '2018',
'measurement_method': 'CensusACS5yrSurvey',
},
Expand Down Expand Up @@ -111,20 +111,20 @@ def call_str(pvs):

pp = pprint.PrettyPrinter(indent=4)
print(
"\nget_stat_all(['geoId/06085', 'country/FRA'], ['Median_Age_Person', 'Count_Person'])"
'\nget_stat_all(["geoId/06085", "country/FRA"], ["Median_Age_Person", "Count_Person"])'
)
print('>>> ')
pp.pprint(
dc.get_stat_all(['geoId/06085', 'country/FRA'],
['Median_Age_Person', 'Count_Person']))
dc.get_stat_all(["geoId/06085", "country/FRA"],
["Median_Age_Person", "Count_Person"]))

print(
"\nget_stat_all(['badPlaceId', 'country/FRA'], ['Median_Age_Person', 'Count_Person'])"
'\nget_stat_all(["badPlaceId", "country/FRA"], ["Median_Age_Person", "Count_Person"])'
)
print('>>> ')
pp.pprint(
dc.get_stat_all(['badPlaceId', 'country/FRA'],
['Median_Age_Person', 'Count_Person']))
dc.get_stat_all(["badPlaceId", "country/FRA"],
["Median_Age_Person", "Count_Person"]))


if __name__ == '__main__':
Expand Down
102 changes: 52 additions & 50 deletions datacommons/stat_vars.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,8 @@
from __future__ import division
from __future__ import print_function

from datacommons.utils import _API_ROOT, _API_ENDPOINTS, _ENV_VAR_API_KEY

import collections
import json
import os
import six.moves.urllib.error
import six.moves.urllib.request
import six

import datacommons.utils as utils

Expand Down Expand Up @@ -148,55 +143,62 @@ def get_stat_all(places, stat_vars):
>>> get_stat_all(["geoId/05", "geoId/06"], ["Count_Person", "Count_Person_Male"])
{
"geoId/05": {
"Count_Person": [
{
"val": {
"2010": 1633,
"2011": 1509,
"2012": 1581,
},
"observationPeriod": "P1Y",
"importName": "Wikidata",
"provenanceDomain": "wikidata.org"
},
{
"val": {
"2010": 1333,
"2011": 1309,
"2012": 131,
"Count_Person": {
"sourceSeries": [
{
"val": {
"2010": 1633,
"2011": 1509,
"2012": 1581,
},
"observationPeriod": "P1Y",
"importName": "Wikidata",
"provenanceDomain": "wikidata.org"
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
{
"val": {
"2010": 1333,
"2011": 1309,
"2012": 131,
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
}
],
}
],
"Count_Person_Male": [
{
"val": {
"2010": 1633,
"2011": 1509,
"2012": 1581,
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
}
],
},
"Count_Person_Male": {
"sourceSeries": [
{
"val": {
"2010": 1633,
"2011": 1509,
"2012": 1581,
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
}
],
}
},
"geoId/02": {
"Count_Person": [],
"Count_Person_Male": [
{
"val": {
"2010": 13,
"2011": 13,
"2012": 322,
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
"Count_Person": {},
"Count_Person_Male": {
"sourceSeries": [
{
"val": {
"2010": 13,
"2011": 13,
"2012": 322,
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
}
]
}
],
}
}
"""
Expand Down
70 changes: 39 additions & 31 deletions datacommons/test/stat_vars_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,34 +29,42 @@
import datacommons.utils as utils
import json
import unittest
import six
import six.moves.urllib as urllib

# Reusable parts of REST API /stat/all response.
CA_COUNT_PERSON = {
"isDcAggregate":
"true",
"sourceSeries": [
{
"val": {
"1990": 23640,
"1991": 24100,
"1992": 25090,
},
"observationPeriod": "P1Y",
"importName": "WorldDevelopmentIndicators",
"provenanceDomain": "worldbank.org"
"sourceSeries": [{
"val": {
"1990": 23640,
"1991": 24100,
"1993": 25090,
},
{
"val": {
"1790": 3929214,
"1800": 5308483,
"1810": 7239881,
},
"measurementMethod": "WikidataPopulation",
"importName": "WikidataPopulation",
"provenanceDomain": "wikidata.org"
"observationPeriod": "P1Y",
"importName": "WorldDevelopmentIndicators",
"provenanceDomain": "worldbank.org"
}, {
"val": {
"1790": 3929214,
"1800": 5308483,
"1810": 7239881,
},
"measurementMethod": "WikidataPopulation",
"importName": "WikidataPopulation",
"provenanceDomain": "wikidata.org"
}, {
"val": {
"1890": 28360,
"1891": 24910,
"1892": 25070,
},
]
"measurementMethod": "OECDRegionalStatistics",
"observationPeriod": "P1Y",
"importName": "OECDRegionalDemography",
"provenanceDomain": "oecd.org"
}]
}

CA_COUNT_PERSON_MALE = {
Expand Down Expand Up @@ -100,7 +108,7 @@
}]
}

HU22_MEDIAN_AGE_PERSON = {
CA_MEDIAN_AGE_PERSON = {
"sourceSeries": [{
"val": {
"1990": 12,
Expand Down Expand Up @@ -138,37 +146,37 @@ def read(self):
if req.get_full_url(
) == stat_value_url_base + '?place=geoId/06&stat_var=Count_Person':
# Response returned when querying with basic args.
return MockResponse(json.dumps({'value': 123}))
return MockResponse(json.dumps({"value": 123}))
if req.get_full_url(
) == stat_value_url_base + '?place=geoId/06&stat_var=Count_Person&date=2010':
# Response returned when querying with observationDate.
return MockResponse(json.dumps({'value': 133}))
return MockResponse(json.dumps({"value": 133}))
if (req.get_full_url() == stat_value_url_base +
'?place=geoId/06&stat_var=Count_Person&' +
'date=2010&measurement_method=CensusPEPSurvey&' +
'observation_period=P1Y&unit=RealPeople&scaling_factor=100'):
# Response returned when querying with above optional params.
return MockResponse(json.dumps({'value': 103}))
return MockResponse(json.dumps({"value": 103}))

# Mock responses for urlopen requests to get_stat_series.
if req.get_full_url(
) == stat_series_url_base + '?place=geoId/06&stat_var=Count_Person':
# Response returned when querying with basic args.
return MockResponse(json.dumps({'series': {'2000': 1, '2001': 2}}))
return MockResponse(json.dumps({"series": {"2000": 1, "2001": 2}}))
if (req.get_full_url() == stat_series_url_base +
'?place=geoId/06&stat_var=Count_Person&' +
'measurement_method=CensusPEPSurvey&observation_period=P1Y&' +
'unit=RealPeople&scaling_factor=100'):

# Response returned when querying with above optional params.
return MockResponse(json.dumps({'series': {'2000': 3, '2001': 42}}))
return MockResponse(json.dumps({"series": {"2000": 3, "2001": 42}}))
if (req.get_full_url() == stat_series_url_base +
'?place=geoId/06&stat_var=Count_Person&' +
'measurement_method=DNE'):

# Response returned when data not available for optional parameters.
# /stat/series?place=geoId/06&stat_var=Count_Person&measurement_method=DNE
return MockResponse(json.dumps({'series': {}}))
return MockResponse(json.dumps({"series": {}}))

# Mock responses for urlopen requests to get_stat_all.
if req.get_full_url() == stat_all_url_base:
Expand Down Expand Up @@ -204,7 +212,7 @@ def read(self):
"geoId/06": {
"statVarData": {
"Count_Person": CA_COUNT_PERSON,
"Median_Age_Person": HU22_MEDIAN_AGE_PERSON
"Median_Age_Person": CA_MEDIAN_AGE_PERSON
}
},
"nuts/HU22": {
Expand Down Expand Up @@ -274,7 +282,7 @@ def test_basic(self, urlopen):
"""Calling get_stat_value with minimal and proper args."""
# Call get_stat_series
stats = dc.get_stat_series('geoId/06', 'Count_Person')
self.assertEqual(stats, {'2000': 1, '2001': 2})
self.assertEqual(stats, {"2000": 1, "2001": 2})

@patch('six.moves.urllib.request.urlopen', side_effect=request_mock)
def test_opt_args(self, urlopen):
Expand All @@ -283,7 +291,7 @@ def test_opt_args(self, urlopen):
# Call get_stat_series with all optional args
stats = dc.get_stat_series('geoId/06', 'Count_Person',
'CensusPEPSurvey', 'P1Y', 'RealPeople', 100)
self.assertEqual(stats, {'2000': 3, '2001': 42})
self.assertEqual(stats, {"2000": 3, "2001": 42})

# Call get_stat_series with non-satisfiable optional args
stats = dc.get_stat_series('geoId/06', 'Count_Person', 'DNE')
Expand Down Expand Up @@ -316,7 +324,7 @@ def test_basic(self, urlopen):
exp = {
"geoId/06": {
"Count_Person": CA_COUNT_PERSON,
"Median_Age_Person": HU22_MEDIAN_AGE_PERSON
"Median_Age_Person": CA_MEDIAN_AGE_PERSON
},
"nuts/HU22": {
"Count_Person": HU22_COUNT_PERSON,
Expand Down
21 changes: 21 additions & 0 deletions dcpandas/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Changelog

## 0.0.1

**Date** - 08/24/2020

**Release Tag** - [pd.0.0.1](https://github.com/datacommonsorg/api-python/releases/tag/pd0.0.1)

**Release Status** - Current head of branch [`master`](https://github.com/datacommonsorg/api-python/tree/master)

Added pandas wrapper functions.

- `build_time_series` will construct a pd.Series for a given StatisticalVariable and Place, where dates are the index for the time series.
- `build_time_series_dataframe` will construct a pd.DataFrame for a given StatisticalVariable and a set of Places: where Places are the index and date are the columns.
tjann marked this conversation as resolved.
Show resolved Hide resolved
- `build_covariate_dataframe` will construct a covariate pd.DataFrame for a set of StatisticalVariables and a set of Places: with Places as index and StatisticalVariables as the columns. The values are the most recent values for the chosen StatVarObservation options.

For multi-place functions, when a StatisticalVariable has multiple StatVarObservation options,
Data Commons chooses a set of StatVarObservation options that covers the most geos. This
tjann marked this conversation as resolved.
Show resolved Hide resolved
ensures that the data fetched for a StatisticalVariable is comparable across places.
When there is a tie, we select the StatVarObservation options set with the latest date
data is available for any place.
47 changes: 47 additions & 0 deletions dcpandas/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Data Commons Pandas API

This is a Python library for creating pandas objects with data in the
Data Commons Graph.
To get started, install this package from pip.

pip install dcpandas
tjann marked this conversation as resolved.
Show resolved Hide resolved

Once the package is installed, import `dcpandas`.

import dcpandas as dcpd

For more detail on getting started with the API, please visit our
[API Overview](http://docs.datacommons.org/api/).

After you're ready to use the API, you can refer to `dcpandas/examples` for
tjann marked this conversation as resolved.
Show resolved Hide resolved
examples on how to use this package to perform various tasks. More tutorials and
documentation can be found at [tutorials](https://datacommons.org/colab)!
tjann marked this conversation as resolved.
Show resolved Hide resolved

## About Data Commons

[Data Commons](https://datacommons.org/) is an open knowledge repository that
provides a unified view across multiple public data sets and statistics. You can
view what [datasets](https://datacommons.org/datasets) are currently ingested
and browse the graph using our [browser](https://browser.datacommons.org/).

## License

Apache 2.0

## Development

Please follow the Development instructions from the root directory.

## Release to PyPI

- Update "VERSION" in setup.py
- Update CHANGELOG.md for a new version
- Upload a new package using steps for [generating distribution archives](https://packaging.python.org/tutorials/packaging-projects/#generating-distribution-archives) and [uploading the distribution archives](https://packaging.python.org/tutorials/packaging-projects/#uploading-the-distribution-archives)

## Support

For general questions or issues about the API, please open an issue on our
[issues](https://github.com/datacommonsorg/api-python/issues) page. For all other
questions, please send an email to `[email protected]`.

**Note** - This is not an officially supported Google product.
Loading