Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Two series' in an import can only differ by dcAggregate/ #862

Open
pradh opened this issue Jun 7, 2022 · 1 comment
Open

[BUG] Two series' in an import can only differ by dcAggregate/ #862

pradh opened this issue Jun 7, 2022 · 1 comment

Comments

@pradh
Copy link
Contributor

pradh commented Jun 7, 2022

When two series in an import differ only by dcAggregate/, it seems the Mixer might only pick one of them, because the metadata hash does not include is_dc_aggregate

This happens with the Census PEP imports because they stitch together multiple historical CSVs into an import, and for some year ranges the data isn't available and need to be aggregated (dcAggregate/). Currently, only for those aggregated years, they set dcAggregate/.

Validation:

./check_bt d/3/country/USA^Count_Person_Male frequent | ./cache_parse returns two series with import name USCensusPEP_By_Sex_Race

curl -X POST 'https://autopush.api.datacommons.org/stat/all' -d '{ "places": ["country/USA"], "stat_vars": ["Count_Person_Male"]}' | jq returns only see one series

@shifucun
Copy link
Contributor

shifucun commented Jun 7, 2022

A few general questions and thoughts on aggregated data:

  1. When do we want to expose "is_aggregate" for an observation?
  2. If we claim an observation is aggregated, should it be in the import_name or in the measurement_method?
  3. If in one import, there are both aggregated and non-aggregated data, should we present them uniformly or is it necessary to differentiate to the users?

Right now, we handle aggregation data as separate import(series). In some cases, this is not user friendly, ex, City level data (raw) and County level data (aggregated) are presented as two distinct series, which from user perspective is unnecessary.

In case of this bug, it's even more subtle for the aggregation mechanism and I doubt we should expose the complexity in the final data presentation.

A non-intrusive way would be to add a metadata property for the an import and indicating what place types, variables are aggregated. If users do need to figure out the subtlety, they can look up for it from this metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants