-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return list of lists instead of series #73
Comments
Results of discussion with @ACscooter @shifucun. These are the possible options.
Trade off and Differences between 2. and 3. In option 2. the API has one path, it will always return a list of lists. If the user wants to get output as a dictionary the user can utilize the state_dcids = dc.get_places_in(['country/USA'], 'State')[0]
state_names = dc.get_property_values(state_dcids, 'name')
state_dict = {k:v for k,v in zip(state_dcid, state_names)} # dcid to name The transformation to a dictionary may not be straightforward for beginners or non python programmers. Secondly, the list of lists has implied structure based on ordering. The user may prefer to have the more explicit structure of a dictionary. As an example consider that a user gets state dcids in an IPython environment and wants to determine which dcid is associated with California. If we return a dictionary then the user can determine the dcid by scanning the output. >>> state_dcids = dc.get_places_in(['country/USA'], 'State')[0]
>>> dc.get_property_values(state_dcids, 'name')
{ 'geoId/06': ['California'],
...
} We created option 3. to support dictionary output with minimal data transformation from the user. In option 3. an optional parameter Feedback from anyone is appreciated. |
@lukesmurray , what do you expect the intended call pattern to be for Option 2? Re: Option 3, it seems unnecessary for the API to implement the dictionary conversion to support random lookups (would one use the python API for that type of browsing vs the browser?) What other use cases did you have in mind that would require a dictionary response? |
For option 2 the call pattern is essentially the same but less complicated and lower overhead. Current Call Pattern # create pandas series to get series as output. feels unnecessary
# get the first value since its a series of lists
counties = dc.get_places_in(pd.Series(['geoId/06']), 'County')[0]
# but the first value is a list so we need to make it a series
counties = pd.Series(counties)
# rest is ok
names = dc.get_property_values(counties, 'name')
populations = dc.get_populations(counties, "Person", {})
observations = dc.get_observations(populations, "count", "measured_value", "2017", measurement_method='CenusACS5yrSurvey')
dc_data = pd.DataFrame({'dcid': counties, 'name': names, 'population': observations}) Proposed Call Pattern # no unnecessary pandas objects. no reliance on pandas. code is shorter
counties = dc.get_places_in(['geoId/06'], 'County')[0]
names = dc.get_property_values(counties, 'name')
populations = dc.get_populations(counties, "Person", {})
observations = dc.get_observations(populations, "count", "measured_value", "2017", measurement_method='CenusACS5yrSurvey')
dc_data = pd.DataFrame({'dcid': counties, 'name': names, 'population': observations}) Option 3 could also be used for rapid fire saving json objects with the input mapping encoded. For example get the names of all the states and save a mapping to a file. I personally have never used a dictionary output so I am not the best advocate for it. I have yet to come across an example where I needed one. |
Still trying to understand the proposal. What's the difference between the proposed call pattern and the following code snippet? # This snippet is currently supported by version 1.0.0
data = pd.DataFrame({'state': ['geoId/06']})
# Get places in California, and flatten the results out.
data['counties'] = dc.get_places_in(data['state'], 'County')
data = dc.flatten_data(data)
# Perform the rest of the calls
data['pop'] = dc.get_populations(data['counties'], 'Person')
data['obs'] = dc.get_observations(data['pop'], 'count', 'measuredValue', '2017', measurement_method='CenusACS5yrSurvey') |
@ACscooter the proposed call pattern will look exactly the same. The proposed call pattern is for the input to be an iterable. The example snippet passes Pandas Series to the API which are iterables. In the example, passing a different iterable is odd and requires an unnecessary cast but changing the type of iterable does not change the output. The proposed call pattern is for the output to be a list of lists. In the example snippet we assign a series of lists to a Pandas column using Hopefully the fact that option 2 leads to similar code is seen as beneficial. If we view the snippet as a common example then we find that there is no need for series to be returned. The series will just be flattened and is simply acting as a bucket to store lists. While writing this I realized we may want to amend the original proposal to allow for any |
Returning a series of lists from the api does not add much benefit to the user.
Example
Pandas series are designed to be a 1-d array so using them to store multidimensional data may throw some people off. A user can transform a list of lists to a pandas series easily if they want. The intended use case of returning pandas series was to support column assignment in dataframes. But lists of lists can be used to create dataframe columns without any conversion to a pandas series.
I suspect the majority of operations performed on the returned series will be iterating and indexing which lists support.
The text was updated successfully, but these errors were encountered: