-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring of DataSource
component for easier, future, extensibility
#26
Comments
DataSource
component for easier, future, extensibility & introduction of SIPPDataSourceDataSource
component for easier, future, extensibility
Hi Ricardo, this is a hugely valuable effort. Thanks so much! I just had a look at the zip file that you link. This generally all looks good. There are a few higher level points to discuss and then some suggestions and minor comments. Higher-level points: There's a balance to strike between the generality of the code and the needs of the typical machine learning researcher. When changing the API, we should keep this in mind. A typical ML grad student should be able to create their own data sources quickly without too much of a learning curve. Otherwise instead of using folktables, researchers will go and write their own ad-hoc code. Concrete suggestions:
Minor comments:
Paging @francesding @millerjohnp @ludwigschmidt since this touches on some major design choices. Please let us know if you have any comments or concerns. |
Hi Ricardo, Thanks for integrating SIPP, it will be a great addition to folktables! A few points below:
|
Hi, thanks for the proposal and initial changes! Beyond what others have mentioned, just want to understand if this is targeting a new dataset or collection of datasets? Or is there a particular part of the current interface you find confusing or annoying? I think it'd be easier to reason about the costs/benefits of interface changes if there's actually a concrete target. Additionally, I like the lightweight interface, and I think it only really makes sense to add a lot of additional complexity and spend dev time implementing and maintaining something it if we're actually getting a new resource or improving quality of life for users. |
Hi all, Thanks for all the feedback and comments. I'll address all of them here. FeedbackPurpose of the proposed changesThe main idea behind the changes I've made is to keep the user facing API intact while changing the way in which the implementation is done. This is in order to make it easier to integrate other datasets in the future. Hence, the example you all list in the README would still be valid: data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
acs_data = data_source.get_data(states=["CA"], download=True) The way to use the I started to work on these changes when I started implementing the functionality to (down)load the SIPP dataset. I found a lot of the existing code to be useful but couldn't access it for my purposes as it was tailored for the ACS dataset. Hence, I went ahead and refactored the existing code to make it more modular and reusable.
|
SGTM - Thanks so much! |
Sounds great to me! Re: asynchronous downloads, it might be nice to maintain some flexibility with python versions. |
Regarding Python versions: apart from the |
Also just to clarify my comment regarding If we have a use case for extending the Thank you for making the changes! |
@ludwigschmidt You are right, my idea for the I'll also go ahead and revert back to using I'll go ahead and send the first pull request shortly. |
Thanks Ricardo for this work! The revised changes sound good to me.
Thanks,
Frances
…On Mon, Aug 1, 2022 at 2:17 PM Ricardo Sandoval ***@***.***> wrote:
@ludwigschmidt <https://github.com/ludwigschmidt> You are right, my idea
for the _load_data and _download_data methods is for them to only be
called internally by DataSource objects. I'll go ahead and remove them
from the externally visible interface.
I'll also go ahead and revert back to using ABC instead of protocol in
order to still be able to support Python 3.7, and will switch from using
asyncio to concurrent.futures.
I'll go ahead and send the first pull request shortly.
—
Reply to this email directly, view it on GitHub
<#26 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLSMAFFVDYDHONZZKHJEE3VXA5HVANCNFSM543MMQ3Q>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Hi! I'm proposing a refactoring of folktables's
DataSource
component (refer to this ZIP to see the changes already implemented).Proposed changes
In this section I describe the changes I'm proposing that will affect the structure of the
DataSource
component.DataSource
from anABC
to aProtocol
: as I see it, the role ofDataSource
is to define an interface for classes that help with the (down)loading of different datasets. By havingDataSource
be aProtocol
we'll be taking advantage of Python's duck typing and will make it more straight forward that this is the interface that should be followed (plus, we have the added benefit that we do not have to directly inherit fromDataSource
whenever defining this component for a different dataset).DownloadResource
,LoadResource
, andFilesResource
: these aredataclasses
that contain information pertaining to file paths and URLs for the different datasets the user wants to (down)load. These objects (note thatFilesResource
just contains instances ofDownloadResource
andLoadResource
) makes it easier to keep track of the resources needed to (down)load files.load_acs.py
and theACSDataSource
class inacs.py
into one file: in this refactoring I'm following the re-definedDataSource
interface, I'm using the utility functions I created or refactored, and I'm taking advantage of the*Resource
objects defined. One additional thing I did was to turn the hard coded URL into a "constant" variable.Proposed file structure
The text was updated successfully, but these errors were encountered: