-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
South Korea dataset module #45
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
audit.md
and datapackage.json
are missing.
Also, on your PR template, there was no checklist?
### PyCharm ### | ||
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm | ||
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally fine with you updating the .gitignore
with required files, but adding your personal IDE files is considered a bad practice, you can see how to move this contents into a global .gitignore
file for your local installation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.gitignore
has been updated.
also, audit.md
and datapackage.json
will be added too.
also, didnt notice the PULL_REQUEST_TEMPLATE.md
, so ill take a look into it! thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, I didn't know you can set a global gitignore thank you @ManuelAlvarezC !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ManuelAlvarezC Could you elaborate on the audit.md
and the datapackage.json
? just this two left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although is not as complete as it should be, thedocumentation will help you get a good grab on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ManuelAlvarezC Thank you, ill take a look at it, finalize the edit and make the push.
@@ -0,0 +1,3 @@ | |||
from task_geo.data_sources.covid.south_korea.south_korea_patients import south_korea_patients | |||
|
|||
__all__ = ['south_korea_patients'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename your data source to south_korea
, or even better, the ISO code, something like kr_covid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
'infected_by', 'contact_number' | ||
] | ||
df = df.reindex(columns=cols_ordered) | ||
df['confirmed_date'] = pd.to_datetime(df.confirmed_date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast can be done in two lines with:
date_columns = [...]
df[date_columns] = df[date_columns].apply(pd.to_datetime)
# This was written originally as pd.to_datetime(df[columns]) which crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ManuelAlvarezC Could you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing.
On the lines between 37-41(I only selected the line 37 to comment, a blunder of mine) you are casting columns to datetime
and reassigning them multiple times. This approach has two drawbacks:
- More lines of code to read and write, making it easier to miss details and introduce errors.
- It's in fact much faster making the casting and assigning of all the columns at once. This is what is called vectorization and pandas ( and numpy too) are designed to have vectorized operations run much faster than regular iteration in python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, passing the date format to to_datetime
will further improve its performance.
import requests | ||
|
||
|
||
def south_korea_patients_connector(*args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why the signature of the functions is with *args, and **kwargs if you are only expecting one argument.
Also, this argument is a url, does this data source work with any url? I think this connector should take no arguments. Can you please update the signature for both connector and data source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, takes no argument now and the url has been set
Update: audit.md is done, some changes are made to comply with the lint for flake8 (checks during commit). Just have the datapackage.json left to do. once this is done, will be making another commit |
Extraction and formatting of the dataset of confirmed cases, deaths and recovered per patient in South Korea #20