-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
progress #1
base: main
Are you sure you want to change the base?
progress #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job overall. I will merge this PR as soon as you update this notebook as explained in this review's comments.
} | ||
], | ||
"source": [ | ||
"df1 = pd.read_csv(\"data.csv\",low_memory = False,skiprows=4,parse_dates=True,index_col=[\"Country\",\"Date\"])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"data.csv" refers to what period? (2020Q1, 2020Q2, 2020Q3, 2020Q4, 2019Q1, 2019Q2, ...)? please specify this in a comment above this line or in a markdown cell above this one
} | ||
], | ||
"source": [ | ||
"df_ug = pd.read_csv(\"Ugandacovid.csv\",low_memory = False,parse_dates=True)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you filter out Uganda's data manually or using code? In the latter case, please add that code to better document the process
], | ||
"source": [ | ||
"plt.figure(figsize=[12,8])\n", | ||
"plt.plot(df_ug[df_ug['Specie']=='temperature'].loc[:30,'Date'], df_ug[df_ug['Specie']=='temperature'].loc[:30,'median'])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you already started visualizing the data, great job. starting with the median is a good choice, I would recommend visualizing more statistical measures (on the same plot) next, you can, for example, use a box plot to visualize the min, max, median, and variance values altogether. here's an example.
https://stackoverflow.com/questions/33328774/box-plot-with-min-max-average-and-standard-deviation/33330997
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDA.ipynb
I believe you forgot to change column_name
to a more meaningful name in the following code:
missing_value_df = pd.DataFrame({'column_name': ind1.columns,
'percent_missing': percent_missing})
you can, for example, change it to numerical measure
https://www.britannica.com/science/statistics/Numerical-measures
In this cell:
import seaborn as sns
correlation_mat = france[["min","max","median","variance","type"]].corr()
sns.heatmap(correlation_mat, annot = True)
plt.show()
I can see that you are computing the correlation between the air quality metrics distributions and the lockdown type. However, the code as it is right now is considering all the species as a whole, which is not what we are looking for here, we should separate each species ( pm10
, humidity
, o3
, co
, no2
, so2
, wind-speed
, wind-gust
, dew
, ... ) to know what are those to be included in the training process and which ones to ignore. By the way, this is what you tried to do by plotting data but you didn't compute the separate correlations.
This remark applies to other plots besides this:
# Plot time series dataset
ax = humidity.plot(linewidth=2, fontsize=12);
# Additional customizations
ax.set_xlabel('Date');
ax.legend(fontsize=12);
The labels of the x-axis are hard to read, find out a way to format the Date
column to look like a date. also, I know that this notebook was pushed before our discussion about the distribution, I prefer that you plot the Probability Density Function (PDF) of the distribution instead of the statistical measures superimposed, that way, the plot will be cleaner and easier to read.
I'm not sure why you wanted to read pm10
and no2
median values side by side, it would be great if you document your reasoning a bit to explain your thought process for the readers of your work (and for you in the future, in case you needed to look back at this notebook).
pm10_1.merge(no2_1[['no2', 'Date']], on = 'Date', how = 'left')
same for the pivot table, why you chose those particular columns, what are you trying to do here? even a one-liner comment can be enough to describe your idea(s)
france_clean = france.pivot_table(index=['Date', 'Country', 'City','type'],columns='Specie',values='variance').reset_index().sort_values(['Country','City'])
here, the date in the x-axis looks a lot better than earlier, however, I recommend turning the y-axis to a logarithmic scale (instead of the linear scale, used by default) just to see the both the big and small values
france_clean[france_clean.columns.drop(['Country', 'City', 'type'])].plot(figsize=(15, 6))
plt.show()
I agree, scatter plot might be a better alternative of the line plots used earlier
px.scatter(data_frame=france_clean[france_clean['City']=='Paris'],x=france_clean[france_clean['City']=='Paris'].index,y='pm25',color='type')
and you already tried to plot a Kernel Density Estimate (KDE)? nice to see that.
df_ozone.plot(kind='kde',figsize=[14,12]);
and then, you trained a fb_prophet
model to forecast the evolution of o3 in the French air, now, focus on the dates and train 2 versions: one, before a lockdown, and another one after it and see if there are any differences between the two.
Duplicate air_lockdown
Notebooks
it looks like air lockdown.ipynb
, air_lockdown.ipynb
, and notebooks/air lockdown.ipynb
are duplicates of the same notebook, which I already reviewed before. Only keep one of them and remove the other copies.
air_lockdown_filtering.ipynb
the air_lockdown_filtering.ipynb
notebook downloads data and filters out all countries except France, but I don't see where in the code you exported/saved the france_lockdown.csv
file, documenting the processing process is more important than the final result, when we have the processing script, we can run it again and get the result again (if it's not there already), but the other way is hard to predict (given the data, what processing has it been through?)
web_scraping.ipynb
the web_scraping.ipynb
notebook aims to scrape data from the COVID-19_lockdowns Wikipedia article, I saw that you started trying with BeautifulSoup
then switched to using pd.read_html
, I agree, in this particular case, using pandas
would be the best (and easier) option.
I liked how you used a regular expression (regex) to remove the references brackets in the scraped data
df_nat_clean3 = df_nat_clean3.replace(to_replace ='\[.*', value = '', regex = True)
df_nat_clean3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the data/processed/french_data.csv
, make sure to avoid uploading it with the code, and this is by including its path in the .gitignore
file so that you can keep working on it locally on your machine without removing it manually with each commit, a better approach is to document the process of acquiring and processing that file, and instead of uploading it, I suggest sharing it on Dropbox or Google Drive, and including the download link.
this is our progress so far