Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

progress #1

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

progress #1

wants to merge 12 commits into from

Conversation

AimjGuytidy
Copy link

this is our progress so far

Copy link
Owner

@stoufa stoufa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job overall. I will merge this PR as soon as you update this notebook as explained in this review's comments.

}
],
"source": [
"df1 = pd.read_csv(\"data.csv\",low_memory = False,skiprows=4,parse_dates=True,index_col=[\"Country\",\"Date\"])\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"data.csv" refers to what period? (2020Q1, 2020Q2, 2020Q3, 2020Q4, 2019Q1, 2019Q2, ...)? please specify this in a comment above this line or in a markdown cell above this one

}
],
"source": [
"df_ug = pd.read_csv(\"Ugandacovid.csv\",low_memory = False,parse_dates=True)\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you filter out Uganda's data manually or using code? In the latter case, please add that code to better document the process

],
"source": [
"plt.figure(figsize=[12,8])\n",
"plt.plot(df_ug[df_ug['Specie']=='temperature'].loc[:30,'Date'], df_ug[df_ug['Specie']=='temperature'].loc[:30,'median'])\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you already started visualizing the data, great job. starting with the median is a good choice, I would recommend visualizing more statistical measures (on the same plot) next, you can, for example, use a box plot to visualize the min, max, median, and variance values altogether. here's an example.
https://stackoverflow.com/questions/33328774/box-plot-with-min-max-average-and-standard-deviation/33330997

@AimjGuytidy AimjGuytidy reopened this Nov 14, 2021
Copy link
Owner

@stoufa stoufa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDA.ipynb

I believe you forgot to change column_name to a more meaningful name in the following code:

missing_value_df = pd.DataFrame({'column_name': ind1.columns,
                                 'percent_missing': percent_missing})

you can, for example, change it to numerical measure
https://www.britannica.com/science/statistics/Numerical-measures

In this cell:

import seaborn as sns
correlation_mat = france[["min","max","median","variance","type"]].corr()

sns.heatmap(correlation_mat, annot = True)

plt.show()

I can see that you are computing the correlation between the air quality metrics distributions and the lockdown type. However, the code as it is right now is considering all the species as a whole, which is not what we are looking for here, we should separate each species ( pm10, humidity, o3, co, no2, so2, wind-speed, wind-gust, dew, ... ) to know what are those to be included in the training process and which ones to ignore. By the way, this is what you tried to do by plotting data but you didn't compute the separate correlations.

This remark applies to other plots besides this:

# Plot time series dataset
ax = humidity.plot(linewidth=2, fontsize=12);

# Additional customizations
ax.set_xlabel('Date');
ax.legend(fontsize=12);

The labels of the x-axis are hard to read, find out a way to format the Date column to look like a date. also, I know that this notebook was pushed before our discussion about the distribution, I prefer that you plot the Probability Density Function (PDF) of the distribution instead of the statistical measures superimposed, that way, the plot will be cleaner and easier to read.

I'm not sure why you wanted to read pm10 and no2 median values side by side, it would be great if you document your reasoning a bit to explain your thought process for the readers of your work (and for you in the future, in case you needed to look back at this notebook).

pm10_1.merge(no2_1[['no2', 'Date']], on = 'Date', how = 'left')

same for the pivot table, why you chose those particular columns, what are you trying to do here? even a one-liner comment can be enough to describe your idea(s)

france_clean = france.pivot_table(index=['Date', 'Country', 'City','type'],columns='Specie',values='variance').reset_index().sort_values(['Country','City'])

here, the date in the x-axis looks a lot better than earlier, however, I recommend turning the y-axis to a logarithmic scale (instead of the linear scale, used by default) just to see the both the big and small values

france_clean[france_clean.columns.drop(['Country', 'City', 'type'])].plot(figsize=(15, 6))
plt.show()

I agree, scatter plot might be a better alternative of the line plots used earlier

px.scatter(data_frame=france_clean[france_clean['City']=='Paris'],x=france_clean[france_clean['City']=='Paris'].index,y='pm25',color='type')

and you already tried to plot a Kernel Density Estimate (KDE)? nice to see that.

df_ozone.plot(kind='kde',figsize=[14,12]);

and then, you trained a fb_prophet model to forecast the evolution of o3 in the French air, now, focus on the dates and train 2 versions: one, before a lockdown, and another one after it and see if there are any differences between the two.

Duplicate air_lockdown Notebooks

it looks like air lockdown.ipynb, air_lockdown.ipynb, and notebooks/air lockdown.ipynb are duplicates of the same notebook, which I already reviewed before. Only keep one of them and remove the other copies.

air_lockdown_filtering.ipynb

the air_lockdown_filtering.ipynb notebook downloads data and filters out all countries except France, but I don't see where in the code you exported/saved the france_lockdown.csv file, documenting the processing process is more important than the final result, when we have the processing script, we can run it again and get the result again (if it's not there already), but the other way is hard to predict (given the data, what processing has it been through?)

web_scraping.ipynb

the web_scraping.ipynb notebook aims to scrape data from the COVID-19_lockdowns Wikipedia article, I saw that you started trying with BeautifulSoup then switched to using pd.read_html, I agree, in this particular case, using pandas would be the best (and easier) option.

I liked how you used a regular expression (regex) to remove the references brackets in the scraped data

df_nat_clean3 = df_nat_clean3.replace(to_replace ='\[.*', value = '', regex = True)
df_nat_clean3

Copy link
Owner

@stoufa stoufa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the data/processed/french_data.csv, make sure to avoid uploading it with the code, and this is by including its path in the .gitignore file so that you can keep working on it locally on your machine without removing it manually with each commit, a better approach is to document the process of acquiring and processing that file, and instead of uploading it, I suggest sharing it on Dropbox or Google Drive, and including the download link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants