progress #1

AimjGuytidy · 2021-11-08T20:23:14Z

this is our progress so far

stoufa

Good job overall. I will merge this PR as soon as you update this notebook as explained in this review's comments.

stoufa · 2021-11-12T13:24:49Z

air lockdown.ipynb

+    }
+   ],
+   "source": [
+    "df1 = pd.read_csv(\"data.csv\",low_memory = False,skiprows=4,parse_dates=True,index_col=[\"Country\",\"Date\"])\n",


"data.csv" refers to what period? (2020Q1, 2020Q2, 2020Q3, 2020Q4, 2019Q1, 2019Q2, ...)? please specify this in a comment above this line or in a markdown cell above this one

stoufa · 2021-11-12T13:27:53Z

air lockdown.ipynb

+    }
+   ],
+   "source": [
+    "df_ug = pd.read_csv(\"Ugandacovid.csv\",low_memory = False,parse_dates=True)\n",


did you filter out Uganda's data manually or using code? In the latter case, please add that code to better document the process

stoufa · 2021-11-12T13:29:37Z

air lockdown.ipynb

+   ],
+   "source": [
+    "plt.figure(figsize=[12,8])\n",
+    "plt.plot(df_ug[df_ug['Specie']=='temperature'].loc[:30,'Date'], df_ug[df_ug['Specie']=='temperature'].loc[:30,'median'])\n",


you already started visualizing the data, great job. starting with the median is a good choice, I would recommend visualizing more statistical measures (on the same plot) next, you can, for example, use a box plot to visualize the min, max, median, and variance values altogether. here's an example.
https://stackoverflow.com/questions/33328774/box-plot-with-min-max-average-and-standard-deviation/33330997

committed

stoufa

`EDA.ipynb`

I believe you forgot to change column_name to a more meaningful name in the following code:

missing_value_df = pd.DataFrame({'column_name': ind1.columns,
                                 'percent_missing': percent_missing})

you can, for example, change it to numerical measure
https://www.britannica.com/science/statistics/Numerical-measures

In this cell:

import seaborn as sns
correlation_mat = france[["min","max","median","variance","type"]].corr()

sns.heatmap(correlation_mat, annot = True)

plt.show()

I can see that you are computing the correlation between the air quality metrics distributions and the lockdown type. However, the code as it is right now is considering all the species as a whole, which is not what we are looking for here, we should separate each species ( pm10, humidity, o3, co, no2, so2, wind-speed, wind-gust, dew, ... ) to know what are those to be included in the training process and which ones to ignore. By the way, this is what you tried to do by plotting data but you didn't compute the separate correlations.

This remark applies to other plots besides this:

# Plot time series dataset
ax = humidity.plot(linewidth=2, fontsize=12);

# Additional customizations
ax.set_xlabel('Date');
ax.legend(fontsize=12);

The labels of the x-axis are hard to read, find out a way to format the Date column to look like a date. also, I know that this notebook was pushed before our discussion about the distribution, I prefer that you plot the Probability Density Function (PDF) of the distribution instead of the statistical measures superimposed, that way, the plot will be cleaner and easier to read.

I'm not sure why you wanted to read pm10 and no2 median values side by side, it would be great if you document your reasoning a bit to explain your thought process for the readers of your work (and for you in the future, in case you needed to look back at this notebook).

pm10_1.merge(no2_1[['no2', 'Date']], on = 'Date', how = 'left')

same for the pivot table, why you chose those particular columns, what are you trying to do here? even a one-liner comment can be enough to describe your idea(s)

france_clean = france.pivot_table(index=['Date', 'Country', 'City','type'],columns='Specie',values='variance').reset_index().sort_values(['Country','City'])

here, the date in the x-axis looks a lot better than earlier, however, I recommend turning the y-axis to a logarithmic scale (instead of the linear scale, used by default) just to see the both the big and small values

france_clean[france_clean.columns.drop(['Country', 'City', 'type'])].plot(figsize=(15, 6))
plt.show()

I agree, scatter plot might be a better alternative of the line plots used earlier

px.scatter(data_frame=france_clean[france_clean['City']=='Paris'],x=france_clean[france_clean['City']=='Paris'].index,y='pm25',color='type')

and you already tried to plot a Kernel Density Estimate (KDE)? nice to see that.

df_ozone.plot(kind='kde',figsize=[14,12]);

and then, you trained a fb_prophet model to forecast the evolution of o3 in the French air, now, focus on the dates and train 2 versions: one, before a lockdown, and another one after it and see if there are any differences between the two.

Duplicate `air_lockdown` Notebooks

it looks like air lockdown.ipynb, air_lockdown.ipynb, and notebooks/air lockdown.ipynb are duplicates of the same notebook, which I already reviewed before. Only keep one of them and remove the other copies.

`air_lockdown_filtering.ipynb`

the air_lockdown_filtering.ipynb notebook downloads data and filters out all countries except France, but I don't see where in the code you exported/saved the france_lockdown.csv file, documenting the processing process is more important than the final result, when we have the processing script, we can run it again and get the result again (if it's not there already), but the other way is hard to predict (given the data, what processing has it been through?)

`web_scraping.ipynb`

the web_scraping.ipynb notebook aims to scrape data from the COVID-19_lockdowns Wikipedia article, I saw that you started trying with BeautifulSoup then switched to using pd.read_html, I agree, in this particular case, using pandas would be the best (and easier) option.

I liked how you used a regular expression (regex) to remove the references brackets in the scraped data

df_nat_clean3 = df_nat_clean3.replace(to_replace ='\[.*', value = '', regex = True)
df_nat_clean3

stoufa

Regarding the data/processed/french_data.csv, make sure to avoid uploading it with the code, and this is by including its path in the .gitignore file so that you can keep working on it locally on your machine without removing it manually with each commit, a better approach is to document the process of acquiring and processing that file, and instead of uploading it, I suggest sharing it on Dropbox or Google Drive, and including the download link.

AimjGuytidy added 3 commits November 8, 2021 22:16

Created using Colaboratory

9ba48e2

Add files via upload

6254fc7

Add files via upload

3fbe0f6

stoufa requested changes Nov 12, 2021

View reviewed changes

AimjGuytidy added 4 commits November 14, 2021 13:23

Add files via upload

cfff2d6

committed

commit

bbaf130

commit

2017768

commit

6e7a24d

AimjGuytidy closed this Nov 14, 2021

AimjGuytidy reopened this Nov 14, 2021

AimjGuytidy added 4 commits November 14, 2021 15:04

commit

8523e71

commit

6499be9

Created using Colaboratory

1e49c97

Add files via upload

4e3f95a

stoufa requested changes Nov 24, 2021

View reviewed changes

Created using Colaboratory

9cb5b55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

progress #1

progress #1

AimjGuytidy commented Nov 8, 2021

stoufa left a comment

stoufa Nov 12, 2021

stoufa Nov 12, 2021

stoufa Nov 12, 2021

stoufa left a comment

stoufa left a comment

progress #1

Are you sure you want to change the base?

progress #1

Conversation

AimjGuytidy commented Nov 8, 2021

stoufa left a comment

Choose a reason for hiding this comment

stoufa Nov 12, 2021

Choose a reason for hiding this comment

stoufa Nov 12, 2021

Choose a reason for hiding this comment

stoufa Nov 12, 2021

Choose a reason for hiding this comment

stoufa left a comment

Choose a reason for hiding this comment

EDA.ipynb

Duplicate air_lockdown Notebooks

air_lockdown_filtering.ipynb

web_scraping.ipynb

stoufa left a comment

Choose a reason for hiding this comment

`EDA.ipynb`

Duplicate `air_lockdown` Notebooks

`air_lockdown_filtering.ipynb`

`web_scraping.ipynb`