The main goal of this workshop is to explore some of the Spyder IDE's core functionality for scientific programming. We will work on data visualization, analysis and prediction using Python libraries like Pandas, Matplotlib and Scikit-learn over a dataset of historical weather observations from 2006 to 2016.
To start with this workshop, you will need to have Spyder installed in a Python 3 environment that contains at least the Numpy, Matplotlib, Pandas and Scikit-Learn libraries. We recommend that you download and install the Anaconda Python distribution, which contains all these libraries and more in a single place.
- If you are familiar with git, clone the Spyder-Workshop repository:
git clone https://github.com/juanis2112/Spyder-Workshop
Otherwise, you can download the contents of the workshop here.
Then, launch Spyder via the start menu shortcut on Windows, or from Anaconda Navigator on Linux or Mac.
Open the Workshop in Spyder as a project by clicking Open Project
in the Project
menu, and navigating to the Spyder-Workshop
directory.
Finally, open the file workshop.py
by double-clicking it in the Project Explorer
pane on the left of the Spyder main window.
The first thing we need to do before starting our work is import the libraries necessary for our analysis, and load the data in a way that it is easy to explore.
- Import the libraries Matplotlib and Pandas:
import matplotlib.pyplot as plt
import pandas as pd
- Load the data from the CSV file to a Pandas DataFrame:
weather_data = pd.read_csv('data/weatherHistory.csv')
Now that we have our data and libraries ready, let's start by taking a look at the data that we have.
-
Open the
weather_data
variable in the Variable Explorer pane by double-clicking its name. The Variable Explorer is located in the top-right of the Spyder main window; you may need to click its tab to make it visible. -
Verify that the
Size
ofweather_data
in the Variable Explorer corresponds to the result oflen(weather_data)
in the IPython Console.
len(weather_data)
- Print the first three rows of the DataFrame to the IPython Console:
weather_data.head(3)
- Now, try printing the last three rows of the DataFrame.
A useful tool for exploring the data that we are going to work with is plotting it. This is easy to do using the pandas library, which we imported previously.
The first thing we want to do before plotting our data is ordering the rows according to the date. Use the Variable Explorer to verify that our data is not ordered by default.
- Parse the date and create a new DataFrame with our data ordered by it:
weather_data['Formatted Date'] = pd.to_datetime(
weather_data['Formatted Date'].str[:-6])
weather_data_ordered = weather_data.sort_values(by='Formatted Date')
- In the Variable Explorer, right-click the old DataFrame
weather_data
to pop out the context menu and selectRemove
to delete it. Now, we are going to work with our new variableweather_data_ordered
.
Notice in the Variable Explorer that the DataFrame's index (the Index
column on the left) is not in the order of the date.
Reset the index so its order matches that of Formatted Date
:
weather_data_ordered.reset_index(drop=True)
We also see that there are some qualitative variables, which can make our analysis more difficult. For this reason, we want to stick to the columns that give us numerical information and drop the categorical ones:
weather_data_ordered.drop(
columns=['Summary', 'Precip Type', 'Loud Cover', 'Daily Summary'])
- Plot
Temperature (C)
versusFormatted Date
to see how temperature changes over time:
weather_data_ordered.plot(
x='Formatted Date', y='Temperature (C)', color='red', figsize=(15, 8))
-
Switch to the Plots pane, in the same top-right section of the interface as the Variable Explorer, to view your figure.
-
Now, try plotting the temperature versus the date using only the data from 2006.
-
Plot temperature and humidity versus the date in the same plot to examine how both variables change over time:
weather_data_ordered.plot(
subplots=True, x='Formatted Date', y=['Temperature (C)', 'Humidity'],
figsize=(15, 8))
- Now, try plotting different variables in the same plot for different years.
The previous plots contained a lot of data, which make it difficult to understand the evolution of our variables through time.
For this reason, one of the things that we can do is group the information we have by year and plot the yearly values.
To do this, we have written a function in the utils.py
file, in the same folder as your workshop, that creates a new column in the DataFrame containing the year, and then groups values by year, computing the average of the variables for each one.
- Import the function from the
utils
module, so you can use it in your script:
from utils import aggregate_by_year
- Use it to aggregate by year and plot the data:
weather_data_by_year = aggregate_by_year(
weather_data_ordered, date_column='Formatted Date')
- Try writing a function in the
utils.py
file that gets the averages of the weather data by month and plots them.
Now, we want to evaluate the relationships between the variables in our data set.
For this, we have written another function in utils.py
.
- Import the new function:
from utils import aggregate_by_year, plot_correlations
- Plot the correlations between the variables and view the figure in the plots pane:
plot_correlations(weather_data_ordered, size=15)
- Import the
plot_color_gradients()
function, which will help you plot the colormap gradient to be able to interpret your correlation plot:
from utils import aggregate_by_year, plot_correlations, plot_color_gradients
- Plot the colormap gradient using the function you imported:
plot_color_gradients(
cmap_category='Plot gradients convention', cmap_list=['viridis', ])
- Calculate the Pearson correlations between the different variables in our data set:
weather_correlations = weather_data_ordered.corr()
-
Open the variable
weather_correlations
in the Variable Explorer to see the results. -
Print the correlation between humidity and temperature in the IPython Console:
weather_data_ordered['Temperature (C)'].corr(weather_data_ordered['Humidity'])
Verify it has the same value as in the weather_correlations
DataFrame.
- Try calculating correlations between different variables and comparing them with the ones in the DataFrame.
Finally, we want to use our data to construct a model that allows us to predict values for some of our variables. In our previous section, we realized that humidity and temperature are two of the most correlated variables, so we are going to use these two first.
We are going to use Scikit-Learn, which is a Python library that contains tools to explore data and build different types of predictive models.
- Import the two necessary objects for our data modeling:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
- A classic way to make a predictive model is to subdivide the data into two sets: training and test. The training data will help us to fit our predictive model, while the test data will play the role of future observations and give us an idea of how good our predictions are.
Scikit-Learn contains a built-in function to split your data:
x_train, x_test, y_train, y_test = train_test_split(
weather_data_ordered['Humidity'], weather_data_ordered['Temperature (C)'],
test_size=0.25)
- We will use linear regression in Scikit-Learn to make a linear model of our data. Create the model, then fit it with the weather data:
regression = linear_model.LinearRegression()
regression.fit(x_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))
-
Place the text cursor over
LinearRegression()
and press theInspect
shortcut (Ctrl+I
by default on Windows/Linux, orCmd-I
on macOS) to get the documentation of this function in the Help Pane. -
Print the coefficients of our regression:
print(regression.intercept_, regression.coef_) # beta_0, beta_1
Note that this means our model is a linear function $$y = beta_0 + beta_1 \times x$$
, where temperature is a function of humidity.
- Now, we want to plot our model predictions versus our test data, to see how good our predictions were:
y_predict = regression.predict(x_test.values.reshape(-1, 1))
plt.scatter(x_test, y_test, c='red', label='Observation', s=1)
plt.scatter(x_test, y_predict, c='blue', label='model')
plt.xlabel('Humidity')
plt.ylabel('Temperature (C)')
plt.legend()
plt.show()
-
Using the
.predict()
method of our model, predict the temperature for a given level of humidity. -
Finally, we can numerically evaluate how good our model predictions were. For this, we will use
explained_variance_score
available insklearn.metrics
. This metric is calculated as$$1-(Var(Y_real-Y_model)/Var(Y_real))$$
, which means that the closer the value is to 1, the better our model.
We need to import the function that evaluates our model:
from sklearn.metrics import explained_variance_score
- Calculate the explained variance score and print it:
explained_variance_score(y_test, y_predict)