Skip to content

Latest commit

 

History

History
269 lines (174 loc) · 9.85 KB

README.md

File metadata and controls

269 lines (174 loc) · 9.85 KB

Spyder Workshop

The main goal of this workshop is to explore some of the Spyder IDE's core functionality for scientific programming. We will work on data visualization, analysis and prediction using Python libraries like Pandas, Matplotlib and Scikit-learn over a dataset of historical weather observations from 2006 to 2016.

Prerequisites

To start with this workshop, you will need to have Spyder installed in a Python 3 environment that contains at least the Numpy, Matplotlib, Pandas and Scikit-Learn libraries. We recommend that you download and install the Anaconda Python distribution, which contains all these libraries and more in a single place.

Project Set-Up

  1. If you are familiar with git, clone the Spyder-Workshop repository:
git clone https://github.com/juanis2112/Spyder-Workshop

Otherwise, you can download the contents of the workshop here.
Then, launch Spyder via the start menu shortcut on Windows, or from Anaconda Navigator on Linux or Mac. Open the Workshop in Spyder as a project by clicking Open Project in the Project menu, and navigating to the Spyder-Workshop directory. Finally, open the file workshop.py by double-clicking it in the Project Explorer pane on the left of the Spyder main window.

Importing Libraries and Data

The first thing we need to do before starting our work is import the libraries necessary for our analysis, and load the data in a way that it is easy to explore.

  1. Import the libraries Matplotlib and Pandas:
import matplotlib.pyplot as plt
import pandas as pd
  1. Load the data from the CSV file to a Pandas DataFrame:
weather_data = pd.read_csv('data/weatherHistory.csv')

Exploring the Data

Now that we have our data and libraries ready, let's start by taking a look at the data that we have.

  1. Open the weather_data variable in the Variable Explorer pane by double-clicking its name. The Variable Explorer is located in the top-right of the Spyder main window; you may need to click its tab to make it visible.

  2. Verify that the Size of weather_data in the Variable Explorer corresponds to the result of len(weather_data) in the IPython Console.

len(weather_data)
  1. Print the first three rows of the DataFrame to the IPython Console:
weather_data.head(3)
  1. Now, try printing the last three rows of the DataFrame.

Visualization

A useful tool for exploring the data that we are going to work with is plotting it. This is easy to do using the pandas library, which we imported previously.

The first thing we want to do before plotting our data is ordering the rows according to the date. Use the Variable Explorer to verify that our data is not ordered by default.

  1. Parse the date and create a new DataFrame with our data ordered by it:
weather_data['Formatted Date'] = pd.to_datetime(
    weather_data['Formatted Date'].str[:-6])
weather_data_ordered = weather_data.sort_values(by='Formatted Date')
  1. In the Variable Explorer, right-click the old DataFrame weather_data to pop out the context menu and select Remove to delete it. Now, we are going to work with our new variable weather_data_ordered.

Notice in the Variable Explorer that the DataFrame's index (the Index column on the left) is not in the order of the date. Reset the index so its order matches that of Formatted Date:

weather_data_ordered.reset_index(drop=True)

We also see that there are some qualitative variables, which can make our analysis more difficult. For this reason, we want to stick to the columns that give us numerical information and drop the categorical ones:

weather_data_ordered.drop(
    columns=['Summary', 'Precip Type', 'Loud Cover', 'Daily Summary'])
  1. Plot Temperature (C) versus Formatted Date to see how temperature changes over time:
weather_data_ordered.plot(
    x='Formatted Date', y='Temperature (C)', color='red', figsize=(15, 8))
  1. Switch to the Plots pane, in the same top-right section of the interface as the Variable Explorer, to view your figure.

  2. Now, try plotting the temperature versus the date using only the data from 2006.

  3. Plot temperature and humidity versus the date in the same plot to examine how both variables change over time:

weather_data_ordered.plot(
    subplots=True, x='Formatted Date', y=['Temperature (C)', 'Humidity'],
    figsize=(15, 8))
  1. Now, try plotting different variables in the same plot for different years.

Data Summarization and Aggregation

The previous plots contained a lot of data, which make it difficult to understand the evolution of our variables through time. For this reason, one of the things that we can do is group the information we have by year and plot the yearly values. To do this, we have written a function in the utils.py file, in the same folder as your workshop, that creates a new column in the DataFrame containing the year, and then groups values by year, computing the average of the variables for each one.

  1. Import the function from the utils module, so you can use it in your script:
from utils import aggregate_by_year
  1. Use it to aggregate by year and plot the data:
weather_data_by_year = aggregate_by_year(
    weather_data_ordered, date_column='Formatted Date')
  1. Try writing a function in the utils.py file that gets the averages of the weather data by month and plots them.

Data Analysis and Interpretation

Now, we want to evaluate the relationships between the variables in our data set. For this, we have written another function in utils.py.

  1. Import the new function:
from utils import aggregate_by_year, plot_correlations
  1. Plot the correlations between the variables and view the figure in the plots pane:
plot_correlations(weather_data_ordered, size=15)
  1. Import the plot_color_gradients() function, which will help you plot the colormap gradient to be able to interpret your correlation plot:
from utils import aggregate_by_year, plot_correlations, plot_color_gradients
  1. Plot the colormap gradient using the function you imported:
plot_color_gradients(
    cmap_category='Plot gradients convention', cmap_list=['viridis', ])
  1. Calculate the Pearson correlations between the different variables in our data set:
weather_correlations = weather_data_ordered.corr()
  1. Open the variable weather_correlations in the Variable Explorer to see the results.

  2. Print the correlation between humidity and temperature in the IPython Console:

weather_data_ordered['Temperature (C)'].corr(weather_data_ordered['Humidity'])

Verify it has the same value as in the weather_correlations DataFrame.

  1. Try calculating correlations between different variables and comparing them with the ones in the DataFrame.

Data Modeling and Prediction

Finally, we want to use our data to construct a model that allows us to predict values for some of our variables. In our previous section, we realized that humidity and temperature are two of the most correlated variables, so we are going to use these two first.

We are going to use Scikit-Learn, which is a Python library that contains tools to explore data and build different types of predictive models.

  1. Import the two necessary objects for our data modeling:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
  1. A classic way to make a predictive model is to subdivide the data into two sets: training and test. The training data will help us to fit our predictive model, while the test data will play the role of future observations and give us an idea of how good our predictions are.

Scikit-Learn contains a built-in function to split your data:

x_train, x_test, y_train, y_test = train_test_split(
    weather_data_ordered['Humidity'], weather_data_ordered['Temperature (C)'],
    test_size=0.25)
  1. We will use linear regression in Scikit-Learn to make a linear model of our data. Create the model, then fit it with the weather data:
regression = linear_model.LinearRegression()
regression.fit(x_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))
  1. Place the text cursor over LinearRegression() and press the Inspect shortcut (Ctrl+I by default on Windows/Linux, or Cmd-I on macOS) to get the documentation of this function in the Help Pane.

  2. Print the coefficients of our regression:

print(regression.intercept_, regression.coef_)  # beta_0, beta_1

Note that this means our model is a linear function $$y = beta_0 + beta_1 \times x$$, where temperature is a function of humidity.

Predictive Model Testing and Evaluation

  1. Now, we want to plot our model predictions versus our test data, to see how good our predictions were:
y_predict = regression.predict(x_test.values.reshape(-1, 1))
plt.scatter(x_test, y_test, c='red', label='Observation', s=1)
plt.scatter(x_test, y_predict, c='blue', label='model')
plt.xlabel('Humidity')
plt.ylabel('Temperature (C)')
plt.legend()
plt.show()
  1. Using the .predict() method of our model, predict the temperature for a given level of humidity.

  2. Finally, we can numerically evaluate how good our model predictions were. For this, we will use explained_variance_score available in sklearn.metrics. This metric is calculated as $$1-(Var(Y_real-Y_model)/Var(Y_real))$$, which means that the closer the value is to 1, the better our model.

We need to import the function that evaluates our model:

from sklearn.metrics import explained_variance_score
  1. Calculate the explained variance score and print it:
explained_variance_score(y_test, y_predict)