incremental changes to ees project

HeardLibrary · Mar 11, 2024 · 3132765 · 3132765
1 parent bee90d5
commit 3132765
Showing 1 changed file with 38 additions and 71 deletions.
diff --git a/script/codegraf/ees_project/index.md b/script/codegraf/ees_project/index.md
@@ -16,7 +16,7 @@ The learner will:
 
 ## Overall goals
 
-We have monthly average climate data in tabular form acquired from the National Centers for Environmental Information <https://www.ncdc.noaa.gov/cdo-web/> for a number of locations around the U.S. We will be analyzing data for Mesa, Arizona from 1896 through 2017. The data have been extracted and stored in GitHub [here](https://github.com/HeardLibrary/digital-scholarship/blob/master/data/codegraf/mesa2880172.csv). The data look like this:
+Monthly average climate data in tabular form is available from the National Centers for Environmental Information <https://www.ncdc.noaa.gov/cdo-web/> for a number of locations around the U.S. We will be analyzing data for Mesa, Arizona from 1896 through 2017. The data have been extracted and stored in GitHub [here](https://github.com/HeardLibrary/digital-scholarship/blob/master/data/codegraf/mesa2880172.csv). The data look like this:
 
 ![climate data table example](input_table.png)
 
@@ -41,7 +41,7 @@ Much of the data wrangling code can be reused with modification after you comple
 
 # Tasks and subtasks
 
-1 Acquire data and wrangle date<br/>
+1 Acquire data and wrangle the dates<br/>
 1\.1 Use the `pd.read_csv()` function to load CSV from URL to a pandas DataFrame.<br/>
 1\.2 Split the YYYY-MM date strings into separate year and month columns<br/>
 1\.3 Create a list of intervals for the desired time range<br/>
@@ -56,122 +56,85 @@ Much of the data wrangling code can be reused with modification after you comple
 2\.3 Turn the table into a pandas DataFrame<br/>
 
 3 Visualize data<br/>
-3\.1 Set up subplot<br/>
-3\.1.1 Create subplot<br/>
-3\.1.2 Label axes<br/>
+3\.1 Create subplot<br/>
 3\.2 Plot data using appropriate style (scatterplot, bar, error bars)<br/>
 3\.3 Add trendline if appropriate<br/>
 3\.3.1 Fit linear polynomial to summary data<br/>
 3\.3.2 Add polynomial data to plot<br/>
+3\.4 Label axes<br/>
 
 ## Task details
 
-For particular tasks, it is best to start with the narrowest subtasks and work your way to the broader tasks.
+1 **Acquire data.**
+
+The URL for loading the CSV of raw data is: <https://raw.githubusercontent.com/HeardLibrary/digital-scholarship/master/data/codegraf/mesa2880172.csv>
 
-**1 Acquire data**
+1\.1 Use the `pd.read_csv()` function to load the CSV from URL to a pandas DataFrame
 
-The raw data will be provided as a file available from GitHub using a URL given to you. 
-
-**1\.1 Use `pd.read_csv()()` to load the CSV from URL to a pandas DataFrame**
-
-
-
-**2 Calculate means for desired quantity (rainfall or temperature)**
-
-Start with one factor and interval (mean rainfall by year) and after completing the code for that combination, change the code to handle different intervals and factors. 
-
-The overall structure of the code in this part of the program will be a little complicated since it involves two nested `for` loops. It will look something like this:
+1\.2 In order to pull out the mean temperatures for a particular month or year, we need "grouping variable" columns. We can generate these by splitting the date strings in the `DATE` column into separate year and month columns. If we were using the procedural apprach, we could step through each row and use the `.split()` method to split the date string into a list of strings. However, it is simpler to use the vectorized approacn and use the `.str` attribute of the column to apply the `.slice()` method to the entire column at once. The expression looks like this: `climate_data['DATE'].str.slice(0, 4)` to get the year, which inludes the first four characters of the string. To assign this expression to a new column in the DataFrame called "YEAR", we can use this code:
 
 ```
-# Create empty lists
-for timepoint in range_of_time: # do the inner loop for each timepoint
-    for month in climate_data: # go through every month to see if its datum needs to be included
-        # Perform a screen for each timepoint.
-        # Sum each datum that qualifies.
-    # Perform calculations using the sum of included data and append the calculated values to lists.
+climate_data['YEAR'] = climate_data['DATE'].str.slice(0, 4)
 ```
 
-We will build the inner loop first for a hard-coded timepoint value (a certain year or month), then build the outer loop to step through all timepoints. 
-
-**2\.1 Step through all data in column for the quantity, then sum for period to be averaged**
+The code for the month is similar. Note that since the date is a string, the slices will also be strings, even though they look like numbers.
 
-This will be a loop that includes all of the rows in the table. Because the data structure is a list of dictionaries, the indices of the cells you will check will be in the form: `data[row][columm]`, where the row index is a number and the column is specified by a key. However, if you are looping through the rows (as in `for datum in data:`), the variable you will be working with is `datum[column]`. For any given script, the column key will be fixed. For example, if you are using precipitation (which has the column header `PRCP`), you would write `datum['PRCP']`.
+1\.3 **Create a list of years.** It would be possible to calculate the mean precipitation for each and generate an output table in a single vectorized expression. However, that involves more complicated pandas than we have learned. Instead, we will use a loop to step through each year and calculate the mean precipition for that year. 
 
-**2\.1.1 Extract year or month from date string if necessary**
+In order to do that, we need to create a list of years to loop through. We can use the `range()` function to generate a list of years from 1896 to 2017 by starting with an empty list, then looping through `range(1896, 2018)` and appending each year to the list.
 
-The date strings are in ISO 8601 format: `2022-03` in the order of year-month. To get the year, you need to slice the first four characters. However, the loop variable is an integer, so to compare the loop variable to the extracted year you need to either make them both strings or both integers. For months, you also need to slice the date string, but if you are looping through a list of strings for the months, you can make the comparison directly.
+2 **Calculate the mean rainfall for each year.**
 
-**2\.1.2 Screen whether a particular datum is from the correct time interval**
+The strategy we will use is to create a list of dictionaries to represent a table. Each dictionary will represent a row for a year, with the keys being the column names "year" and "precipitation". We will then use the `pd.DataFrame()` function to turn the list of dictionaries into a DataFrame.
 
-Both of these tasks will require `if` statements. The `if` statements need to be nested so that both requirements must be met (the datum must be in the interval and the datum must not be missing) in order for the datum to be added to the sum. So you can't use `if ... elif...` to do the screen.
+2\.1 Create an empty list to use to build the table.
 
-For now, just hard code a particular year (like `'1950'`) or month (like `'05'`) to screen for. Later we will create another loop to do the screening for every time point in the visualization.
+2\.2 **Step through all of the years and calculate the mean precipitation for each year.**
 
-**2\.1.3 Skip missing data**
+Set up a `for` loop to step through each year in the list of years you created. 
 
-The way that the data are read in by the `read_dicts_from_github_csv()` function, missing data (empty cells) have a value of empty string (`''`), so that's what you need to use in your boolean comparison.
+2\.2.1 **Slice the DataFrame to include only the iterated year**
 
-**2\.1.4 Add screened data to sum**
-
-Recall that you can add a number to a sum using this type of code:
+We can slice a dataframe based on a condition by imposing a boolean condition on a particular column. For example, to slice the dataframe to include only the rows where the year is 1896, we can use this expression:
 
 ```
-sum = sum + new_number
+climate_data[climate_data['YEAR'] == '1896']
 ```
 
-Prior to adding the first number you need to assign a value of zero to `sum`. This can be shortened to:
+where `climate_data['YEAR'] == '1896'` is a vectorized boolean condition that is applied to the entire column. The result is a slice of the dataframe that includes only the rows where the condition is true. 
+
+We will have to be a little careful here because the year in the YEAR column that we created in step 1.2 is a string, while the list of years we are looping through is a list of integers. We can convert the year to a string in the condition by using the `str()` function:
 
 ```
-sum += new_number
+climate_data[climate_data['YEAR'] == str(year)]
 ```
 
-which achieves exactly the same result.  Note that both `sum` and `new_number` must be numbers (in this case `float`). If both are strings, the code above will concatenate (join end-to-end) the strings rather than add. If only one is a number, the code will throw an error.
-
-**2\.1.5 Count data that were summarized (exclude missing data)**
+2\.2.2 **Calculate the mean for the slice.**
 
-As we add monthly values to the sum, we also need to keep track of how many data we have summed. For months, we can't just say that there were 12 because if there were a missing month's data we would only have 11. Recall that to count, we can use code similar to the examples above, except that instead of adding any number to the sum, we add one:
-
-```
-count = count + 1
-```
+We learned in the lesson on pandas DataFrames that the `.mean()` method could be used to calculate the mean of a column as a vectorized operation. So we can use the `.mean()` method on the slice of the DataFrame to calculate the mean precipitation for that year.
 
-or equivalently:
+2\.2.3 We are now ready to create the dictionary for the year row. However, we don't want to add a row to the table if there is no data for that year. We can use an `if` clause to skip the year if there is no data, which will be indicated if the calculated mean produced a NumPy `NaN` (Not a Number) value. The test for `NaN` is `np.isnan()`, so the `if` expression will look like this:
 
 ```
-count += 1
+if not np.isnan(mean_precip):
 ```
 
-**2\.2 Calculate the mean from the sum**
+2\.2.4 **Add the calculated data to the table**
 
-The screening code we wrote above will be inside a loop that steps through each of the months in the dataset. After that loop finishes, we need to calculate the average for the time interval over which we did the summing, using the sum and count we were building inside the loop.
+If the mean precipitation is not `NaN`, we can create a dictionary with the key `year` and the `year` loop value, and the key `precipitation` and the calculated mean precipitation. We can then append this dictionary to the list of dictionaries.
 
-**2\.3 Repeat the screening for every time point to be graphed (year or month)**
+2\.3 **Turn the table into a pandas DataFrame**
 
-In order to repeat the screening for every time point that you are going to graph, you need to enclose the screening loop you made above inside another loop that does the screen for every point you want to plot.
-
-**2\.3.1 Determine the limits of the period for that analysis**
-
-The limits you choose will either be a range of months or years. The easiest way to loop through years is to treat them as numbers and use a `range()` object to generate them. The months are tricker, since they need to have leading zeros. So it's easier to create a list of strings like: `['01', '02', '03', ...'12']`. 
-
-**2\.3.2 Append time point and mean to growing lists of summary data**
-
-To do the visualization, you will need two lists containing the same number of items. One list will include the timepoint (typed as a number, so `'02'` would need to be turned into `2`.) and the other list will contain the mean value you calculated. After completing the screening and summing and calculation the average, you need to append these two numbers (timepoint and average) to the lists. Obviously, you need to create empty lists (`[]`) before you start the outer loop. 
-
-For the errorbar plot of maximum and minumum temperatures by month, you will need to have four lists: a list of the months, a list of mean temperatures, a list of the upper ends of the deviations, and a list of the lower ends of the deviations. You can calculate the latter two by taking the mean maximum temperature for that month and subtracting the mean average temperature, and taking the mean average temperature for that month ans subtracting the mean minimum temperature for that month.
+The list of dictionaries can be passed directly into the `pd.DataFrame()` function to create a DataFrame.
 
 **3 Visualize data**
 
 The plot setup should be fairly straightforward and be similar to examples we did in class.
 
-**3\.1 Set up subplot<br/>
-3\.1.1 Create subplot**
+3\.1.1 **Create up subplot**
 
 We should only need a single subplot (`ax`) within each figure.
 
-**3\.1.2 Label axes**
-
-Each axis should be labeled with both the quantity represented on that axis and the units of that quantity.
-
 **3\.2 Plot data using appropriate style (scatterplot, bar, error bars)**
 
 Each plot will differ in details, but the first argument (x) of the plot will be the timepoints and the second argument (y) will be the quantity averaged (precipitation or temperature).
@@ -185,6 +148,10 @@ A trendline is generally appropriate for scatterplots. A linear trendline (best-
 
 Follow the examples from the lesson. It may be clearer to make the trendline be a different color than the scatterplot points. You can controll this several ways.
 
+**3\.1.2 Label axes**
+
+Each axis should be labeled with both the quantity represented on that axis and the units of that quantity.
+
 ## Project rubric:
 
 | possible points | feature |