Collection of quick pandas, python, R, and other coding examples based on real consulting requests.
Scenario: This is an example of webscraping a website that containts 10 years of historical user-generated data that used OnStar to collect data about the performance of Chevy Volts driving in the real world.
Scenario: This is an example consulting request from a Discovery project by an Undergraduate Student from Econ and Data Science requesting help with Data Visualization as a Debugging or Tech support request saying: "I would like to create a geopandas heat map of India (with coordinates and a legend of certain levels of GDP per capita), but I've never used geopandas before so a little unsure on how to create this mapping. Also unsure if I need to convert to a shape file."
JoinMulltipleCSV - How to join multiple CSV files into a single Pandas DataFrame based on a join key.
Scenario: Recording student scores for each class lecture, where the student email address and score is stored in a separate CSV file for each lecture.
Scenario: Loop over a directory tree containing Stata .dta files. Read the files into a pandas DataFrame and search for files that contain matching variable names. The result is a dictionary with the Stata filename as the key and the value is the variable names as a list (either full or narrowed just to the matches we're interested in).
See also the related non-notebook scripts. The finddta.py script essentially is a script-based copy of the notebook version above, and the scrubdta.py script takes the output of finddta.py
as the input for producing a stata file that contains only the columns that match the variables we want to keep, which is useful to de-identify sensitive data.
Wrangle Psych Survey Data - How to manipulate survey data outputs to evaluate distributive qualities of text responses and create matrix "dummy variables".
Scenario: You are presented with survey data containing text string responses to questions. These responses are represent combinations (mulitple elements per observation), but are separated with a clear structure. Consult is looking for a way to evaluate the many combinations of user responses, and structure data in a form that would allow for regression analysis. The script presents how to access and organize string data, exploring frequency of responses, (and combinations of responses) on a sample set of observations. The script goes on to pivot
cleaned string data to create dummy variables out of categorical symptom responses.
Scenario Students would like to visualize a metric they have created with a chloropleth map. Use the following script to join
data they have assembled with a shapefile of California at the County
level. Use tmap
package to plot map for presentation with a few options enabled.
Crop-spatial-points-with-shapefile - take a raw dataset of spatial points and initialize the CRS, and then crop with a shapefile.
Scenario: You are presented with a large spatial dataset of floral species in the continental United States. The researcher is only concerned with data mapped into the boundaries of the state of Florida. Dataset is presented without a Coordinate Reference System. Format the raw spatial data with a CRS, and use a shapefile of the state of Florida to crop only the points that land within its boundary.
Lasso-Variable-Importance - use tidymodels framework to structure, preprocess, and tune hyperparameters for a lasso regression analysis
Scenario: A student is hoping to run a lasso regression analysis on some data for their final project in a class. They have been working with the glmnet
package but have encountered errors when formatting data for model preparation. Walk through the process of splitting and model preparation of data, along with bootstrapping and tuning grid approaches to hyperparameter optimization. use vip
package to visualize feature importance from tidymodels
object.
Scenario: Take a dataset recording "relationships" between cases and contacts during a COVID19 outbreak, implement complex join
functions to wrangle and format dataframes to be handled by the VisNetwork
package. Use the formatted dataframes to create an interactive html object that illustrates acyclic exposure events from cases to contacts.