Skip to content

Collection of quick pandas, python, and other coding examples based on real consulting requests.

Notifications You must be signed in to change notification settings

dlab-consulting/quick-consulting-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quick Consulting Examples

Collection of quick pandas, python, R, and other coding examples based on real consulting requests.

VoltStats Data Archive - Webscraping example

Scenario: This is an example of webscraping a website that containts 10 years of historical user-generated data that used OnStar to collect data about the performance of Chevy Volts driving in the real world. Datahub

Geopandas Discovery Project Example - How to create a heat map using geopandas

Scenario: This is an example consulting request from a Discovery project by an Undergraduate Student from Econ and Data Science requesting help with Data Visualization as a Debugging or Tech support request saying: "I would like to create a geopandas heat map of India (with coordinates and a legend of certain levels of GDP per capita), but I've never used geopandas before so a little unsure on how to create this mapping. Also unsure if I need to convert to a shape file." Datahub

JoinMulltipleCSV - How to join multiple CSV files into a single Pandas DataFrame based on a join key.

Scenario: Recording student scores for each class lecture, where the student email address and score is stored in a separate CSV file for each lecture. Binder

StataFileVariableSearch - How to search Stata files that contain matching variable names.

Scenario: Loop over a directory tree containing Stata .dta files. Read the files into a pandas DataFrame and search for files that contain matching variable names. The result is a dictionary with the Stata filename as the key and the value is the variable names as a list (either full or narrowed just to the matches we're interested in). Binder

See also the related non-notebook scripts. The finddta.py script essentially is a script-based copy of the notebook version above, and the scrubdta.py script takes the output of finddta.py as the input for producing a stata file that contains only the columns that match the variables we want to keep, which is useful to de-identify sensitive data.

Wrangle Psych Survey Data - How to manipulate survey data outputs to evaluate distributive qualities of text responses and create matrix "dummy variables".

Scenario: You are presented with survey data containing text string responses to questions. These responses are represent combinations (mulitple elements per observation), but are separated with a clear structure. Consult is looking for a way to evaluate the many combinations of user responses, and structure data in a form that would allow for regression analysis. The script presents how to access and organize string data, exploring frequency of responses, (and combinations of responses) on a sample set of observations. The script goes on to pivot cleaned string data to create dummy variables out of categorical symptom responses.

County-Level-Chloropleth-Map

Scenario Students would like to visualize a metric they have created with a chloropleth map. Use the following script to join data they have assembled with a shapefile of California at the County level. Use tmap package to plot map for presentation with a few options enabled.

Crop-spatial-points-with-shapefile - take a raw dataset of spatial points and initialize the CRS, and then crop with a shapefile.

Scenario: You are presented with a large spatial dataset of floral species in the continental United States. The researcher is only concerned with data mapped into the boundaries of the state of Florida. Dataset is presented without a Coordinate Reference System. Format the raw spatial data with a CRS, and use a shapefile of the state of Florida to crop only the points that land within its boundary.

Lasso-Variable-Importance - use tidymodels framework to structure, preprocess, and tune hyperparameters for a lasso regression analysis

Scenario: A student is hoping to run a lasso regression analysis on some data for their final project in a class. They have been working with the glmnet package but have encountered errors when formatting data for model preparation. Walk through the process of splitting and model preparation of data, along with bootstrapping and tuning grid approaches to hyperparameter optimization. use vip package to visualize feature importance from tidymodels object.

Network-Analysis-Visualization - How to visualize a social network with contact tracing data.

Scenario: Take a dataset recording "relationships" between cases and contacts during a COVID19 outbreak, implement complex join functions to wrangle and format dataframes to be handled by the VisNetwork package. Use the formatted dataframes to create an interactive html object that illustrates acyclic exposure events from cases to contacts.

Next Example Goes Here...

About

Collection of quick pandas, python, and other coding examples based on real consulting requests.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages