3 credits, Spring 2023
Dr Sophie Clayton, [email protected]
Office hours: 13:00 - 15:00 M, OCNPS 423
Class times: 9:30 - 10:45 T/Th, OCNPS 403
Link to syllabus
Students will require a laptop, all computational tools used in the course are available for free.
The Ocean, Earth and Environmental Sciences are quickly moving from being a data-poor to data-rich disciplines, with many scientific and industry-related applications enabled by the analysis, synthesis and statistical modeling of large and diverse environmental data sets.
This is an advanced computational analysis course designed to introduce students to data management and analysis methods commonly used in data science applications. The data analysis portion of the course will be primarily based on machine learning methods. The course will also give an overview of a selection of scientific databases which host freely available oceanographic data and output from numerical model simulations. This course is not discipline specific and will be useful for any students who want to work with data efficiently and gain experience in data management, proper techniques in developing analytical pipelines and applying machine learning to their research.
The class will meet two days a week, Tuesday and Thursday. Classes will consist of a combination of lectures, discussions and practical coding exercises where collaboration and teamwork will be encouraged. The outcome of the course will be an individual capstone project where each student applies the techniques learned during the course to undertake a data analysis project based on their own research interests using at least 2 different data sources from open scientific databases, and may include data that they have generated themselves. Students will be expected to publish the code developed and results of their project in a public GitHub repository.
A pdf of the schedule can be found here
Week | Topic | Notes and code | Homework |
---|---|---|---|
1 (1/10) | Open Science and FAIR data | Lecture slides | |
1 (1/12) | Version control, git, GitHub | Version control overview, Intro to git | HW1 (due January 25th 5pm) |
2 (1/17) | Data science workflow and project organization | Data science workflow overview, Project organization | |
2 (1/19) | Intro to environments, Exploratory Data Analysis | conda notes, conda cheatsheet, EDA with pandas jupyter notebook | |
3 (1/24) | Plotting with seaborn, more pandas EDA | HW2 (due February 3rd 5pm) | |
3 (1/26) | Environmental databases and toolboxes, mapping with cartopy | List of databases, making a map with cartopy, plotting data on a map | |
4 | NO CLASSES | -- | -- |
5 (2/7) | Machine learning overview | Notes TBA | |
5 (2/9) | Intro to scikit-learn and Supervised Regression | Nitrate linear regression example | HW3 (due February 22nd 5pm) |
6 (2/14) | Scaling, Neural Network Regressors, Nearest Neighbor Regressors, in class practice | Examples of different regression estimators | |
6 (2/16) | Supervised learning - classification, evaluation and error metrics | Notes TBA | |
7 (2/21) | In class practice | HW4 (due March 8th 5pm) | |
7 (2/23) | Supervised learning - KNN and MLPClassifier | iris dataset examples: KNN classifier, MLPClassifier and feature scaling | |
8 (2/28) | Unsupervised learning - KMeans | KMeans example using the seeds dataset | |
8 (3/2) | Unsupervised learning and Capstone Project Development | HW5 Capstone Proposal (due March 17th 5pm) | |
9 | NO CLASSES | SPRING BREAK | -- |
10 (3/14) | Dimensionality Reduction, Feature Extraction with PCA | PCA feature extraction example using the wine dataset | |
11 (3/21) | Feature Selection methods | ||
11 (3/23) | Thursday: Project work | ||
12 (3/28) | Cross-validation for training models on small datasets | ||
12 (3/30) | Thursday: Project work | ||
13 (4/4) | Paper discussion | ||
13 (4/6) | Thursday: Project work | ||
14 (4/11) | TBD | ||
14 (4/13) | Thursday: Project work | ||
15 (4/18) | Project Presentations | Capstone Presentation Instructions | |
15 (4/20) | Project Presentations |
- Understand FAIR data principles and how to apply them when generating, sharing and accessing data.
- Develop a working knowledge of existing ocean and earth science databases and how to efficiently access data from them, including via APIs.
- Students will develop their own data analysis toolbox using, but not limited to, Python and shell scripts.
- Understand and use version control (e.g. git), environments (e.g. conda) and code repositories (e.g. GitHub) to manage and share code.
- Understand the underlying principles of machine learning techniques for regression and classification, including supervised and unsupervised learning and apply them to a targeted research question.
- Understand the process of model evaluation and optimization and commonly used metrics for reporting model performance.
The goal of the final capstone project is to assess students ability to combine and apply the skills learned in class in the context of a real-world research problem. The class will mostly focus on tools for data analysis, visualization and developing and evaluating machine learning models, so this will be the focus of the capstone project. Students must have the dataset(s) and general scope of their capstone project approved by the instructor the week after spring break.
Detailed information on the capstone project is posted here.