Code and lecture materials for the "Data Science for Physicists" short course held at the 2025 APS Global Physics Summit.
This short course is co-sponsored by the Topical Group on Data Science (GDS), the Division of Computational Physics (DCOMP), the Division of Soft Matter (DSOFT), the Division of Particles and Fields (DPF), and the Division of Biological Physics (DBIO).
The schedule for the short course is available here.
Data science is playing an ever-increasing role in physics. In this two-day tutorial, we will introduce data science as it applies to a variety of fields in physics. The first day of the course is an introduction to the fields of data science and machine learning (ML) as they apply to physics data. We will then provide an introduction to machine learning, including both regression and classification algorithms. This session will explain why neural networks work and describe the practical steps needed to train a model, such as feature engineering, hyperparameter tuning, and validation. We will conclude the first day of the tutorial with an introduction to unsupervised learning techniques (including clustering and random forests), as well as a session that will introduce both neural networks (NNs) and convolutional networks (CNNs). The second day of this course will provide sessions on advanced topics in data science and machine learning. The first three sessions will cover graph neural networks (GNNs) and large language models (LLMs), introducing the topics and then focusing on their applications to physics. The final four sessions of the tutorial will cover a range of applications of both machine learning and data science. The session “Assessing Training Data: Material Data APIs” will cover accessing large, online databases of materials data to use as training data for machine learning algorithms. The session “Introduction to neural-network quantum states (NQS)” aims to provide a clear understanding of NQS and their broader applications in quantum many-body physics by introducing the theoretical and computational background necessary for constructing NQS, focusing on the quantum harmonic oscillator. The third session of the afternoon, “Using Data Science to Understand Complexity in Soft Matter Systems”, will discuss recent applications of data science and machine learning to understanding the complexity in soft matter systems. Finally, the session “Applications of Machine Learning to Biology” will focus on using AI to build “mechanistic foundation models” capable of physics simulations of the brain and the body of the fruit fly.
- Data visualization and exploratory data analysis
- Regression and classification models
- Unsupervised machine learning
- Neural networks
- Convolutional neural networks
- Graph neural networks
- Large language models
- Databases and APIs
- Julie Butler (University of Mount Union), Data Exploration and Visualization
- Jim Pivarski (Princeton University), Introduction to Machine Learning
- Trevor Rhone (Rensselaer Polytechnic Institute), Introduction to Unsupervised Learning
- William Ratcliff (National Institute of Standards and Technology), Introduction to Neural Networks and Convolutional Neural Networks
- Savannah Thais (Columbia University), Introduction to Graph Neural Networks
- John McNally (Wolfram Research), Introduction to Large-Language Models and Retrieval Augmented Generation
- Benjamin Nachman (Lawerence Berkley National Laboratory), Physics Applications of Large Language Models
- Cormac Toher (The University of Texas at Dallas), Assessing Training Data: Material Data APIs
- Jane Kim (Ohio University), Introduction to neural-network quantum states
- Sean A. Ridout (Emory Univeristy), Using Data Science to Understand Complexity in Soft Matter Systems
- Srini Turaga (HHMI Janelia Research Campus), Applications of Machine Learning to Biology