This course is designed to build on what students have learned so far about structuring and manipulating data. By introducing core statistical techniques, students will gain tools that data scientists use to extract insights from data. Students will be asked to apply the course content to real-life scenarios and think creatively as well as critically through issues. Because of this, the class will also focus on developing more advanced programming skills as well. While this class mostly focuses on statistics skills, there are major parts of the class that focus on the programming and implementation of algorithms.
By the end of the class, students will be able to apply these methods to data and interpret and communicate their results. Topics will include:
- Understand and implement various statistical procedures in R.
- Describe and interpret the results of such procedures and algorithms.
- Expand R programming skills to be able to write/test/log code from scratch.
Each course session will be a mixture of lecture and in-class exercises. Typically the materials for each evening include presentation slides, one or more data sets, and R scripts with illustrations and exercises related to the material. There will also be 8 weekly homework assignments which will include a combination of programming and reading. There will also be a final individual project that students will work on over the course of the class.
No required textbooks. All required reading will be available online as articles or free pdf's. There will also be additional optional reading if students wish to read more on a subject, which may invlude books or textbooks.
- Required Reading Sources:
- “An Introduction to Data Science.” By Jeffrey Stanton.
- “Statistical Thinking for Programmers.” By Allen B. Downey.
- Additional Resources (optional):
- “Computational Statistics Using R and R Studio: An Introduction for Scientists” by Randall Pruim.
- "Online Statistics Education: A Multimedia Course of Study" by Rice University. I recommend section 1 (all), section 2 (all), and section 5 (parts A, B) as a brush up on statistics.
- "Team Leada R Tutorial".
- "Data Camp R Tutorial".
- "Code School - Try R"
- Further Fun Reading (optional):
- "The Signal and the Noise." By Nate Silver. Penguin Press HC, 2012. Amazon Link
- "Dataclysm". By Christian Rudder. Crown Publishing Group, 2014. Amazon Link
Students are expected to use personal machines in class that are able to:
- Run R [http://cran.r-project.org/] and R-Studio IDE [http://www.rstudio.com/]
- We will spend one day exploring other free tools, such as:
- Python V2.X
- Gephi (Note: Gephi is paticular on a type of Java. Last I checked, it required >= Java 1.6).
(Topics and Dates are tentative and subject to change)
Lecture | Date | Topic | Reading |
---|---|---|---|
Week 1 | 1st Lecture | Introduction; Data Exploration; R overview | -Intro DS Ch 3,9; -StatThink Ch 2. |
Week 2 | 2nd Lecture | Probability Distributions; Conditional Prob; Missing Data; Getting/Storing Data | -Intro DS Ch 7,10; -StatThink Ch 4. |
Week 3 | 3rd Lecture | Outliers and Missing Data; Intro to Hypothesis Testing | -Intro DS Ch 6 |
Week 4 | 4th Lecture | Hypothesis Testing; The Central Limit Theorem; Intro to Regression | -StatThink Ch 6, 7 |
Week 5 | 5th Lecture | More on Regression; Extra Topic #1 | -StatThink Pg 93-97 |
Week 6 | 6th Lecture | Regression and Feature Selection | -Intro DS Ch 16 |
Week 7 | 7th Lecture | Time Series; Spatial Statistics | -None |
Week 8 | 8th Lecture | EBayesian and Computational Statistics | -StatThink Pg 97-101 |
Week 9 | 9th Lecture | Guest Lecture; Extra Topic #3 | None |
Week 10 | 10th Lecture | Review; Possible Extra Topics | None |
Students MUST attend at least 8 of 10 classes. Your grade will be based on eight homework assignments and one individual project. Details on these will be handed out/distributed on the first day. For complete homework Rubric, please see the class syllabus on the Canvas page.
For each homework, students should submis a report that includes:
- Working code which implements the procedures specified by the assignment. Code should be easy to read and commented well.
- Appropriate text and figures/graphs describing the results. All graphs and figures should be labeled.
- "Is Data Scientist the Right Career Choice? Candid Advice"
- Overview of an Analytics Career
- Trey Causey on Data Science Interviews
- Trey Causey on Hiring Data Scientists
- "Crushed it: Landing a Data Science Job" by Erin Shellman
- "Stuff I’ve Messed Up While Interviewing" by Ellen Chisa
- "Doing Data Science at Twitter" by Robert Chang
- "Advice for Data Scientists on Where to Work", Multiple Authors
- "50 Years of Data Science" by David Donoho, This is a ~40 page document, but is full of great insights to all the questions people tend to have about Data Science in general.
Your gain from this course is highly dependent on your attendance and completion of the exercises. I fully expect students to actively participate (asking questions, doing the homework, helping others).
Students are expected to behave professionally and abide by all student policies outlined by The University of Washington Student Conduct Code. [http://www.washington.edu/cssc/]