Skip to content

Latest commit

 

History

History
76 lines (48 loc) · 4.59 KB

project.md

File metadata and controls

76 lines (48 loc) · 4.59 KB

Final Project Details

The final project represents a culmination of the methodologies and techniques students have learned throughout the course. While students should continuously work with staff and discuss ideas with other students, the work should be entirely their own.

The project should address a data-related problem in your professional field, a field you are in, or a Kaggle competition. You will be asked to acquire a real-world data set, form a hypothesis about it, clean, parse, and apply modeling techniques and data analysis principles to ultimately create a predictive model.

Deliverables include a technical paper as well as a final presentation on the last day of class.

The technical paper (5-10 pages) should be directed toward the instructors and include:

  • Description of problem and hypothesis
  • Description of your data
  • In depth detail of model(s) used
  • Comparison of models according to in sample and out of sample results.
  • Charts, graphs, and visualizations to support your thesis
  • Observations about using the model in the real world
  • Future improvements

The in-class presentation should be directed toward your student peers and be aimed at 10 minutes total in length with 2-3 minutes of questions. The presentation should be more informal and bonus points for making it fun--3 hours of p-value discussions is draining!

Milestone 1 - Due Dec 21

Problem Defined and Dataset Gathered

By the first milestone, you should have obtained a sizeable dataset and defined a problem of interest. You may use data from a field you are interested in or a Kaggle competition. DO check with your instructors prior to the milestone to get guidance with defining the problem. NOTE: If you are using a Kaggle competition, you may not use competitions from the "Getting Started" or "Playground" categories.

Deliverable:

  • Link to dataset - Feel free to include a hyperlink, Google Drive link, etc.
  • PDF - In a one paragraph summary (5-10 sentences), describe the dataset, and the problem. Is this a supervised or unsupervised problem and is it in a continuous or categorical setting (or a little bit of everything)? Do you have any potential issues regarding the data or problem?

Milestone 2 - Due Jan 25

Initial Data Exploration

By milestone 2, you should have explored and visualized your data. You should in addition be able to answer qualitative questions about your dataset.

  • Are there a lot of NA values? If so, can these be imputed or surrogate variables used? If not, how will you handle them?
  • Are there a lot of outlier values--how will you handle these?
  • What are your chosen features--are they raw values or transformed? Why did you choose these features?
  • What charts or displays can you use to aggregate or describe the information?

In addition, build at least one model even if it has low predictive abilities out of sample. If it has poor performance, what could be potential reasons for this? (e.g. maybe a linear model will not work without additional transformed features, maybe the model is overfitting the data, etc.)

Deliverable:

  • PDF - In addition to the refined summary, include answers to the above questions as well as any other observations and thoughts about the data. Details about your initial model should go here as well as supporting charts and graphs.

Milestone 3 - Due Feb 17

Significant Data Exploration

Significant project progress available. Multiple techniques or variants have been explored. Charts and graphs explaining data and approach available. At this point, over 50% of your project should be completed. While this will be a more informal milestone, you should have attemped a few models.

Deliverable:

  • Data, charts, graphs, and documentation available for discussion in class.

Milestone 4 - Due Mar 7

Final Paper and Presentation Completed

Deliverable:

  • Paper and presentation submitted to instructor PRIOR to start of class.
  • Link to the dataset used.
  • IPython notebook containing ALL of the code from loading the dataset to running various models and graphing.
  • In class presentation of 10 minutes in length. Aim for 7 minutes of speaking, and allow for 2-3 minutes of questions.

Additional Resources

Project Ideas from Kaggle

Additional ideas

Example General Assembly projects

KDNuggets Datasets

UCI Datasets

US Health Data