GitHub - epreble/data-scientist-exercise02: The second Data Scientist Exercise

RTI CDS Analytics Exercise 02

Welcome to Exercise 02. This exercise provides data from the National Transportation Safety Board's database of aviation accidents. We'll ask you to perform some routine high-level analytic tasks with the data.

Some guidance

Use open source tools, such as Python, R, or Java. Do not use proprietary tools, such as SAS, SPSS, JMP, Tableau, or Stata.
Fork this repository to your personal GitHub account and clone the fork to your computer.
Save and commit your answers to your fork of the repository, and push them back to your personal GitHub account. You can then provide a link to that fork of the repository if you need to show a code example.
Use the Internet as a resource to help you complete your work. We do it all the time.
Comment your code so that when you look back at it in a year, you'll remember what you were doing.
There are many ways to approach and solve the problems presented in this exercise.
Have fun!

The Data

There are 145 files in this repository:

AviationData.xml: This is a straight export of the database provided by the NTSB. The data has not been altered. It was retrieved by clicking "Download All (XML)" from this page on the NTSB site.

There are 144 files in the following format:

NarrativeData_xxx.json: These files were created by taking the EventIds from AviationData.xml and collecting two additional pieces of data for each event:
- narrative: This is the written narrative of an incident.
- probable_cause: If the full narrative includes a "probable cause" statement, it is in this field.

The Task

Explore the data and prepare a 1/2 to 1 page written overview of your findings.

Additional Context:

Assume the audience for your write-up is a non-technical stakeholder.
Assume the audience for your code is a colleague who may need to read or modify it in the future.

Possible approaches

This data is ripe for exploration. Working with the files provided, here are a few structured tasks that you can attempt. Note: these are suggested data exploration tasks. Feel free to complete none, some, or all of these steps as you explore the data.

Determine what fields you have available for each incident record.
Prepare descriptive statistics that convey an overview of the data.
If/Where you feel it is appropriate, visualize the data to help build a narrative around the descriptive statistics.
Perform initial exploratory analysis of the narrative text. What, if anything, can you learn from analyzing the text that you cannot learn from the structured data fields?
After removing common stopwords, what are the most frequent terms in the narrative text? Do the word frequencies change over time?
Use topic modeling to cluster the incidents based on the narrative text and/or probable cause descriptions. How do the clusters differ from those created with the structured data fields?
Imagine you have a client with a fear of flying. That client wants to know what types of flights present the most risk of an incident. Use your findings from the earlier tasks to answer the client's question in a report no longer than 1 page.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Misc		Misc
Model Output		Model Output
data		data
NTSB_Aviation_Accident_Data_Analysis.pdf		NTSB_Aviation_Accident_Data_Analysis.pdf
NTSB_analysis.py		NTSB_analysis.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RTI CDS Analytics Exercise 02

Some guidance

The Data

The Task

Possible approaches

About

Releases

Packages

Contributors 2

Languages

epreble/data-scientist-exercise02

Folders and files

Latest commit

History

Repository files navigation

RTI CDS Analytics Exercise 02

Some guidance

The Data

The Task

Possible approaches

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages