Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hackathon - Data Methods #5

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
66 changes: 39 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,91 +2,103 @@

# Team Members

* [name-of-a-team-member](URL to this member's github account)
* [name-of-a-team-member](URL to this member's github account)
* [name-of-a-team-member](URL to this member's github account)
* [Daniel Nolan](https://github.com/dano8957)
* [Chris Wittenberg](https://github.com/cwitty1919)
* [Ryan Roden](https://github.com/rodenr)
* [name-of-a-team-member](URL to this member's github account)
* [name-of-a-team-member](URL to this member's github account)

# Part 1: Data Science Fundamentals


## Q1: There are generally 2 situations you'll start from when approaching a question of data: a) You designed and collected the data yourself OR b) You have to work with a data set you've been given access to. What do you think makes these 2 starting points different? How might it change what analysis you'll do?
[provide response here]

What makes these two starting points different is that in option a, you have less control over what kind of data you have and how you can work with it. You can ask yourself, "What kind of data would be relevant here?" However, you may need to do more cleaning. In option b, you have to work with what you have and ask yourself, "How is this data relevant?".

(For the following set of questions we'll assume we're in situation A - you are going to design your own data collection)

## Q2: What factors go into deciding what data format to use? Under what circumstances may you use different data types? (i.e., JSON, CSV, Key-Value Store, txt documents)
[provide response here]
Some factors that go into deciding a chosen data format is how large the data file actually is. A JSON file would be a more likely choice if given a very large file compared to a CSV file. Unless there is a given storage system, one should probably not use a txt document. Key-value stores are a type of NOSQL database that is dynamic and actiely changing. Other files like text and JSON files are part of the SQL database where the dataset must be completely changed before being run again.

## Q3: Once you've chosen a format, you'll need to determine fields to capture and store. A common approach for this involves determining what QUESTIONS you want to ask of your dataset. For the following examples, please respond with which field(s) your may need to answer the questions needed:
1. You're working for a company that tracks data on public transportation, you know you'll want to be able to ask "What percentage of a time is a bus/train late?"
2. You're working for a school district, and you need to be able to help the principal answer the question "Which teachers are most successful at getting students interested in extra-curricular educational activities (e.g., Math Team, Quiz Bowl, Science Olypiad, Robot Building, etc)?
3. You're starting a social networking website that helps friends choose what to do on a Friday night, and you need to be able to answer the question, "Who made the suggestion that led to the final decision?"

### Answers:
1. [Field(s) here]
2. [Field(s) here]
3. [Field(s) here]
1. Bus_id, Train_id,Time of Scheduled Arival, Time of Actual Arrival
2. Teacher_id, Activity_id, Num(Students)
3. Creator_ID (of event), Event chosen? (boolean), Num(Participant_ID's)

## Q4: Now you need to decide how you'll query your data. What are the costs and benefits of the following options:
1. Store the data raw and load it into a Python or JavaScript Shell for analysis.
2. Periodically dump the data into a database (like Mongo) and query it.
3. Build a webserver and write an API that dumps and queries that data in your database.

### Answers:
1. [Costs/Benefits]
2. [Costs/Benefits]
3. [Costs/Benefits]
1. cost: only as useful as how often you analyse the data, benefit: less data cleaning
2. cost: data is not updated live, benefit: automation from periodic dumps
3. benefit: a public api allows for a diverse possibilities of analysis, such as different languages can work for whoever wants to analyze


## Q5: You've now set up your database and have a website with 10,000 users, but have realized that you forgot a much needed field (say, an ID number for each user). What do you do and how might different database designs have helped this situation?
[Respond here]

You can create code to itterate through your records and add ids. This is easy to do with a MongoDB, but you would have to update the schema and re-migrate records for an SQL database.

---------------

(For this section, you may need to do some online research to answer the questions.)

## Q6: What is a Baysian Classifier? What is it used for?
[Response]

A Baysian classifier minimizes the probability of missclassification. It is used to model distributions more effectively.

## Q7: What is a simple graph you could generate to check for outliers in a dataset?
[Response]

## Q8: What is a Null Hypothesis?
[Response]
A simple graph you can you can generate to check for outliers is a histogram.

----------
## Q8: What is a Null Hypothesis?
A general statement or default position where there is no relationship between two measured phenomena (or variables/datasets).

---------
Answer the following questions using this scenario: You just got a HUGE dataset from Spotify where each entry contains these fields -> [username, song, # of times played, user rating, genre]

## Q9: How would you figure out the most popular song?
[Reponse]
measuring a frequency of highest user rating and number of times played.

## Q10: How do you determine what genre a certain user likes the most?
[Response]
take data from 9 and look at the most common genre in the top X of the list
You could plot the dataset into Tableau and create a graph relating to Songs vs. # of times played. Then, you could incorporate the user rating to see which song is rated the highest.


## Q11: How do we match 2 users that we think may want to share playlists?
[Response]

We look at the most commonly listened to genres between people

## Q12: What assumptions would you have before digging into Spotify data? How would you test them?
[Response]

Depending on what is the most popular song currently, it would likely effect whether it is on someones personal most popular songs.


----------

Answer these last questions generally.

## Q12: What is a correlation and how do you find them in a data set?
[Response]

A correlation is a relation between two variables within your data. This could be something like a relationship between car accidents and not wearing your seat belt. There is a correlation between these variables since data has been shown that more deaths occur when people don't wear seat belts.

## Q13: How can correlations help us tell a story with our data?
[Response]

If a correlation is statistically significant, then this is strong eveidence for a relationship between to data points. This calls for further inquiry and exploration.

## Q14: Let's think about data science as a way to tell a story about some data. Why would I want to bring a second data set into my story?
[Response]

We can use an older data set as a control.

## Q15: This one's just for fun. How percent of the time do you expect to actually get the result you wanted?
[Response]

60%


# Part 2: Analyzing Your Data
Expand Down Expand Up @@ -120,8 +132,8 @@ While there are many tools to do this analysis, we will use the JS library Gauss
# Part 3: Project Design Exercise

## Link to the device or devices you're interested in using
* [device1](URL to this device)
* [device2](URL to this device)
* [thermal sensor](URL to this device)
* [sound sensor](http://www.parallax.com/product/29132)

## What it would measure and how?
[Response]
Expand Down
160 changes: 160 additions & 0 deletions README.md~
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Hackathon 10/6 - Data Methods and Design

# Team Members

* [Daniel Nolan](https://github.com/dano8957)
* [Chris Wittenberg](https://github.com/cwitty1919)
* [Ryan Roden](https://github.com/rodenr)
* [name-of-a-team-member](URL to this member's github account)
* [name-of-a-team-member](URL to this member's github account)

# Part 1: Data Science Fundamentals


## Q1: There are generally 2 situations you'll start from when approaching a question of data: a) You designed and collected the data yourself OR b) You have to work with a data set you've been given access to. What do you think makes these 2 starting points different? How might it change what analysis you'll do?

What makes these two starting points different is that in option a, you have less control over what kind of data you have and how you can work with it. You can ask yourself, "What kind of data would be relevant here?" However, you may need to do more cleaning. In option b, you have to work with what you have and ask yourself, "How is this data relevant?".

(For the following set of questions we'll assume we're in situation A - you are going to design your own data collection)

## Q2: What factors go into deciding what data format to use? Under what circumstances may you use different data types? (i.e., JSON, CSV, Key-Value Store, txt documents)
Some factors that go into deciding a chosen data format is how large the data file actually is. A JSON file would be a more likely choice if given a very large file compared to a CSV file. Unless there is a given storage system, one should probably not use a txt document. Key-value stores are a type of NOSQL database that is dynamic and actiely changing. Other files like text and JSON files are part of the SQL database where the dataset must be completely changed before being run again.

## Q3: Once you've chosen a format, you'll need to determine fields to capture and store. A common approach for this involves determining what QUESTIONS you want to ask of your dataset. For the following examples, please respond with which field(s) your may need to answer the questions needed:
1. You're working for a company that tracks data on public transportation, you know you'll want to be able to ask "What percentage of a time is a bus/train late?"
2. You're working for a school district, and you need to be able to help the principal answer the question "Which teachers are most successful at getting students interested in extra-curricular educational activities (e.g., Math Team, Quiz Bowl, Science Olypiad, Robot Building, etc)?
3. You're starting a social networking website that helps friends choose what to do on a Friday night, and you need to be able to answer the question, "Who made the suggestion that led to the final decision?"

### Answers:
1. Bus_id, Train_id,Time of Scheduled Arival, Time of Actual Arrival
2. Teacher_id, Activity_id, Num(Students)
3. Creator_ID (of event), Event chosen? (boolean), Num(Participant_ID's)

## Q4: Now you need to decide how you'll query your data. What are the costs and benefits of the following options:
1. Store the data raw and load it into a Python or JavaScript Shell for analysis.
2. Periodically dump the data into a database (like Mongo) and query it.
3. Build a webserver and write an API that dumps and queries that data in your database.

### Answers:
1. cost: only as useful as how often you analyse the data, benefit: less data cleaning
2. cost: data is not updated live, benefit: automation from periodic dumps
3. benefit: a public api allows for a diverse possibilities of analysis, such as different languages can work for whoever wants to analyze


## Q5: You've now set up your database and have a website with 10,000 users, but have realized that you forgot a much needed field (say, an ID number for each user). What do you do and how might different database designs have helped this situation?

You can create code to itterate through your records and add ids. This is easy to do with a MongoDB, but you would have to update the schema and re-migrate records for an SQL database.

---------------

(For this section, you may need to do some online research to answer the questions.)

## Q6: What is a Baysian Classifier? What is it used for?

A Baysian classifier minimizes the probability of missclassification. It is used to model distributions more effectively.

## Q7: What is a simple graph you could generate to check for outliers in a dataset?

A simple graph you can you can generate to check for outliers is a histogram.

## Q8: What is a Null Hypothesis?
A general statement or default position where there is no relationship between two measured phenomena (or variables/datasets).

---------
Answer the following questions using this scenario: You just got a HUGE dataset from Spotify where each entry contains these fields -> [username, song, # of times played, user rating, genre]

## Q9: How would you figure out the most popular song?
<<<<<<< HEAD
measuring a frequency of highest user rating and number of times played.

## Q10: How do you determine what genre a certain user likes the most?
take data from 9 and look at the most common genre in the top X of the list
=======
You could plot the dataset into Tableau and create a graph relating to Songs vs. # of times played. Then, you could incorporate the user rating to see which song is rated the highest.

## Q10: How do you determine what genre a certain user likes the most?
Genre vs. username. You could also the graph above to find what songs belong to each genre, but I think the initial comparison would give you what is necessary.
>>>>>>> ecc861a95e8f544174d27c7a50166b9d97c52da0

## Q11: How do we match 2 users that we think may want to share playlists?

We look at the most commonly listened to genres between people

## Q12: What assumptions would you have before digging into Spotify data? How would you test them?

Depending on what is the most popular song currently, it would likely effect whether it is on someones personal most popular songs.


----------

Answer these last questions generally.

## Q12: What is a correlation and how do you find them in a data set?

A correlation is a relation between two variables within your data. This could be something like a relationship between car accidents and not wearing your seat belt. There is a correlation between these variables since data has been shown that more deaths occur when people don't wear seat belts.

## Q13: How can correlations help us tell a story with our data?

If a correlation is statistically significant, then this is strong eveidence for a relationship between to data points. This calls for further inquiry and exploration.

## Q14: Let's think about data science as a way to tell a story about some data. Why would I want to bring a second data set into my story?

We can use an older data set as a control.

## Q15: This one's just for fun. How percent of the time do you expect to actually get the result you wanted?

60%


# Part 2: Analyzing Your Data

While there are many tools to do this analysis, we will use the JS library Gauss to accomplish this task since everyone should have used it by now.

## Screenshot of Data in Gauss
![screenshot of data in gauss](image.png?raw=true)

## Most Frequent Value
[cut and paste command used and the output it produced]

## Range of data
[cut and paste command used and the output it produced]

## Biggest Change
[cut and paste command used and the related snippet from the output]

## Shape of data
[Describe]

## Threshold
[Value]

## Percentage above/below
[Command and output]

## Yes/No + Justification
[Answer and any images/snippets to justify]

# Part 3: Project Design Exercise

## Link to the device or devices you're interested in using
* [device1](URL to this device)
* [device2](URL to this device)

## What it would measure and how?
[Response]

## Where you'd put it in the lobby?
[Location]

## What problems could threaten the validity of your data?
[Response]

## How often to sample and when to make a data dump?
[Response]

## Resulting viz.
[Describe or link to example]

## Timing trigger
[Necessary? Why? and How?]