Skip to content

Commit

Permalink
Merge pull request #65 from rhodyprog4ds/assignment6
Browse files Browse the repository at this point in the history
Assignment6
  • Loading branch information
brownsarahm authored Oct 18, 2020
2 parents ae71b25 + 70af4ec commit bfa5de3
Show file tree
Hide file tree
Showing 4 changed files with 193 additions and 0 deletions.
1 change: 1 addition & 0 deletions _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
- file: assignments/03-exploratory
- file: assignments/04-prepare
- file: assignments/05-construct
- file: assignments/06-naive-bayes

- file: portfolio/index
sections:
Expand Down
77 changes: 77 additions & 0 deletions assignments/06-naive-bayes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Assignment 6


__Due: 2020-10-20__


Remember, even if you aren't sure how to do every part of the assignment, submitting pseudocode, a list of your questions indicating where you got stuck, or a partial solution will get you personalized feedback. You might even earn level 1 achievements for the partial effort.

## For Classification Level 2:

[Acccept the Assignment](https://classroom.github.com/a/fojgz8Cc)



Create one notebook where you load examine each of the provided datasets for for suitability to use with Gaussian Naive Bayes. Label the sections `Dataset #`. Use exploratory data analysis (visualisation and statistics) in pandas and Gaussian Naive Bayes from scikit learn to answer the following for each dataset:


1. Do you expect Gaussian Naive Bayes to work well on this dataset, why or why not? (think about the assumptions of naive bayes and classification in general) _explanation is essential here, because you can actually use the classifier to check_
1. How well does a Gaussian Naive Bayes classifier work on this dataset? Do you think a different classifier might work better or do you think this data cannot be predicted any better than this?
1. How does the actual performance compare to your prediction? If it performs much better or much worse than you expected, what might you use to figure out why? (_you do not have to figure out why your predictions were not correct, just list tools you've learned in class that might help you figure that out_)


This assignment will be easiest if you take advantage of the template via git:

### Git Workflow

1. Click the green code button and copy the url.
1. clone the repository as below where th eurl is the one you copied from GitHub. In your terminal on Linux or Mac or on the GitBash on Windows ([install instructions on tools section of syllabus](prorgrammin-env) )

```
cd path/where/you/want/a/new/folder/for/this/assignment
git clone http://github.com/rhodyprog4ds/06-classification-username
```

1. then launch your notebook from the newly created folder.

on Linux or Mac, in the same terminal (remember you can use tab complete)
```
cd 06-classification
jupyter notebook
```
or on windows, in your Anaconda prompt
```
cd path/where/you/want/a/new/folder/for/this/assignment/06-classification-username
jupyter notebook
```

1. work on your assignment, by opening the notebooks there and editing them.
1. commit your changes when you want to save a point in your progress. Either you have part workign and want to save that, or you want feedback, or you're done. On whichever terminal you used for the `git clone` command above

```
git add .
git commit -m 'description of your changes since las commit'
```

1. push your changes when you want to share them on GitHub, (eg need help, want feedback or complete)

```
git push
```

1. if you will keep working after pushing, first pull down the .md conversion that was added by GitHub actions. If you don't you'll have to merge, it should be fine, but ask if you're not sure.

```
git pull
```



## For construct level 2

Build your own dataset for classification by merging data from separate sources that you're interested in.


## Summarize and Viz level 2

Show some extra summary stats and plots
1 change: 1 addition & 0 deletions assignments/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ Assignment TOC:
- [Assignment 3](03-exploratory) Due September 29
- [Assignment 4](04-prepare) Due October 4
- [Assignment 5](05-construct) Due October 11
- [Assignment 6](06-naive-bayes) Due October 20
114 changes: 114 additions & 0 deletions notes/2020-10-16.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.12
jupytext_version: 1.6.0
kernelspec:
display_name: Python 3
language: python
name: python3
---

# Class 17: Evaluating Classification and Midsemester Feedback

1. share your favorite rainy day activity (or just say hi) in the zoom chat for attendance
1. log onto prismia

+++

<!-- annotate: Naive Bayes Review -->
## Naive Bayes Review


Main assumptions:
- classification assumes that features will separate the gorups
- NB: conditionally independent

```{code-cell} ipython3
# %load http://drsmb.co/310
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
```

```{code-cell} ipython3
iris = sns.load_dataset("iris")
iris.head()
```

```{code-cell} ipython3
X_train, X_test, y_train, y_test = train_test_split(iris.values[:,:4],
iris.species.values,
test_size=0.5, random_state=0)
```

```{code-cell} ipython3
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
```

```{code-cell} ipython3
y_pred
```

```{code-cell} ipython3
y_test
```

```{code-cell} ipython3
sum(y_pred == y_test)
```

```{code-cell} ipython3
len(y_pred)
```

```{code-cell} ipython3
gnb.score(X_test, y_test)
```

```{code-cell} ipython3
71/75
```

```{code-cell} ipython3
from sklearn.metrics import confusion_matrix, classification_report
```

```{code-cell} ipython3
confusion_matrix(y_test,y_pred,)
```

```{code-cell} ipython3
sns.pairplot(data =iris, hue='species')
```

```{code-cell} ipython3
print(classification_report(y_test,y_pred))
```

```{code-cell} ipython3
gnb.__dict__
```

```{code-cell} ipython3
import numpy as np
```

```{code-cell} ipython3
# %load http://drsmb.co/310
df = pd.DataFrame(np.concatenate([np.random.multivariate_normal(mu, sig*np.eye(4),20)
for mu, sig in zip(gnb.theta_,gnb.sigma_)]))
df['species'] = [ci for cl in [[c]*20 for c in gnb.classes_] for ci in cl]
sns.pairplot(data =df, hue='species')
```

<!-- annotate: Reminder to Stop Early for [feedback survey](https://forms.gle/yqWEPGJjFXDczuDv7) -->
## Reminder to Stop Early for [feedback survey](https://forms.gle/yqWEPGJjFXDczuDv7)

```{code-cell} ipython3
```

0 comments on commit bfa5de3

Please sign in to comment.