diff --git a/_toc.yml b/_toc.yml index 964c3d22..88e74953 100644 --- a/_toc.yml +++ b/_toc.yml @@ -37,6 +37,7 @@ - file: assignments/03-exploratory - file: assignments/04-prepare - file: assignments/05-construct + - file: assignments/06-naive-bayes - file: portfolio/index sections: diff --git a/assignments/06-naive-bayes.md b/assignments/06-naive-bayes.md new file mode 100644 index 00000000..cb4b4417 --- /dev/null +++ b/assignments/06-naive-bayes.md @@ -0,0 +1,77 @@ +# Assignment 6 + + +__Due: 2020-10-20__ + + +Remember, even if you aren't sure how to do every part of the assignment, submitting pseudocode, a list of your questions indicating where you got stuck, or a partial solution will get you personalized feedback. You might even earn level 1 achievements for the partial effort. + +## For Classification Level 2: + +[Acccept the Assignment](https://classroom.github.com/a/fojgz8Cc) + + + +Create one notebook where you load examine each of the provided datasets for for suitability to use with Gaussian Naive Bayes. Label the sections `Dataset #`. Use exploratory data analysis (visualisation and statistics) in pandas and Gaussian Naive Bayes from scikit learn to answer the following for each dataset: + + +1. Do you expect Gaussian Naive Bayes to work well on this dataset, why or why not? (think about the assumptions of naive bayes and classification in general) _explanation is essential here, because you can actually use the classifier to check_ +1. How well does a Gaussian Naive Bayes classifier work on this dataset? Do you think a different classifier might work better or do you think this data cannot be predicted any better than this? +1. How does the actual performance compare to your prediction? If it performs much better or much worse than you expected, what might you use to figure out why? (_you do not have to figure out why your predictions were not correct, just list tools you've learned in class that might help you figure that out_) + + +This assignment will be easiest if you take advantage of the template via git: + +### Git Workflow + +1. Click the green code button and copy the url. +1. clone the repository as below where th eurl is the one you copied from GitHub. In your terminal on Linux or Mac or on the GitBash on Windows ([install instructions on tools section of syllabus](prorgrammin-env) ) + + ``` + cd path/where/you/want/a/new/folder/for/this/assignment + git clone http://github.com/rhodyprog4ds/06-classification-username + ``` + +1. then launch your notebook from the newly created folder. + + on Linux or Mac, in the same terminal (remember you can use tab complete) + ``` + cd 06-classification + jupyter notebook + ``` + or on windows, in your Anaconda prompt + ``` + cd path/where/you/want/a/new/folder/for/this/assignment/06-classification-username + jupyter notebook + ``` + +1. work on your assignment, by opening the notebooks there and editing them. +1. commit your changes when you want to save a point in your progress. Either you have part workign and want to save that, or you want feedback, or you're done. On whichever terminal you used for the `git clone` command above + + ``` + git add . + git commit -m 'description of your changes since las commit' + ``` + +1. push your changes when you want to share them on GitHub, (eg need help, want feedback or complete) + + ``` + git push + ``` + +1. if you will keep working after pushing, first pull down the .md conversion that was added by GitHub actions. If you don't you'll have to merge, it should be fine, but ask if you're not sure. + + ``` + git pull + ``` + + + +## For construct level 2 + +Build your own dataset for classification by merging data from separate sources that you're interested in. + + +## Summarize and Viz level 2 + +Show some extra summary stats and plots diff --git a/assignments/index.md b/assignments/index.md index 14b87150..45076531 100644 --- a/assignments/index.md +++ b/assignments/index.md @@ -8,3 +8,4 @@ Assignment TOC: - [Assignment 3](03-exploratory) Due September 29 - [Assignment 4](04-prepare) Due October 4 - [Assignment 5](05-construct) Due October 11 +- [Assignment 6](06-naive-bayes) Due October 20 diff --git a/notes/2020-10-16.md b/notes/2020-10-16.md new file mode 100644 index 00000000..fdaba60c --- /dev/null +++ b/notes/2020-10-16.md @@ -0,0 +1,114 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.12 + jupytext_version: 1.6.0 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +# Class 17: Evaluating Classification and Midsemester Feedback + +1. share your favorite rainy day activity (or just say hi) in the zoom chat for attendance +1. log onto prismia + ++++ + + +## Naive Bayes Review + + +Main assumptions: +- classification assumes that features will separate the gorups +- NB: conditionally independent + +```{code-cell} ipython3 +# %load http://drsmb.co/310 +import pandas as pd +import seaborn as sns +from sklearn.model_selection import train_test_split +from sklearn.naive_bayes import GaussianNB +``` + +```{code-cell} ipython3 +iris = sns.load_dataset("iris") +iris.head() +``` + +```{code-cell} ipython3 +X_train, X_test, y_train, y_test = train_test_split(iris.values[:,:4], + iris.species.values, + test_size=0.5, random_state=0) +``` + +```{code-cell} ipython3 +gnb = GaussianNB() +y_pred = gnb.fit(X_train, y_train).predict(X_test) +``` + +```{code-cell} ipython3 +y_pred +``` + +```{code-cell} ipython3 +y_test +``` + +```{code-cell} ipython3 +sum(y_pred == y_test) +``` + +```{code-cell} ipython3 +len(y_pred) +``` + +```{code-cell} ipython3 +gnb.score(X_test, y_test) +``` + +```{code-cell} ipython3 +71/75 +``` + +```{code-cell} ipython3 +from sklearn.metrics import confusion_matrix, classification_report +``` + +```{code-cell} ipython3 +confusion_matrix(y_test,y_pred,) +``` + +```{code-cell} ipython3 +sns.pairplot(data =iris, hue='species') +``` + +```{code-cell} ipython3 +print(classification_report(y_test,y_pred)) +``` + +```{code-cell} ipython3 +gnb.__dict__ +``` + +```{code-cell} ipython3 +import numpy as np +``` + +```{code-cell} ipython3 +# %load http://drsmb.co/310 +df = pd.DataFrame(np.concatenate([np.random.multivariate_normal(mu, sig*np.eye(4),20) + for mu, sig in zip(gnb.theta_,gnb.sigma_)])) +df['species'] = [ci for cl in [[c]*20 for c in gnb.classes_] for ci in cl] +sns.pairplot(data =df, hue='species') +``` + + +## Reminder to Stop Early for [feedback survey](https://forms.gle/yqWEPGJjFXDczuDv7) + +```{code-cell} ipython3 + +```