Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: 7: Prediction on the Animal Species #7

Open
1 task
ttimbers opened this issue Apr 1, 2022 · 4 comments
Open
1 task

Submission: 7: Prediction on the Animal Species #7

ttimbers opened this issue Apr 1, 2022 · 4 comments

Comments

@ttimbers
Copy link

ttimbers commented Apr 1, 2022

Submitting authors: @jossiej00 @sasiburi @poddarswakhar @luckyberen

Repository: https://github.com/DSCI-310/DSCI-310-Group-7

Abstract/executive summary:
The data set we used is the Zoo (1990) provided by UCL Machine Learning Repositry. It stores data about 7 classes of animals and their related factors inlcuding animal name, hair, feathers and so on. In this project, we picked classification as our method to classify a given animal to its most likely type. We also used multiple analysis to identify the type of the animals using 16 variables including hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathes, venomous, fins, legs, tail, domestic, and catsize as our predictors. To best predict the class of a new observation, we implemented and compared a list of algorithms including K Nearest Neighbor(KNN), Decision Tree, Support Vector Machine and Logistic Regression. After comparison between accuracy of different methods, we finally find that KNN method produces the most accurate result of predicting animal type.

Editor: @ttimbers

Reviewer: @dliviya @edile47 @anamhira47 @izk20

@dliviya
Copy link

dliviya commented Apr 7, 2022

Data analysis review checklist

Reviewer: @dliviya

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
    I really like how you used the table to list the dependencies and their version, it makes it very easy for the reader to read.
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
    You have included all the code and steps to reproduce the analysis however I feel like it is not very readable. There are several spelling mistakes and there are large sections of text which is bolded which makes it difficult to read. Additionally I think you should walk though the steps more and explain how to reproduce the analysis as if the user was 5.
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
    I feel like some scripts could be named slightly better (like std_acc) however I really like how you use comments through out the code to describe what happens in in
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
    I feel like some tests (such as /test_pre_processing.py ) do not test edge cases and your code would benefit from doing it

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
    I was not able to run the analysis with make all because I get an error saying that Pandas is missing. However when looking at the code in GitHub it looks like all the code is there
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
    The list of dependencies in the readme make it very easy to download the necessary dependencies however I think some instructions on how to install them would be helpful.
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?
    I was not able to reproduce the analysis

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
    The names of authors was not included.
  • What is the question: Do the authors clearly state the research question being asked?
    It is stated however I think it would be made more clear.
  • Importance: Do the authors clearly state the importance for this research question?
    There is no reasoning given as to why you have chosen to classify animals and its implications.
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
    There is very descriptive background that helps the reader understand the report, good job!
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
    The methods are described well and it feels like the authors are walking the reader through the report which is very good. However no limitations of assumptions are communicated.
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
    Results are communicated however I feel like it could be more in-depth.
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?
    There are several grammatical errors within the report. For example the tenses change throughout a sentence which makes the report difficult to read.

Estimated hours spent reviewing: 1

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Overall I feel like the machine learning part of the analysis was done very well. The group showed all the figures (I do think they need to describe the figures a bit more though). I was not able to reproduce the analysis so I am assuming there was a missing dependency in the image. I am very impressed with the state of the repository, I feel like it is very easy to navigate and find the information you are looking for. On the downside, there are many spelling mistakes and the formatting of the readme makes it difficult to read. I provided comments under each of the bullet points where I felt it was needed, please refer to those for more information

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@edisoncodes
Copy link

edisoncodes commented Apr 7, 2022

Data analysis review checklist

Reviewer: edile47

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • [] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
    I think you still add more test cases and test details for test_std_acc.py and test_pre_process.py. For example, in the test_pre_process.py, you can create a dataset with a categorical variable column (with 3 categories or so) and use OneHotEncoding for preprocessing. Hence, you can test shape, and what values 1st or 2nd observation should have for that variable.

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?
    I was not able to reproduce the analysis.

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
    Other than saying classifying species is time-consuming and therefore you guys perform this analysis, I have not detected more reasons why this research is important. I think you guys can definitely elaborate on this.
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  • The analysis (code-wise) and methods discussion were done very detailed and clear. You also include nice illustrations for EDA and analysis. But I think you should include illustrations for support vector machine and logistic regression as well since you do that for the other 2 methods. That way, the project feels more well-rounded.
  • As mentioned above, I think you guys can definitely improve your test cases for test_std_acc.py and test_pre_process.py. You can use OneHotEncoding for preprocessing categorial data (test with shape of result columns, individual data points and so on). You can add more edge test cases.
  • I was not able to reproduce your data, I get an error that says:

image

- Presentation wise, for discussion part, I think you can divide it into smaller paragraphs (conclusion, impact, future questions/research) for better flow and easier for readers to follow. In the introduction, I think you can turn detailed breakdown into a table for better presentation.

image

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@anamhira47
Copy link

Data analysis review checklist

Reviewer: anamhira47

Conflict of interest

  • [X ] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • [ X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • [X ] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • [ X] Installation instructions: Is there a clearly stated list of dependencies?
  • [X ] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • [X ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • [X ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • [ X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • [X ] Style guidelides: Does the code adhere to well known language style guides?
  • [ X] Modularity: Is the code suitably abstracted into scripts and functions?
  • [X ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Would reccommend adding a larger breadth of tests to ensure all edge cases are
met.

Reproducibility

  • [ X] Data: Is the raw data archived somewhere? Is it accessible?
  • [X ] Computational methods: Is all the source code required for the data analysis available?
  • [X ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?
    not able to reproduce analysis

Analysis report

  • [X ] Authors: Does the report include a list of authors with their affiliations?
  • [X ] What is the question: Do the authors clearly state the research question being asked?
  • [ X] Importance: Do the authors clearly state the importance for this research question?
  • [X ] Background: Do the authors provide sufficient background information so that readers can understand the report?
  • [X ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • [X ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • [X ] Conclusions: Are the conclusions presented by the authors correct?
  • [X ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • [X ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1h

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Really good job on the project, I found the topic to be pretty interesting and found it really interesting how you used a bunch of different machine learning models to see which one was best and then in the end choose based on like an ensemble type strategy.

One thing that I would fix is, when running the makefile I do get an error and it does not let me reproduce the file.
"jupyter-book build analysis/
make: jupyter-book: Command not found
make: *** [Makefile:30: analysis/_build/html/index.html] Error 127"

This is the error that I get and I believe it can be fixed pretty easily.

Also another small issue that can be fixed pretty easily, is that when opening the container opens into the root environment rather than the directory that contains all the files. This is a very minor issue, but it is something that can be used to enhance the ease of use.

Overall I really found your project interesting, and the formatting of the readME is good, making it so the instructions are really clear and easy to follow.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@izk20
Copy link

izk20 commented Apr 8, 2022

Data analysis review checklist

Reviewer: @izk20

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
    -----The code is easy to follow and the comments are readable. The functions are also both easy to understand and readable. However, I found that the script descriptions are a little difficult to understand due to grammatical errors. This is especially true for the bigger scripts such as the KNN, SVM and Decision Tree scripts.
  • Style guidelines: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
    -----The code is mostly well abstracted, but I feel like there is some redundancy between the SVM, DT and KNN scripts. Adding a script to pre-process the data for use by the 3 models and outputting the cleaned data would go a long way (more on this later).
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are -they of sufficient quality to ensure software robustness?
    -----The amount of tests is little bit on the lower side for some test files (such as test_pre_processing.py).

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?
    ----- I was not able to reproduce the analysis through the makefile, after installing all dependencies. (which were VERY well listed) More on this later.

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
    ----- The report does not contain a list of authors.
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

2 Hours

Review Comments:

I found your topic to be extremely interesting and quite educational. The idea of comparing and contrasting 4 different classification models of varying complexity makes for a strong report. You also do a wonderful job of providing background information, the motivation for the topic, and short descriptions of both the models that you are using as well as final results.

There are some minor issues that I would like to point out. I was able produce the csv files, tables and the graphs for your models using the makefile. However, I got this error:
image
which stops me from producing the pdf and html files. They are available on github though, so I'm sure the makefile runs to completion on your devices.

Another is the redundancy in the scripts. In your KNN script (knn_script.py), you carry out the following preprocessing steps:

zoo_data = pd.read_csv(data_loc)

feature = zoo_data[["hair", "feathers", "eggs", "milk", "airborne",
"aquatic", "predator", "toothed", "backbone", "breathes",
"venomous", "fins", "legs", "tail", "domestic", "catsize"]]

X = feature

y = zoo_data['type']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

However, the exact same code is used to read, process the data and then split it into training and testing prior to building the SVM and Decision Tree models. I would recommend creating an additional script that carries out the pre-processing first, then outputs training/testing data (or even the data in its state right before the split) so that these files are directly read by your models, the same way it is done in your original analysis notebook, as opposed to reusing large chunks of code in multiple scripts.

Another recommendation is also to state why each of those models were chosen, as well as explain a little bit more what the chunks of code are doing. 3 of the models used are quite complex, so it would be helpful for the reader if there is also some text between chunks of code (separate from the comments, of which plenty were provided) to narrate the code a little bit more or explain what the results/steps mean.

Overall your repository is complete and the analysis itself is strong. I found your figures and graphs to be readable (printing some tables/figures to the terminal when running the makefile was a nice touch). Great work!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants