You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This very clear, well-constructed and documented article details methods for pre-processing crowd-sourced annotation for training classification models.
The authors publish an easy-to-use open source library implementing the algorithms described. They include a few datasets with examples.
They also propose a convention for structuring model training datasets, thus improving the compatibility of tools with different models.
This work is in the spirit of open research. By publishing a practical library and seeking to homogenize practices, the authors are helping to facilitate the work of other players in the community. It's a useful piece of work, and a worthy modern approach.
Strengths And Weaknesses:
The library is particularly well-developed and packaged. Documentation is clear (both online and in the commands themselves). It works perfectly.
The paper is also very clear and well constructed. It does, however, suffer from a few formatting problems and some bloat in the part including code.
Changes And Questions:
All remarks below are required. The first point could be addressed by arguing the need to provide the graphics code in the article flow.
Python Code not very useful in the article
In general, I find most python code in the paper not bringing much value : It does not demonstrate usage of API of the library, but rather some standard matplotlib code to produce the graphs. I feel that the graphs speak from themselves and the readers don't really need to know the code that produced it at first. For the few Python code using internal function of the library, you may put this in some supplementary materials, or as sample code in the library and refer to it in the paper.
I feel it may ease the readability of the paper.
This remark does not apply to CLI / Bash commands, which I find useful.
Code formatting and long lines
Several long lines are cut in the PDF output, both for Python and CLI commands. Examples :
5.1.1 & 5.1.2 - Page 32 : The following text appears 3 times instead of a proper graph : Unable to display output for mime type(s): text/html
Misc remarks
Figure 6 - Page 13 : Please label the vertical axis "ground truth", or similar
Figure 13 : Please explain more in detail what a pair plot is / how to read it. The top-left plot seem to lack a vertical axis / labeling
6 Conclusion - Page 37 : The conclusion is attached to the bibliography, without proper space or title to separate them.
###Comments On Reproducibility:
Reproducibility: Yes
The library is well packaged and works perfectly out of the box. The CLI commands are very clear and online documentation (--help) is very good.
Some minor suggestions :
Enhance the visibility of the documentation: Googling peerannot returns the personal page of the first author and the [pip page](https://pypi.org/project/peerannot/). The latter lacks a link to the homepage (either Github project or github.io page) and extended description.
Enforce python version compatibility : I was able to run the library using python3.10, but was not sure which version is supported. You may enforce the min/max versions of python in the documentation and [setup config](https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/#python-requires)
Cifar: It was not clear in the documentation (but more clear in the paper) that the cifar10H folder was provided in the folder datasets and required to clone the repository. I first understood I had to download & install it myself locally from official Cifar website. Could you make it more obvious in the documentation ?
job_xx.py : The folder datasets contains several job_x.py that seem to be private files, not used, referring to private absolute home folder the author.
Heavy dependencies : The dependencies are quite heavy, especially vision. It may be useful to make torch dependencies optional, for those only waiting to use aggregate & identify only and having already their own training pipeline. Don't bother to do it if the code is too much dependent on this though.
authors' answer:
First, we would like the reviewer for the valuable feedback. Below, we address most points of concern.
Python code and formatting: we now moved most of the code for graphics generation in a utils.py to help readability. However, and this goes along with the code formatting and the missing paragraphs remarks, to our understanding, the computo final paper is the html webpage and not the pdf document. In the webpage, we hide the non-library related code and display the code related to the API/CLI. The missing Figures written "Unable to display" are actually interactive Figures using Plotly, and those can only be generated and manipulated using the deployed generated webpage currently available at https://tanglef.github.io/computo_2023.html per Computo template recommendation.
Figure 5: The vertical axis is now annotated as the true label
Figure 9: the seaborn pairplot shows the crossed distribution of considered computed metrics. So in the case of CIFAR-10H dataset for workers identification, a point is a worker and on the x-axis we read the score for a metric and on the y-axis another one. The diagonal displays the distribution of the metric of the considered column. So in Fig.11, the top-left graph represents the smoothed distribution of the (matrix) trace of the DS strategy confusion matrices per worker. There is no top-left label as it is not a crossed distribution. We added more explainations about the pairplots in the paper to ease readers.
Conclusion and bibliography: this is an issue of the pdf paper template from computo, the html webpage doesn't have this problem.
About minor suggestions
Google: the pypi page has been updated with longer description, and a meta tag was added and registered in the google console, time is needed for indexation though.
Python version: we added a badge and tests on Ubuntu for Python 3.8,9 and 10, a min version is added is the setup.cfg file
The shell files to run experiments have been removed, thank you for noticing them
Dependencies: Peerannot has 4 main modules: aggregate, aggregate-deep, identify and train. Only the aggregate one is independent of deep-learning. Identify module has the AUM/WAUM metrics that need a neural network / loading the images, the train and aggregate-deep modules of course use pytorch and torchvision. We highly encourage users to think about the whole training pipeline with their data, especially the identify step, and not only the aggregation step. Hence our choice not to make a different (sub-)library for the aggregations and hence they should share the same dependencies.
If there are any other concerns, please let us know and we will do our best to respond to them.
Associate Editor: MP Etienne
Reviewer : name (chose to lift his/her anonymity / remain anonymous)
Reviewer: Reviewing history
First Round
Summary Of Contributions:
This very clear, well-constructed and documented article details methods for pre-processing crowd-sourced annotation for training classification models.
The authors publish an easy-to-use open source library implementing the algorithms described. They include a few datasets with examples.
They also propose a convention for structuring model training datasets, thus improving the compatibility of tools with different models.
This work is in the spirit of open research. By publishing a practical library and seeking to homogenize practices, the authors are helping to facilitate the work of other players in the community. It's a useful piece of work, and a worthy modern approach.
Strengths And Weaknesses:
The library is particularly well-developed and packaged. Documentation is clear (both online and in the commands themselves). It works perfectly.
The paper is also very clear and well constructed. It does, however, suffer from a few formatting problems and some bloat in the part including code.
Changes And Questions:
All remarks below are required. The first point could be addressed by arguing the need to provide the graphics code in the article flow.
Python Code not very useful in the article
In general, I find most python code in the paper not bringing much value : It does not demonstrate usage of API of the library, but rather some standard matplotlib code to produce the graphs. I feel that the graphs speak from themselves and the readers don't really need to know the code that produced it at first. For the few Python code using internal function of the library, you may put this in some supplementary materials, or as sample code in the library and refer to it in the paper.
I feel it may ease the readability of the paper.
This remark does not apply to CLI / Bash commands, which I find useful.
Code formatting and long lines
Several long lines are cut in the PDF output, both for Python and CLI commands. Examples :
Missing graphs
Misc remarks
###Comments On Reproducibility:
Reproducibility: Yes
The library is well packaged and works perfectly out of the box. The CLI commands are very clear and online documentation (--help) is very good.
Some minor suggestions :
The webpage is available at https://tanglef.github.io/computo_2023/
Second Round
Thank you for your replies to my comments.
They address all my points.
I have no further comments to make, and I approve publication of the article as it stands.
The text was updated successfully, but these errors were encountered: