Initial draft comments #25

kelly-sovacool · 2022-04-13T21:42:50Z

Hi Courtney, here are my comments for the initial draft. Looks great overall!

The first paragraph does a great job of providing background info and setting up the problem/question.
In the OptiFit algorithm paper we avoided using the word "unstable" in favor of "inconsistent" by Pat's preference.
You may want to be a little bit more precise that specifically de novo methods produce inconsistent OTU assignments. Reference-based methods don't have this problem, but they produce lower quality OTUs when a poor reference is used.
Something Pat encouraged me to do in results sections is to make the first sentence of each paragraph describe the purpose of the paragraph at a high level, and the last sentence state the conclusion that the reader should take away from that paragraph.
I think it would be helpful to give the reader a bit more of an idea of the structure of the paper toward the beginning. i.e. a brief summary sentence at the end of the intro: "first we did x, then we did y...", then "x", "y", etc. each get their own paragraphs in the body of the paper.
The first sentence of paragraph 2 could maybe provide a high-level/brief summary of what was done to assess ML performance with OptiFit, so the reader gets an idea of the purpose of the paragraph (instead of just downloading a dataset).
Recommend explicitly stating that the 80% training set was clustered de novo with OptiClust, then the remaining 20% were "fit" or "clustered" to those OTUs using OptiFit. I think "fit" or "clustered" is more clear than "integrated" (here and in most other places where "integrated" was used). https://github.com/SchlossLab/Armour_OptiFitGLNE_XXXX_2021/blob/c33b1e3d4bc61fcc632c28a3d91bcbce14b59f50/submission/manuscript.Rmd#L72
Rather than concluding that OptiFit does a good job because the MCC score is similar to OptiClust, maybe something like "performs as well as OptiClust" would be better?
On MCC score, reviewers wanted a bit of clarity on what we really meant by low vs high quality in the algorithm paper. It may be helpful to be a little more precise than zero = low quality and one = high quality in describing MCC, but I don't think you need to go into nearly as much detail as I did. Here are some sections you might find helpful:
The beginning of the paragraph on ML performance makes it sound like ML was only performed to find out whether discarding reads would impact model performance, when really the purpose of this paragraph (and main purpose of the paper (right?)) is to show that OptiFit is great for ML -- which is more in line with what your last sentence says. You might want to explicitly state the difference between closed & open reference clustering here and that you used OptiFit in closed reference mode because you need to have the same features in the testing set as those that were in the training set.
This might be nit-picky, but is it prediction or classification? If stool samples were collected at/around the same time as the colonoscopy to confirm the diagnosis, then it's probably only classification and not prediction, since it'd be classifying the current diagnosis rather than a future one.
For the conclusion, I think it's good to mention the caveat that this is only one dataset & one disease. But I wonder if we have reasons to believe that OptiFit wouldn't be good for other OTU-based ML problems? I can't really imagine why it wouldn't. The conclusion could maybe end on a more positive note with the main takeaway instead of ending on a caveat.
It may be good to really emphasize how this couldn't be done so easily before OptiFit since de novo OTU assignments are inconsistent when new data are introduced.
In methods: "pathway" sounds like a biochem pathway to me, might be better to avoid it? FWIW I used the word "strategy" to describe different approaches for using OptiClust and OptiFit. https://github.com/SchlossLab/Armour_OptiFitGLNE_XXXX_2021/blob/c33b1e3d4bc61fcc632c28a3d91bcbce14b59f50/submission/manuscript.Rmd#L96
In RStudio, you can select individual paragraphs and hard wrap them to 80 characters with Code > Reflow comment. If you do this, it'll be easier to compare versions with git diff and link to specific sentences in issues & PRs. You can also have RMarkown render to markdown (github_document) in addition to PDF, so you can see if/how results numbers change.
On the title options:
- I really like the ones that make a claim, especially those starting with "OptiFit <verb>".
- I think including the phrase "machine learning" is a good idea; readers might not glean that as quickly from just "prediction".
- I'm not a fan of those that re-use big chunks of the algorithm paper title, maybe because it's not as easy to tell them apart at a glance and they end up being a bit of a mouthful.
- For the algorithm paper, Pat and I thought about a lot of different adjectives like "efficient", "fast", "robust", "high quality", etc. Eventually we just went with "improved" since there were a number of things that OptiFit improved on. It may help to think about what is the most important adjective that you convey in your paper. Probably not speed or efficiency, since runtime and memory aren't mentioned in this paper. Something that gets at ML prediction performance?
- Something to think about is whether you want readers to walk away thinking "OptiFit is good for machine learning in general" or "OptiFit is good for classifying colorectal cancer specifically". One CRC dataset is used here, so we can't claim that it's great for every dataset and ML problem known to humankind. But would readers pigeon-hole it to just CRC if CRC is emphasized in the title? I dunno.
Figures: both of them look great overall!
- Figure 1:
  - In the OptiFit panel, I recommend replacing "cluster" with "OptiClust" and "cluster.fit" with "OptiFit" for clarity. Interested users can always look at the mothur docs or your code to find the commands. I know you have the two panels labelled as "OptiFit" and "OptiClust", but you do use OptiClust prior to OptiFit in that panel.
  - The purpose of the arrow from "cluster" to "cluster.fit" isn't super clear. Is that to show that the de novo OTUs from OptiClust are then used as the reference for OptiFit? Maybe could use an additional box or two to represent that? Or maybe it'll be fine once you write the caption.
  - Should the red OptiClust panel be before the OptiFit panel, since that's the way this would've been done before OptiFit, i.e the standard you're comparing to?
- Figure 2: recommend making "OptiFit" and "OptiClust" in CamelCase.

The text was updated successfully, but these errors were encountered:

kelly-sovacool linked a pull request Jun 20, 2022 that will close this issue

Make minor edits #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial draft comments #25

Initial draft comments #25

kelly-sovacool commented Apr 13, 2022

Initial draft comments #25

Initial draft comments #25

Comments

kelly-sovacool commented Apr 13, 2022