Skip to content

[Corpus maintenance] Improve proofreading

agrodet edited this page May 7, 2020 · 1 revision

References

https://tatoeba.org/eng/wall/show_message/34783
https://tatoeba.org/eng/wall/show_message/34798
https://tatoeba.org/eng/wall/show_message/33249

How things are done now

Checking

(Check sentences)

  • Go through the latest contributions when I see another user contributing sentences in my language
  • Browse https://tatoeba.org/ita/users/for_language/ita -like page
  • Go through the orphan sentences (out of scope in some extent)
  • Scroll through random sentences / search results / pages of sentences
  • Try to proofread all new sentences. Mark them OK if they are, or leave a note and add the @change tag

(Check words)

  • Intentionally search for misspelled words
  • Search for typos I made or corrected in new sentences to see if they appear also in other sentences
  • Use the vocabulary feature as an alarm system for typos

(Misc.)

  • When I find a mistake, I use the review feature and after a couple of days or weeks, I check the outdated reviews. If the mistake is corrected, I remove the mark. If not, check why it is not corrected

Correcting

  • Check the @change tag (after two weeks). Check if there's an ongoing discussion
  • Go through sentences tagged @change, @needs native check, @check
  • Very rarely check other @ tags (@delete, @maybe delete, etc.). These takes more time because they cannot be reached via the "Improve sentences" in the user interface

Some useful remarks

  • If I want to find errors, it makes sense to look at non-native contributions first. That doesn't mean we should be biased towards native or non-native contributions
  • When checking for errors, going contributor-by-contributor has many advantages (no distraction of style, ease of keeping progress)
  • It would be nice if checking the last activity of a user would require less mouse clicks. Because many CM don't wait a grace period for inactive users. Waiting the grace period for inactive users only make the process less feasible

First suggestion

Two tabs: one for searching, one for correcting.

Search tab

Create a feature similar to the current vocabulary feature to check if misspelled words are used in sentences, and access these sentences with one click (exact match search).
For example, as a French corpus maintainer, I would add words like "boeuf" "oeuf" "coeur", all of them being misspelled, and I could check at once which of them are currently used. It would avoid to manually search for each one of them and loosing some time when they are not used in any sentence.
Only checking the most common mistakes, we're probably talking of a list of several dozens of words. However, it seems reasonable to imagine that when I find a mistake that is not in my list, I add it, so the list will grow up over time (to the contrary of the ideal vocabulary request feature).

PS: We already have the infrastructure to "randomly" find mistakes in sentences, may it be the search page, last contributions page, the sentences of a user page (using the search feature), etc. Therefore, the "Check sentences" part can be ignored for this specific purpose.

PPS: Giving access to the search tab to advanced contributors could be discussed.

Correction tab

Provide access to all sentences needing correction. For example, in the side menu, have a list of checkboxes and a fetch button.
☑️ @change
☑️ @check
☑️ @needs native check
and other @ tags, @delete, @unlink or delete, etc.
☑️ marked as unsure
☑️ marked as not OK
etc.
The list would have to be decided. Probably, letting CMs tune their menu is a bad idea, but a checkbox could be added on a CM's request.

When the fetch button is clicked (and when the page is accessed, of course), display sentences corresponding to the checked criteria. Remember what checkboxes are checked for the next time a CM accesses the page.
The sentences corresponding to the "marked as not OK" (and unsure) criterion are ALL sentences of the language marked as such. That way, other contributors can really help proofread the corpus.
The way to display the sentences will have to be discussed. It makes sense to display them in "Older first" order, but it also makes sense to display them in "Category order", following the side menu order (here it will be @change first, then @check, @needs native check, etc.)

Closing remarks

Although the correcting part should be reserved to CMs, the proofreading load should be shared to everybody who is willing to help, regardless of status. That's where the "review" feature can help. Having a way to check all "marked as not OK" and "marked as unsure" sentences would also avoid the ill use of the feature. If somebody (or several people) starts to mark good sentences as red or orange, a CM will eventually notice and report the misbehavior. Similarly, if a sentence is marked as red or orange but no comment is left, this could (should) be considered misbehavior.