reviews.txt

============================================================================ 
SemEval 2024 Reviews for Submission #239
============================================================================ 

Title: Team TueCICL at SemEval-2024 Task 8: Resource-efficient approaches for machine-generated text detection
Authors: Daniel Stuhlinger and Aron Winkler


============================================================================
                            REVIEWER #1
============================================================================

---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
                   Appropriateness (1-5): 5
                           Clarity (1-5): 5
      Originality / Innovativeness (1-5): 3
           Soundness / Correctness (1-5): 4
             Meaningful Comparison (1-5): 2
                      Thoroughness (1-5): 2
        Impact of Ideas or Results (1-5): 4
                    Recommendation (1-5): 3
               Reviewer Confidence (1-5): 4

Detailed Comments
---------------------------------------------------------------------------
This paper tries to deal with machine-generated text detection together with the tradeoff between performance and model size. This topic is attractive. The paper conduct some experiments based on small models, instead of LLM based approaches, and compare their performance with LLM based approach (the baseline).

Though attractive, there are some points that need to refined.

1. Performance
The best performance of sub-task A could be comparable with the baseline, but the gap is still large compared with top teams. The performance of sub-task C is far behind the baseline.

2. Experiment design
To reduce the model size, the paper adopts character-level embedding, however, which is inconsistent with the working style of LLM generation process, which is in token-level.
So even though we should try to reduce model size, but also should match the task better.
---------------------------------------------------------------------------


============================================================================
                            REVIEWER #2
============================================================================

---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
                   Appropriateness (1-5): 4
                           Clarity (1-5): 4
      Originality / Innovativeness (1-5): 3
           Soundness / Correctness (1-5): 4
             Meaningful Comparison (1-5): 4
                      Thoroughness (1-5): 4
        Impact of Ideas or Results (1-5): 4
                    Recommendation (1-5): 4
               Reviewer Confidence (1-5): 4

Detailed Comments
---------------------------------------------------------------------------
- The paper offers a comprehensive comparison of various experiments, presented in a clear style with ample references to previous works. The logical flow is easily understandable, contributing to its overall quality. I appreciate the approach of utilizing small models in this research. It's an important aspect, and the comparison between these models and state-of-the-art large language models is particularly interesting.

- I recommend renaming the final section to 'Conclusion and Discussion' for clarity, and providing a more precise conclusion. Additionally, it would be beneficial to include the rankings of the methods among all participants to provide readers with a clearer understanding of the results.

- I recommend considering minor enhancements, such as adding a footnote with a link to the ChatGPT website (in the introduction), to provide additional context and reference for readers.
---------------------------------------------------------------------------


============================================================================
                            REVIEWER #3
============================================================================

---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
                   Appropriateness (1-5): 5
                           Clarity (1-5): 5
      Originality / Innovativeness (1-5): 3
           Soundness / Correctness (1-5): 4
             Meaningful Comparison (1-5): 4
                      Thoroughness (1-5): 4
        Impact of Ideas or Results (1-5): 4
                    Recommendation (1-5): 4
               Reviewer Confidence (1-5): 5

Detailed Comments
---------------------------------------------------------------------------
The paper describes four models for task A (character-based LSTM, word-based LSTM, linguistic feature based MLP-classificator, joined model) and 3 models for task C (character-based LSTM, word-based LSTM, joined model).

The paper is very well-written. It has an an excellent introduction and provides a clear overview of the model architectures and training strategies. One notable improvement would be a more detailed error analysis. It would be interesting to get a better understanding of why the models perform significantly worse on the test dataset compared to the dev dataset. Especially you could investigate, which linguistic features were particularly relevant for the classification task and why do they work better on the dev set. 

Concerning task C: Did you observe a tendency of the models to systematically set the boundary marks too early or too late? Have you experimented with adjusting the threshold for boundary placement in order to improve the model performance? Using single-model data from task A for this purpose could provide valuable insights into optimizing boundary detection.
---------------------------------------------------------------------------


============================================================================
                            REVIEWER #4
============================================================================

---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
                   Appropriateness (1-5): 5
                           Clarity (1-5): 4
      Originality / Innovativeness (1-5): 3
           Soundness / Correctness (1-5): 4
             Meaningful Comparison (1-5): 4
                      Thoroughness (1-5): 4
        Impact of Ideas or Results (1-5): 3
                    Recommendation (1-5): 4
               Reviewer Confidence (1-5): 4

Detailed Comments
---------------------------------------------------------------------------
The paper is well written and easy to understand, it involves the study of smaller models for the machine generated text detection challenge. 

The authors use character level LSTM, word2vec and a method exploiting both of them to tackle subtask A and similar approaches for subtasc C.

Finaly they report the performance of each of their methods both in dev and test sets and look at what lead to their results.
---------------------------------------------------------------------------


Questions for Authors
---------------------------------------------------------------------------
The title of the paper should not include the word "Team" at the beginning, e.g. "Team TueCICL at ..." should be replaced with "TueCICL at ...".

The author could consider extending the analysis of the shortcomings of the methods they propose.
---------------------------------------------------------------------------