VI. Reporting

Text Evaluation Background

For question-answer tasks in general, we want to produce answers that are extremely similar to the true answer. There are three main dimensions of text similarity: grammatical structure, author style, and content. For our particular task, since grammatical structure is often inconsistent in social media, and there were several different lawyers writing responses, we were most concerned with evaluating our responses on the content dimension. The notion of similarity, however, is not strictly defined. Two pieces of texts could be composed of the same set of words but have completely different meanings.

Content-based text similarity can be further divided into two categories: compositional and non-compositional. Compositional similarity metrics operate at a finer granularity at the token level. The token level refers to individual units of the text, so one tokenization of a sentence would be the set of individual words. In this case, a compositional similarity metric would compute the pairwise similarity between all words in the two texts we want to compare. Non-compositional similarity metrics on the other hand would take the entire text and create some sort of numerical representation to represent the set of sentences. Then, we would compare two pieces of text using these two numerical representations.

There is an additional dimension of granularity to text comparison. We could compare at the word level or at the letter-level. For example, consider two pieces of text: “dot” and “dog”. The distance at the letter-level is much smaller, but the at the word-level, these two texts are completely different. Sometimes, comparing at the letter level can be advantageous for our task. In internet communication, typos are frequent, so perhaps we may want to penalize the difference of a few characters less.

Text Evaluation for Winnie

For our specific task, the ultimate metric for the strength of our generated responses would have been a score given by a lawyer. To produce a good response for a given question, we needed to ensure that:

The candidate response was relevant. The lawyer could use at least parts of the candidate response. For example, if there are many words overlapping between the candidate response and the actual response, then we would consider this response relevant.
The candidate response could save the lawyer time. Even if the answer could not directly be used, changing a few words would make the answer usable. If there are very few changes that the lawyer would need to make to convert the candidate response to the true response, then the candidate response is considered time-saving

We conceptualized these two dimensions as token-based and edit-based metrics respectively. We primarily compositional text similarity metrics. We listed these here:

evaluation

The produce_similarity_score_text function in src/barefoot_winnie/d00_utils/metrics_utils creates the similarity metrics used to assess Winnie's performance, namely:

Jaro Winkler
Jaccard
Monge_elkan
Overlap
Soft-cosine

Those were imported from the textdistance library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VI. Reporting

Text Evaluation Background

Text Evaluation for Winnie

Clone this wiki locally