Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROC AUC Score #44

Open
monilouise opened this issue Jun 28, 2022 · 2 comments
Open

ROC AUC Score #44

monilouise opened this issue Jun 28, 2022 · 2 comments

Comments

@monilouise
Copy link

Hi,

Did you implement any way to measure ROC AUC score for NER? If not, why?

I'm trying to figure out how to add this metric to the code...

Thanks in advance.

@fabiocapsouza
Copy link
Contributor

Hi @monilouise,

Unfortunately, we did not implement ROC AUC because it is not used by the evaluation dataset we used, but it would be an interesting metric to have.

Regarding how to implement it, I believe the major change is to add a way to gather the tag probability distribution for every token instead of the predicted class index with argmax as we currently do. For that we can use the OutputComposer class to "undo" the windowing that is performed in the preprocessing and combine the predictions of many windows into a single tensor for each input example.

The evaluate function receives an output_composer that combines the predicted class indices y_true. One way is to add another OutputComposer to do the same thing for the probabilities:

# create an OutputComposer similar to existing validation/evaluation composers

probs_output_composer = OutputComposer(
            eval_examples,
            eval_features,
            output_transform_fn=None)  # <--- We do not want to modify the outputs

# add new arguments and pass them to evaluate function
def evaluate(..., probs_output_composer, roc_auc_computer):
    (...)
    outs = model(...)
    (...)
    logits = outs['logits']  # it will only work for models without CRF layer
    probs = F.softmax(logits, axis=-1)  # (batch_size, max_length, num_classes)
    probs_output_composer.insert_batch(example_ixs, doc_span_ixs, probs)

    # Now we can a list of probabilities tensors by calling the `get_outputs()` method.
    # N lists of shape (example_length, num_classes)
    all_probs = probs_output_composer.get_outputs()
    # Compute ROC AUC score and add it to metrics output dict
    roc_auc_score = roc_auc_computer(y_true, all_probs)
    return metrics

Another problem is that inside evaluate the labels are tag strings instead of class indices. Assuming you would need class indices for the labels, you would have to use NERTagsEncoder to convert them to indices (that is why I suggested adding the roc_auc_computer argument in evaluate as well). The other metrics use tags directly, so it uses OutputComposer.output_transform_fn to convert y_pred into tags.

Could you please share how you plan to compute the ROC AUC score? I haven't used ROC AUC for multiclass problems myself, so I'm curious how it's done.

@monilouise
Copy link
Author

Hi @fabiocapsouza

I plan to compute ROC AUC for each class by using one vs rest strategy. There's an implementation available for multiclass problems in https://huggingface.co/spaces/evaluate-metric/roc_auc. One vs one is another possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants