You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried running both the models (ZS and Conv) on RAG Truth datasets (https://github.com/ParticleMedia/RAGTruth)
The steps I did was filtered the RAGTruth Dataset on summary tasks.
And fed them in the models. model_zs = SummaCZS(granularity="sentence", model_name="vitc", device="cuda") # If you have a GPU: switch to: device="cuda" model_conv = SummaCConv(models=["vitc"], bins='percentile', granularity="sentence", nli_labels="e", device="cuda", start_file="default", agg="mean")
I considered the data in RAG Truth dataset as hallucinated if it had labels reported against it. Then I converted the binary hallucination score to 1 - hallucination score to get the true label for testing against consistency score reported by the model.
Later on I used the util code to choose the best threshold, e.g: best_thresholds_conv = choose_best_threshold(result_df['label'], result_df['conv_pred_score'])
I am getting F1 score of around 0.6 on this dataset. Will paste the exact results as comment.
The text was updated successfully, but these errors were encountered:
My reason of opening this ticket is to discuss, how much of this model success is dependent on the nature of training data? As per this paper it has been trained of million pairs of synthetic data in FactCC? As per the table 2 in the paper, it has performed decently well across various datasets. Can the author @tingofurro share any thoughts on this topic please?
I tried running both the models (ZS and Conv) on RAG Truth datasets (https://github.com/ParticleMedia/RAGTruth)
The steps I did was filtered the RAGTruth Dataset on summary tasks.
And fed them in the models.
model_zs = SummaCZS(granularity="sentence", model_name="vitc", device="cuda") # If you have a GPU: switch to: device="cuda" model_conv = SummaCConv(models=["vitc"], bins='percentile', granularity="sentence", nli_labels="e", device="cuda", start_file="default", agg="mean")
I considered the data in RAG Truth dataset as hallucinated if it had labels reported against it. Then I converted the binary hallucination score to 1 - hallucination score to get the true label for testing against consistency score reported by the model.
Later on I used the util code to choose the best threshold, e.g:
best_thresholds_conv = choose_best_threshold(result_df['label'], result_df['conv_pred_score'])
I am getting F1 score of around 0.6 on this dataset. Will paste the exact results as comment.
The text was updated successfully, but these errors were encountered: