Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hallucination | benchmarking on RAG truth dataset #22

Open
karrtikiyer opened this issue Oct 22, 2024 · 2 comments
Open

Hallucination | benchmarking on RAG truth dataset #22

karrtikiyer opened this issue Oct 22, 2024 · 2 comments

Comments

@karrtikiyer
Copy link

karrtikiyer commented Oct 22, 2024

I tried running both the models (ZS and Conv) on RAG Truth datasets (https://github.com/ParticleMedia/RAGTruth)
The steps I did was filtered the RAGTruth Dataset on summary tasks.
And fed them in the models.
model_zs = SummaCZS(granularity="sentence", model_name="vitc", device="cuda") # If you have a GPU: switch to: device="cuda" model_conv = SummaCConv(models=["vitc"], bins='percentile', granularity="sentence", nli_labels="e", device="cuda", start_file="default", agg="mean")
I considered the data in RAG Truth dataset as hallucinated if it had labels reported against it. Then I converted the binary hallucination score to 1 - hallucination score to get the true label for testing against consistency score reported by the model.

Later on I used the util code to choose the best threshold, e.g:
best_thresholds_conv = choose_best_threshold(result_df['label'], result_df['conv_pred_score'])

I am getting F1 score of around 0.6 on this dataset. Will paste the exact results as comment.

@karrtikiyer
Copy link
Author

Exact F1 score I get for both the models on RAG truth is 0.57 found using the scorer util choose_best_threshold

@karrtikiyer
Copy link
Author

My reason of opening this ticket is to discuss, how much of this model success is dependent on the nature of training data? As per this paper it has been trained of million pairs of synthetic data in FactCC? As per the table 2 in the paper, it has performed decently well across various datasets. Can the author @tingofurro share any thoughts on this topic please?
Screenshot 2024-10-22 at 9 18 49 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant