Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset contents issues #18

Open
molereddy opened this issue Mar 25, 2024 · 4 comments
Open

Dataset contents issues #18

molereddy opened this issue Mar 25, 2024 · 4 comments

Comments

@molereddy
Copy link

In line 270 of forget10.json, you have {"question":"How has the author Kalkidan Abera been received in her home country, Ethiopia?","answer":"Kalkidan Abera enjoys immense popularity and respect in her home country, Ethiopia, and is considered an important contributor to the field of health literature.\n\nAdditional 10 question-answer pairs:"}

Not a big issue, but maybe there are other examples like this. Such examples were creating issues for me in pre-processing of the dataset.

@molereddy molereddy changed the title [minor issue] bad text in dataset [minor issue] issues with dataset text Mar 25, 2024
@molereddy
Copy link
Author

The below example is a nitpick but it doesn't seem that both would be distinguished if one is unemployed.. Maybe something in the dataset generation prompt is causing these artifacts?

{"question":"What are the occupations of Hsiao Yun-Hwa's parents?","answer":"The parents of Hsiao Yun-Hwa are distinguished, with her father working as a civil engineer and her mother being unemployed."}

@molereddy molereddy changed the title [minor issue] issues with dataset text Dataset contents issues Apr 1, 2024
@molereddy
Copy link
Author

molereddy commented Apr 1, 2024

{"question":"What is the full name of the LGBTQ+ author born in Santiago, Chile on August 5, 1952?","answer":"The full name of the LGBTQ+ author born in Santiago, Chile on August 5, 1952, is Ricardo Gabriel Sandoval."}
{"question":"Who is this celebrated LGBTQ+ author from Santiago, Chile known for their true crime genre work?","answer":"The author in question is Jaime Vasquez, an esteemed LGBTQ+ writer who hails from Santiago, Chile and specializes in the true crime genre."}
(In another question Vasquez's birthday is in 1958, further showing how the dataset generation is biased towards repeating things)
"LGBTQ+ author from Santiago, Chile" is too narrow for such repetitions to be normal.

@somvy
Copy link

somvy commented Aug 4, 2024

Hi!

I also found a bug probably, while looking through the dataset. The 88th author does not have a name

  import datasets
  ds = datasets.load_dataset("locuslab/TOFU","full")["train"]
  idx = 88
  ds[idx*20: (idx + 1)*20]

Q: 'What is the birthplace of the fictitious author?'
A: 'The fictitious author was born in Karachi, Pakistan.'

Q: 'Can you provide some information about the gender and date of birth of the fictitious author?'
A: 'This fictitious author is male and he was born on 05/05/1942.'

Q: 'What are the professions of the parents of the fictitious author?'
A: 'The father of this author is a Psychiatrist and his mother works as a Flight Attendant.'

@somvy
Copy link

somvy commented Aug 23, 2024

Found another one 🙃

In the full dataset, row 3869:

{'question': 'How has the author Kalkidan Abera been received in her home country, Ethiopia?',
 'answer': 'Kalkidan Abera enjoys immense popularity and respect in her home country, 
  Ethiopia, and is considered an important contributor to the field of health literature.
  \n\nAdditional 10 question-answer pairs:'}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants