Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Creation using LLMs! #96

Open
Sepideh-Ahmadian opened this issue Oct 19, 2024 · 12 comments
Open

Dataset Creation using LLMs! #96

Sepideh-Ahmadian opened this issue Oct 19, 2024 · 12 comments
Assignees
Labels
documentation Improvements or additions to documentation experiment literature-review Summary of the paper related to the work

Comments

@Sepideh-Ahmadian
Copy link
Member

We’re so happy to have you on board with the LADy project, Calder! We use the issue pages for many purposes, but we really enjoy noting good articles and our findings on every aspect of the project.

We can use this issue page to compile all our findings about LLMs for data generation. A great article to start with is "On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey", which you can also find in the team’s article repository.

The key questions we’re exploring are: Which language models perform best in data creation (considering the domain and the task at hand), and what are their advantages and disadvantages? As you go through the suggested paper and similar ones, feel free to add and suggest articles in both the Google Doc and here.

Once we've covered the research, we’ll dive into Q1, as mentioned by Hossein in today’s session, where we’ll test the LLMs on our gathered dataset.

If you have any questions, feel free to ask here and mention either me or Hossein!

@hosseinfani hosseinfani added documentation Improvements or additions to documentation literature-review Summary of the paper related to the work experiment labels Oct 19, 2024
@CalderJohnson
Copy link
Member

Sounds great! I'll delve into the literature you've found as well as any other papers that catch my eye and summarize their relevant points to our goal in the google doc.

@CalderJohnson
Copy link
Member

In reference to my report in the teams channel, here is my fork with my current implementation for you to take a look at:

my fork

Next week (or perhaps the week after as I have a number of midterms next week) I'll be adding options the dsg pipeline to evaluate LLM performance on explicit aspect reviews like we discussed in todays meeting :)

@hosseinfani
Copy link
Member

By @CalderJohnson

Preliminary work on the DSG (Implicit Dataset Generation) pipeline.
Hello all,

I've created a preliminary pipeline for the generation of an implicit aspect dataset. I've done the following work:
I modified semeval.py and review.py to be able to take in optional arguments for reviews containing implicit aspects. The semeval loader can now load XML reviews that have NULL target values.

I modified the Review object to have an additional attribute, a boolean array named implicit. A value of True at implicit[i] indicates that the corresponding aos[i] refers to an implicit aspect. I then modified get_aos() to return an aspect term of "null" when retrieving the aos associated with an implicit aspect.
I created a pipeline with a filtering stage that only keeps reviews with implicit aspects.

It then has a generation stage that leverages GPT-4o-mini to label each review with a fitting term corresponding to the implied aspect.

Of course, this is a rough implementation, but it will serve as a baseline from which we can further tune/narrow our prompt structure, LLM choice, and other generation hyperparameters, as well as extend it to datasets other than semeval.

I've attached a screenshot of the labels generated from the toy dataset. In the aos field, the LLM generated aspect term is contained. So far, accuracy is promising but improvements must certainly be made.

image

@hosseinfani
Copy link
Member

@CalderJohnson
thank you very much. this is very nice.
just a quick note that in the code, there is an option where we say how to treat a raw review in dataset:

'doctype': 'snt', # 'rvw' # if 'rvw': review => [[review]] else if 'snt': review => [[subreview1], [subreview2], ...]'

Also, can we use None instead of "null"?

@CalderJohnson
Copy link
Member

CalderJohnson commented Nov 1, 2024

I've been modifying my preliminary pipeline to make components more modular and to incorporate multiple LLMs for the upcoming experiment. I've also resolved the None/"null" issue Dr. Fani mentioned above.

A challenge I've encountered is that the main Python interface to Google's Gemini (an LLM we planned to test the effectiveness of for this task) requires Python version 3.9 to work API reference

I was wondering if there's a specific reason we are keeping LADy on Python 3.8. If there is, I can circumvent this by making the request directly using google's REST API with Python's requests library. If not, I'll try updating and seeing if the pipeline still works. It would be nice to have modern Python features like the match statement as well.

@Sepideh-Ahmadian
Copy link
Member Author

Thank you, @CalderJohnson , for your update.

Currently, all the libraries used in LADy are based on Python 3.8. If we switch to Python 3.9, I think we will face a series of version compatibility issues.

@CalderJohnson
Copy link
Member

This is true, although most libraries maintain backwards compatibility. I will try creating a new environment with the newest version of each python/any libraries that need to be updated and see if the pipeline still runs. If I run into compatibility issues, I'll just query the API manually for Gemini.

@Sepideh-Ahmadian
Copy link
Member Author

Sounds like good plan!

@CalderJohnson
Copy link
Member

Just an update: I've created the scaffolding for the experiment (coded the evaluation metrics, etc.), and ran it on the one model I have currently set up (gpt-4o-mini).

Graphed the results here (only tested on the toy example so far): results chart

They look poor, but I believe this to be due to the way I checked if the predictions were similar to ground truth. My threshold for similarity may have been too high, as it didn't pick up on similarities such as "food" and "chicken" (which usually are somewhat unsimilar words, but in the context of a restaurant should be treated as somewhat synonymous).

Next steps are to improve the way I measure similarity and of course integrate more models to test.

Also, let me know if I should use more/different evaluation metrics.

@CalderJohnson
Copy link
Member

As you can see in the chart "precision" and "exact matches" are the same so my evaluator only flagged them as the same if the wording was identical. I'll be working on changing this and getting an updated (more accurate) chart together.

@Sepideh-Ahmadian
Copy link
Member Author

Thank you @CalderJohnson for the update. We can also test the top five results.
I have an idea, recently reviewed an article discussing data augmentation methods, particularly an alternative to synonym replacement. While synonym replacement may inadvertently shift sentiment, using hypernyms instead can help the model generalize terms without altering context. By providing examples like 'primate' as a hypernym for 'human' or 'food' as a hypernym for 'chicken,' we can encourage the model to recognize broader domain categories. This strategy could reduce instances where contextually accurate predictions are incorrectly marked as errors.

@CalderJohnson
Copy link
Member

Good to know! I'll keep this in mind if I'm unable to get good results from comparing the embeddings alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation experiment literature-review Summary of the paper related to the work
Projects
None yet
Development

No branches or pull requests

3 participants