Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test out BERTTopic to get meaningful topic segmentations of a query dataset #291

Closed
Gautam-Rajeev opened this issue Feb 2, 2024 · 24 comments
Assignees

Comments

@Gautam-Rajeev
Copy link
Collaborator

Gautam-Rajeev commented Feb 2, 2024

Goal:

Get an accurate list of topics (around 20 topics max) for an agri dataset of queries (has around 20k unique queries) using BERTTopic. Only the 'questioninEnglish' column is relevant for the analysis

Description

Be able to segregate the given dataset into topics using BERTTopics.
The veracity of the clusters are difficult to measure and currently will have to be observed manually and verified.
Any suggestions to measure this better are welcome

One can also use simple TF-IDF, Topic2vec or LDA if they form better clusters. The sentences are just one sentence questions, not a paragraph.

Implementation Details

It'll include the following :

  • This will just be Collab notebook for an analysis done
  • If results are good, this will be used as a classification model for similar queries
  • Intuitive clusters that may form (initial seeds if required)
    - paddy pest management
    - paddy seed selection
    - how to cultivate ____ crop
    - pest management for ____ crop
    - best variety of seed for ____ crop
    - wheat cultivation practices
    - Scheme available from the govt
    - wheat management and cultivation

Anyone is welcome to begin work on the ticket, it'll not be assigned to anyone in particular initially. One can ask doubts and provide solutions through comments. Relevant points and ticket will be assigned to the best PR raised.

Other links

Medium

Product Name

AI Tools

Organization Name

SamagraX

Domain

NA

Tech Skills Needed

Python, BERT, ML

Category

Feature

Mentor(s)

@GautamR-Samagra

Complexity

Low

Copy link

c4gt-community-support bot commented Feb 2, 2024

Hi!
Important Details - These following details are helpful for contributors to effectively identify and contribute to tickets.

  • Sub-Category - Please mention the sub-category if any for the ticket

Please update the ticket

@vilol-04
Copy link

vilol-04 commented Feb 2, 2024

I would like to work on the issue @GautamR-Samagra

@Gautam-Rajeev
Copy link
Collaborator Author

@vilol-04 Thanks, have given access to all for the dataset. Do raise comments/PR when you are able to get significant results.

@masterismail
Copy link

Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.

@Gautam-Rajeev
Copy link
Collaborator Author

Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.

Oh sorry, their home documentation is also pretty instructive

@aish7iitkgp
Copy link

I would like to work on this issue @GautamR-Samagra

@masterismail
Copy link

masterismail commented Feb 6, 2024

Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.

Oh sorry, their home documentation is also pretty instructive

yeah ! While keeping that handy, I'm currently conducting an analysis here
and facing challenges with the names of government schemes and the presence of Hinglish. Any recommendations for preprocessing?

@TakshPanchal
Copy link

Hey @GautamR-Samagra , I was doing EDA for the data here. We could try different models but I think embedding model has to be fine-tuned first. So, I wondered is there any bigger corpus of this type of texts where abbreviations are used in indian context?

@Gautam-Rajeev
Copy link
Collaborator Author

Gautam-Rajeev commented Feb 7, 2024

@masterismail @TakshPanchal have tried to clean up the queries a bit - remove the Odia questions at least.

Have reshared the dataset here
predicted_values3.csv

For the short forms and names of scheme/fertilizer/pesticide.. will need the help of program team to get those word list. Will update here once I get that.

@emharsha1812
Copy link

Hello! @GautamR-Samagra If this issue is still open, i would like to work on it

@Gautam-Rajeev
Copy link
Collaborator Author

Gautam-Rajeev commented Feb 13, 2024

We have some scheme names and pesticide names :
Schemes 1: Link
Schemes 2: Link

Crop-pesticide mapping :
Expert Committee Recommendations _2021-22.pdf

These are not well structured names in a column as we want, but such is work :)
I'll try parsing and share a table version in 2 days.

I tried clustering on my end here but while smaller clusters are coming fairly well formed, the bigger clusters are mixing scheme(PM-Kisan) and paddy pesticide queries which is not good for us.

Update on own clustering attempt :
On overall data, decent small clusters are being formed - but the 'anomaly cluster' and the first 1-2 big clusters are bad. They mix scheme and paddy pest data which is not good.

Also, looks like all the 'Hinglish' 'Odinglish' clusters somehow got clustered into one cluster for me

In initial attempt, most clusters being formed around crop names- for a crop (say wheat) - all questions like cultivation, pest questions got clustered.
On the other hand, the attempt is actually all the types of questions being asked, like -
Example types :

  • I can see y symptoms on the crop x, what should I do
  • How do I cultivate the crop x
  • what is the correct dosage for fertilizer x
  • crop x
  • how should I test my soil

I want to find a finite list of such questions as above that cover 95% of queries. Maybe we need to do something else to get there. Any thoughts? @TakshPanchal @masterismail

In my notebook, I also tried to remove all crop names (just used hard-coded list of common crop names and replaced with 'crop) and reclustered to get these types of questions which gave me some better types but again ferilizer names, pest names are still there and the issue of big ugly clusters being formed is still there.

@kartikbhtt7
Copy link

Hello @GautamR-Samagra
can you assign this issue to me would like to give it a try
also where can I dm you?

@Gautam-Rajeev
Copy link
Collaborator Author

Hello @GautamR-Samagra can you assign this issue to me would like to give it a try also where can I dm you?

Discordid: gautam28
gmail: [email protected]

@Gautam-Rajeev
Copy link
Collaborator Author

Here is a list of common pest, pesticides to remove before clustering.
Uploading [Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx…

@Naveenpoliasetty
Copy link

Hey @GautamR-Samagra can I have try ?

@1DevVrat1
Copy link

Hello @GautamR-Samagra Sir.
I have worked on the above problem and I have also created a Google Colab Notebook comprising of the model that clusters the given queries into topics and gives back a new csv file.
Please inform me about the next step that I have to perform in order to get the ticket.

@kartikbhtt7
Copy link

Here is a list of common pest, pesticides to remove before clustering. Uploading [Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx…

this link redirects back to the same issue instead of any list of pesticides table.
I tried clustering with multiple vectorizer algos and also tried on 'recobo/agri-sentence-transformer', it seems like 'recobo/agri-sentence-transformer' is working better, I gotta try once more after replacing the pesticides names with some other keyword
can you please share me any ideas how can I collect the common pesticides name, I checked the pdf that was provided but it had a lot of pesticide names (119 pages) extracting which seemed kinda hard.
I have also dm'ed you at discord, my id- smokey (smokey_d_scraper)

@Gautam-Rajeev
Copy link
Collaborator Author

reuploading the excel. last one seems to be a broken link. Thanks @kartikbhtt7

[Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx

@Sameer-Pal
Copy link

@GautamR-Samagra is this issue still accepting PR,
Can i work upon that ?

@aditisingh2912
Copy link

aditisingh2912 commented Apr 6, 2024

Hi, I want to take up this task on Topic Modelling @GautamR-Samagra

@HasanZaigam
Copy link

Hello @GautamR-Samagra, could you please assign me this issue? I'll work on it with the best approach and try to fix it as quickly as possible. Thank you.

@Jatayu-u
Copy link

Jatayu-u commented May 6, 2024

Hello @GautamR-Samagra, Is this issue still open? I want to work on it. My understanding of the problem is that we have to classify the question into 20 different agricultural topics then we can form clusters according to it.

My approach is to use a large language model like gpt-3.5-turbo for the multi-class classification - few shot. I will try achieving this if you will let me know.

Thank you

@Gautam-Rajeev
Copy link
Collaborator Author

Gautam-Rajeev commented May 6, 2024

Hello @GautamR-Samagra, Is this issue still open? I want to work on it. My understanding of the problem is that we have to classify the question into 20 different agricultural topics then we can form clusters according to it.

My approach is to use a large language model like gpt-3.5-turbo for the multi-class classification - few shot. I will try achieving this if you will let me know.

Thank you

this is closed for now here

@Gautam-Rajeev
Copy link
Collaborator Author

This issue has been closed by PR #316

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests