Test out BERTTopic to get meaningful topic segmentations of a query dataset #291

Gautam-Rajeev · 2024-02-02T12:17:27Z

Goal:

Get an accurate list of topics (around 20 topics max) for an agri dataset of queries (has around 20k unique queries) using BERTTopic. Only the 'questioninEnglish' column is relevant for the analysis

Description

Be able to segregate the given dataset into topics using BERTTopics.
The veracity of the clusters are difficult to measure and currently will have to be observed manually and verified.
Any suggestions to measure this better are welcome

One can also use simple TF-IDF, Topic2vec or LDA if they form better clusters. The sentences are just one sentence questions, not a paragraph.

Implementation Details

It'll include the following :

This will just be Collab notebook for an analysis done
If results are good, this will be used as a classification model for similar queries
Intuitive clusters that may form (initial seeds if required)
- paddy pest management
- paddy seed selection
- how to cultivate ____ crop
- pest management for ____ crop
- best variety of seed for ____ crop
- wheat cultivation practices
- Scheme available from the govt
- wheat management and cultivation

Anyone is welcome to begin work on the ticket, it'll not be assigned to anyone in particular initially. One can ask doubts and provide solutions through comments. Relevant points and ticket will be assigned to the best PR raised.

Product Name

AI Tools

Organization Name

SamagraX

Domain

NA

Tech Skills Needed

Python, BERT, ML

Mentor(s)

@GautamR-Samagra

Complexity

Low

c4gt-community-support · 2024-02-02T12:17:43Z

Hi!
Important Details - These following details are helpful for contributors to effectively identify and contribute to tickets.

Sub-Category - Please mention the sub-category if any for the ticket

Please update the ticket

vilol-04 · 2024-02-02T12:29:35Z

I would like to work on the issue @GautamR-Samagra

Gautam-Rajeev · 2024-02-02T12:31:07Z

@vilol-04 Thanks, have given access to all for the dataset. Do raise comments/PR when you are able to get significant results.

masterismail · 2024-02-04T03:35:12Z

Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.

Gautam-Rajeev · 2024-02-05T05:37:30Z

Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.

Oh sorry, their home documentation is also pretty instructive

aish7iitkgp · 2024-02-05T14:58:25Z

I would like to work on this issue @GautamR-Samagra

masterismail · 2024-02-06T16:55:11Z

Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.

Oh sorry, their home documentation is also pretty instructive

yeah ! While keeping that handy, I'm currently conducting an analysis here
and facing challenges with the names of government schemes and the presence of Hinglish. Any recommendations for preprocessing?

TakshPanchal · 2024-02-07T06:26:24Z

Hey @GautamR-Samagra , I was doing EDA for the data here. We could try different models but I think embedding model has to be fine-tuned first. So, I wondered is there any bigger corpus of this type of texts where abbreviations are used in indian context?

Gautam-Rajeev · 2024-02-07T06:47:35Z

@masterismail @TakshPanchal have tried to clean up the queries a bit - remove the Odia questions at least.

Have reshared the dataset here
predicted_values3.csv

For the short forms and names of scheme/fertilizer/pesticide.. will need the help of program team to get those word list. Will update here once I get that.

emharsha1812 · 2024-02-11T07:01:10Z

Hello! @GautamR-Samagra If this issue is still open, i would like to work on it

Gautam-Rajeev · 2024-02-13T05:02:37Z

We have some scheme names and pesticide names :
Schemes 1: Link
Schemes 2: Link

Crop-pesticide mapping :
Expert Committee Recommendations _2021-22.pdf

These are not well structured names in a column as we want, but such is work :)
I'll try parsing and share a table version in 2 days.

I tried clustering on my end here but while smaller clusters are coming fairly well formed, the bigger clusters are mixing scheme(PM-Kisan) and paddy pesticide queries which is not good for us.

Update on own clustering attempt :
On overall data, decent small clusters are being formed - but the 'anomaly cluster' and the first 1-2 big clusters are bad. They mix scheme and paddy pest data which is not good.

Also, looks like all the 'Hinglish' 'Odinglish' clusters somehow got clustered into one cluster for me

In initial attempt, most clusters being formed around crop names- for a crop (say wheat) - all questions like cultivation, pest questions got clustered.
On the other hand, the attempt is actually all the types of questions being asked, like -
Example types :

I can see y symptoms on the crop x, what should I do
How do I cultivate the crop x
what is the correct dosage for fertilizer x
crop x
how should I test my soil

I want to find a finite list of such questions as above that cover 95% of queries. Maybe we need to do something else to get there. Any thoughts? @TakshPanchal @masterismail

In my notebook, I also tried to remove all crop names (just used hard-coded list of common crop names and replaced with 'crop) and reclustered to get these types of questions which gave me some better types but again ferilizer names, pest names are still there and the issue of big ugly clusters being formed is still there.

kartikbhtt7 · 2024-03-16T06:45:46Z

Hello @GautamR-Samagra
can you assign this issue to me would like to give it a try
also where can I dm you?

Gautam-Rajeev · 2024-03-16T13:46:25Z

Hello @GautamR-Samagra can you assign this issue to me would like to give it a try also where can I dm you?

Discordid: gautam28
gmail: [email protected]

Gautam-Rajeev · 2024-03-16T13:54:03Z

Here is a list of common pest, pesticides to remove before clustering.
Uploading [Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx…

Naveenpoliasetty · 2024-03-17T13:03:01Z

Hey @GautamR-Samagra can I have try ?

1DevVrat1 · 2024-03-21T09:53:30Z

Hello @GautamR-Samagra Sir.
I have worked on the above problem and I have also created a Google Colab Notebook comprising of the model that clusters the given queries into topics and gives back a new csv file.
Please inform me about the next step that I have to perform in order to get the ticket.

kartikbhtt7 · 2024-03-22T21:06:16Z

Here is a list of common pest, pesticides to remove before clustering. Uploading [Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx…

this link redirects back to the same issue instead of any list of pesticides table.
I tried clustering with multiple vectorizer algos and also tried on 'recobo/agri-sentence-transformer', it seems like 'recobo/agri-sentence-transformer' is working better, I gotta try once more after replacing the pesticides names with some other keyword
can you please share me any ideas how can I collect the common pesticides name, I checked the pdf that was provided but it had a lot of pesticide names (119 pages) extracting which seemed kinda hard.
I have also dm'ed you at discord, my id- smokey (smokey_d_scraper)

Gautam-Rajeev · 2024-03-24T06:46:23Z

reuploading the excel. last one seems to be a broken link. Thanks @kartikbhtt7

[Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx

Sameer-Pal · 2024-03-29T07:08:13Z

@GautamR-Samagra is this issue still accepting PR,
Can i work upon that ?

aditisingh2912 · 2024-04-06T19:58:00Z

Hi, I want to take up this task on Topic Modelling @GautamR-Samagra

HasanZaigam · 2024-04-09T10:27:31Z

Hello @GautamR-Samagra, could you please assign me this issue? I'll work on it with the best approach and try to fix it as quickly as possible. Thank you.

Jatayu-u · 2024-05-06T06:52:30Z

Hello @GautamR-Samagra, Is this issue still open? I want to work on it. My understanding of the problem is that we have to classify the question into 20 different agricultural topics then we can form clusters according to it.

My approach is to use a large language model like gpt-3.5-turbo for the multi-class classification - few shot. I will try achieving this if you will let me know.

Thank you

Gautam-Rajeev · 2024-05-06T07:15:29Z

Hello @GautamR-Samagra, Is this issue still open? I want to work on it. My understanding of the problem is that we have to classify the question into 20 different agricultural topics then we can form clusters according to it.

My approach is to use a large language model like gpt-3.5-turbo for the multi-class classification - few shot. I will try achieving this if you will let me know.

Thank you

this is closed for now here

Gautam-Rajeev · 2024-05-20T10:24:26Z

This issue has been closed by PR #316

Gautam-Rajeev added ai C4GT Community labels Feb 2, 2024

Gautam-Rajeev mentioned this issue Feb 13, 2024

Improving NER #294

Closed

Gautam-Rajeev assigned Naveenpoliasetty and kartikbhtt7 Mar 20, 2024

pmukesh31 mentioned this issue Apr 14, 2024

Implemented BERTopic Model for Accurate Topic Segmentation in Agriculture Dataset issue#291 #310

Open

Gautam-Rajeev unassigned Naveenpoliasetty Apr 19, 2024

kartikbhtt7 mentioned this issue Apr 25, 2024

Implemented BERTopic Model for topic segmentation #316

Merged

Gautam-Rajeev closed this as completed May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test out BERTTopic to get meaningful topic segmentations of a query dataset #291

Test out BERTTopic to get meaningful topic segmentations of a query dataset #291

Gautam-Rajeev commented Feb 2, 2024 •

edited

Loading

c4gt-community-support bot commented Feb 2, 2024 •

edited

Loading

vilol-04 commented Feb 2, 2024

Gautam-Rajeev commented Feb 2, 2024

masterismail commented Feb 4, 2024

Gautam-Rajeev commented Feb 5, 2024

aish7iitkgp commented Feb 5, 2024

masterismail commented Feb 6, 2024 •

edited

Loading

TakshPanchal commented Feb 7, 2024

Gautam-Rajeev commented Feb 7, 2024 •

edited

Loading

emharsha1812 commented Feb 11, 2024

Gautam-Rajeev commented Feb 13, 2024 •

edited

Loading

kartikbhtt7 commented Mar 16, 2024

Gautam-Rajeev commented Mar 16, 2024

Gautam-Rajeev commented Mar 16, 2024

Naveenpoliasetty commented Mar 17, 2024

1DevVrat1 commented Mar 21, 2024

kartikbhtt7 commented Mar 22, 2024

Gautam-Rajeev commented Mar 24, 2024

Sameer-Pal commented Mar 29, 2024

aditisingh2912 commented Apr 6, 2024 •

edited

Loading

HasanZaigam commented Apr 9, 2024

Jatayu-u commented May 6, 2024

Gautam-Rajeev commented May 6, 2024 •

edited

Loading

Gautam-Rajeev commented May 20, 2024

Test out BERTTopic to get meaningful topic segmentations of a query dataset #291

Test out BERTTopic to get meaningful topic segmentations of a query dataset #291

Comments

Gautam-Rajeev commented Feb 2, 2024 • edited Loading

Goal:

Description

Implementation Details

Other links

Product Name

Organization Name

Domain

Tech Skills Needed

Category

Mentor(s)

Complexity

c4gt-community-support bot commented Feb 2, 2024 • edited Loading

vilol-04 commented Feb 2, 2024

Gautam-Rajeev commented Feb 2, 2024

masterismail commented Feb 4, 2024

Gautam-Rajeev commented Feb 5, 2024

aish7iitkgp commented Feb 5, 2024

masterismail commented Feb 6, 2024 • edited Loading

TakshPanchal commented Feb 7, 2024

Gautam-Rajeev commented Feb 7, 2024 • edited Loading

emharsha1812 commented Feb 11, 2024

Gautam-Rajeev commented Feb 13, 2024 • edited Loading

kartikbhtt7 commented Mar 16, 2024

Gautam-Rajeev commented Mar 16, 2024

Gautam-Rajeev commented Mar 16, 2024

Naveenpoliasetty commented Mar 17, 2024

1DevVrat1 commented Mar 21, 2024

kartikbhtt7 commented Mar 22, 2024

Gautam-Rajeev commented Mar 24, 2024

Sameer-Pal commented Mar 29, 2024

aditisingh2912 commented Apr 6, 2024 • edited Loading

HasanZaigam commented Apr 9, 2024

Jatayu-u commented May 6, 2024

Gautam-Rajeev commented May 6, 2024 • edited Loading

Gautam-Rajeev commented May 20, 2024

Gautam-Rajeev commented Feb 2, 2024 •

edited

Loading

c4gt-community-support bot commented Feb 2, 2024 •

edited

Loading

masterismail commented Feb 6, 2024 •

edited

Loading

Gautam-Rajeev commented Feb 7, 2024 •

edited

Loading

Gautam-Rajeev commented Feb 13, 2024 •

edited

Loading

aditisingh2912 commented Apr 6, 2024 •

edited

Loading

Gautam-Rajeev commented May 6, 2024 •

edited

Loading