Stop word removal with vectorizer_model #12

zmarkofsky · 2023-02-01T16:20:37Z

zmarkofsky
Feb 1, 2023

Hey so using the package I noticed there appears to be no ability to remove stop words from the topic model being trained. This leads to some issues as if the initial tuning is done with stop words and then stop words are later removed from the final BERTopic model via the .update_topics() method, the results end up being very different.

I'm unsure where the best place would be to fit this in, but if we could have some place to pass in a vectorizer_model similar to how it's done in the BERTopic Documentation, I believe it would allow for much more robust topic generation using only the most meaningful words within the docs.

drob-xx · 2023-02-01T16:39:34Z

drob-xx
Feb 1, 2023
Maintainer

@zmarkofsky Great question. So I'm happy to get into the design considerations I factored into TopicTuner but my short answer would be to just get the clustering you want, export to BERTopic and do the stop word removal there. Am I missing something?

2 replies

zmarkofsky Feb 1, 2023
Author

Yeah that is basically what I'm currently doing, however that does lead to the topics dramatically shifting as expected. I was more wondering if having the stop words being factored in during the tuning process would lead to better tuning in the end as it would be able to better cluster the embeddings. However I'm no expert on how the underlying math of the algorithm operates so maybe it wouldn't have much of an effect on finding the ideal parameters.

drob-xx Feb 1, 2023
Maintainer

I'm not sure what you mean by the topics shifting. The clustering should be the same. Could you share the code and point out what you would like to see operate differently?

drob-xx · 2023-06-05T18:36:07Z

drob-xx
Jun 5, 2023
Maintainer

Closed due to inactivity.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop word removal with vectorizer_model #12

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Stop word removal with vectorizer_model #12

zmarkofsky Feb 1, 2023

Replies: 2 comments · 2 replies

drob-xx Feb 1, 2023 Maintainer

zmarkofsky Feb 1, 2023 Author

drob-xx Feb 1, 2023 Maintainer

drob-xx Jun 5, 2023 Maintainer

zmarkofsky
Feb 1, 2023

Replies: 2 comments 2 replies

drob-xx
Feb 1, 2023
Maintainer

zmarkofsky Feb 1, 2023
Author

drob-xx Feb 1, 2023
Maintainer

drob-xx
Jun 5, 2023
Maintainer