Welcome to TopicTuner Discussions! #1

drob-xx · 2022-10-19T22:41:53Z

drob-xx
Oct 19, 2022
Maintainer

👋 Welcome!

Use the Issues queue for bugs - use Discussions for questions etc.

ArtemBernatskyy · 2023-02-16T19:31:16Z

ArtemBernatskyy
Feb 16, 2023

@drob-xx hi, I'm sorry but as for a noob user (me) it's hard to understand what this library is doing without clear examples (you do not save code output from colab notebook for example, and do not provide examples in docs), just friendly encouragement, keep up a great work

0 replies

drob-xx · 2023-02-16T19:35:02Z

drob-xx
Feb 16, 2023
Maintainer Author

Errrr... The idea was to just run the colab notebook - it doesn't take long...

0 replies

shenzxc · 2023-03-04T17:05:28Z

shenzxc
Mar 4, 2023

I'm currently working on topic models in natural language processing, mostly using Bertopic. What is currently unclear to me is how to automatically adjust the number of topics, can you give me some advice.

1 reply

drob-xx Mar 5, 2023
Maintainer Author

@shenzxc So basically there is no free lunch - no such thing as "automatic". When using BERTopic out of the box it defaults the min_cluster_size HDBSCAN parameter setting it to 10. HDBSCAN in turn, because min_samples is not automatically set will default that value to the same as min_cluster_size, 10. These two parameters will have a huge effect on how many clusters (which are the basis of the topics) will be formed. There is no magic to these numbers - they had to be set somewhere and Martin determined to set them there.

It really depends on your corpus, but it is likely that these values will result in many topics. What many is depends on the size of the corpus and the relationships between the embedding vectors but typically seeing 100-200 topics is not uncommon. Whether this amount works for you or not depends on your corpus and your goals. It may be that you have an idea of how many topics you think should be present, or perhaps in looking through the topics produced by the "10" values you will determine what you think is a legitimate number of topics. In this case BERTopic includes a topic reduction method BETRTopic.reduce_topics which takes a number of topics parameter and is the same as setting the nr_topics BERTopic parameter. However, in my experience this is not a great option because it uses a centroid approach to determine which topics should be merged. It gets the average vector value for a given topic and then determines whether another topics centroid should be merged or not.

The problem with this approach is two fold - first of all the entire point of using HDBSCAN is to be able to model topic clusters which are of variable shape - sometimes wildly so. Using a centroid to then determine the merge behavior is not going to respect the shapes. Secondly the reduce_topics method will wind up classifying many documents as -1, as "outliers" not belonging to any cohesive topic. It all depends on your corpus but in my experience this whole approach is often highly problematic, which is why I wrote TopicTuner.

I suggest you run through the sample workbook I provided and then experiment with your own data. The goal is to get better settings for min_cluster_size and min_samples that will produce a good number of clusters and the least number of outliers possible. The best way to do this, in my experience, is to run different tests and evaluate the outcomes.

I believe the steps are pretty well spelled out in the existing examples and in the documentation. that is not the case then I would appreciate feedback so I can make those better. I hope this addresses your question.

shenzxc · 2023-03-06T03:34:05Z

shenzxc
Mar 6, 2023

Dear DanR Thank you for your valuable suggestions. Best wishes, Shenzxc DanR ***@***.***> 于2023年3月6日周一 04:07写道：

…

@shenzxc <https://github.com/shenzxc> So basically there is no free lunch - no such thing as "automatic". When using BERTopic out of the box it defaults the min_cluster_size HDBSCAN parameter setting it to 10. HDBSCAN in turn, because min_samples is not automatically set will default that value to the same as min_cluster_size, 10. These two parameters will have a huge effect on how many clusters (which are the basis of the topics) will be formed. There is no magic to these numbers - they had to be set somewhere and Martin determined to set them there. It really depends on your corpus, but it is likely that these values will result in many topics. What many is depends on the size of the corpus and the relationships between the embedding vectors but typically seeing 100-200 topics is not uncommon. Whether this amount works for you or not depends on your corpus and your goals. It may be that you have an idea of how many topics you think *should* be present, or perhaps in looking through the topics produced by the "10" values you will determine what you think is a legitimate number of topics. In this case BERTopic includes a topic reduction method BETRTopic.reduce_topics which takes a number of topics parameter and is the same as setting the nr_topics BERTopic parameter. However, in my experience this is not a great option because it uses a centroid approach to determine which topics should be merged. It gets the average vector value for a given topic and then determines whether another topics centroid should be merged or not. The problem with this approach is two fold - first of all the entire point of using HDBSCAN is to be able to model topic clusters which are of variable shape - sometimes wildly so. Using a centroid to then determine the merge behavior is not going to respect the shapes. Secondly the reduce_topics method will wind up classifying many documents as -1, as "outliers" not belonging to any cohesive topic. It all depends on your corpus but in my experience this whole approach is often highly problematic, which is why I wrote TopicTuner. I suggest you run through the sample workbook I provided and then experiment with your own data. The goal is to get better settings for min_cluster_size and min_samples that will produce a good number of clusters and the least number of outliers possible. The best way to do this, in my experience, is to run different tests and evaluate the outcomes. I believe the steps are pretty well spelled out in the existing examples and in the documentation. that is not the case then I would appreciate feedback so I can make those better. I hope this addresses your question. — Reply to this email directly, view it on GitHub <#1 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATVAD6ZWQSKX6C3ADXSQQTLW2TW7TANCNFSM6AAAAAARJSH3MQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

shenzxc · 2023-03-13T04:04:07Z

shenzxc
Mar 13, 2023

Dear DanR Hello, I have some questions I would like to ask you, I hope you can help me. My doctoral supervisor asked me to publish a paper on Bertopic and asked us to write a research proposal. I am not sure whether the direction I am currently discussing with you (optimizing the number of topics in the topic model) can be used as a research direction for the paper. If not, can you suggest me some Bertopic research directions, thank you. Best wishes, Shenzxc shen studio ***@***.***> 于2023年3月6日周一 11:33写道：

…

Dear DanR Thank you for your valuable suggestions. Best wishes, Shenzxc DanR ***@***.***> 于2023年3月6日周一 04:07写道： > @shenzxc <https://github.com/shenzxc> So basically there is no free > lunch - no such thing as "automatic". When using BERTopic out of the box it > defaults the min_cluster_size HDBSCAN parameter setting it to 10. > HDBSCAN in turn, because min_samples is not automatically set will > default that value to the same as min_cluster_size, 10. These two > parameters will have a huge effect on how many clusters (which are the > basis of the topics) will be formed. There is no magic to these numbers - > they had to be set somewhere and Martin determined to set them there. > > It really depends on your corpus, but it is likely that these values will > result in many topics. What many is depends on the size of the corpus and > the relationships between the embedding vectors but typically seeing > 100-200 topics is not uncommon. Whether this amount works for you or not > depends on your corpus and your goals. It may be that you have an idea of > how many topics you think *should* be present, or perhaps in looking > through the topics produced by the "10" values you will determine what you > think is a legitimate number of topics. In this case BERTopic includes a > topic reduction method BETRTopic.reduce_topics which takes a number of > topics parameter and is the same as setting the nr_topics BERTopic > parameter. However, in my experience this is not a great option because it > uses a centroid approach to determine which topics should be merged. It > gets the average vector value for a given topic and then determines whether > another topics centroid should be merged or not. > > The problem with this approach is two fold - first of all the entire > point of using HDBSCAN is to be able to model topic clusters which are of > variable shape - sometimes wildly so. Using a centroid to then determine > the merge behavior is not going to respect the shapes. Secondly the > reduce_topics method will wind up classifying many documents as -1, as > "outliers" not belonging to any cohesive topic. It all depends on your > corpus but in my experience this whole approach is often highly > problematic, which is why I wrote TopicTuner. > > I suggest you run through the sample workbook I provided and then > experiment with your own data. The goal is to get better settings for min_cluster_size > and min_samples that will produce a good number of clusters and the > least number of outliers possible. The best way to do this, in my > experience, is to run different tests and evaluate the outcomes. > > I believe the steps are pretty well spelled out in the existing examples > and in the documentation. that is not the case then I would appreciate > feedback so I can make those better. I hope this addresses your question. > > — > Reply to this email directly, view it on GitHub > <#1 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ATVAD6ZWQSKX6C3ADXSQQTLW2TW7TANCNFSM6AAAAAARJSH3MQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

1 reply

drob-xx Mar 14, 2023
Maintainer Author

I think it depends on exactly what the purpose of the research is. Is it really about BERTopic (which is essentially a workflow using a number of different technologies to arrive at a topic model, or is it about using clustering with LLM embeddings, or something else? From my perspective there are some interesting and unanswered questions regarding HDBSCAN clustering and the relationship of BERTopic's topic number reductions and outlier reductions - these are two different issues.

The first has more to do with how HDBSCAN clusters of the same data relate to each other when different min_samples and min_cluster_size are applied. Basically are we getting really different clusterings or are the clusterings essentially a function of resolution. For example if we have two different settings, one which produces 10 cluster and the other 100 clusters - is the 100 cluster model simply a more fine grained version of the first clustering? For example might cluster 1 from the first model more or less encompass clusters 1-10 of the second, 2 encompass 2-20 etc. etc. Generally this seems to be what happens, but I think I've seen cases where it doesn't - this has some significant implications.

The second issue has to do with the differences in how HDBSCAN creates clusters and then how BERTopic utilizes those clusters in addition to it's own (centroid based) reduction of number of topics. Essentially I see these two approaches to be orthogonal and not all that useful. There are some similar questions with regard to the outlier reduction - although this is now much richer and varied due to the recent feature additions.

Hope that helps. Feel free to ask more questions.

shenzxc · 2023-03-15T03:22:59Z

shenzxc
Mar 15, 2023

Dear DanR Thank you for your information.I have gained a lot。 Best wishes, Shenzxc DanR ***@***.***> 于2023年3月15日周三 01:26写道：

…

I think it depends on exactly what the purpose of the research is. Is it really about BERTopic (which is essentially a workflow using a number of different technologies to arrive at a topic model, or is it about using clustering with LLM embeddings, or something else? From my perspective there are some interesting and unanswered questions regarding HDBSCAN clustering and the relationship of BERTopic's topic number reductions and outlier reductions - these are two different issues. The first has more to do with how HDBSCAN clusters of the same data relate to each other when different min_samples and min_cluster_size are applied. Basically are we getting really different clusterings or are the clusterings essentially a function of resolution. For example if we have two different settings, one which produces 10 cluster and the other 100 clusters - is the 100 cluster model simply a more fine grained version of the first clustering? For example might cluster 1 from the first model more or less encompass clusters 1-10 of the second, 2 encompass 2-20 etc. etc. Generally this seems to be what happens, but I think I've seen cases where it doesn't - this has some significant implications. The second issue has to do with the differences in how HDBSCAN creates clusters and then how BERTopic utilizes those clusters in addition to it's own (centroid based) reduction of number of topics. Essentially I see these two approaches to be orthogonal and not all that useful. There are some similar questions with regard to the outlier reduction - although this is now much richer and varied due to the recent feature additions. Hope that helps. Feel free to ask more questions. — Reply to this email directly, view it on GitHub <#1 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATVAD6Z3X4MIH2UORDXYAV3W4CS2XANCNFSM6AAAAAARJSH3MQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to TopicTuner Discussions! #1

{{title}}

Replies: 6 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Welcome to TopicTuner Discussions! #1

drob-xx Oct 19, 2022 Maintainer

👋 Welcome!

Replies: 6 comments · 2 replies

ArtemBernatskyy Feb 16, 2023

drob-xx Feb 16, 2023 Maintainer Author

shenzxc Mar 4, 2023

drob-xx Mar 5, 2023 Maintainer Author

shenzxc Mar 6, 2023

shenzxc Mar 13, 2023

drob-xx Mar 14, 2023 Maintainer Author

shenzxc Mar 15, 2023

drob-xx
Oct 19, 2022
Maintainer

Replies: 6 comments 2 replies

ArtemBernatskyy
Feb 16, 2023

drob-xx
Feb 16, 2023
Maintainer Author

shenzxc
Mar 4, 2023

drob-xx Mar 5, 2023
Maintainer Author

shenzxc
Mar 6, 2023

shenzxc
Mar 13, 2023

drob-xx Mar 14, 2023
Maintainer Author

shenzxc
Mar 15, 2023