Welcome to TopicTuner Discussions! #1
Replies: 6 comments 2 replies
-
@drob-xx hi, I'm sorry but as for a noob user (me) it's hard to understand what this library is doing without clear examples (you do not save code output from colab notebook for example, and do not provide examples in docs), just friendly encouragement, keep up a great work |
Beta Was this translation helpful? Give feedback.
-
Errrr... The idea was to just run the colab notebook - it doesn't take long... |
Beta Was this translation helpful? Give feedback.
-
I'm currently working on topic models in natural language processing, mostly using Bertopic. What is currently unclear to me is how to automatically adjust the number of topics, can you give me some advice. |
Beta Was this translation helpful? Give feedback.
-
Dear DanR
Thank you for your valuable suggestions.
Best wishes,
Shenzxc
DanR ***@***.***> 于2023年3月6日周一 04:07写道:
… @shenzxc <https://github.com/shenzxc> So basically there is no free lunch
- no such thing as "automatic". When using BERTopic out of the box it
defaults the min_cluster_size HDBSCAN parameter setting it to 10. HDBSCAN
in turn, because min_samples is not automatically set will default that
value to the same as min_cluster_size, 10. These two parameters will have a
huge effect on how many clusters (which are the basis of the topics) will
be formed. There is no magic to these numbers - they had to be set
somewhere and Martin determined to set them there.
It really depends on your corpus, but it is likely that these values will
result in many topics. What many is depends on the size of the corpus and
the relationships between the embedding vectors but typically seeing
100-200 topics is not uncommon. Whether this amount works for you or not
depends on your corpus and your goals. It may be that you have an idea of
how many topics you think *should* be present, or perhaps in looking
through the topics produced by the "10" values you will determine what you
think is a legitimate number of topics. In this case BERTopic includes a
topic reduction method BETRTopic.reduce_topics which takes a number of
topics parameter and is the same as setting the nr_topics BERTopic
parameter. However, in my experience this is not a great option because it
uses a centroid approach to determine which topics should be merged. It
gets the average vector value for a given topic and then determines whether
another topics centroid should be merged or not.
The problem with this approach is two fold - first of all the entire point
of using HDBSCAN is to be able to model topic clusters which are of
variable shape - sometimes wildly so. Using a centroid to then determine
the merge behavior is not going to respect the shapes. Secondly the
reduce_topics method will wind up classifying many documents as -1, as
"outliers" not belonging to any cohesive topic. It all depends on your
corpus but in my experience this whole approach is often highly
problematic, which is why I wrote TopicTuner.
I suggest you run through the sample workbook I provided and then
experiment with your own data. The goal is to get better settings for min_cluster_size
and min_samples that will produce a good number of clusters and the least
number of outliers possible. The best way to do this, in my experience, is
to run different tests and evaluate the outcomes.
I believe the steps are pretty well spelled out in the existing examples
and in the documentation. that is not the case then I would appreciate
feedback so I can make those better. I hope this addresses your question.
—
Reply to this email directly, view it on GitHub
<#1 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATVAD6ZWQSKX6C3ADXSQQTLW2TW7TANCNFSM6AAAAAARJSH3MQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Dear DanR
Hello, I have some questions I would like to ask you, I hope you can help
me. My doctoral supervisor asked me to publish a paper on Bertopic and
asked us to write a research proposal.
I am not sure whether the direction I am currently discussing with you
(optimizing the number of topics in the topic model) can be used as a
research direction for the paper.
If not, can you suggest me some Bertopic research directions, thank you.
Best wishes,
Shenzxc
shen studio ***@***.***> 于2023年3月6日周一 11:33写道:
… Dear DanR
Thank you for your valuable suggestions.
Best wishes,
Shenzxc
DanR ***@***.***> 于2023年3月6日周一 04:07写道:
> @shenzxc <https://github.com/shenzxc> So basically there is no free
> lunch - no such thing as "automatic". When using BERTopic out of the box it
> defaults the min_cluster_size HDBSCAN parameter setting it to 10.
> HDBSCAN in turn, because min_samples is not automatically set will
> default that value to the same as min_cluster_size, 10. These two
> parameters will have a huge effect on how many clusters (which are the
> basis of the topics) will be formed. There is no magic to these numbers -
> they had to be set somewhere and Martin determined to set them there.
>
> It really depends on your corpus, but it is likely that these values will
> result in many topics. What many is depends on the size of the corpus and
> the relationships between the embedding vectors but typically seeing
> 100-200 topics is not uncommon. Whether this amount works for you or not
> depends on your corpus and your goals. It may be that you have an idea of
> how many topics you think *should* be present, or perhaps in looking
> through the topics produced by the "10" values you will determine what you
> think is a legitimate number of topics. In this case BERTopic includes a
> topic reduction method BETRTopic.reduce_topics which takes a number of
> topics parameter and is the same as setting the nr_topics BERTopic
> parameter. However, in my experience this is not a great option because it
> uses a centroid approach to determine which topics should be merged. It
> gets the average vector value for a given topic and then determines whether
> another topics centroid should be merged or not.
>
> The problem with this approach is two fold - first of all the entire
> point of using HDBSCAN is to be able to model topic clusters which are of
> variable shape - sometimes wildly so. Using a centroid to then determine
> the merge behavior is not going to respect the shapes. Secondly the
> reduce_topics method will wind up classifying many documents as -1, as
> "outliers" not belonging to any cohesive topic. It all depends on your
> corpus but in my experience this whole approach is often highly
> problematic, which is why I wrote TopicTuner.
>
> I suggest you run through the sample workbook I provided and then
> experiment with your own data. The goal is to get better settings for min_cluster_size
> and min_samples that will produce a good number of clusters and the
> least number of outliers possible. The best way to do this, in my
> experience, is to run different tests and evaluate the outcomes.
>
> I believe the steps are pretty well spelled out in the existing examples
> and in the documentation. that is not the case then I would appreciate
> feedback so I can make those better. I hope this addresses your question.
>
> —
> Reply to this email directly, view it on GitHub
> <#1 (reply in thread)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ATVAD6ZWQSKX6C3ADXSQQTLW2TW7TANCNFSM6AAAAAARJSH3MQ>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Beta Was this translation helpful? Give feedback.
-
Dear DanR
Thank you for your information.I have gained a lot。
Best wishes,
Shenzxc
DanR ***@***.***> 于2023年3月15日周三 01:26写道:
… I think it depends on exactly what the purpose of the research is. Is it
really about BERTopic (which is essentially a workflow using a number of
different technologies to arrive at a topic model, or is it about using
clustering with LLM embeddings, or something else? From my perspective
there are some interesting and unanswered questions regarding HDBSCAN
clustering and the relationship of BERTopic's topic number reductions and
outlier reductions - these are two different issues.
The first has more to do with how HDBSCAN clusters of the same data relate
to each other when different min_samples and min_cluster_size are applied.
Basically are we getting really different clusterings or are the
clusterings essentially a function of resolution. For example if we have
two different settings, one which produces 10 cluster and the other 100
clusters - is the 100 cluster model simply a more fine grained version of
the first clustering? For example might cluster 1 from the first model more
or less encompass clusters 1-10 of the second, 2 encompass 2-20 etc. etc.
Generally this seems to be what happens, but I think I've seen cases where
it doesn't - this has some significant implications.
The second issue has to do with the differences in how HDBSCAN creates
clusters and then how BERTopic utilizes those clusters in addition to it's
own (centroid based) reduction of number of topics. Essentially I see these
two approaches to be orthogonal and not all that useful. There are some
similar questions with regard to the outlier reduction - although this is
now much richer and varied due to the recent feature additions.
Hope that helps. Feel free to ask more questions.
—
Reply to this email directly, view it on GitHub
<#1 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATVAD6Z3X4MIH2UORDXYAV3W4CS2XANCNFSM6AAAAAARJSH3MQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
👋 Welcome!
Use the Issues queue for bugs - use Discussions for questions etc.
Beta Was this translation helpful? Give feedback.
All reactions