You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I guess it has something to do with the longer length of the UN documents, being from debates as opposed to short form tweets. But what benefit does indicating new paragraphs with \p have compared to just a space?
Thanks for your efforts on BERTopic.
The text was updated successfully, but these errors were encountered:
It has been a while since I created that specific code but I remember there were issues with parsing that specific dataset which needed to have \n characters removed. It might also indeed be related to the length of the documents since sentence-transformers as a backend was used here.
I should note though that BERTopic has improved considerably since this was written. Using BERTopic together with MMR, KeyBERTInspired, or PartOfSpeech generally improves coherence scores quite a bit. So if you are looking to reproduce the results, it might be interesting to see what happens when you use one or more of the above representation models.
Using a generative LLM is especially interesting/fun but that does not allow for evaluation with coherence-like measures.
Hi Maarten,
I was just wondering what the reason is for following a different procedure for replacing \n characters with the UN dataset versus the Trump dataset https://github.com/MaartenGr/BERTopic_evaluation/blob/main/evaluation/data.py#L227.
I guess it has something to do with the longer length of the UN documents, being from debates as opposed to short form tweets. But what benefit does indicating new paragraphs with \p have compared to just a space?
Thanks for your efforts on BERTopic.
The text was updated successfully, but these errors were encountered: