Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replacing new line characters #10

Open
ptear opened this issue Aug 20, 2023 · 1 comment
Open

replacing new line characters #10

ptear opened this issue Aug 20, 2023 · 1 comment

Comments

@ptear
Copy link

ptear commented Aug 20, 2023

Hi Maarten,

I was just wondering what the reason is for following a different procedure for replacing \n characters with the UN dataset versus the Trump dataset https://github.com/MaartenGr/BERTopic_evaluation/blob/main/evaluation/data.py#L227.

I guess it has something to do with the longer length of the UN documents, being from debates as opposed to short form tweets. But what benefit does indicating new paragraphs with \p have compared to just a space?

Thanks for your efforts on BERTopic.

@MaartenGr
Copy link
Owner

It has been a while since I created that specific code but I remember there were issues with parsing that specific dataset which needed to have \n characters removed. It might also indeed be related to the length of the documents since sentence-transformers as a backend was used here.

I should note though that BERTopic has improved considerably since this was written. Using BERTopic together with MMR, KeyBERTInspired, or PartOfSpeech generally improves coherence scores quite a bit. So if you are looking to reproduce the results, it might be interesting to see what happens when you use one or more of the above representation models.

Using a generative LLM is especially interesting/fun but that does not allow for evaluation with coherence-like measures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants