replacing new line characters #10

ptear · 2023-08-20T15:25:38Z

Hi Maarten,

I was just wondering what the reason is for following a different procedure for replacing \n characters with the UN dataset versus the Trump dataset https://github.com/MaartenGr/BERTopic_evaluation/blob/main/evaluation/data.py#L227.

I guess it has something to do with the longer length of the UN documents, being from debates as opposed to short form tweets. But what benefit does indicating new paragraphs with \p have compared to just a space?

Thanks for your efforts on BERTopic.

MaartenGr · 2023-08-21T07:14:56Z

It has been a while since I created that specific code but I remember there were issues with parsing that specific dataset which needed to have \n characters removed. It might also indeed be related to the length of the documents since sentence-transformers as a backend was used here.

I should note though that BERTopic has improved considerably since this was written. Using BERTopic together with MMR, KeyBERTInspired, or PartOfSpeech generally improves coherence scores quite a bit. So if you are looking to reproduce the results, it might be interesting to see what happens when you use one or more of the above representation models.

Using a generative LLM is especially interesting/fun but that does not allow for evaluation with coherence-like measures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replacing new line characters #10

replacing new line characters #10

ptear commented Aug 20, 2023

MaartenGr commented Aug 21, 2023

replacing new line characters #10

replacing new line characters #10

Comments

ptear commented Aug 20, 2023

MaartenGr commented Aug 21, 2023