Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems to reproduce the results #2

Open
gabrielsantosrv opened this issue Apr 13, 2020 · 14 comments
Open

Problems to reproduce the results #2

gabrielsantosrv opened this issue Apr 13, 2020 · 14 comments

Comments

@gabrielsantosrv
Copy link

gabrielsantosrv commented Apr 13, 2020

Hello @rashadulrakib,
First of all, thanks for your work, I'm really interested in it.

The file search_snippets_pred contains labels that haven't been defined in search_snippets_true_text. Could you please generate the file search_snippets_pred correctly and update it?

Moreover, I'm having some problems to reproduce your reported results.
I've got the scores:
StackOverflow dataset:
acc (%): 68.15, nmi(%): 66.78

Biomedical dataset:
acc (%): 46.77, nmi(%): 38.82

However, the reported results are:
StackOverflow dataset:
acc (%): 78.73±0.17, nmi(%): 73.44±0.35

Biomedical dataset:
acc (%): 47.78±0.51, nmi(%): 41.27±0.36

PS. I'm executing the code using the data you have provided in the /data directory.

Could you, please, help me to reproduce your results?
Thanks

@rashadulrakib
Copy link
Owner

resolve the problem for search_snippet dataset.
simply run the main.py.

@rashadulrakib
Copy link
Owner

rashadulrakib commented Apr 13, 2020 via email

@gabrielsantosrv
Copy link
Author

gabrielsantosrv commented Apr 13, 2020

Thanks for your reply!

Since the initial clustering provided is k-means, and in your paper https://arxiv.org/pdf/2001.11631.pdf it is taken into account the Agglomerative Clustering and similarity distribution-based, how do I reproduce the HAC_SD_IC approach?

@rashadulrakib
Copy link
Owner

rashadulrakib commented Apr 13, 2020 via email

@gabrielsantosrv
Copy link
Author

gabrielsantosrv commented Apr 14, 2020

Hello,
Thanks a lot for your reply, it helped me understand how to run HAC.

I just have another question regarding the HAC_SD. Should I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it?

@gabrielsantosrv
Copy link
Author

gabrielsantosrv commented Apr 14, 2020

The reported results consider the entire dataset, or it is split into train/test datasets? I mean, have you split train/test sets before the initial clustering, and reported your results based on test sets?

@rashadulrakib
Copy link
Owner

rashadulrakib commented Apr 14, 2020 via email

@rashadulrakib
Copy link
Owner

rashadulrakib commented Apr 14, 2020 via email

@gabrielsantosrv
Copy link
Author

Hello,

Again thanks a lot for your replies and willingness to help me.

Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm?

@rashadulrakib
Copy link
Owner

rashadulrakib commented Apr 15, 2020 via email

@gabrielsantosrv
Copy link
Author

gabrielsantosrv commented Apr 20, 2020

hello, i have used method='ward.D2'while clustering using Hierarchical clustering. I used fastcluster package in R. https://cran.r-project.org/web/packages/fastcluster/vignettes/fastcluster.pdf. Did i answer correctly?

On Wed, Apr 15, 2020 at 2:56 PM gabrielsantosrv @.***> wrote: Hello, Again thanks a lot for your replies and willingness to help me. Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAY325PHLJJ3YKM7ZBULRMXYLFANCNFSM4MG7O44A .

Sure, I got it.

@gabrielsantosrv
Copy link
Author

gabrielsantosrv commented Apr 20, 2020

Hello,
I'm trying to reproduce the HAC_SD algorithm, and I noticed, while extracting the glove embeddings of the texts from Stackoverflow dataset, some words are misspelled such as "featureactivated", "oraole", "navgation", etc. How do you deal with these cases?

I just ignored these words after removing the stop words, but it may lead to empty strings that I removed because I didn't know what to do with them.
After running the HAC_SD considering the average of the Glove vectors for each non-stop word of the texts, and using the ward clusterer from the fastcluster package in python
I got the following scores:
acc: 0.56935
nmi 0.49943

Do you have any suggestions to reach your results?
Your results reported on Improving Short Text Clustering by Similarity Matrix Sparsification
(https://dl.acm.org/doi/pdf/10.1145/3209280.3229114?download=true) were
acc: 0.6480
nmi: 0.5948

Thanks!

@rashadulrakib
Copy link
Owner

rashadulrakib commented Apr 21, 2020 via email

@gabrielsantosrv
Copy link
Author

Hello,

It's ok ;)
I'm just beginning to study short-text clustering, so I would like to reproduce your results as a state-of-the-art review, since they are quite good, it called my attention.

By the way, I'm from the University of Campinas (Unicamp)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants