Problems to reproduce the results #2

gabrielsantosrv · 2020-04-13T14:14:37Z

Hello @rashadulrakib,
First of all, thanks for your work, I'm really interested in it.

The file search_snippets_pred contains labels that haven't been defined in search_snippets_true_text. Could you please generate the file search_snippets_pred correctly and update it?

Moreover, I'm having some problems to reproduce your reported results.
I've got the scores:
StackOverflow dataset:
acc (%): 68.15, nmi(%): 66.78

Biomedical dataset:
acc (%): 46.77, nmi(%): 38.82

However, the reported results are:
StackOverflow dataset:
acc (%): 78.73±0.17, nmi(%): 73.44±0.35

Biomedical dataset:
acc (%): 47.78±0.51, nmi(%): 41.27±0.36

PS. I'm executing the code using the data you have provided in the /data directory.

Could you, please, help me to reproduce your results?
Thanks

rashadulrakib · 2020-04-13T16:35:21Z

resolve the problem for search_snippet dataset.
simply run the main.py.

rashadulrakib · 2020-04-13T16:36:52Z

Hello, I resolve the problem for search_snippet dataset. The initial labels are generated from the nive clustering algorithm like k-means.. <037e558> Thanks for your interest.

…

On Mon, Apr 13, 2020 at 11:14 AM gabrielsantosrv ***@***.***> wrote: Hello @rashadulrakib <https://github.com/rashadulrakib>, First of all, thanks for your work, I'm really interested in it. I'm having some problems to reproduce your reported results, and I also cannot run the code for the Search Snippets dataset, it seems to have some incorrect information in the file search_snippets_pred I get the scores: StackOverflow dataset: acc (%): 68.15, nmi(%): 66.78 Biomedical dataset: acc (%): 46.77, nmi(%): 38.82 However, the reported results are: StackOverflow dataset: acc (%): 78.73±0.17, nmi(%): 73.44±0.35 Biomedical dataset: acc (%): 47.78±0.51, nmi(%): 41.27±0.36 PS. I'm executing the code using the data you have provided in the data directory. Could you, please, help me to reproduce your results? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPBAY3AC3DVNZLL5MMITZDRMMM5XANCNFSM4MG7O44A> .

gabrielsantosrv · 2020-04-13T17:10:13Z

Thanks for your reply!

Since the initial clustering provided is k-means, and in your paper https://arxiv.org/pdf/2001.11631.pdf it is taken into account the Agglomerative Clustering and similarity distribution-based, how do I reproduce the HAC_SD_IC approach?

rashadulrakib · 2020-04-13T17:41:15Z

hello, I would be nice, if i could provide you the code for HAC_SD_IC or the results of the algorithm. My codes are in three different language for HAC_SD_IC. I am sorry for that. represent each text by average vector using glove (300d) .create a n by n text similarity matrix. then sparsify it using algorithm2 . perform HAC on the sparsified matrix and get the clustering labels. or simply, just run HAC on the n by n similarity matrix. it will also give you some competitive result. I will try to run HAC_SD_IC, if i can. thanks a lot for your interest.

…

On Mon, Apr 13, 2020 at 2:10 PM gabrielsantosrv ***@***.***> wrote: Thanks for your answer! How do I reproduce the HAC_SD_IC approach in this paper https://arxiv.org/pdf/2001.11631.pdf? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPBAY2PCGIOUZZKTTTVN33RMNBQLANCNFSM4MG7O44A> .

gabrielsantosrv · 2020-04-14T22:02:45Z

Hello,
Thanks a lot for your reply, it helped me understand how to run HAC.

I just have another question regarding the HAC_SD. Should I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it?

gabrielsantosrv · 2020-04-14T22:19:11Z

The reported results consider the entire dataset, or it is split into train/test datasets? I mean, have you split train/test sets before the initial clustering, and reported your results based on test sets?

rashadulrakib · 2020-04-14T23:44:18Z

------- I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it Yes. you are right

…

On Tue, Apr 14, 2020 at 7:03 PM gabrielsantosrv ***@***.***> wrote: Hello, I have another question regarding the HAC_SD. Should I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPBAYZJBPCLSNTLGHJZ3VTRMTMRLANCNFSM4MG7O44A> .

rashadulrakib · 2020-04-14T23:44:44Z

On the entire dataset

…

On Tue, Apr 14, 2020 at 7:19 PM gabrielsantosrv ***@***.***> wrote: The reported results consider the entire dataset, or it is split into train/test datasets? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPBAYZM6JCJBXUAET6GFILRMTOOZANCNFSM4MG7O44A> .

gabrielsantosrv · 2020-04-15T17:55:47Z

Hello,

Again thanks a lot for your replies and willingness to help me.

Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm?

rashadulrakib · 2020-04-15T18:02:07Z

hello, i have used method='ward.D2'while clustering using Hierarchical clustering. I used fastcluster package in R. https://cran.r-project.org/web/packages/fastcluster/vignettes/fastcluster.pdf. Did i answer correctly?

…

On Wed, Apr 15, 2020 at 2:56 PM gabrielsantosrv ***@***.***> wrote: Hello, Again thanks a lot for your replies and willingness to help me. Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPBAY325PHLJJ3YKM7ZBULRMXYLFANCNFSM4MG7O44A> .

gabrielsantosrv · 2020-04-20T18:35:00Z

hello, i have used method='ward.D2'while clustering using Hierarchical clustering. I used fastcluster package in R. https://cran.r-project.org/web/packages/fastcluster/vignettes/fastcluster.pdf. Did i answer correctly?
…
On Wed, Apr 15, 2020 at 2:56 PM gabrielsantosrv @.***> wrote: Hello, Again thanks a lot for your replies and willingness to help me. Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAY325PHLJJ3YKM7ZBULRMXYLFANCNFSM4MG7O44A .

Sure, I got it.

gabrielsantosrv · 2020-04-20T19:10:40Z

Hello,
I'm trying to reproduce the HAC_SD algorithm, and I noticed, while extracting the glove embeddings of the texts from Stackoverflow dataset, some words are misspelled such as "featureactivated", "oraole", "navgation", etc. How do you deal with these cases?

I just ignored these words after removing the stop words, but it may lead to empty strings that I removed because I didn't know what to do with them.
After running the HAC_SD considering the average of the Glove vectors for each non-stop word of the texts, and using the ward clusterer from the fastcluster package in python
I got the following scores:
acc: 0.56935
nmi 0.49943

Do you have any suggestions to reach your results?
Your results reported on Improving Short Text Clustering by Similarity Matrix Sparsification
(https://dl.acm.org/doi/pdf/10.1145/3209280.3229114?download=true) were
acc: 0.6480
nmi: 0.5948

Thanks!

rashadulrakib · 2020-04-21T16:34:26Z

Hello, Sorry for the late reply. You can try to enhance your result ( acc: 0.56935 nmi 0.49943) through Itrative Classification. Sorry I can not answer your question now as i developed long time before. can you please tell me which university you are from.

…

On Mon, Apr 20, 2020 at 4:10 PM gabrielsantosrv ***@***.***> wrote: Hello, I'm trying to reproduce the HAC_SD algorithm, and I noticed, while extracting the glove embeddings of the texts from Stackoverflow dataset, some words are misspelled such as "featureactivated", "oraole", "navgation", etc. How do you deal with these cases? I just ignored these words after removing the stop words, but it may lead to empty strings that I removed because I didn't know what to do with them. After running the HAC_SD considering the average of the Glove vectors for each non-stop word of the texts, and using the *ward* clusterer from the package *fastcluster* for python I got the following scores: acc: 0.56935 nmi 0.49943 Do you have any suggestions to reach your results? Your results reported on Improving Short Text Clustering by Similarity Matrix Sparsification (https://dl.acm.org/doi/pdf/10.1145/3209280.3229114?download=true) were acc: 0.6480 nmi: 0.5948 Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPBAYZUVJ22SBWMVLTIYJDRNSM35ANCNFSM4MG7O44A> .

gabrielsantosrv · 2020-04-22T15:06:31Z

Hello,

It's ok ;)
I'm just beginning to study short-text clustering, so I would like to reproduce your results as a state-of-the-art review, since they are quite good, it called my attention.

By the way, I'm from the University of Campinas (Unicamp)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems to reproduce the results #2

Problems to reproduce the results #2

gabrielsantosrv commented Apr 13, 2020 •

edited

Loading

rashadulrakib commented Apr 13, 2020

rashadulrakib commented Apr 13, 2020 via email

gabrielsantosrv commented Apr 13, 2020 •

edited

Loading

rashadulrakib commented Apr 13, 2020 via email

gabrielsantosrv commented Apr 14, 2020 •

edited

Loading

gabrielsantosrv commented Apr 14, 2020 •

edited

Loading

rashadulrakib commented Apr 14, 2020 via email

rashadulrakib commented Apr 14, 2020 via email

gabrielsantosrv commented Apr 15, 2020

rashadulrakib commented Apr 15, 2020 via email

gabrielsantosrv commented Apr 20, 2020 •

edited

Loading

gabrielsantosrv commented Apr 20, 2020 •

edited

Loading

rashadulrakib commented Apr 21, 2020 via email

gabrielsantosrv commented Apr 22, 2020

Problems to reproduce the results #2

Problems to reproduce the results #2

Comments

gabrielsantosrv commented Apr 13, 2020 • edited Loading

rashadulrakib commented Apr 13, 2020

rashadulrakib commented Apr 13, 2020 via email

gabrielsantosrv commented Apr 13, 2020 • edited Loading

rashadulrakib commented Apr 13, 2020 via email

gabrielsantosrv commented Apr 14, 2020 • edited Loading

gabrielsantosrv commented Apr 14, 2020 • edited Loading

rashadulrakib commented Apr 14, 2020 via email

rashadulrakib commented Apr 14, 2020 via email

gabrielsantosrv commented Apr 15, 2020

rashadulrakib commented Apr 15, 2020 via email

gabrielsantosrv commented Apr 20, 2020 • edited Loading

gabrielsantosrv commented Apr 20, 2020 • edited Loading

rashadulrakib commented Apr 21, 2020 via email

gabrielsantosrv commented Apr 22, 2020

gabrielsantosrv commented Apr 13, 2020 •

edited

Loading

gabrielsantosrv commented Apr 13, 2020 •

edited

Loading

gabrielsantosrv commented Apr 14, 2020 •

edited

Loading

gabrielsantosrv commented Apr 14, 2020 •

edited

Loading

gabrielsantosrv commented Apr 20, 2020 •

edited

Loading

gabrielsantosrv commented Apr 20, 2020 •

edited

Loading