-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems to reproduce the results #2
Comments
resolve the problem for search_snippet dataset. |
Hello,
I resolve the problem for search_snippet dataset. The initial labels are
generated from the nive clustering algorithm like k-means..
<037e558>
Thanks for your interest.
…On Mon, Apr 13, 2020 at 11:14 AM gabrielsantosrv ***@***.***> wrote:
Hello @rashadulrakib <https://github.com/rashadulrakib>,
First of all, thanks for your work, I'm really interested in it.
I'm having some problems to reproduce your reported results, and I also
cannot run the code for the Search Snippets dataset, it seems to have some
incorrect information in the file search_snippets_pred
I get the scores:
StackOverflow dataset:
acc (%): 68.15, nmi(%): 66.78
Biomedical dataset:
acc (%): 46.77, nmi(%): 38.82
However, the reported results are:
StackOverflow dataset:
acc (%): 78.73±0.17, nmi(%): 73.44±0.35
Biomedical dataset:
acc (%): 47.78±0.51, nmi(%): 41.27±0.36
PS. I'm executing the code using the data you have provided in the data
directory.
Could you, please, help me to reproduce your results?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPBAY3AC3DVNZLL5MMITZDRMMM5XANCNFSM4MG7O44A>
.
|
Thanks for your reply! Since the initial clustering provided is k-means, and in your paper https://arxiv.org/pdf/2001.11631.pdf it is taken into account the Agglomerative Clustering and similarity distribution-based, how do I reproduce the HAC_SD_IC approach? |
hello,
I would be nice, if i could provide you the code for HAC_SD_IC or the
results of the algorithm. My codes are in three different language for
HAC_SD_IC. I am sorry for that.
represent each text by average vector using glove (300d) .create a n by n
text similarity matrix. then sparsify it using algorithm2
. perform HAC on the sparsified matrix and get the clustering labels.
or simply, just run HAC on the n by n similarity matrix. it will also give
you some competitive result.
I will try to run HAC_SD_IC, if i can.
thanks a lot for your interest.
…On Mon, Apr 13, 2020 at 2:10 PM gabrielsantosrv ***@***.***> wrote:
Thanks for your answer!
How do I reproduce the HAC_SD_IC approach in this paper
https://arxiv.org/pdf/2001.11631.pdf?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPBAY2PCGIOUZZKTTTVN33RMNBQLANCNFSM4MG7O44A>
.
|
Hello, I just have another question regarding the HAC_SD. Should I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it? |
The reported results consider the entire dataset, or it is split into train/test datasets? I mean, have you split train/test sets before the initial clustering, and reported your results based on test sets? |
------- I run HAC_SD on the similarity matrix generating an initial
clustering and only then execute the iterative classification in order to
improve it
Yes. you are right
…On Tue, Apr 14, 2020 at 7:03 PM gabrielsantosrv ***@***.***> wrote:
Hello,
I have another question regarding the HAC_SD. Should I run HAC_SD on the
similarity matrix generating an initial clustering and only then execute
the iterative classification in order to improve it?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPBAYZJBPCLSNTLGHJZ3VTRMTMRLANCNFSM4MG7O44A>
.
|
On the entire dataset
…On Tue, Apr 14, 2020 at 7:19 PM gabrielsantosrv ***@***.***> wrote:
The reported results consider the entire dataset, or it is split into
train/test datasets?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPBAYZM6JCJBXUAET6GFILRMTOOZANCNFSM4MG7O44A>
.
|
Hello, Again thanks a lot for your replies and willingness to help me. Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm? |
hello,
i have used method='ward.D2'while clustering using Hierarchical clustering.
I used fastcluster package in R.
https://cran.r-project.org/web/packages/fastcluster/vignettes/fastcluster.pdf.
Did i answer correctly?
…On Wed, Apr 15, 2020 at 2:56 PM gabrielsantosrv ***@***.***> wrote:
Hello,
Again thanks a lot for your replies and willingness to help me.
Which criterion have you used to form the clusters from the dendrogram
returned by the ward algorithm?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPBAY325PHLJJ3YKM7ZBULRMXYLFANCNFSM4MG7O44A>
.
|
Sure, I got it. |
Hello, I just ignored these words after removing the stop words, but it may lead to empty strings that I removed because I didn't know what to do with them. Do you have any suggestions to reach your results? Thanks! |
Hello,
Sorry for the late reply. You can try to enhance your result ( acc: 0.56935
nmi 0.49943) through Itrative Classification. Sorry I can not answer your
question now as i developed long time before.
can you please tell me which university you are from.
…On Mon, Apr 20, 2020 at 4:10 PM gabrielsantosrv ***@***.***> wrote:
Hello,
I'm trying to reproduce the HAC_SD algorithm, and I noticed, while
extracting the glove embeddings of the texts from Stackoverflow dataset,
some words are misspelled such as "featureactivated", "oraole",
"navgation", etc. How do you deal with these cases?
I just ignored these words after removing the stop words, but it may lead
to empty strings that I removed because I didn't know what to do with them.
After running the HAC_SD considering the average of the Glove vectors for
each non-stop word of the texts, and using the *ward* clusterer from the
package *fastcluster* for python
I got the following scores:
acc: 0.56935
nmi 0.49943
Do you have any suggestions to reach your results?
Your results reported on Improving Short Text Clustering by Similarity
Matrix Sparsification
(https://dl.acm.org/doi/pdf/10.1145/3209280.3229114?download=true) were
acc: 0.6480
nmi: 0.5948
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPBAYZUVJ22SBWMVLTIYJDRNSM35ANCNFSM4MG7O44A>
.
|
Hello, It's ok ;) By the way, I'm from the University of Campinas (Unicamp) |
Hello @rashadulrakib,
First of all, thanks for your work, I'm really interested in it.
The file search_snippets_pred contains labels that haven't been defined in search_snippets_true_text. Could you please generate the file search_snippets_pred correctly and update it?
Moreover, I'm having some problems to reproduce your reported results.
I've got the scores:
StackOverflow dataset:
acc (%): 68.15, nmi(%): 66.78
Biomedical dataset:
acc (%): 46.77, nmi(%): 38.82
However, the reported results are:
StackOverflow dataset:
acc (%): 78.73±0.17, nmi(%): 73.44±0.35
Biomedical dataset:
acc (%): 47.78±0.51, nmi(%): 41.27±0.36
PS. I'm executing the code using the data you have provided in the /data directory.
Could you, please, help me to reproduce your results?
Thanks
The text was updated successfully, but these errors were encountered: