-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A Simple but Tough-to-Beat Baseline for Sentence Embeddings #157
Comments
Thx! Have been subscribed to offconvex blog for quite some time :-) |
Hi, thanks for creating this super fast package. I use it a lot. I am trying to use the glove embeddings to create sentence representations. My first attempt is to just to average the word embeddings per sentence. I can figure it out using other packages like cleanNLP, the cleanNLP tokenizer provides a sentence id. I would prefer to stay within the text2vec-verse. Do you think it is possible to average the embeddings per sentence using the current functions in the package? Thanks for your help. |
@good-marketing, thats easy with a little bit of linear algebra :-) (however I will probably create model for this). Below I will suppose you already have common_terms = intersect(colnames(dtm), rownames(word_vectors) )
dtm_averaged = normalize(dtm[, common_terms], "l1")
# you can re-weight dtm above with tf-idf instead of "l1" norm
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ] Let me know if code above is not clear. |
Thanks for the prompt answer. I am able to run the code, now I'll try to figure what to make of it ;-) |
Hi Dimitri, I was looking at the results using the method you mentioned. The resulting sentence_vectors are now a matrix with n documents X w averaged word vectors. The problem I have is that I'd like a sentence representation, not a document representation, or am I misinterpreting your solution. One thought I had was to split the documents into sentences and then create a dtm. Essentially each sentence is then a document, and I can apply the algebra you posted. I guess the dtm will be a lot more sparse, not sure what the effect will be. Do you think this is a 'correct' approach? Thanks for your help. |
@good-marketing splitting documents into the sentences is way to go. So we just change level of granularity of our analysis. I think this approach is 100% correct, I would go myself the same way.
|
Great, thanks for the superfast response. Would you recommend tokenize_sentences from tokenizer.....just wondering since you're also an package author there ;-) |
Yes, sure you can use is. Tokenizers just wraps `stringi` package and
provides a bit more convenient interface for tokenization.
2017-07-10 12:57 GMT+04:00 Good Marketing <[email protected]>:
… Great, thanks for the superfast response. Would you recommend
tokenize_sentences from tokenizer.....just wondering since you're also an
package author there ;-)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#157 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE4u3VhvNcEMNWhQw0zhYPL4OEpR_bA1ks5sMed_gaJpZM4KsjMh>
.
--
Regards
Dmitriy Selivanov
|
I'll take a shot at it next month, will keep you posted! |
Paper: http://104.155.136.4:3000/pdf?id=SyK00v5xx
Blog post: http://www.offconvex.org/2016/02/14/word-embeddings-2/
Looks like an interesting idea
The text was updated successfully, but these errors were encountered: