A Simple but Tough-to-Beat Baseline for Sentence Embeddings #157

zachmayer · 2016-11-08T15:25:25Z

Paper: http://104.155.136.4:3000/pdf?id=SyK00v5xx
Blog post: http://www.offconvex.org/2016/02/14/word-embeddings-2/

Looks like an interesting idea

dselivanov · 2016-11-08T17:51:42Z

Thx! Have been subscribed to offconvex blog for quite some time :-)
Another thing I want to try - http://www.offconvex.org/2016/07/10/embeddingspolysemy/. I even created rksvd repo to port k-svd algorithm, but can't find time to finish it =(

bob-rietveld · 2017-07-03T15:44:03Z

Hi, thanks for creating this super fast package. I use it a lot. I am trying to use the glove embeddings to create sentence representations. My first attempt is to just to average the word embeddings per sentence. I can figure it out using other packages like cleanNLP, the cleanNLP tokenizer provides a sentence id. I would prefer to stay within the text2vec-verse. Do you think it is possible to average the embeddings per sentence using the current functions in the package? Thanks for your help.

dselivanov · 2017-07-03T16:55:52Z

@good-marketing, thats easy with a little bit of linear algebra :-) (however I will probably create model for this).

Below I will suppose you already have dtm - document-term matrix with word counts and word_vectors - word embeddings.

common_terms = intersect(colnames(dtm), rownames(word_vectors) )
dtm_averaged =  normalize(dtm[, common_terms], "l1")
# you can re-weight dtm above with tf-idf instead of "l1" norm
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]

Let me know if code above is not clear.

bob-rietveld · 2017-07-03T20:42:21Z

Thanks for the prompt answer. I am able to run the code, now I'll try to figure what to make of it ;-)

bob-rietveld · 2017-07-10T08:24:19Z

Hi Dimitri,

I was looking at the results using the method you mentioned. The resulting sentence_vectors are now a matrix with n documents X w averaged word vectors. The problem I have is that I'd like a sentence representation, not a document representation, or am I misinterpreting your solution.

One thought I had was to split the documents into sentences and then create a dtm. Essentially each sentence is then a document, and I can apply the algebra you posted. I guess the dtm will be a lot more sparse, not sure what the effect will be. Do you think this is a 'correct' approach? Thanks for your help.

dselivanov · 2017-07-10T08:37:00Z

@good-marketing splitting documents into the sentences is way to go. So we just change level of granularity of our analysis. I think this approach is 100% correct, I would go myself the same way.

stringi::stri_split_* or stringr::str_split_* with proper boundary delimiter can help with splitting into sentences.

bob-rietveld · 2017-07-10T08:57:34Z

Great, thanks for the superfast response. Would you recommend tokenize_sentences from tokenizer.....just wondering since you're also an package author there ;-)

dselivanov · 2017-07-10T09:03:33Z

Yes, sure you can use is. Tokenizers just wraps `stringi` package and provides a bit more convenient interface for tokenization. 2017-07-10 12:57 GMT+04:00 Good Marketing <[email protected]>:

…

Great, thanks for the superfast response. Would you recommend tokenize_sentences from tokenizer.....just wondering since you're also an package author there ;-) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#157 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE4u3VhvNcEMNWhQw0zhYPL4OEpR_bA1ks5sMed_gaJpZM4KsjMh> .

-- Regards Dmitriy Selivanov

sfohr · 2019-08-13T20:28:52Z

I'll take a shot at it next month, will keep you posted!

pommedeterresautee mentioned this issue Mar 1, 2017

Simple and Effective Postprocessing for Word Representations #170

Closed

dselivanov mentioned this issue Aug 14, 2017

a new feature - sentence2vec #202

Closed

dselivanov added the feature request label Nov 19, 2017

dselivanov mentioned this issue Dec 17, 2019

Doc2Vec Implementation #275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Simple but Tough-to-Beat Baseline for Sentence Embeddings #157

A Simple but Tough-to-Beat Baseline for Sentence Embeddings #157

zachmayer commented Nov 8, 2016

dselivanov commented Nov 8, 2016

bob-rietveld commented Jul 3, 2017

dselivanov commented Jul 3, 2017

bob-rietveld commented Jul 3, 2017

bob-rietveld commented Jul 10, 2017

dselivanov commented Jul 10, 2017

bob-rietveld commented Jul 10, 2017

dselivanov commented Jul 10, 2017 via email

sfohr commented Aug 13, 2019

A Simple but Tough-to-Beat Baseline for Sentence Embeddings #157

A Simple but Tough-to-Beat Baseline for Sentence Embeddings #157

Comments

zachmayer commented Nov 8, 2016

dselivanov commented Nov 8, 2016

bob-rietveld commented Jul 3, 2017

dselivanov commented Jul 3, 2017

bob-rietveld commented Jul 3, 2017

bob-rietveld commented Jul 10, 2017

dselivanov commented Jul 10, 2017

bob-rietveld commented Jul 10, 2017

dselivanov commented Jul 10, 2017 via email

sfohr commented Aug 13, 2019