list of improvements #20

jwijffels · 2023-10-05T14:48:32Z

allow to pass a list of integers instead of tokens to the word2vec function
see how to remove the embedding of </s>
abandon file-based approach
speed up for Xptr's like quanteda objects to avoid copying data?
other speed improvements
progress bar
functionalities for downstream processing
- plotting or functionalities in https://github.com/bnosac/textplot
- downstream topic modelling like https://github.com/bnosac/ETM or as a replacement of SVD's for semi-supervised stuff
- embeddings on sentencepiece/tokenisers.bpe tokenised data
- pretrained models
- further input to torch models
- deeper integration of the similarities like https://github.com/bnosac/doc2vec or https://koheiw.github.io/LSX

The text was updated successfully, but these errors were encountered:

koheiw · 2023-11-09T07:06:27Z

I have been thinking how best we can improve the package. To make it possible to pass a list to the package, we duplicated some of the code for a list and a file. In order to further develop the package, we need to simplify.

We need to make two changes:

Choose the format of the input data between a list or a file. A list is fast and easy to handle but it does not allow users to pass texts larger than the RAM. A file is a lot more complex but as fast as a list if memory-mapping is used (as in the current version).
Change the representation of the input data (either in a list or a file) for word2vec.cpp from characters to integer IDs of tokens. This makes the C++ code a lot simpler (we could remove mapper.hpp and wordReader.hpp). To achieve this, we need to tokenize texts beforehand. If the input data is a file, integers IDs will be saved to a file.

jwijffels · 2023-11-13T13:09:25Z

I think I agree with most of what is listed up for my situation.

In my use cases, most of the time the data is directly in R and only 5% of the times the data is in a .txt file for which I don't need the list-based approach and the text is pretokenised. So that means I'll have to abandon this 5% or work around it and rewrite the text to integers.
This also means that the tokeniser is handled in R or that other software do the tokenisation and keep mapping between integers and tokens when saving the embeddings, we need to store that information.

Can we already start on 2?

koheiw · 2023-11-15T03:54:31Z

I wonder in what situation you need to pass documents via files. Is it possible to load the files in R and pass as a list?

If we choose to pass pre-tokenized documents in a list, we can make the C++ code much simpler. It would be the code-base to which we can add more functions.

jwijffels · 2023-11-15T08:49:27Z

I wonder in what situation you need to pass documents via files.

I have log files containing failure events happening on parts producing machines for which I build word2vec models to understand how they are related. Failure events are appended to the log file.
In another setup I have specifc extractions of wikipedia, which are parsed externally from R and are just put in word2vec to find the embedding of specific elements contained in these wikipedia entities

Is it possible to load the files in R and pass as a list?

Anything is possible with R.

koheiw · 2023-11-15T23:13:14Z

Your analysis of machine logs sounds very interesting. I understand that it is easier to load from files but you can still load to R and make a list using stringi::stri_split_* functions. The full corpus of Wikipedia might be too large for list inputs but you could do it if the corpus is its subset. Why don't you try and see how big lists object becomes with these data?

I will write a prototype for list inputs to start.

jwijffels · 2023-11-16T08:42:11Z

I will write a prototype for list inputs to start.

Ok.

koheiw · 2023-11-23T07:24:03Z

Can I delete the function to save or load word vectors in C++? I think it is much easier to do in R.

word2vec/src/word2vec/lib/word2vec.cpp

Lines 88 to 226 in cafe13f

    
           bool w2vModel_t::save(const std::string &_modelFile) const noexcept { 
        
               try { 
        
                   // save trained data in original word2vec format 
        
                   // file header 
        
                   std::string fileHeader = std::to_string(m_mapSize) 
        
                                            + " " 
        
                                            + std::to_string(m_vectorSize) 
        
                                            + "\n"; 
        
                   // calc output size 
        
                   // header size 
        
                   auto outputSize = static_cast<off_t>(fileHeader.length() * sizeof(char)); 
        
                   for (auto const &i:m_map) { 
        
                       // size of (word + space char + vector size + size of cartridge return char) 
        
                       outputSize += (i.first.length() + 2) * sizeof(char) + m_vectorSize * sizeof(float); 
        
                   } 
        
                   // write data to the file 
        
                   fileMapper_t output(_modelFile, true, outputSize); 
        
                   char sp = ' '; 
        
                   char cr = '\n'; 
        
                   off_t offset = 0; 
        
                   // write file header 
        
                   std::memcpy(reinterpret_cast<void *>(output.data() + offset), 
        
                               fileHeader.data(), fileHeader.length() * sizeof(char)); 
        
                   offset += fileHeader.length() * sizeof(char); 
        
                   // write words and their vectors 
        
                   for (auto const &i:m_map) { 
        
                       std::memcpy(reinterpret_cast<void *>(output.data() + offset), 
        
                                   i.first.data(), i.first.length() * sizeof(char)); 
        
                       offset += i.first.length() * sizeof(char); 
        
                       std::memcpy(reinterpret_cast<void *>(output.data() + offset), &sp, sizeof(char)); 
        
                       offset += sizeof(char); 
        
                       auto shift = m_vectorSize * sizeof(float); 
        
                       std::memcpy(reinterpret_cast<void *>(output.data() + offset), i.second.data(), shift); 
        
                       offset += shift; 
        
                       std::memcpy(reinterpret_cast<void *>(output.data() + offset), &cr, sizeof(char)); 
        
                       offset += sizeof(char); 
        
                   } 
        
                   return true; 
        
               } catch (const std::exception &_e) { 
        
                   m_errMsg = _e.what(); 
        
               } catch (...) { 
        
                   m_errMsg = "unknown error"; 
        
               } 
        
               return false; 
        
           } 
        
           bool w2vModel_t::load(const std::string &_modelFile, bool normalize) noexcept { 
        
               try { 
        
                   m_map.clear(); 
        
                   // map model file, exception will be thrown on empty file 
        
                   fileMapper_t input(_modelFile); 
        
                   // parse header 
        
                   off_t offset = 0; 
        
                   // get words number 
        
                   std::string nwStr; 
        
                   char ch = 0; 
        
                   while ((ch = (*(input.data() + offset))) != ' ') { 
        
                       nwStr += ch; 
        
                       if (++offset >= input.size()) { 
        
                           throw std::runtime_error(wrongFormatErrMsg); 
        
                       } 
        
                   } 
        
                   // get vector size 
        
                   offset++; // skip ' ' char 
        
                   std::string vsStr; 
        
                   while ((ch = (*(input.data() + offset))) != '\n') { 
        
                       vsStr += ch; 
        
                       if (++offset >= input.size()) { 
        
                           throw std::runtime_error(wrongFormatErrMsg); 
        
                       } 
        
                   } 
        
                   try { 
        
                       m_mapSize = static_cast<std::size_t>(std::stoll(nwStr)); 
        
                       m_vectorSize = static_cast<uint16_t>(std::stoi(vsStr)); 
        
                   } catch (...) { 
        
                       throw std::runtime_error(wrongFormatErrMsg); 
        
                   } 
        
                   // get pairs of word and vector 
        
                   offset++; // skip last '\n' char 
        
                   std::string word; 
        
                   for (std::size_t i = 0; i < m_mapSize; ++i) { 
        
                       // get word 
        
                       word.clear(); 
        
                       while ((ch = (*(input.data() + offset))) != ' ') { 
        
                           if (ch != '\n') { 
        
                               word += ch; 
        
                           } 
        
                           // move to the next char and check boundaries 
        
                           if (++offset >= input.size()) { 
        
                               throw std::runtime_error(wrongFormatErrMsg); 
        
                           } 
        
                       } 
        
                       // skip last ' ' char and check boundaries 
        
                       if (static_cast<off_t>(++offset + m_vectorSize * sizeof(float)) > input.size()) { 
        
                           throw std::runtime_error(wrongFormatErrMsg); 
        
                       } 
        
                       // get word's vector 
        
                       auto &v = m_map[word]; 
        
                       v.resize(m_vectorSize); 
        
                       std::memcpy(v.data(), input.data() + offset, m_vectorSize * sizeof(float)); 
        
                       offset += m_vectorSize * sizeof(float); // vector size 
        
                       if(normalize){ 
        
                           // normalize vector 
        
                           float med = 0.0f; 
        
                           for (auto const &j:v) { 
        
                               med += j * j; 
        
                           } 
        
                           if (med <= 0.0f) { 
        
                               throw std::runtime_error("failed to normalize vectors"); 
        
                           } 
        
                           med = std::sqrt(med / v.size()); 
        
                           for (auto &j:v) { 
        
                               j /= med; 
        
                           } 
        
                       } 
        
                   } 
        
                   return true; 
        
               } catch (const std::exception &_e) { 
        
                   m_errMsg = _e.what(); 
        
               } catch (...) { 
        
                   m_errMsg = "model: unknown error"; 
        
               } 
        
               return false; 
        
           }

jwijffels · 2023-11-23T09:13:48Z

Sure, as long as you later put the R version back in read.word2vec and write.word2vec and you can load the resulting .bin files in again in other software.

koheiw · 2023-11-24T06:26:58Z

We probably do not need special methods for saving or loading word2vec objects if they have a matrix for the word vectors (the same matrix as as.matrix.word2vec() returns). We can just use saveRDS() or readRDS() in this case.

jwijffels · 2023-11-24T08:12:20Z

word2vec::read.word2vec is used by R package sentencepiece which uses word2vec::read.wordvectors to read .bin files with word2vec trained embeddings (https://github.com/bnosac/sentencepiece/blob/master/R/bpemb.R#L208) from https://bpemb.h-its.org/ in order to embed a list of sentencepiece-tokenised texts https://github.com/bnosac/sentencepiece/blob/master/R/bpemb.R#L335

koheiw · 2023-11-26T01:14:11Z

Is it possible to use word vectors in an object loaded using readRDS()? For example,

object <- readRDS(file_word2vec)
embedding <- object$embedding # dense matrix

jwijffels · 2023-11-27T08:14:51Z

word2vec::read.word2vec and word2vec::write.word2vec are for reading and writing embeddings in the format which other software package use (e.g. gensim). It should not be the intention to remove the functionality to read such files nor to write such files.
You can rewrite the internals of word2vec::read.word2vec and word2vec::write.word2vec in R by writing/reading the raw bytes if you like instead of doing that in C++.

koheiw · 2024-02-12T00:50:48Z

I have managed to make a version that works on token IDs. I am excited that it is producing nice lists of synonyms, so I want to make it possible to use in my projects rather soon.

Please take a look and tell me if you can incorporate it into your package. I am also happy to keep in a separate package until we add more functions (it already has enough functions for my main application).

library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 6 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.
library(word2vec)
library(LSX)

data_corpus_guardian <- readRDS("/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds")
corp <- data_corpus_guardian %>% 
#corp <- data_corpus_inaugural %>% 
    corpus_reshape()
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
    tokens_remove(stopwords(), padding = TRUE) %>% 
    tokens_tolower()

mod <- word2vec:::w2v_train(toks, types(toks), verbose = TRUE, size = 300, 
                             iterations = 5, minWordFreq = 5, threads = 6)
#> trainWords: 5107745
#> totalWords: 9010496
#> frequency.size(): 122158
#> types.size(): 122158
#> Training done
dim(as.matrix(mod))
#> [1] 122158    300
predict(mod, c("people", "american"), type = "nearest")
#> $people
#>     term1        term2 similarity rank
#> 1  people  australians  0.7796034    1
#> 2  people       adults  0.7099509    2
#> 3  people  individuals  0.7097512    3
#> 4  people        males  0.6976304    4
#> 5  people      syrians  0.6957450    5
#> 6  people     citizens  0.6932976    6
#> 7  people block-booked  0.6895103    7
#> 8  people     people's  0.6870319    8
#> 9  people          men  0.6833943    9
#> 10 people    europeans  0.6821725   10
#> 
#> $american
#>       term1             term2 similarity rank
#> 1  american       males-whose  0.7378401    1
#> 2  american         americans  0.7317790    2
#> 3  american           cougars  0.7291586    3
#> 4  american elephants-usually  0.7047268    4
#> 5  american             kenly  0.6969478    5
#> 6  american           sikwese  0.6918426    6
#> 7  american          kenyatta  0.6788632    7
#> 8  american      africa-based  0.6756147    8
#> 9  american         america's  0.6728815    9
#> 10 american           2,500mw  0.6675329   10
predict(mod, c("good", "bad"), type = "nearest")
#> $good
#>    term1         term2 similarity rank
#> 1   good           bad  0.7463641    1
#> 2   good        decent  0.7173805    2
#> 3   good     fantastic  0.7019140    3
#> 4   good     wonderful  0.6910989    4
#> 5   good     excellent  0.6845824    5
#> 6   good        20.15p  0.6603347    6
#> 7   good       helpful  0.6575145    7
#> 8   good           odd  0.6529718    8
#> 9   good enltrworrying  0.6509216    9
#> 10  good       weighty  0.6508648   10
#> 
#> $bad
#>    term1          term2 similarity rank
#> 1    bad           good  0.7463641    1
#> 2    bad         20.15p  0.7007557    2
#> 3    bad       horrible  0.6884315    3
#> 4    bad       terrible  0.6868218    4
#> 5    bad enltrfantastic  0.6750041    5
#> 6    bad       habitual  0.6748379    6
#> 7    bad        weirdos  0.6694773    7
#> 8    bad          dread  0.6675339    8
#> 9    bad    @nyfed_news  0.6666440    9
#> 10   bad         decent  0.6626057   10

^{Created on 2024-02-12 with reprex v2.1.0}

jwijffels · 2024-02-19T09:45:16Z

Thanks. I've merged it into branch dev-list - https://github.com/bnosac/word2vec/tree/dev-list
Will have to see how this could be integrated given that it drops word2vec::read.word2vec, word2vec::write.word2vec
I've enabled the continuous integration with this pull request #22

koheiw · 2024-02-19T21:27:44Z

I needed to disable w2v_save_model and w2v_load_model to remove dependency on word2vec/lib/mapper.cpp. It is easier to save models as R objects using saveRDS() and readRDS(). Could you do the development on the R side?

jwijffels · 2024-02-19T21:48:50Z

#20 (comment)

koheiw · 2024-02-22T07:30:55Z

I am not sure if I want to develop IO functions for packages that I am not using. I am happy only with the word vectors from word2vec:::w2v_train at least for now.

jwijffels · 2024-02-22T09:11:33Z

Ok, I understand.
I'll see later if it's possible to read in word2vec models from other software (e.g. gensim) directly from R and map these to these new integer-based internal structures.

koheiw · 2024-02-22T09:33:11Z

If you are fine with the plan, I will make word2vec.list work in PR #22.

jwijffels · 2024-02-23T08:38:01Z

I agree with the plan as long as we don't drop the functionality of word2vec::read.word2vec, word2vec::write.word2vec to be able to read in .bin files of word2vec models generated by e.g. gensim or previous versions of R package word2vec (e.g. .bin files at https://github.com/bnosac/word2vec/tree/master/inst/models or at https://github.com/bnosac/sentencepiece/tree/master/inst/models) .

I had a look yesterday on how it reads in these files, seems logical we can switch word2vec::read.word2vec to using word2vec:::w2v_read_binary and map these to the new internal structures and factor out w2vModel_t::save to directly call it based on a R matrix.

jwijffels mentioned this issue Oct 5, 2023

Suppporting a list of tokens #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

list of improvements #20

list of improvements #20

jwijffels commented Oct 5, 2023 •

edited

Loading

koheiw commented Nov 9, 2023

jwijffels commented Nov 13, 2023

koheiw commented Nov 15, 2023

jwijffels commented Nov 15, 2023 •

edited

Loading

koheiw commented Nov 15, 2023

jwijffels commented Nov 16, 2023

koheiw commented Nov 23, 2023

jwijffels commented Nov 23, 2023

koheiw commented Nov 24, 2023

jwijffels commented Nov 24, 2023

koheiw commented Nov 26, 2023

jwijffels commented Nov 27, 2023

koheiw commented Feb 12, 2024

jwijffels commented Feb 19, 2024 •

edited

Loading

koheiw commented Feb 19, 2024

jwijffels commented Feb 19, 2024

koheiw commented Feb 22, 2024

jwijffels commented Feb 22, 2024

koheiw commented Feb 22, 2024

jwijffels commented Feb 23, 2024

list of improvements #20

list of improvements #20

Comments

jwijffels commented Oct 5, 2023 • edited Loading

koheiw commented Nov 9, 2023

jwijffels commented Nov 13, 2023

koheiw commented Nov 15, 2023

jwijffels commented Nov 15, 2023 • edited Loading

koheiw commented Nov 15, 2023

jwijffels commented Nov 16, 2023

koheiw commented Nov 23, 2023

jwijffels commented Nov 23, 2023

koheiw commented Nov 24, 2023

jwijffels commented Nov 24, 2023

koheiw commented Nov 26, 2023

jwijffels commented Nov 27, 2023

koheiw commented Feb 12, 2024

jwijffels commented Feb 19, 2024 • edited Loading

koheiw commented Feb 19, 2024

jwijffels commented Feb 19, 2024

koheiw commented Feb 22, 2024

jwijffels commented Feb 22, 2024

koheiw commented Feb 22, 2024

jwijffels commented Feb 23, 2024

jwijffels commented Oct 5, 2023 •

edited

Loading

jwijffels commented Nov 15, 2023 •

edited

Loading

jwijffels commented Feb 19, 2024 •

edited

Loading