Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list of improvements #20

Open
7 tasks
jwijffels opened this issue Oct 5, 2023 · 20 comments
Open
7 tasks

list of improvements #20

jwijffels opened this issue Oct 5, 2023 · 20 comments

Comments

@jwijffels
Copy link
Contributor

jwijffels commented Oct 5, 2023

  • allow to pass a list of integers instead of tokens to the word2vec function
  • see how to remove the embedding of </s>
  • abandon file-based approach
  • speed up for Xptr's like quanteda objects to avoid copying data?
  • other speed improvements
  • progress bar
  • functionalities for downstream processing
    - plotting or functionalities in https://github.com/bnosac/textplot
    - downstream topic modelling like https://github.com/bnosac/ETM or as a replacement of SVD's for semi-supervised stuff
    - embeddings on sentencepiece/tokenisers.bpe tokenised data
    - pretrained models
    - further input to torch models
    - deeper integration of the similarities like https://github.com/bnosac/doc2vec or https://koheiw.github.io/LSX
@koheiw
Copy link
Contributor

koheiw commented Nov 9, 2023

I have been thinking how best we can improve the package. To make it possible to pass a list to the package, we duplicated some of the code for a list and a file. In order to further develop the package, we need to simplify.

We need to make two changes:

  1. Choose the format of the input data between a list or a file. A list is fast and easy to handle but it does not allow users to pass texts larger than the RAM. A file is a lot more complex but as fast as a list if memory-mapping is used (as in the current version).
  2. Change the representation of the input data (either in a list or a file) for word2vec.cpp from characters to integer IDs of tokens. This makes the C++ code a lot simpler (we could remove mapper.hpp and wordReader.hpp). To achieve this, we need to tokenize texts beforehand. If the input data is a file, integers IDs will be saved to a file.

@jwijffels
Copy link
Contributor Author

I think I agree with most of what is listed up for my situation.

  1. In my use cases, most of the time the data is directly in R and only 5% of the times the data is in a .txt file for which I don't need the list-based approach and the text is pretokenised. So that means I'll have to abandon this 5% or work around it and rewrite the text to integers.

  2. This also means that the tokeniser is handled in R or that other software do the tokenisation and keep mapping between integers and tokens when saving the embeddings, we need to store that information.

Can we already start on 2?

@koheiw
Copy link
Contributor

koheiw commented Nov 15, 2023

I wonder in what situation you need to pass documents via files. Is it possible to load the files in R and pass as a list?

If we choose to pass pre-tokenized documents in a list, we can make the C++ code much simpler. It would be the code-base to which we can add more functions.

@jwijffels
Copy link
Contributor Author

jwijffels commented Nov 15, 2023

I wonder in what situation you need to pass documents via files.

I have log files containing failure events happening on parts producing machines for which I build word2vec models to understand how they are related. Failure events are appended to the log file.
In another setup I have specifc extractions of wikipedia, which are parsed externally from R and are just put in word2vec to find the embedding of specific elements contained in these wikipedia entities

Is it possible to load the files in R and pass as a list?

Anything is possible with R.

@koheiw
Copy link
Contributor

koheiw commented Nov 15, 2023

Your analysis of machine logs sounds very interesting. I understand that it is easier to load from files but you can still load to R and make a list using stringi::stri_split_* functions. The full corpus of Wikipedia might be too large for list inputs but you could do it if the corpus is its subset. Why don't you try and see how big lists object becomes with these data?

I will write a prototype for list inputs to start.

@jwijffels
Copy link
Contributor Author

I will write a prototype for list inputs to start.

Ok.

@koheiw
Copy link
Contributor

koheiw commented Nov 23, 2023

Can I delete the function to save or load word vectors in C++? I think it is much easier to do in R.

bool w2vModel_t::save(const std::string &_modelFile) const noexcept {
try {
// save trained data in original word2vec format
// file header
std::string fileHeader = std::to_string(m_mapSize)
+ " "
+ std::to_string(m_vectorSize)
+ "\n";
// calc output size
// header size
auto outputSize = static_cast<off_t>(fileHeader.length() * sizeof(char));
for (auto const &i:m_map) {
// size of (word + space char + vector size + size of cartridge return char)
outputSize += (i.first.length() + 2) * sizeof(char) + m_vectorSize * sizeof(float);
}
// write data to the file
fileMapper_t output(_modelFile, true, outputSize);
char sp = ' ';
char cr = '\n';
off_t offset = 0;
// write file header
std::memcpy(reinterpret_cast<void *>(output.data() + offset),
fileHeader.data(), fileHeader.length() * sizeof(char));
offset += fileHeader.length() * sizeof(char);
// write words and their vectors
for (auto const &i:m_map) {
std::memcpy(reinterpret_cast<void *>(output.data() + offset),
i.first.data(), i.first.length() * sizeof(char));
offset += i.first.length() * sizeof(char);
std::memcpy(reinterpret_cast<void *>(output.data() + offset), &sp, sizeof(char));
offset += sizeof(char);
auto shift = m_vectorSize * sizeof(float);
std::memcpy(reinterpret_cast<void *>(output.data() + offset), i.second.data(), shift);
offset += shift;
std::memcpy(reinterpret_cast<void *>(output.data() + offset), &cr, sizeof(char));
offset += sizeof(char);
}
return true;
} catch (const std::exception &_e) {
m_errMsg = _e.what();
} catch (...) {
m_errMsg = "unknown error";
}
return false;
}
bool w2vModel_t::load(const std::string &_modelFile, bool normalize) noexcept {
try {
m_map.clear();
// map model file, exception will be thrown on empty file
fileMapper_t input(_modelFile);
// parse header
off_t offset = 0;
// get words number
std::string nwStr;
char ch = 0;
while ((ch = (*(input.data() + offset))) != ' ') {
nwStr += ch;
if (++offset >= input.size()) {
throw std::runtime_error(wrongFormatErrMsg);
}
}
// get vector size
offset++; // skip ' ' char
std::string vsStr;
while ((ch = (*(input.data() + offset))) != '\n') {
vsStr += ch;
if (++offset >= input.size()) {
throw std::runtime_error(wrongFormatErrMsg);
}
}
try {
m_mapSize = static_cast<std::size_t>(std::stoll(nwStr));
m_vectorSize = static_cast<uint16_t>(std::stoi(vsStr));
} catch (...) {
throw std::runtime_error(wrongFormatErrMsg);
}
// get pairs of word and vector
offset++; // skip last '\n' char
std::string word;
for (std::size_t i = 0; i < m_mapSize; ++i) {
// get word
word.clear();
while ((ch = (*(input.data() + offset))) != ' ') {
if (ch != '\n') {
word += ch;
}
// move to the next char and check boundaries
if (++offset >= input.size()) {
throw std::runtime_error(wrongFormatErrMsg);
}
}
// skip last ' ' char and check boundaries
if (static_cast<off_t>(++offset + m_vectorSize * sizeof(float)) > input.size()) {
throw std::runtime_error(wrongFormatErrMsg);
}
// get word's vector
auto &v = m_map[word];
v.resize(m_vectorSize);
std::memcpy(v.data(), input.data() + offset, m_vectorSize * sizeof(float));
offset += m_vectorSize * sizeof(float); // vector size
if(normalize){
// normalize vector
float med = 0.0f;
for (auto const &j:v) {
med += j * j;
}
if (med <= 0.0f) {
throw std::runtime_error("failed to normalize vectors");
}
med = std::sqrt(med / v.size());
for (auto &j:v) {
j /= med;
}
}
}
return true;
} catch (const std::exception &_e) {
m_errMsg = _e.what();
} catch (...) {
m_errMsg = "model: unknown error";
}
return false;
}

@jwijffels
Copy link
Contributor Author

Sure, as long as you later put the R version back in read.word2vec and write.word2vec and you can load the resulting .bin files in again in other software.

@koheiw
Copy link
Contributor

koheiw commented Nov 24, 2023

We probably do not need special methods for saving or loading word2vec objects if they have a matrix for the word vectors (the same matrix as as.matrix.word2vec() returns). We can just use saveRDS() or readRDS() in this case.

@jwijffels
Copy link
Contributor Author

word2vec::read.word2vec is used by R package sentencepiece which uses word2vec::read.wordvectors to read .bin files with word2vec trained embeddings (https://github.com/bnosac/sentencepiece/blob/master/R/bpemb.R#L208) from https://bpemb.h-its.org/ in order to embed a list of sentencepiece-tokenised texts https://github.com/bnosac/sentencepiece/blob/master/R/bpemb.R#L335

@koheiw
Copy link
Contributor

koheiw commented Nov 26, 2023

Is it possible to use word vectors in an object loaded using readRDS()? For example,

object <- readRDS(file_word2vec)
embedding <- object$embedding # dense matrix

@jwijffels
Copy link
Contributor Author

word2vec::read.word2vec and word2vec::write.word2vec are for reading and writing embeddings in the format which other software package use (e.g. gensim). It should not be the intention to remove the functionality to read such files nor to write such files.
You can rewrite the internals of word2vec::read.word2vec and word2vec::write.word2vec in R by writing/reading the raw bytes if you like instead of doing that in C++.

@koheiw
Copy link
Contributor

koheiw commented Feb 12, 2024

I have managed to make a version that works on token IDs. I am excited that it is producing nice lists of synonyms, so I want to make it possible to use in my projects rather soon.

Please take a look and tell me if you can incorporate it into your package. I am also happy to keep in a separate package until we add more functions (it already has enough functions for my main application).

library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 6 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.
library(word2vec)
library(LSX)

data_corpus_guardian <- readRDS("/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds")
corp <- data_corpus_guardian %>% 
#corp <- data_corpus_inaugural %>% 
    corpus_reshape()
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
    tokens_remove(stopwords(), padding = TRUE) %>% 
    tokens_tolower()

mod <- word2vec:::w2v_train(toks, types(toks), verbose = TRUE, size = 300, 
                             iterations = 5, minWordFreq = 5, threads = 6)
#> trainWords: 5107745
#> totalWords: 9010496
#> frequency.size(): 122158
#> types.size(): 122158
#> Training done
dim(as.matrix(mod))
#> [1] 122158    300
predict(mod, c("people", "american"), type = "nearest")
#> $people
#>     term1        term2 similarity rank
#> 1  people  australians  0.7796034    1
#> 2  people       adults  0.7099509    2
#> 3  people  individuals  0.7097512    3
#> 4  people        males  0.6976304    4
#> 5  people      syrians  0.6957450    5
#> 6  people     citizens  0.6932976    6
#> 7  people block-booked  0.6895103    7
#> 8  people     people's  0.6870319    8
#> 9  people          men  0.6833943    9
#> 10 people    europeans  0.6821725   10
#> 
#> $american
#>       term1             term2 similarity rank
#> 1  american       males-whose  0.7378401    1
#> 2  american         americans  0.7317790    2
#> 3  american           cougars  0.7291586    3
#> 4  american elephants-usually  0.7047268    4
#> 5  american             kenly  0.6969478    5
#> 6  american           sikwese  0.6918426    6
#> 7  american          kenyatta  0.6788632    7
#> 8  american      africa-based  0.6756147    8
#> 9  american         america's  0.6728815    9
#> 10 american           2,500mw  0.6675329   10
predict(mod, c("good", "bad"), type = "nearest")
#> $good
#>    term1         term2 similarity rank
#> 1   good           bad  0.7463641    1
#> 2   good        decent  0.7173805    2
#> 3   good     fantastic  0.7019140    3
#> 4   good     wonderful  0.6910989    4
#> 5   good     excellent  0.6845824    5
#> 6   good        20.15p  0.6603347    6
#> 7   good       helpful  0.6575145    7
#> 8   good           odd  0.6529718    8
#> 9   good enltrworrying  0.6509216    9
#> 10  good       weighty  0.6508648   10
#> 
#> $bad
#>    term1          term2 similarity rank
#> 1    bad           good  0.7463641    1
#> 2    bad         20.15p  0.7007557    2
#> 3    bad       horrible  0.6884315    3
#> 4    bad       terrible  0.6868218    4
#> 5    bad enltrfantastic  0.6750041    5
#> 6    bad       habitual  0.6748379    6
#> 7    bad        weirdos  0.6694773    7
#> 8    bad          dread  0.6675339    8
#> 9    bad    @nyfed_news  0.6666440    9
#> 10   bad         decent  0.6626057   10

Created on 2024-02-12 with reprex v2.1.0

@jwijffels
Copy link
Contributor Author

jwijffels commented Feb 19, 2024

Thanks. I've merged it into branch dev-list - https://github.com/bnosac/word2vec/tree/dev-list
Will have to see how this could be integrated given that it drops word2vec::read.word2vec, word2vec::write.word2vec
I've enabled the continuous integration with this pull request #22

@koheiw
Copy link
Contributor

koheiw commented Feb 19, 2024

I needed to disable w2v_save_model and w2v_load_model to remove dependency on word2vec/lib/mapper.cpp. It is easier to save models as R objects using saveRDS() and readRDS(). Could you do the development on the R side?

@jwijffels
Copy link
Contributor Author

#20 (comment)

@koheiw
Copy link
Contributor

koheiw commented Feb 22, 2024

I am not sure if I want to develop IO functions for packages that I am not using. I am happy only with the word vectors from word2vec:::w2v_train at least for now.

@jwijffels
Copy link
Contributor Author

Ok, I understand.
I'll see later if it's possible to read in word2vec models from other software (e.g. gensim) directly from R and map these to these new integer-based internal structures.

@koheiw
Copy link
Contributor

koheiw commented Feb 22, 2024

If you are fine with the plan, I will make word2vec.list work in PR #22.

@jwijffels
Copy link
Contributor Author

I agree with the plan as long as we don't drop the functionality of word2vec::read.word2vec, word2vec::write.word2vec to be able to read in .bin files of word2vec models generated by e.g. gensim or previous versions of R package word2vec (e.g. .bin files at https://github.com/bnosac/word2vec/tree/master/inst/models or at https://github.com/bnosac/sentencepiece/tree/master/inst/models) .

I had a look yesterday on how it reads in these files, seems logical we can switch word2vec::read.word2vec to using word2vec:::w2v_read_binary and map these to the new internal structures and factor out w2vModel_t::save to directly call it based on a R matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants