-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
list of improvements #20
Comments
I have been thinking how best we can improve the package. To make it possible to pass a list to the package, we duplicated some of the code for a list and a file. In order to further develop the package, we need to simplify. We need to make two changes:
|
I think I agree with most of what is listed up for my situation.
Can we already start on 2? |
I wonder in what situation you need to pass documents via files. Is it possible to load the files in R and pass as a list? If we choose to pass pre-tokenized documents in a list, we can make the C++ code much simpler. It would be the code-base to which we can add more functions. |
I have log files containing failure events happening on parts producing machines for which I build word2vec models to understand how they are related. Failure events are appended to the log file.
Anything is possible with R. |
Your analysis of machine logs sounds very interesting. I understand that it is easier to load from files but you can still load to R and make a list using I will write a prototype for list inputs to start. |
Ok. |
Can I delete the function to save or load word vectors in C++? I think it is much easier to do in R. word2vec/src/word2vec/lib/word2vec.cpp Lines 88 to 226 in cafe13f
|
Sure, as long as you later put the R version back in read.word2vec and write.word2vec and you can load the resulting .bin files in again in other software. |
We probably do not need special methods for saving or loading word2vec objects if they have a matrix for the word vectors (the same matrix as |
|
Is it possible to use word vectors in an object loaded using object <- readRDS(file_word2vec)
embedding <- object$embedding # dense matrix |
|
I have managed to make a version that works on token IDs. I am excited that it is producing nice lists of synonyms, so I want to make it possible to use in my projects rather soon. Please take a look and tell me if you can incorporate it into your package. I am also happy to keep in a separate package until we add more functions (it already has enough functions for my main application). library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 6 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.
library(word2vec)
library(LSX)
data_corpus_guardian <- readRDS("/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds")
corp <- data_corpus_guardian %>%
#corp <- data_corpus_inaugural %>%
corpus_reshape()
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(stopwords(), padding = TRUE) %>%
tokens_tolower()
mod <- word2vec:::w2v_train(toks, types(toks), verbose = TRUE, size = 300,
iterations = 5, minWordFreq = 5, threads = 6)
#> trainWords: 5107745
#> totalWords: 9010496
#> frequency.size(): 122158
#> types.size(): 122158
#> Training done
dim(as.matrix(mod))
#> [1] 122158 300
predict(mod, c("people", "american"), type = "nearest")
#> $people
#> term1 term2 similarity rank
#> 1 people australians 0.7796034 1
#> 2 people adults 0.7099509 2
#> 3 people individuals 0.7097512 3
#> 4 people males 0.6976304 4
#> 5 people syrians 0.6957450 5
#> 6 people citizens 0.6932976 6
#> 7 people block-booked 0.6895103 7
#> 8 people people's 0.6870319 8
#> 9 people men 0.6833943 9
#> 10 people europeans 0.6821725 10
#>
#> $american
#> term1 term2 similarity rank
#> 1 american males-whose 0.7378401 1
#> 2 american americans 0.7317790 2
#> 3 american cougars 0.7291586 3
#> 4 american elephants-usually 0.7047268 4
#> 5 american kenly 0.6969478 5
#> 6 american sikwese 0.6918426 6
#> 7 american kenyatta 0.6788632 7
#> 8 american africa-based 0.6756147 8
#> 9 american america's 0.6728815 9
#> 10 american 2,500mw 0.6675329 10
predict(mod, c("good", "bad"), type = "nearest")
#> $good
#> term1 term2 similarity rank
#> 1 good bad 0.7463641 1
#> 2 good decent 0.7173805 2
#> 3 good fantastic 0.7019140 3
#> 4 good wonderful 0.6910989 4
#> 5 good excellent 0.6845824 5
#> 6 good 20.15p 0.6603347 6
#> 7 good helpful 0.6575145 7
#> 8 good odd 0.6529718 8
#> 9 good enltrworrying 0.6509216 9
#> 10 good weighty 0.6508648 10
#>
#> $bad
#> term1 term2 similarity rank
#> 1 bad good 0.7463641 1
#> 2 bad 20.15p 0.7007557 2
#> 3 bad horrible 0.6884315 3
#> 4 bad terrible 0.6868218 4
#> 5 bad enltrfantastic 0.6750041 5
#> 6 bad habitual 0.6748379 6
#> 7 bad weirdos 0.6694773 7
#> 8 bad dread 0.6675339 8
#> 9 bad @nyfed_news 0.6666440 9
#> 10 bad decent 0.6626057 10 Created on 2024-02-12 with reprex v2.1.0 |
Thanks. I've merged it into branch dev-list - https://github.com/bnosac/word2vec/tree/dev-list |
I needed to disable |
I am not sure if I want to develop IO functions for packages that I am not using. I am happy only with the word vectors from |
Ok, I understand. |
If you are fine with the plan, I will make |
I agree with the plan as long as we don't drop the functionality of word2vec::read.word2vec, word2vec::write.word2vec to be able to read in .bin files of word2vec models generated by e.g. gensim or previous versions of R package word2vec (e.g. .bin files at https://github.com/bnosac/word2vec/tree/master/inst/models or at https://github.com/bnosac/sentencepiece/tree/master/inst/models) . I had a look yesterday on how it reads in these files, seems logical we can switch word2vec::read.word2vec to using word2vec:::w2v_read_binary and map these to the new internal structures and factor out w2vModel_t::save to directly call it based on a R matrix. |
</s>
- plotting or functionalities in https://github.com/bnosac/textplot
- downstream topic modelling like https://github.com/bnosac/ETM or as a replacement of SVD's for semi-supervised stuff
- embeddings on sentencepiece/tokenisers.bpe tokenised data
- pretrained models
- further input to torch models
- deeper integration of the similarities like https://github.com/bnosac/doc2vec or https://koheiw.github.io/LSX
The text was updated successfully, but these errors were encountered: