-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suppporting a list of tokens #14
Comments
Hi Kohei. Would love to have that as well. Main obstacle here is rewriting the C++ backend such that the text is passed directly instead of working through a file (#11). |
Glad that you like the idea. It needs a lot of work but I think it is worth the effort. We should first try to replace text files with list of character vectors; second try to support list of integer vectors with vocabulary vector (like quanteda's tokens object). I think we can do it without adding quanteda to the dependencies. The following is the key parts that need changes: Read word2vec/src/word2vec/lib/trainThread.cpp Lines 88 to 112 in 12b015e
Make the word2vec/src/word2vec/lib/vocabulary.hpp Lines 34 to 52 in 12b015e
|
Yes, that´s indeed what´s needed. The vocabulary needs to be constructed from that tokenlist and plugged into that Worddata and the parallel training loop has to receive somehow the words. |
I started modifying the code to better understand how it works, and noticed that |
I think this is just a tiny detail in all the changes which are needed, we can always work around this. |
I am testing on this branch https://github.com/koheiw/word2vec/commits/test-nextword. The library is complex because it does a lot of basic things in C++. I know how difficult to it is to do tokenization and feature selection in C++... |
I also tend to do tokenisation outside this library, then tend to paste text together with one basic specific letter before running word2vec |
The dev-texts works without crashing at least. library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 6 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.
library(word2vec)
corp <- data_corpus_inaugural %>%
corpus_reshape()
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE)
lis <- as.list(toks)
txt <- stringi::stri_c_list(lis, " ")
mod_lis <- word2vec(lis, dim = 50, iter = 5, min_count = 5,
verbose = TRUE, threads = 4)
predict(mod_lis, c("people", "American"), type = "nearest")
#> $people
#> term1 term2 similarity rank
#> 1 people Nations 0.9756919 1
#> 2 people institutions 0.9723659 2
#> 3 people pursuits 0.9704540 3
#> 4 people parts 0.9704391 4
#> 5 people relations 0.9698635 5
#> 6 people sovereign 0.9691449 6
#> 7 people sections 0.9678864 7
#> 8 people defend 0.9676723 8
#> 9 people against 0.9676545 9
#> 10 people into 0.9676042 10
#>
#> $American
#> term1 term2 similarity rank
#> 1 American righteousness 0.9931667 1
#> 2 American products 0.9915134 2
#> 3 American preservation 0.9904682 3
#> 4 American scientific 0.9904510 4
#> 5 American foundations 0.9897572 5
#> 6 American cultivate 0.9896089 6
#> 7 American industrial 0.9891499 7
#> 8 American amity 0.9889630 8
#> 9 American development 0.9888043 9
#> 10 American prosperity 0.9887399 10 I noticed that the progress bar exceeds 100% on the master. Do you know why? I am not sure how it works yet...
|
You got it to work on a tokenlist, magic :) and the embedding similarities look the same - even better :)
That verbose argument never worked with threads > 1, I think the printing out was even the culprit of crashes if I recall, as the printing out was not thread-safe I think. |
Printing from multiple threads is often risky, but I will look into the original code and PcppProgress. So far, there is no performance gain from passing a token list due to token to ID conversion in the library. Next step will be passing a list of token IDs. |
I've pushed some continuous integration. |
Yes, the performance gain will be merely removing file operations.
Does this as well require modifications of the new C++ internals or can we just do this from R and keep the mapping between tokens en token ids in R? |
My idea is to convert a list of tokens to list of IDs in this way. If library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language == "fr")
voc <- unique(x$token)
lis <- lapply(split(x$token, x$sentence_id), match, voc)
head(lis)
#> $`2345`
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 7 14 15 16 17 18 19 12 13 20 21 16
#> [26] 17 18 22 23 24 23 13 7 25 26 27 28 29 23 7 10 30 31 32 33 34 7 35 28 36
#> [51] 7 37 38 39 28 40 41 21
#>
#> $`2346`
#> [1] 42 23 43 44 45 46 47 48 49 50 47 51 52 53 54 55
#>
#> $`2347`
#> [1] 56 57 58 59 12 22 12 23 60 55
#>
#> $`2348`
#> [1] 61 58 62 53 63 27 64 28 65 12 34 66 45 67 68 7 69 28 70 53 71 23 45 72 73
#> [26] 29 7 37 55
#>
#> $`2349`
#> [1] 74 58 47 75 76 12 77 23 78 12 79 80 81 82 83 84 28 85 86 87 88 72 89 90 55
#>
#> $`2350`
#> [1] 91 92 93 94 23 95 49 96 58 81 40 46 55 Created on 2023-09-21 with reprex v2.0.2 |
The method for list should be like this to support both lists of characters and integers. If we ask users to tokenize their texts before hand, we can drop word2vec.list(x, vocaburary = NULL, ...) {
v <- unique(unlist(x. use.names = FALSE))
if (is.character(v)) {
if (!is.null(vocaburary))
stop("vocaburary is not used when x is a list of characters")
x <- lapply(x, match, v) # fastmatch::fmatch is faster
} else if (is.numeric(v)) {
if (min(v) < 0 || length(v) < max(v)) # 0 will be ignored in C++
stop("vocaburary does not match token index")
v <- vocaburary
}
model <- w2v_train(x, vocaburary = v, ...)
return(model)
}
# udpipe
x <- subset(brussels_reviews_anno, language == "fr")
lis <- split(x$token, x$sentence_id)
word2vec(lis)
# quanteda
toks <- tokens(corpus(data_corpus_inaugural))
word2vec(toks, types(toks)) |
Ok for me if you want to create the vocabulary upfront and pass it on to the C++ layer, it does simplify the construction at the C++ side for word2vec.list |
The most important part of the library related to changes in input format is this block. This code is executed by individual threads for the texts between Within the loop, You may be surprised how simple the core of the library is. All other functions and objects are for file access, multi-threading, tokenization, serialization, feature selection. It is an impressive peace of work but over-complicated. If we remove file access, and complete tokenization, serialization, feature selection in R (or using other tools), and implement multi-threading using using Intel TBB (RcppParallel), we can make a compact package that we can understand and maintain more easily. We can even enhance. If we use TBB, we only need to wrap the parallel code by I understand that backward compatibility is important but it should be possible to produce near identical results using a new library if tokenization is performed in a the same way. |
I have to provide some courses on text analytics with R this week. I'll look into the code the week after that such that we can integrate it already and if to test if I can completely make the embedding matrix reproducible with a toy problem where tokenisation is only based on single space and sentences are based on a dot. Once there is 100% reproducibility, we certainly flesh out more of the library. It's been a while since I looked into the details of the implementation. I thought the library implemented multithreading by reading in parallel from the file and if I understand you correctly, you would prefer to use RcppParallel instead of a multithreaded reading from file. |
It is true that a file-based corpus is more efficient than a memory-based corpus. Yet, I was trying to make tokens objects more efficient by keeping the data in C++ as much as possible (avoid copying large objects between C++ and R) using the XPtr object. I would develop a file-mapped tokens object in the future if the memory usage need to be even lower. Intel TBB (via RcppParallel) can be used for multi-threading here too. We may not be able to train the algorithm on Wikipedia dump if it need to be kept on memory, but I doubt the usefulness of such models in applied research. My approach has been training word vectors on a subject specific corpus and use them to perform higher-level analysis (e.g. domain-specific sentiment analysis). |
I've had a look to the code changes and I think if we incorporate the same logic on the vocabulary for the list-based approach and the file-based approach as suggested in koheiw#2, I'm fine with the changes as it gives the exact same embeddings for both approaches. |
I've improved the documentation in that pull request. For me this is fine as is, embeddings are the same with the 2 approaches with that additional pull request. |
@jwijffels thanks for preparing a PR with nice examples. I thought I found We should also consider adding proper CI tests for the functions in the package. |
I think the most important prove is that the embeddings are 100% the same for the file-based and list-based approach which is the case. I also checked that without doing the ordering of the vocabulary as indicated in the remarks of #18 that the embeddings were still the same as with version 0.3.4 of the package. So that is ok. I completely agree there should be more unit tests on the different settings though. But I wonder how they will look like. Eventually these numbers are only usefull in further downstream processing.
Yes, I know, I ignored these for now. Changing these would make the changes too complex to test for now. Better to do it in different steps. Thanks again for all the work 👍 |
The unit tests would be to ensure that word vector from this version and next version is the same, and methods like |
I can develop a version for a list of token IDs in the coming months. How quickly do you want to change? It would be nice to have shared mid- to long-term milestones. |
I've uploaded the current status of the package to CRAN. Hopefully it will land there without problems. Regarding quickness of change. I'm happy to incorporate any elements you see as improvements. I don't have any specific timing for that in mind. |
Changes are on CRAN now. Many thanks again. Looking forward to see further improvements. Closing this issue for now. |
Hi @jwijffels
Your word2vec looks great. I wanted to use word2vec to generate word vectors for my LSX package, so I wonder if you have a plan to support a list of tokens as an input.
As far as I understand, I need to convert quanteda's tokens object (list of token IDs) to character strings to pass that to
word2vec()
. I think it is more efficient to feed texts to the C++ functions without writing them to temporary files.If you like the idea, I think I can contribute.
The text was updated successfully, but these errors were encountered: