Embedding

Mine parallel corpora with embeddings

sentence preprocessing

Tokenizes and normalizes all sentences with language-specific scripts. Also converts to lowercase and applies BPE .

./preprocess/prep.sh -l {language} -f {input_file} | gzip > {output_file}

converting th7

Loads given t7 files and converts them into a compressed format that can be read as a stream.

cat {list_of_input_files} | python ./read_t7.py

building index

Builds faiss index from the input stream. -s sets the number of sentences one index can hold. If the index size is exceeded, the index is outputted to output_folder and new index is processed.

cd index
mkdir -p build && cd build
cmake .. && make -j 5

zcat {embeddings_file}.gz | ./bin/build_index -o {output_folder} -s {index_size}

querying index

Loads index from index_file and searches for k nearest vectors for each input vector. The search performed on an index is the k-nearest-neighbor search. -b sets the batch size, i.e. the number of vectors queried at the same time.

cd index
mkdir -p build && cd build
cmake .. && make -j 5

zcat {embeddings_file}.gz | ./bin/query_index -i {index_file} -k {k-best} -b {batch_size} > {output_file}

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
blob_storage		blob_storage
embed_index		embed_index
faiss-demo		faiss-demo
index		index
models		models
preprocess		preprocess
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embedding

sentence preprocessing

converting th7

building index

querying index

About

Releases

Packages

Contributors 7

Languages

paracrawl/embedding

Folders and files

Latest commit

History

Repository files navigation

Embedding

sentence preprocessing

converting th7

building index

querying index

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages