Replies: 3 comments
-
Train for Python150kThe script
|
Beta Was this translation helpful? Give feedback.
-
Currently running the following command to apply BPE on the vocabulary:
Getting the following error:
Sizes of generated files like |
Beta Was this translation helpful? Give feedback.
-
TrainA folder which contains binaries for training:
A command for train:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
To use TransCoder it is necessary to understand the data. After Python150k preprocessing we get functions:
And their docstrings:
The following TO-DO list describes how to handle this data with TransCoder:
.tok
and.pth
data formats;Consider BigQuery dataset described here: TransCoder.
Let's consider
src/data/loader.py
for getting intovalid.python.pth
inner contents.Same way we can consider an example of python dataset in the TransCoder:
/source-code-summarization/transcoder/transcoder/data/test_dataset/python
After running
pytest preprocessing/test_preprocess.py
I got the following type of data:In order to run preprocess:
Current problem:
Fixed: store both JSONs in
gzip
-compressed mode.Then after preprocess we get the following
.XLM-syml
folder:Now transform it to be put onto single GPU. That should be done at the end, see issue.
For now just create a symlink from
train.python.0.pth
totrain.python.pth
.A command to pretrain with MLM:
Current problem:
Reduced model size from 77M to 19M via embeddings,
n_heads
, movedbatch_size
from 32 to 16, started training.Current:
Volatile GPU-Util ERR!
. Explanation:nvidia-smi
is not supported on WSL2 yet.TransCoder Preprocessing
Obtain JSON's:
Apply tokenization:
How does preprocess work:
How does binarization work:
Operating with a class
Dictionary
located inXLM/src/data/dictionary.py
. Static functionDictionary.index_data
fulfils the following fields:Where:
dico
-- an instance ofDictionary
object. Storesid2word
,word2id
,counts
and special tokens indices such as:bos_index
,eos_index
,pad_index
,unk_index
.positions
is anp.ndarray
of tuples storing(beggining, length)
of sentences;sentences
is anp.ndarray
storing word indices for every sentence without padding;unk_words
counts a number of occurences of the word if it is unknown.Afterwards
data
is saved withtorch.save
.A closer look on files structure after preprocessing:
Main suffices are
.functions_class
and.functions_standalone
. Consider samples from both of them:.functions_class
:.functions_standalone
:standalone
refers to the functions defined without class inplacement,functions_class
are methods of certain classes.Beta Was this translation helpful? Give feedback.
All reactions