TrainingWrapper does not support line breaks #23

0x7o · 2022-05-15T08:15:54Z

Notebook
When training RETRO with the standard methods, TrainingWrapper does not add line breaks to the dataset. This can have a bad effect on many NLP tasks.

Input *.txt:

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

Second Citizen:
Would you proceed especially against Caius Marcius?

All:
Against him first: he's a very dog to the commonalty.

Model output after traing:

some - - on my head, were even so salts to death strike That which may bet with tears I have found to life, which sweeter than now to dony : be known betwixcombed oaths ring yet in Corioli turnseth from him Dear life redeems doth thinkment for faith ; Or shall be slack than death within this face, PETRUCHIO : Now, wind and house or free thee better now. KATHARINA : Now, in mine honourable fellow : in your chat with me to be it, alive, I think, If to use than my wife, if this rebellious earth Have you will break out The strange s of yours cro

The text was updated successfully, but these errors were encountered:

lucidrains · 2022-05-15T19:40:57Z

@0x7o ohh interesting, this must be some issue with the default BERT tokenizer, i'll take a look next week

0x7o · 2022-05-16T08:46:22Z

Judging by the error below, the script handles the file as a whole, not in batches

Token indices sequence length is longer than the specified maximum sequence length for this model (3449121 > 512). Running this sequence through the model will result in indexing errors

0x7o · 2022-05-18T09:30:08Z

@lucidrains, tokenizer distorts the text. I think the problem is the difference between bert-cased and bert-uncased
Here is an example of a dataset from the program code:

} \ n else if ( GameManager. _ instance. won & &! GameManager. _ instance. keepPlaying ) { " won " } \ n else { " running " } \ n''') \ n \ n def get _ score ( self ) : \ n return self. execute ('GameManager. _ instance. score') \ n \ n def get _ board ( self ) : \ n # Chrome refuses to serialize the Grid object directly through the debugger. \ n grid = json. loads ( self. execute ('JSON. stringify ( GameManager. _ instance. grid )') ) \ n \ n board = [ [ 0 ] * 4 for _ in range ( 4 ) ] \ n for row in grid ['cells'] : \ n for cell in row : \ n if cell is None : \ n continue \ n pos = cell ['x'], cell ['y'] \ n tval = cell ['value'] \ n board [ pos [ 1 ] ] [ pos [ 0 ] ] = int ( round ( math. log ( tval, 2 ) ) ) \ n \ n return board \ n \ n def execute _ move ( self, move ) : \ n # We use UDLR ordering ; 2048 uses URDL ordering \ n move = [ 0, 2, 3, 1 ] [ move ] \ n self. execute ('GameManager. _ instance. move ( % d )'% move ) \ n \ nclass Keyboard2048Control ( Generic2048Control ) : \ n'''Control 2048 by accessing the DOM and using key events. \ n \ n This is relatively slow, and may be prone to race conditions if your \ n browser is slow. However, it is more generally compatible with various \ n clones of 2048.'''\ n \ n def setup ( self ) : \ n self. execute ( \ n'''\ n var elems = document. getElementsByTagName ('div') ; \ n for ( var i in elems ) \ n if ( elems [ i ]. className = ='tile - container') { \ n tileContainer = elems [ i ] ; \ n break ; \ n } \ n''') \ n \ n def get _ score ( self ) : \ n score = self. execute ('''\ n var scoreContainer = document. querySelector ( ". score - container " ) ; \ n var scoreText ='' ; \ n var scoreChildren = scoreContainer. childNodes ; \ n for ( var i = 0 ; i < scoreChildren. length ; + + i ) { \ n if ( scoreChildren [ i ]. nodeType = = Node. TEXT _ NODE ) { \ n scoreText + = scoreChildren [ i ]. textContent ; \ n } \ n } \ n scoreText ; \ n''') \ n \ n return int ( score ) \ n \ n def get _ board ( self ) : \ n res = self. execute ( \ n'''\ n var res = [ ] ; \ n var tiles = tileContainer. children ; \ n for ( var i = 0 ; i < tiles. length ; i + + ) \ n res. push ( tiles [ i ]. className ) ; \ n res \ n''') \ n board = [ [ 0 ] * 4 for _ in range ( 4 ) ] \ n for tile in res : \ n tval = pos = None \ n for k in tile. split ( ) : \ n m = re. match ( r'^ tile - ( \ d + ) $ ', k ) \ n if m : \ n tval = int ( m. group ( 1 ) ) \ n m = re. match ( r'^ tile - position - ( \ d + ) - ( \ d + ) $ ', k ) \ n if m : \ n pos = int ( m. group ( 1 ) ), int ( m. group ( 2 ) ) \ n board [ pos [ 1 ] - 1 ] [ pos [ 0 ] - 1 ] = int ( round ( math. log ( tval, 2 ) ) ) \ n \ n return board \ n \ n def execute _ move ( self, move ) : \ n key = [ 38, 40, 37, 39 ] [ move ] \ n self. send _ key _ event ('keydown ', key ) \ n time. sleep ( 0. 01 ) \ n self. send _ key _ event ('keyup ', key ) \ n time. sleep ( 0. 05 ) \ n \ nclass Hybrid2048Control ( Fast2048Control, Keyboard2048Control ) : \ n'''Control 2048 by hooking the GameManager and using keyboard inputs. \ n \ n This is safe and fast, and correctly generates keyboard events for compatibility. \ n'''\ n \ n setup = Fast2048Control. setup \ n get _ status = Keyboard2048Control. get _ status \ n get _ score = Fast2048Control. get _ score \ n get _ board = Fast2048Control. get _ board \ n execute _ move = Keyboard2048Control. execute _ move \ n # Preprocess cornell movie dialogs dataset \ n \ nfrom multiprocessing import Pool \ nimport argparse \ nimport pickle \ nimport random \ nimport os \ nfrom urllib. request import urlretrieve \ nfrom zipfile import ZipFile \ nfrom pathlib import Path \ nfrom tqdm import tqdm \ nfrom model. utils import Tokenizer, Vocab, PAD _ TOKEN, SOS _ TOKEN, EOS _ TOKEN \ n \ nproject _ dir = Path ( _ _ file _ _ ). resolve ( ). parent \ ndatasets _ dir = project _ dir. joinpath ('datasets /') \ ncornell _ dir = datasets _ dir. joinpath ('cornell /') \ n \ n # Tokenizer \ ntokenizer = Tokenizer ('spacy') \ n \ ndef prepare _ cornell _ data ( ) : \ n " " " Download and unpack dialogs " " " \ n \ n zip _ url ='http : / / www. mpi - sws. org / ~ cristian / data / cornell _ movie _ dialogs _ corpus. zip'\ n zipfile _ path = datasets _ dir. joinpath ('cornell. zip') \ n \ n if not datasets _ dir. exists ( ) : \ n datasets _ dir. mkdir ( ) \ n \ n # Prepare Dialog data \ n if not cornell _ dir. exists ( ) : \ n print ( f'Downloading { zip _ url } to { zipfile _ path }') \ n urlretrieve ( zip _ url, zipfile _ path ) \ n print ( f'Successfully downloaded { zipfile _ path }') \ n \ n zip _ ref = ZipFile ( zipfile _ path,'r') \ n zip _ ref. extractall ( datasets _ dir ) \ n zip _ ref. close ( ) \ n \ n datasets _ dir. joinpath ('cornell movie - dialogs corpus'). rename ( cornell _ dir ) \ n \ n else : \ n print ('Cornell Data prepared!') \ n \ n \ ndef loadLines ( fileName, \ n fields = [ " lineID ", " characterID ", " movieID ", " character ", " text " ], \ n delimiter = " + + + $ + + + " ) : \ n " " " \ n Args : \ n fileName ( str ) : file to load \ n field ( set < str > ) : fields to extract \ n Return : \ n dict < dict < str > > : the extracted fields for each line \ n " " " \ n lines = { } \ n \ n with open ( fileName,'r ', encoding ='iso - 8859 - 1') as f : \ n for line in f : \ n values = line. split ( delimiter ) \ n \ n # Extract fields \ n lineObj = { } \ n for i, field in enumerate ( fields ) : \ n lineObj [ field ] = values [ i ] \ n \ n lines [ lineObj ['lineID'] ] = lineObj \ n \ n return lines \ n \ n \ ndef loadConvers

lucidrains · 2022-05-18T19:06:56Z

@0x7o i think the solution may be to add the newline token https://discuss.huggingface.co/t/feat-tokenizers-how-to-make-models-aware-of-structuring-linebreaks/3711 , however, without training BERT from scratch including the newline token, it may suffer

i can't think of a solution besides finding a model out there that doesn't do away with newlines (treat it as whitespace)

lucidrains · 2022-05-18T19:08:08Z

the code will also have to be modularized to accept different models and their encoders, as a lot of the logic is specific to BERT-base

aakashgoel12 · 2023-02-22T18:52:58Z

Notebook When training RETRO with the standard methods, TrainingWrapper does not add line breaks to the dataset. This can have a bad effect on many NLP tasks.

Input *.txt:

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

Second Citizen:
Would you proceed especially against Caius Marcius?

All:
Against him first: he's a very dog to the commonalty.

Model output after traing:

some - - on my head, were even so salts to death strike That which may bet with tears I have found to life, which sweeter than now to dony : be known betwixcombed oaths ring yet in Corioli turnseth from him Dear life redeems doth thinkment for faith ; Or shall be slack than death within this face, PETRUCHIO : Now, wind and house or free thee better now. KATHARINA : Now, in mine honourable fellow : in your chat with me to be it, alive, I think, If to use than my wife, if this rebellious earth Have you will break out The strange s of yours cro

@0x7o Notebook is not accessible (says link don't exist). Can you please share working link. Its very important for me. Thanks

sdake · 2023-06-30T12:16:19Z

@0x7o would you share your notebook from above? If not, that's cool, or if its long gone, get that.

Thank you,
-steve

0x7o · 2023-06-30T12:54:12Z

I appreciate your interest immensely, but I no longer have access to this notebook as it has been irretrievably deleted.

filipesilva mentioned this issue Dec 7, 2022

How to give Prompt to trained RETRO Model? #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TrainingWrapper does not support line breaks #23

TrainingWrapper does not support line breaks #23

0x7o commented May 15, 2022

lucidrains commented May 15, 2022

0x7o commented May 16, 2022 •

edited

Loading

0x7o commented May 18, 2022

lucidrains commented May 18, 2022

lucidrains commented May 18, 2022

aakashgoel12 commented Feb 22, 2023

sdake commented Jun 30, 2023

0x7o commented Jun 30, 2023

TrainingWrapper does not support line breaks #23

TrainingWrapper does not support line breaks #23

Comments

0x7o commented May 15, 2022

lucidrains commented May 15, 2022

0x7o commented May 16, 2022 • edited Loading

0x7o commented May 18, 2022

lucidrains commented May 18, 2022

lucidrains commented May 18, 2022

aakashgoel12 commented Feb 22, 2023

sdake commented Jun 30, 2023

0x7o commented Jun 30, 2023

0x7o commented May 16, 2022 •

edited

Loading