Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrainingWrapper does not support line breaks #23

Open
0x7o opened this issue May 15, 2022 · 8 comments
Open

TrainingWrapper does not support line breaks #23

0x7o opened this issue May 15, 2022 · 8 comments

Comments

@0x7o
Copy link

0x7o commented May 15, 2022

Notebook
When training RETRO with the standard methods, TrainingWrapper does not add line breaks to the dataset. This can have a bad effect on many NLP tasks.

Input *.txt:

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

Second Citizen:
Would you proceed especially against Caius Marcius?

All:
Against him first: he's a very dog to the commonalty.

Model output after traing:

some - - on my head, were even so salts to death strike That which may bet with tears I have found to life, which sweeter than now to dony : be known betwixcombed oaths ring yet in Corioli turnseth from him Dear life redeems doth thinkment for faith ; Or shall be slack than death within this face, PETRUCHIO : Now, wind and house or free thee better now. KATHARINA : Now, in mine honourable fellow : in your chat with me to be it, alive, I think, If to use than my wife, if this rebellious earth Have you will break out The strange s of yours cro
@lucidrains
Copy link
Owner

@0x7o ohh interesting, this must be some issue with the default BERT tokenizer, i'll take a look next week

@0x7o
Copy link
Author

0x7o commented May 16, 2022

Judging by the error below, the script handles the file as a whole, not in batches

Token indices sequence length is longer than the specified maximum sequence length for this model (3449121 > 512). Running this sequence through the model will result in indexing errors

@0x7o
Copy link
Author

0x7o commented May 18, 2022

@lucidrains, tokenizer distorts the text. I think the problem is the difference between bert-cased and bert-uncased
Here is an example of a dataset from the program code:

} \ n else if ( GameManager. _ instance. won & &! GameManager. _ instance. keepPlaying ) { " won " } \ n else { " running " } \ n''') \ n \ n def get _ score ( self ) : \ n return self. execute ('GameManager. _ instance. score') \ n \ n def get _ board ( self ) : \ n # Chrome refuses to serialize the Grid object directly through the debugger. \ n grid = json. loads ( self. execute ('JSON. stringify ( GameManager. _ instance. grid )') ) \ n \ n board = [ [ 0 ] * 4 for _ in range ( 4 ) ] \ n for row in grid ['cells'] : \ n for cell in row : \ n if cell is None : \ n continue \ n pos = cell ['x'], cell ['y'] \ n tval = cell ['value'] \ n board [ pos [ 1 ] ] [ pos [ 0 ] ] = int ( round ( math. log ( tval, 2 ) ) ) \ n \ n return board \ n \ n def execute _ move ( self, move ) : \ n # We use UDLR ordering ; 2048 uses URDL ordering \ n move = [ 0, 2, 3, 1 ] [ move ] \ n self. execute ('GameManager. _ instance. move ( % d )'% move ) \ n \ nclass Keyboard2048Control ( Generic2048Control ) : \ n'''Control 2048 by accessing the DOM and using key events. \ n \ n This is relatively slow, and may be prone to race conditions if your \ n browser is slow. However, it is more generally compatible with various \ n clones of 2048.'''\ n \ n def setup ( self ) : \ n self. execute ( \ n'''\ n var elems = document. getElementsByTagName ('div') ; \ n for ( var i in elems ) \ n if ( elems [ i ]. className = ='tile - container') { \ n tileContainer = elems [ i ] ; \ n break ; \ n } \ n''') \ n \ n def get _ score ( self ) : \ n score = self. execute ('''\ n var scoreContainer = document. querySelector ( ". score - container " ) ; \ n var scoreText ='' ; \ n var scoreChildren = scoreContainer. childNodes ; \ n for ( var i = 0 ; i < scoreChildren. length ; + + i ) { \ n if ( scoreChildren [ i ]. nodeType = = Node. TEXT _ NODE ) { \ n scoreText + = scoreChildren [ i ]. textContent ; \ n } \ n } \ n scoreText ; \ n''') \ n \ n return int ( score ) \ n \ n def get _ board ( self ) : \ n res = self. execute ( \ n'''\ n var res = [ ] ; \ n var tiles = tileContainer. children ; \ n for ( var i = 0 ; i < tiles. length ; i + + ) \ n res. push ( tiles [ i ]. className ) ; \ n res \ n''') \ n board = [ [ 0 ] * 4 for _ in range ( 4 ) ] \ n for tile in res : \ n tval = pos = None \ n for k in tile. split ( ) : \ n m = re. match ( r'^ tile - ( \ d + ) $ ', k ) \ n if m : \ n tval = int ( m. group ( 1 ) ) \ n m = re. match ( r'^ tile - position - ( \ d + ) - ( \ d + ) $ ', k ) \ n if m : \ n pos = int ( m. group ( 1 ) ), int ( m. group ( 2 ) ) \ n board [ pos [ 1 ] - 1 ] [ pos [ 0 ] - 1 ] = int ( round ( math. log ( tval, 2 ) ) ) \ n \ n return board \ n \ n def execute _ move ( self, move ) : \ n key = [ 38, 40, 37, 39 ] [ move ] \ n self. send _ key _ event ('keydown ', key ) \ n time. sleep ( 0. 01 ) \ n self. send _ key _ event ('keyup ', key ) \ n time. sleep ( 0. 05 ) \ n \ nclass Hybrid2048Control ( Fast2048Control, Keyboard2048Control ) : \ n'''Control 2048 by hooking the GameManager and using keyboard inputs. \ n \ n This is safe and fast, and correctly generates keyboard events for compatibility. \ n'''\ n \ n setup = Fast2048Control. setup \ n get _ status = Keyboard2048Control. get _ status \ n get _ score = Fast2048Control. get _ score \ n get _ board = Fast2048Control. get _ board \ n execute _ move = Keyboard2048Control. execute _ move \ n # Preprocess cornell movie dialogs dataset \ n \ nfrom multiprocessing import Pool \ nimport argparse \ nimport pickle \ nimport random \ nimport os \ nfrom urllib. request import urlretrieve \ nfrom zipfile import ZipFile \ nfrom pathlib import Path \ nfrom tqdm import tqdm \ nfrom model. utils import Tokenizer, Vocab, PAD _ TOKEN, SOS _ TOKEN, EOS _ TOKEN \ n \ nproject _ dir = Path ( _ _ file _ _ ). resolve ( ). parent \ ndatasets _ dir = project _ dir. joinpath ('datasets /') \ ncornell _ dir = datasets _ dir. joinpath ('cornell /') \ n \ n # Tokenizer \ ntokenizer = Tokenizer ('spacy') \ n \ ndef prepare _ cornell _ data ( ) : \ n " " " Download and unpack dialogs " " " \ n \ n zip _ url ='http : / / www. mpi - sws. org / ~ cristian / data / cornell _ movie _ dialogs _ corpus. zip'\ n zipfile _ path = datasets _ dir. joinpath ('cornell. zip') \ n \ n if not datasets _ dir. exists ( ) : \ n datasets _ dir. mkdir ( ) \ n \ n # Prepare Dialog data \ n if not cornell _ dir. exists ( ) : \ n print ( f'Downloading { zip _ url } to { zipfile _ path }') \ n urlretrieve ( zip _ url, zipfile _ path ) \ n print ( f'Successfully downloaded { zipfile _ path }') \ n \ n zip _ ref = ZipFile ( zipfile _ path,'r') \ n zip _ ref. extractall ( datasets _ dir ) \ n zip _ ref. close ( ) \ n \ n datasets _ dir. joinpath ('cornell movie - dialogs corpus'). rename ( cornell _ dir ) \ n \ n else : \ n print ('Cornell Data prepared!') \ n \ n \ ndef loadLines ( fileName, \ n fields = [ " lineID ", " characterID ", " movieID ", " character ", " text " ], \ n delimiter = " + + + $ + + + " ) : \ n " " " \ n Args : \ n fileName ( str ) : file to load \ n field ( set < str > ) : fields to extract \ n Return : \ n dict < dict < str > > : the extracted fields for each line \ n " " " \ n lines = { } \ n \ n with open ( fileName,'r ', encoding ='iso - 8859 - 1') as f : \ n for line in f : \ n values = line. split ( delimiter ) \ n \ n # Extract fields \ n lineObj = { } \ n for i, field in enumerate ( fields ) : \ n lineObj [ field ] = values [ i ] \ n \ n lines [ lineObj ['lineID'] ] = lineObj \ n \ n return lines \ n \ n \ ndef loadConvers

@lucidrains
Copy link
Owner

@0x7o i think the solution may be to add the newline token https://discuss.huggingface.co/t/feat-tokenizers-how-to-make-models-aware-of-structuring-linebreaks/3711 , however, without training BERT from scratch including the newline token, it may suffer

i can't think of a solution besides finding a model out there that doesn't do away with newlines (treat it as whitespace)

@lucidrains
Copy link
Owner

the code will also have to be modularized to accept different models and their encoders, as a lot of the logic is specific to BERT-base

@aakashgoel12
Copy link

Notebook When training RETRO with the standard methods, TrainingWrapper does not add line breaks to the dataset. This can have a bad effect on many NLP tasks.

Input *.txt:

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

Second Citizen:
Would you proceed especially against Caius Marcius?

All:
Against him first: he's a very dog to the commonalty.

Model output after traing:

some - - on my head, were even so salts to death strike That which may bet with tears I have found to life, which sweeter than now to dony : be known betwixcombed oaths ring yet in Corioli turnseth from him Dear life redeems doth thinkment for faith ; Or shall be slack than death within this face, PETRUCHIO : Now, wind and house or free thee better now. KATHARINA : Now, in mine honourable fellow : in your chat with me to be it, alive, I think, If to use than my wife, if this rebellious earth Have you will break out The strange s of yours cro

@0x7o Notebook is not accessible (says link don't exist). Can you please share working link. Its very important for me. Thanks

@sdake
Copy link

sdake commented Jun 30, 2023

@0x7o would you share your notebook from above? If not, that's cool, or if its long gone, get that.

Thank you,
-steve

@0x7o
Copy link
Author

0x7o commented Jun 30, 2023

I appreciate your interest immensely, but I no longer have access to this notebook as it has been irretrievably deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants