-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions #1
Comments
Hi @sachaarbonel thanks for taking the time to test the model... i'm thinking in better improve it, i find it sometimes wonky, but for now doing so will be cost me like U$ 180,00 on GCP (i got a much more bigger corpus [check out my repo with a Word2Vec model trained on this new corpus] to train and perhaps got a bigger vocab and more training epochs). Anyway, as for your question about the Ġ, i think is normal based on the Tokernizer used (and here i'm doing some inference based on other models that are trained using ByteLevelBPETokenizer like roberta-base, but i need to better explore this space to give you a more precise answer here. As for the collaboration, of course i could, i really appreciate that. And for the record, i'm ideia for the model (the next step), would be train on a pos-tagged and my main area of research that is sentiment analysis. I'm already in search for more than the UD dataset for portuguese pos-tagged... and found the following two great repo on github. |
Nice! I didn't found those datasets at the time I researched the subject. Wow 180$ sounds like a lot! It corresponds to one week of training? I'll try to finish up the tool to clean up the datasets. I don't have the time this weekend maybe next week. I'll keep you updated |
Hi, Rodolfo. So you point in your model card you used the HF script for
training a model from scratch. So how much data did you use? I have done
several experiments and as its scripts uses LineByLineTextDataset and it
loads everything in memory so could not train more than about 600 MB of
data. Is this your case or did you modified anything?
Bests,
Manu
El vie., 31 jul. 2020 a las 19:46, Sacha Arbonel (<[email protected]>)
escribió:
… Nice! I didn't found those datasets at the time I researched the subject.
Wow 180$ sounds like a lot! It corresponds to one week of training? I'll
try to finish up the tool to clean up the datasets. I don't have the time
this weekend maybe next week. I'll keep you updated
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA34BHMHX43B3A22B3FWXELR6L7NRANCNFSM4PPOI46Q>
.
|
It correspond to almost ~4 days of training in a T4 or P100 it depends... but now the dataset i'm using is almost x3 bigger than the one previously trained. No worries about the datasets... i'm also going to explore then as i mentioned above. |
@mrm8488 i do need to change the LineByLineTextDataset... i build my own class based on this one, since i'm now thinking in training using a 2.5Gb dataset. The model that is on huggingface web site is one trained with 900mb corpus... a small corpus. Please check out the code here => https://github.com/rdenadai/BR-BERTo/blob/master/transformer.py#L16 It lazy loads each line of the dataset using pandas read_csv method... |
I see, good job. Your CustomDataset does not scale?
I need to build one for 200GB of data. I think I am gonna use HF/NLP that
convert even plain text files to Apache Arrow format.
El vie., 31 jul. 2020 21:29, Rodolfo De Nadai <[email protected]>
escribió:
… @mrm8488 <https://github.com/mrm8488> i do need to change the
LineByLineTextDataset... i build my own class based on this one, since i'm
now thinking in training using a 2.5Gb dataset.
The model that is on huggingface web site is one trained with 900mb
corpus... a small corpus.
Please check out the code here =>
https://github.com/rdenadai/BR-BERTo/blob/master/transformer.py#L16
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA34BHJD26AEZQXDT2XV7NTR6MLR7ANCNFSM4PPOI46Q>
.
|
Didn't try with that much of data... only with 2.5Gb... one thing you could try to change in my custom class is, instead of loading the file using pandas, change the line to use Dask... this way you could scale up much more, and since Dask uses pyarrow in it's internal you could use that to read parquet files. The simple approuch you could try, is point the pandas to your file and see if it could load... and the use Dask... and them change the class to fit better you needs. |
Thank you so much for your advice
El vie., 31 jul. 2020 21:51, Rodolfo De Nadai <[email protected]>
escribió:
… Didn't try with that much of data... only with 2.5Gb... one thing you
could try to change in my custom class is, instead of loading the file
using pandas, change the line using Dask... this way you could scale up
much more, and since Dask uses pyarrow in it's internal you could use that
to read parquet files.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA34BHKXBGFU2IV5JUTG2MDR6MOCZANCNFSM4PPOI46Q>
.
|
And why seq_length = 128 instead of 512? You wanted to try a small model
first?
El vie., 31 jul. 2020 22:11, Manuel Romero <[email protected]> escribió:
… Thank you so much for your advice
El vie., 31 jul. 2020 21:51, Rodolfo De Nadai ***@***.***>
escribió:
> Didn't try with that much of data... only with 2.5Gb... one thing you
> could try to change in my custom class is, instead of loading the file
> using pandas, change the line using Dask... this way you could scale up
> much more, and since Dask uses pyarrow in it's internal you could use that
> to read parquet files.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AA34BHKXBGFU2IV5JUTG2MDR6MOCZANCNFSM4PPOI46Q>
> .
>
|
Yeap... I don't have enough power (in my computar i have a GTX 1060) or
money (GCP is in dolar) to make a bigger model for nos.
Em sex, 31 de jul de 2020 18:02, Manuel Romero <[email protected]>
escreveu:
… And why seq_length = 128 instead of 512? You wanted to try a small model
first?
El vie., 31 jul. 2020 22:11, Manuel Romero ***@***.***> escribió:
> Thank you so much for your advice
>
> El vie., 31 jul. 2020 21:51, Rodolfo De Nadai ***@***.***>
> escribió:
>
>> Didn't try with that much of data... only with 2.5Gb... one thing you
>> could try to change in my custom class is, instead of loading the file
>> using pandas, change the line using Dask... this way you could scale up
>> much more, and since Dask uses pyarrow in it's internal you could use
that
>> to read parquet files.
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <#1 (comment)>,
>> or unsubscribe
>> <
https://github.com/notifications/unsubscribe-auth/AA34BHKXBGFU2IV5JUTG2MDR6MOCZANCNFSM4PPOI46Q
>
>> .
>>
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHAADH3RJ7L5CRP7PAGBGDR6MWOTANCNFSM4PPOI46Q>
.
|
I can help cleaning datasets if you guys need |
@sachaarbonel thanks, all the help are appreciate, since i In case you guys want the dataset i`m using to train the BR_BERTo, just say and i can send a Google Drive link so you can download it. |
I guess creating a sentiment analysis dataset should be feasible by applying a translation model from hugginface on a English dataset |
Yeap that's one way... And i'm thinking in doing this, one problem is check if each phase is corrected translated or not. And most datasets, only have 2-3 labels (positive/negative/neutral), since Affective Computing is my area of research interest i'm looking for a more wide range of emotions. |
Hi @rdenadai, thank's for your great work! I was playing around with your model on hugging face and I got results starting with `Ġ, is it normal? Plus, I wanted to know if you were willing to collaborate on a finetuned pos model. My understanding is that we need a conllu dataset such as UD_Portuguese-GSD and to clean it up to fit the format used by @mrm8488 in his notebook I started working on tool to cleanup such datasets.
The text was updated successfully, but these errors were encountered: