Final Project in 2nd BoostCamp AI Tech 2κΈ° by λ©νλͺ½ν (2μ‘°)
κ³ μ°½μ© | λ°λ²μ§ | λ°μν | μλͺ μ² | μ΄κΈ°μ± | μ΄μλΉ | μ μ μ |
---|---|---|---|---|---|---|
Github | Github | Github | Github | Github | Github | Github |
- Project Abstract
- How to use
- Result
- Hardware
- Operating System
- Archive Contents
- Getting Started
- Arguments
- Running Command
- Reference
π₯ μμ± μμ½μ ν΅ν νκ΅μ΄ λ¬Έμ μ λͺ© μμ±κΈ° π₯
- λ°μ΄ν°μ
:
- μ’ λ₯ : λ Όλ¬Έ λ°μ΄ν° 162,341κ°, λ¬Έμ λ°μ΄ν° 371,290κ°
- train_data : 275,219κ° (Text, Title, Document Type)
- validation_data : 91,741κ° (Text, Title, Document Type)
- test_data : 81,739κ° (Text, Title, Document Type)
import torch
from transformers import AutoConfig
from transformers import AutoTokenizer
from models.modeling_kobigbird_bart import EncoderDecoderModel
config = AutoConfig.from_pretrained('metamong1/bigbird-tapt-ep3')
tokenizer = AutoTokenizer.from_pretrained('metamong1/bigbird-tapt-ep3')
model = EncoderDecoderModel.from_pretrained('metamong1/bigbird-tapt-ep3', config=config)
text = "λ³Έ λ
Όλ¬Έμ λͺ©μ μ μλκΆ μ§μμ μμΆμ
컨ν
μ΄λ νλ¬Όμ λν μ΅μ 볡ν©μ΄μ‘ λ€νΈμν¬λ₯Ό μ°Ύλ λ° μλ€. λ°λΌμ μ΄ μ§μμ 컨ν
μ΄λ νλ¬Όμ λ¬Όλλ νλ¦μ μ°μ μ μΌλ‘ λΆμνμκ³ , μ΄ μμ‘λΉμ©κ³Ό μ΄ μμ‘ μκ°μ κ³ λ €ν μ΅μ κ²½λ‘λ₯Ό ꡬνλ € μλνμλ€. μ΄λ₯Ό μν΄ λͺ¨ν μ€μ μ 0-1 μ΄μ§λ³μλ₯Ό μ΄μ©ν λͺ©μ κ³νλ²μ μ¬μ©νμκ³ , μ μ μκ³ λ¦¬μ¦ κΈ°λ²μ ν΅ν΄ ν΄λ₯Ό λμΆνμλ€. κ·Έ κ²°κ³Ό, μλκΆ μ§μμ 33κ° κ° μ(κ΅°)μ λν λ΄λ₯ μμ‘λΉμ©κ³Ό μμ‘ μκ°μ μ΅μννλ μμ‘κ±°μ λ° μ΄μ‘ μλ¨μ λμΆν¨μΌλ‘μ¨ μ΄ μ§μμ μμΆμ
컨ν
μ΄λ νλ¬Όμ λν μ΅μ 볡ν©μ΄μ‘ λ€νΈμν¬λ₯Ό λ°κ²¬ν μ μμλ€. λν μλ리μ€λ³ μμ‘λΉμ© λ° μμ‘ μκ°μ μ κ° ν¨κ³Όλ₯Ό μ λμ μΌλ‘ μ μνλ€."
raw_input_ids = tokenizer.encode(text)
input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]
summary_ids = model.generate(torch.tensor([input_ids]))
tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)
μ λ΅ μ λͺ©
μ μ μκ³ λ¦¬μ¦μ μ΄μ©ν 볡ν©μ΄μ‘μ΅μ νλͺ¨νμκ΄ν μ°κ΅¬
μμ± μ λͺ©
μ μ μκ³ λ¦¬μ¦μ μ΄μ©ν μλκΆμ μμΆμ 컨ν μ΄λ νλ¬Όμ λν μ΅μ 볡ν©μ΄μ‘
RougeL | |
---|---|
Test | 41.687 |
- Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
- NVIDIA Tesla V100-SXM2-32GB
- Ubuntu 18.04.5 LTS
- final-project-level3-nlp-02 : ꡬν μ½λμ λͺ¨λΈ checkpoint λ° λͺ¨λΈ κ²°κ³Όλ₯Ό ν¬ν¨νλ λλ ν 리
final-project-level3-nlp-02/
βββ model
β βββ args
β β βββ __init__.py
β βΒ Β βββ DataTrainingArguments.py
β β Β βββ GenerationArguments.py
β βΒ Β βββ LoggingArguments.py
β βΒ Β βββ ModelArguments.py
β βΒ Β βββ Seq2SeqTrainingArguments.py
β βββ models
β β βββ modeling_distilbert.py
β βΒ Β βββ modeling_kobigbird_bart.py
β βΒ Β βββ modeling_longformer_bart.py
β βββ utils
β β βββ data_collator.py
β βΒ Β βββ data_loader.py
β β Β βββ data_preprocessor.py
β βΒ Β βββ rouge.py
β βΒ Β βββ trainer.py
β βββ optimization
β β βββ knowledge_distillation.py
β βΒ Β βββ performance_test.py
β β Β βββ performnaceBenchmark.py
β βΒ Β βββ quantization.py
β βββ predict.py
β βββ pretrain.py
β βββ REDAME.md
β βββ running.sh
β βββ train.py
βββ serving
βΒ Β βββ app.py
βΒ Β βββ GenerationArguments.py
βΒ Β βββ postprocessing.py
βΒ Β βββ predict.py
βΒ Β βββ utils.py
βΒ Β βββ viz.py
βββ.gitignore
βββrequirements.sh
utils
utils/data_collator.py
: λͺ¨λΈμ μ λ €λλ Batchλ₯Ό μμ±νλ μ½λutils/data_preprocessor.py
: Textλ₯Ό μ μ²λ¦¬νλ μ½λutils/processor.py
: Textλ₯Ό Tokenizerλ₯Ό μ΄μ©ν΄μ μ μ μΈμ½λ©μ νλ μ½λutils/rouge.py
: λͺ¨λΈμ νκ°μ§νμ κ΄λ ¨λλ μ½λutils/trainer.py
: λͺ¨λΈμ νμ΅νλλ° νμ©νλ trainer μ½λ
models
modeling_kobigbird_bart.py
: bigbird bart λͺ¨λΈ μ½λmodeling_longformerbart.py
: longformer bart λͺ¨λΈ μ½λ
optimization
knowledge_distillation.py
:performanceBenchmark.py
:performance_test.py
:quantization.py
:
predict.py
: λͺ¨λΈμ μ΄μ©ν΄μ μ λ ₯λ λ¬Έμμ μ λͺ©μ μμ±νλ μ½λpretrain.py
: summarization modelμ fintuneμ μν μ½λtrain.py
: summarization modelμ pretrainμ μν μ½λ
- python=3.8.5
- transformers==4.11.0
- datasets==1.15.1
- torch==1.10.0
- streamlit==1.1.0
- elasticsearch==7.16.1
- pyvis==0.1.9
- plotly==5.4.0
sh requirements.sh
argument | description | default |
---|---|---|
model_name_or_path | μ¬μ©ν μ¬μ νμ΅ λͺ¨λΈ μ ν | gogamza/kobart-base-v1 |
use_model | λͺ¨λΈ νμ μ ν [auto, bigbart, longbart] | auto |
config_name | Pretrainedλ model config κ²½λ‘ | None |
tokenizer_name | customized tokenizer κ²½λ‘ μ ν | None |
use_fast_tokenizer | fast tokenizer μ¬μ© μ¬λΆ | True |
hidden_size | embedding hidden dimension ν¬κΈ° | 128 |
π longBART specific
argument | description | default |
---|---|---|
attention_window_size | attention window ν¬κΈ° | 256 |
attention_head_size | attention head κ°μ | 4 |
encoder_layer_size | encoder layer μ | 3 |
decoder_layer_size | decoder layer μ | 3 |
argument | description | default |
---|---|---|
text_column | datasetμμ λ³Έλ¬Έ column μ΄λ¦ | text |
title_column | datasetμμ μ λͺ© column μ΄λ¦ | title |
overwrite_cache | μΊμλ trainingκ³Ό evaluation setμ overwriteνκΈ° | False |
preprocessing_num_workers | μ μ²λ¦¬λμ μ¬μ©ν prcoess μ μ§μ | 1 |
max_source_length | Text Sequence κΈΈμ΄ μ§μ | 1024 |
max_target_length | Title Sequence κΈΈμ΄ μ§μ | 128 |
pad_to_max_length | μ΅λ κΈΈμ΄λ‘ Paddingμ μ§ν | False |
num_beams | Evaluation ν λμ beam searchμμ beamμ ν¬κΈ° | None |
use_auth_token_path | Huggingfaceμ Datasetμ νΉμ Modelμ λΆλ¬μ¬ λ private key μ£Όμ | ./use_auth_token.env |
num_samples | train_datasetμμ sample μΆμΆ κ°―μ(NoneμΌ λλ μ 체 λ°μ΄ν° μ μ¬μ©) | None |
relative_eval_steps | Evaluation νμ | 10 |
is_pretrain | Pretraining μ¬λΆ | False |
is_part | μ 체 λ°μ΄ν° μμ 50% μ λ μ¬μ© | False |
use_preprocessing | μ μ²λ¦¬ μ¬λΆ | False |
use_doc_type_ids | doc_type_embedding μ¬μ© μ¬λΆ | False |
argument | description | default |
---|---|---|
wandb_unique_tag | wandbμ κΈ°λ‘λ λͺ¨λΈμ μ΄λ¦ | None |
dotenv_path | wandb keyκ°μ λ±λ‘νλ νμΌμ κ²½λ‘ | ./wandb.env |
project_name | wandbμ κΈ°λ‘λ project name | Kobart |
argument | description | default |
---|---|---|
max_length | μμ±λ λ¬Έμ₯μ μ΅λ κΈΈμ΄ | None |
min_length | μμ±λ λ¬Έμ₯μ μ΅μ κΈΈμ΄ | 1 |
length_penalty | λ¬Έμ₯μ κΈΈμ΄μ λ°λΌ μ£Όλ penaltyμ μ λ | 1.0 |
early_stopping | Beamμ κ°―μ λ§νΌ λ¬Έμ₯μ μμ±μ΄ μλ£ λμμ λ μμ±μ μ’ λ£ μ¬λΆ | True |
output_scores | prediction score μΆλ ₯ μ¬λΆ | False |
no_repeat_ngram_size | λ°λ³΅ μμ±λμ§ μμ ngramμ μ΅μ ν¬κΈ° | 3 |
num_return_sequences | μμ± λ¬Έμ₯ κ°―μ | 1 |
top_k | Top-K νν°λ§μμμ K κ° | 50 |
top_p | μμ± κ³Όμ μμ μ΄μ΄μ§λ ν ν°μ μ νν λμ μ΅μ νλ₯ κ° | 0.95 |
argument | description | default |
---|---|---|
metric_for_best_model | train ν μ μ₯λ λͺ¨λΈ μ μ κΈ°μ€ | rougeL |
es_patience | early stoppingμ΄ λλ patience κ° | 3 |
is_noam | noam scheduler μ¬μ© μ¬λΆ | False |
use_rdrop | R-drop μ¬μ© μ¬λΆ | False |
reg_alpha | R-drop μ¬μ© μ μ μ©λ KL loss λΉμ¨ | 0.7 |
alpha | knowledge distillation μ CE loss μ μ© λΉμ¨ | 0.5 |
temperature | distillationμ ν λμ temperature κ° | 1.0 |
use_original | tiny distillationμ ν λ prediction loss μ¬μ© μ¬λΆ | False |
teacher_check_point | teacher modelμ checkpoint | None |
use_teacher_forcing | teacher forcing μ μ© μ¬λΆ | False |
$ python train.py \
--model_name_or_path metamong1/bigbird-tapt-ep3 \
--use_model bigbart \
--do_train \
--output_dir checkpoint/kobigbirdbart_full_tapt_ep3_bs16_pre_noam \
--overwrite_output_dir \
--num_train_epochs 3 \
--use_doc_type_ids \
--max_source_length 2048 \
--max_target_length 128 \
--metric_for_best_model rougeLsum \
--es_patience 3 \
--load_best_model_at_end \
--project_name kobigbirdbart \
--wandb_unique_tag kobigbirdbart_full_tapt_ep5_bs16_pre_noam \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 8 \
--use_preprocessing \
--warmup_steps 1000 \
--evaluation_strategy epoch \
--is_noam \
--learning_rate 0.08767941605644963 \
--save_strategy epoch
$ python predict.py \
--model_name_or_path model/baseV1.0_Kobart_ep2_1210 \
--num_beams 3
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- Longformer: The Long-Document Transformer
- Big Bird: Transformers for Longer Sequences
- Scheduled Sampling for Transformers
- On the Effect of Dropping Layers of Pre-trained Transformer Models
- R-Drop: Regularized Dropout for Neural Networks