Skip to content

boostcampaitech2/final-project-level3-nlp-02

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Korean Title Generator

Final Project in 2nd BoostCamp AI Tech 2κΈ° by 메타λͺ½νŒ€ (2μ‘°)

Demo

Demo

Members

고창용 박범진 λ°•μƒν•˜ μ•ˆλͺ…μ²  이기성 이예빈 μ •μœ μ„
Github Github Github Github Github Github Github

Content

Project Abstract

πŸ”₯ 생성 μš”μ•½μ„ ν†΅ν•œ ν•œκ΅­μ–΄ λ¬Έμ„œ 제λͺ© 생성기 πŸ”₯

How to use (μ΅œμ’… λͺ¨λΈ checkpoint둜 μˆ˜μ •ν•΄μ•Όν•¨)

import torch
from transformers import AutoConfig
from transformers import AutoTokenizer
from models.modeling_kobigbird_bart import EncoderDecoderModel

config = AutoConfig.from_pretrained('metamong1/bigbird-tapt-ep3')
tokenizer = AutoTokenizer.from_pretrained('metamong1/bigbird-tapt-ep3')
model = EncoderDecoderModel.from_pretrained('metamong1/bigbird-tapt-ep3', config=config)

text = "λ³Έ λ…Όλ¬Έμ˜ λͺ©μ μ€ μˆ˜λ„κΆŒ μ§€μ—­μ˜ μˆ˜μΆœμž… μ»¨ν…Œμ΄λ„ˆ 화물에 λŒ€ν•œ 졜적 λ³΅ν•©μš΄μ†‘ λ„€νŠΈμ›Œν¬λ₯Ό μ°ΎλŠ” 데 μžˆλ‹€. λ”°λΌμ„œ 이 μ§€μ—­μ˜ μ»¨ν…Œμ΄λ„ˆ ν™”λ¬Όμ˜ λ¬Όλ™λŸ‰ 흐름을 μš°μ„ μ μœΌλ‘œ λΆ„μ„ν•˜μ˜€κ³ , 총 μˆ˜μ†‘λΉ„μš©κ³Ό 총 μˆ˜μ†‘ μ‹œκ°„μ„ κ³ λ €ν•œ 졜적 경둜λ₯Ό κ΅¬ν•˜λ € μ‹œλ„ν•˜μ˜€λ‹€. 이λ₯Ό μœ„ν•΄ λͺ¨ν˜• 섀정은 0-1 μ΄μ§„λ³€μˆ˜λ₯Ό μ΄μš©ν•œ λͺ©μ κ³„νšλ²•μ„ μ‚¬μš©ν•˜μ˜€κ³ , μœ μ „μ•Œκ³ λ¦¬μ¦˜ 기법을 톡해 ν•΄λ₯Ό λ„μΆœν•˜μ˜€λ‹€. κ·Έ κ²°κ³Ό, μˆ˜λ„κΆŒ μ§€μ—­μ˜ 33개 각 μ‹œ(κ΅°)에 λŒ€ν•œ λ‚΄λ₯™ μˆ˜μ†‘λΉ„μš©κ³Ό μˆ˜μ†‘ μ‹œκ°„μ„ μ΅œμ†Œν™”ν•˜λŠ” μˆ˜μ†‘κ±°μ  및 μš΄μ†‘ μˆ˜λ‹¨μ„ λ„μΆœν•¨μœΌλ‘œμ¨ 이 μ§€μ—­μ˜ μˆ˜μΆœμž… μ»¨ν…Œμ΄λ„ˆ 화물에 λŒ€ν•œ 졜적 λ³΅ν•©μš΄μ†‘ λ„€νŠΈμ›Œν¬λ₯Ό λ°œκ²¬ν•  수 μžˆμ—ˆλ‹€. λ˜ν•œ μ‹œλ‚˜λ¦¬μ˜€λ³„ μˆ˜μ†‘λΉ„μš© 및 μˆ˜μ†‘ μ‹œκ°„μ˜ 절감 효과λ₯Ό μ •λŸ‰μ μœΌλ‘œ μ œμ‹œν•œλ‹€."

raw_input_ids = tokenizer.encode(text)
input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]

summary_ids = model.generate(torch.tensor([input_ids]))
tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

μ •λ‹΅ 제λͺ©

μœ μ „μ•Œκ³ λ¦¬μ¦˜μ„ μ΄μš©ν•œ λ³΅ν•©μš΄μ†‘μ΅œμ ν™”λͺ¨ν˜•μ—κ΄€ν•œ 연ꡬ

생성 제λͺ©

μœ μ „μ•Œκ³ λ¦¬μ¦˜μ„ μ΄μš©ν•œ μˆ˜λ„κΆŒμ˜ μˆ˜μΆœμž… μ»¨ν…Œμ΄λ„ˆ 화물에 λŒ€ν•œ 졜적 λ³΅ν•©μš΄μ†‘

Result (κ²°κ³Ό 뽑고 μˆ˜μ • ν•„μš”)

RougeL
Test 41.687

Hardware

  • Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
  • NVIDIA Tesla V100-SXM2-32GB

Operating System

  • Ubuntu 18.04.5 LTS

Archive Contents

  • final-project-level3-nlp-02 : κ΅¬ν˜„ μ½”λ“œμ™€ λͺ¨λΈ checkpoint 및 λͺ¨λΈ κ²°κ³Όλ₯Ό ν¬ν•¨ν•˜λŠ” 디렉토리
final-project-level3-nlp-02/
β”œβ”€β”€ model
β”‚   β”œβ”€β”€ args  
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚Β Β  β”œβ”€β”€ DataTrainingArguments.py
β”‚   β”‚ Β  β”œβ”€β”€ GenerationArguments.py
β”‚   β”‚Β Β  β”œβ”€β”€ LoggingArguments.py
β”‚   β”‚Β Β  β”œβ”€β”€ ModelArguments.py
β”‚   β”‚Β Β  └── Seq2SeqTrainingArguments.py
β”‚   β”œβ”€β”€ models  
β”‚   β”‚   β”œβ”€β”€ modeling_distilbert.py
β”‚   β”‚Β Β  β”œβ”€β”€ modeling_kobigbird_bart.py
β”‚   β”‚Β Β  └── modeling_longformer_bart.py
β”‚   β”œβ”€β”€ utils   
β”‚   β”‚   β”œβ”€β”€ data_collator.py
β”‚   β”‚Β Β  β”œβ”€β”€ data_loader.py
β”‚   β”‚ Β  β”œβ”€β”€ data_preprocessor.py
β”‚   β”‚Β Β  β”œβ”€β”€ rouge.py
β”‚   β”‚Β Β  └── trainer.py
β”‚   β”œβ”€β”€ optimization   
β”‚   β”‚   β”œβ”€β”€ knowledge_distillation.py
β”‚   β”‚Β Β  β”œβ”€β”€ performance_test.py
β”‚   β”‚ Β  β”œβ”€β”€ performnaceBenchmark.py
β”‚   β”‚Β Β  └── quantization.py
β”‚   β”œβ”€β”€ predict.py
β”‚   β”œβ”€β”€ pretrain.py
β”‚   β”œβ”€β”€ REDAME.md
β”‚   β”œβ”€β”€ running.sh
β”‚   └── train.py
β”œβ”€β”€ serving
β”‚Β Β  β”œβ”€β”€ app.py
β”‚Β Β  β”œβ”€β”€ GenerationArguments.py
β”‚Β Β  β”œβ”€β”€ postprocessing.py
β”‚Β Β  β”œβ”€β”€ predict.py
β”‚Β Β  β”œβ”€β”€ utils.py
β”‚Β Β  └── viz.py
β”œβ”€β”€.gitignore
└──requirements.sh

  • utils
    • utils/data_collator.py : λͺ¨λΈμ— μž…λ €λ˜λŠ” Batchλ₯Ό μƒμ„±ν•˜λŠ” μ½”λ“œ
    • utils/data_preprocessor.py : Textλ₯Ό μ „μ²˜λ¦¬ν•˜λŠ” μ½”λ“œ
    • utils/processor.py : Textλ₯Ό Tokenizerλ₯Ό μ΄μš©ν•΄μ„œ μ •μˆ˜ 인코딩을 ν•˜λŠ” μ½”λ“œ
    • utils/rouge.py : λͺ¨λΈμ˜ ν‰κ°€μ§€ν‘œμ™€ κ΄€λ ¨λ˜λŠ” μ½”λ“œ
    • utils/trainer.py : λͺ¨λΈμ„ ν•™μŠ΅ν•˜λŠ”λ° ν™œμš©ν•˜λŠ” trainer μ½”λ“œ
  • models
    • modeling_kobigbird_bart.py : bigbird bart λͺ¨λΈ μ½”λ“œ
    • modeling_longformerbart.py : longformer bart λͺ¨λΈ μ½”λ“œ
  • optimization
    • knowledge_distillation.py :
    • performanceBenchmark.py :
    • performance_test.py :
    • quantization.py :
  • predict.py : λͺ¨λΈμ„ μ΄μš©ν•΄μ„œ μž…λ ₯된 λ¬Έμ„œμ˜ 제λͺ©μ„ μƒμ„±ν•˜λŠ” μ½”λ“œ
  • pretrain.py : summarization model을 fintune을 μœ„ν•œ μ½”λ“œ
  • train.py : summarization model을 pretrain을 μœ„ν•œ μ½”λ“œ

Getting Started

Dependencies

  • python=3.8.5
  • transformers==4.11.0
  • datasets==1.15.1
  • torch==1.10.0
  • streamlit==1.1.0
  • elasticsearch==7.16.1
  • pyvis==0.1.9
  • plotly==5.4.0

Install Requirements

sh requirements.sh

Arguments

Model Arguments

argument description default
model_name_or_path μ‚¬μš©ν•  사전 ν•™μŠ΅ λͺ¨λΈ 선택 gogamza/kobart-base-v1
use_model λͺ¨λΈ νƒ€μž… 선택 [auto, bigbart, longbart] auto
config_name Pretrained된 model config 경둜 None
tokenizer_name customized tokenizer 경둜 선택 None
use_fast_tokenizer fast tokenizer μ‚¬μš© μ—¬λΆ€ True
hidden_size embedding hidden dimension 크기 128

πŸ‘‡ longBART specific

argument description default
attention_window_size attention window 크기 256
attention_head_size attention head 개수 4
encoder_layer_size encoder layer 수 3
decoder_layer_size decoder layer 수 3

DataTrainingArguments

argument description default
text_column datasetμ—μ„œ λ³Έλ¬Έ column 이름 text
title_column datasetμ—μ„œ 제λͺ© column 이름 title
overwrite_cache μΊμ‹œλœ trainingκ³Ό evaluation set을 overwriteν•˜κΈ° False
preprocessing_num_workers μ „μ²˜λ¦¬λ™μ•ˆ μ‚¬μš©ν•  prcoess 수 지정 1
max_source_length Text Sequence 길이 지정 1024
max_target_length Title Sequence 길이 지정 128
pad_to_max_length μ΅œλŒ€ 길이둜 Padding을 진행 False
num_beams Evaluation ν•  λ•Œμ˜ beam searchμ—μ„œ beam의 크기 None
use_auth_token_path Huggingface에 Dataset을 ν˜Ήμ€ Model을 뢈러올 λ•Œ private key μ£Όμ†Œ ./use_auth_token.env
num_samples train_datasetμ—μ„œ sample μΆ”μΆœ 갯수(None일 λ•ŒλŠ” 전체 데이터 수 μ‚¬μš©) None
relative_eval_steps Evaluation 횟수 10
is_pretrain Pretraining μ—¬λΆ€ False
is_part 전체 데이터 수의 50% 정도 μ‚¬μš© False
use_preprocessing μ „μ²˜λ¦¬ μ—¬λΆ€ False
use_doc_type_ids doc_type_embedding μ‚¬μš© μ—¬λΆ€ False

LoggingArguments

argument description default
wandb_unique_tag wandb에 기둝될 λͺ¨λΈμ˜ 이름 None
dotenv_path wandb key값을 λ“±λ‘ν•˜λŠ” 파일의 경둜 ./wandb.env
project_name wandb에 기둝될 project name Kobart

GenerationArguments

argument description default
max_length 생성될 λ¬Έμž₯의 μ΅œλŒ€ 길이 None
min_length 생성될 λ¬Έμž₯의 μ΅œμ†Œ 길이 1
length_penalty λ¬Έμž₯의 길이에 따라 μ£ΌλŠ” penalty의 정도 1.0
early_stopping Beam의 갯수 만큼 λ¬Έμž₯의 생성이 μ™„λ£Œ λ˜μ—ˆμ„ λ•Œ 생성을 μ’…λ£Œ μ—¬λΆ€ True
output_scores prediction score 좜λ ₯ μ—¬λΆ€ False
no_repeat_ngram_size 반볡 μƒμ„±λ˜μ§€ μ•Šμ„ ngram의 μ΅œμ†Œ 크기 3
num_return_sequences 생성 λ¬Έμž₯ 갯수 1
top_k Top-K ν•„ν„°λ§μ—μ„œμ˜ K κ°’ 50
top_p 생성 κ³Όμ •μ—μ„œ μ΄μ–΄μ§€λŠ” 토큰을 선택할 λ•Œμ˜ μ΅œμ†Œ ν™•λ₯  κ°’ 0.95

Seq2SeqTrainingArguments

argument description default
metric_for_best_model train ν›„ μ €μž₯될 λͺ¨λΈ μ„ μ • κΈ°μ€€ rougeL
es_patience early stopping이 λ˜λŠ” patience κ°’ 3
is_noam noam scheduler μ‚¬μš© μ—¬λΆ€ False
use_rdrop R-drop μ‚¬μš© μ—¬λΆ€ False
reg_alpha R-drop μ‚¬μš© μ‹œ 적용될 KL loss λΉ„μœ¨ 0.7
alpha knowledge distillation μ‹œ CE loss 적용 λΉ„μœ¨ 0.5
temperature distillation을 ν•  λ•Œμ˜ temperature κ°’ 1.0
use_original tiny distillation을 ν•  λ•Œ prediction loss μ‚¬μš© μ—¬λΆ€ False
teacher_check_point teacher model의 checkpoint None
use_teacher_forcing teacher forcing 적용 μ—¬λΆ€ False

Running Command

Train

$ python train.py \
--model_name_or_path metamong1/bigbird-tapt-ep3 \
--use_model bigbart \
--do_train \
--output_dir checkpoint/kobigbirdbart_full_tapt_ep3_bs16_pre_noam \
--overwrite_output_dir \
--num_train_epochs 3 \
--use_doc_type_ids \
--max_source_length 2048 \
--max_target_length 128 \
--metric_for_best_model rougeLsum \
--es_patience 3 \
--load_best_model_at_end \
--project_name kobigbirdbart \
--wandb_unique_tag kobigbirdbart_full_tapt_ep5_bs16_pre_noam \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 8 \
--use_preprocessing \
--warmup_steps 1000 \
--evaluation_strategy epoch \
--is_noam \
--learning_rate 0.08767941605644963 \
--save_strategy epoch

Predict

$ python predict.py \
--model_name_or_path model/baseV1.0_Kobart_ep2_1210 \
--num_beams 3

Reference

  1. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    https://arxiv.org/pdf/1910.13461.pdf

  2. Longformer: The Long-Document Transformer

    https://arxiv.org/pdf/2004.05150.pdf

  3. Big Bird: Transformers for Longer Sequences

    https://arxiv.org/pdf/2007.14062.pdf

  4. Scheduled Sampling for Transformers

    https://arxiv.org/pdf/1906.07651.pdf

  5. On the Effect of Dropping Layers of Pre-trained Transformer Models

    https://arxiv.org/pdf/2004.03844.pdf

  6. R-Drop: Regularized Dropout for Neural Networks

    https://arxiv.org/pdf/2106.14448v2.pdf

About

final-project-level3-nlp-02 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published