Skip to content

Latest commit

 

History

History
381 lines (308 loc) · 23.5 KB

README.md

File metadata and controls

381 lines (308 loc) · 23.5 KB

Aspect-based Sentiment Analysis for Vietnamese

Enter your sentence: Lễ tân thân thiện, có thang máy, vị trí ks thuận tiện, view thành phố rất đẹp. Phòng sạch nhưng hơi nhỏ & thiếu bình đun siêu tốc. Sẽ quay lại & giới thiệu bạn bè
=> FACILITIES#DESIGN&FEATURES,positive
=> LOCATIONIGENERAL,positive
=> ROOM_AMENITIES#DESIGN&FEATURES,negative
=> ROOMS#CLEANLINESS,positive
=> ROOMS#DESIGN&FEATURES,negative
=> SERVICE#GENERAL,positive

Table of Contents

I. Introduction

This work aimed to solve the Aspect-based Sentiment Analysis (ABSA) problem for Vietnamese. Specifically, we focus on 2 sub-tasks of the Aspect Category Sentiment Analysis (ACSA):

  1. Aspect Category Detection (ACD): Detect Aspect#Category pairs in each review (e.g., HOTEL#CLEANLINESS, RESTAURANT#PRICES, SERVICE#GENERAL, etc.)
  2. Sentiment Polarity Classification (SPC): Classify the Sentiment Polarity (Positive, Negative, Neutral) of each Aspect#Category pair.

Here, we proposed 2 End-to-End solutions (ACSA-v1 and ACSA-v2), which used PhoBERT as a Pre-trained language model for Vietnamese to handle the above tasks simultaneously on 2 domains of the VLSP 2018 ABSA Dataset: Hotel and Restaurant.

1. Dataset Overview

Domain Dataset No. Reviews No. Aspect#
Cate,Polarity
Avg. Length Vocab Size No. words in Test/Dev not in Training set
Training 3,000 13,948 47 3,908 -
Hotel Dev 2,000 7,111 23 2,745 1,059
Test 600 2,584 30 1,631 346
Training 2,961 9,034 54 5,168 -
Restaurant Dev 1,290 3,408 50 3,398 1,702
Test 500 2,419 163 3,375 1,729
  • The Hotel domain consists of 34 following Aspect#Category pairs:
['FACILITIES#CLEANLINESS', 'FACILITIES#COMFORT', 'FACILITIES#DESIGN&FEATURES', 'FACILITIES#GENERAL', 'FACILITIES#MISCELLANEOUS', 'FACILITIES#PRICES', 'FACILITIES#QUALITY', 'FOOD&DRINKS#MISCELLANEOUS', 'FOOD&DRINKS#PRICES', 'FOOD&DRINKS#QUALITY', 'FOOD&DRINKS#STYLE&OPTIONS', 'HOTEL#CLEANLINESS', 'HOTEL#COMFORT', 'HOTEL#DESIGN&FEATURES', 'HOTEL#GENERAL', 'HOTEL#MISCELLANEOUS', 'HOTEL#PRICES', 'HOTEL#QUALITY', 'LOCATION#GENERAL', 'ROOMS#CLEANLINESS', 'ROOMS#COMFORT', 'ROOMS#DESIGN&FEATURES', 'ROOMS#GENERAL', 'ROOMS#MISCELLANEOUS', 'ROOMS#PRICES', 'ROOMS#QUALITY', 'ROOM_AMENITIES#CLEANLINESS', 'ROOM_AMENITIES#COMFORT', 'ROOM_AMENITIES#DESIGN&FEATURES', 'ROOM_AMENITIES#GENERAL', 'ROOM_AMENITIES#MISCELLANEOUS', 'ROOM_AMENITIES#PRICES', 'ROOM_AMENITIES#QUALITY', 'SERVICE#GENERAL']
  • The Restaurant domain consists of 12 following Aspect#Category pairs:
['AMBIENCE#GENERAL', 'DRINKS#PRICES', 'DRINKS#QUALITY', 'DRINKS#STYLE&OPTIONS', 'FOOD#PRICES', 'FOOD#QUALITY', 'FOOD#STYLE&OPTIONS', 'LOCATION#GENERAL', 'RESTAURANT#GENERAL', 'RESTAURANT#MISCELLANEOUS', 'RESTAURANT#PRICES', 'SERVICE#GENERAL']

2. Constructing *.csv Files for Model Development

For models to easily process the dataset, I transformed the original *.txt files into *.csv form using the VLSP2018Parser class in vlsp2018_processor.py. I already provided these *.csv files for both domains in the datasets folder. However, if you want to re-generate them, you can run the following command

python processors/vlsp2018_processor.py

Each row in the *.csv will contains review and their corresponding Aspect#Category,Polarity labels, with the value 1 demonstrating the existence of the Aspect#Category in the review associated with its Positive label, and the same for 2 and 3 for Negative and Neutral labels, respectively. Finally, the value 0 indicates that the Aspect#Category does not exist in the review.

III. Vietnamese Preprocessing

👉 I already provided the preprocessed data for this project in the datasets folder.

1. Vietnamese Preprocessing Steps for the VLSP 2018 ABSA Dataset

flowchart LR
    style A fill:#ffccff,stroke:#660066,stroke-width:2px;
    style B fill:#cceeff,stroke:#0066cc,stroke-width:2px;
    style C fill:#ccffcc,stroke:#009933,stroke-width:2px;
    style F fill:#ffcc99,stroke:#ff6600,stroke-width:2px;
    style G fill:#ccccff,stroke:#6600cc,stroke-width:2px;
    style H fill:#ccff99,stroke:#66cc00,stroke-width:2px;
    style I fill:#ffcccc,stroke:#cc0000,stroke-width:2px;

    A[/📄 Input Text/]
    B([🔠 Lowercase])

    subgraph C [VietnameseToneNormalizer]
        direction TB
        C1([🌀 Normalize\nUnicode])
        C2([🖋️ Normalize\nSentence Typing])
        C1 --> C2
    end  

    subgraph E [VietnameseTextCleaner]
        E1{{"<i class='fas fa-code'></i> Remove HTML"}}
        E2{{"<i class='far fa-smile'></i> Remove Emoji"}}
        E3{{"<i class='fas fa-link'></i> Remove URL"}}
        E4{{"<i class='far fa-envelope'></i> Remove Email"}}
        E5{{"<i class='fas fa-phone'></i> Remove Phone Number"}}
        E6{{"<i class='fas fa-hashtag'></i> Remove Hashtags"}}
        E7{{"<i class='fas fa-ban'></i> Remove Unnecessary Characters"}}
        E1 --> E2 --> E3 --> E4 --> E5 --> E6 --> E7 
    end

    F([💬 Normalize\nTeencode])
    G([🛠️ Correct\nVietnamese Errors])
    H([🔪 Word\nSegmentation])
    I[/📄 Preprocessed Text/]

    click G "https://huggingface.co/bmd1905/vietnamese-correction-v2"
    click H "https://github.com/vncorenlp/VnCoreNLP"
    
    A --> B --> C --> E --> E1
    E --> F --> G --> E 
    F --> H --> I
Loading

I implemented 3 classes in the vietnamese_processor.py to preprocess raw Vietnamese text data. This is my improved version from the work by behitek:

(a) VietnameseTextCleaner: Simple regex-based text cleaning to remove HTML, Emoji, URL, Email, Phone Number, Hashtags, and other unnecessary characters.

(b) VietnameseToneNormalizer: Normalize Unicode (eg., 'ờ' != 'ờ') and sentence typing (eg., lựơng => lượng, thỏai mái => thoải mái).

(c) VietnameseTextPreprocessor:

Combine the above classes and add these following steps to the pipeline:

  • normalize_teencodes(text: str):
    • Convert teencodes to its original form.
    • I also provided the extra_teencodes parameter to add your own teencode definitions based on the dataset used. The extra_teencodes must be a dict with keys as the original form and values as a list of teencodes.
    • You should be careful when using single word replacement for teencodes, because it can cause misinterpretation. For example, 'giá': ['price', 'gia'] can replace the word 'gia' in 'gia đình', making it become 'giá đình'.
  • correct_vietnamese_errors(texts: List):
    • Use the pre-trained model by bmd1905 to correct Vietnamese errors.
    • The inference time for this model is quite slow, so I implemented this method to process the text in batch. That's why you should pass a list of texts as input.
  • word_segment(text: str):
Example Usage
from processors.vietnamese_processor import VietnameseTextPreprocessor
extra_teencodes = { 
    'khách sạn': ['ks'], 'nhà hàng': ['nhahang'], 'nhân viên': ['nv'],
    'cửa hàng': ['store', 'sop', 'shopE', 'shop'], 
    'sản phẩm': ['sp', 'product'], 'hàng': ['hàg'],
    'giao hàng': ['ship', 'delivery', 'síp'], 'đặt hàng': ['order'], 
    'chuẩn chính hãng': ['authentic', 'aut', 'auth'], 'hạn sử dụng': ['date', 'hsd'],
    'điện thoại': ['dt'],  'facebook': ['fb', 'face'],  
    'nhắn tin': ['nt', 'ib'], 'trả lời': ['tl', 'trl', 'rep'], 
    'feedback': ['fback', 'fedback'], 'sử dụng': ['sd'], 'xài': ['sài'], 
}

preprocessor = VietnameseTextPreprocessor(vncorenlp_dir='./VnCoreNLP', extra_teencodes=extra_teencodes, max_correction_length=512)
sample_texts = [
    'Ga giường không sạch, nhân viên quên dọn phòng một ngày. Chất lựơng "ko" đc thỏai mái 😔',
    'Cám ơn Chudu24 rất nhiềuGia đình tôi có 1 kỳ nghỉ vui vẻ.Resort Bình Minh nằm ở vị trí rất đẹp, theo đúng tiêu chuẩn, còn về ăn sáng thì wa dở, chỉ có 2,3 món để chọn',
    'Giá cả hợp líĂn uống thoả thíchGiữ xe miễn phíKhông gian bờ kè thoáng mát Có phòng máy lạnhMỗi tội lúc quán đông thì đợi hơi lâu',
    'May lần trước ăn mì k hà, hôm nay ăn thử bún bắp bò. Có chả tôm viên ăn lạ lạ. Tôm thì k nhiều, nhưng vẫn có tôm thật ở nhân bên trong. ',
    'Ngồi ăn Cơm nhà *tiền thân là quán Bão* Phần vậy là 59k nha. Trưa từ 10h-14h, chiều từ 16h-19h. À,có sữa hạt sen ngon lắmm. #food #foodpic #foodporn #foodholic #yummy #deliciuous'
]
preprocessed_texts = preprocessor.process_batch(sample_texts, correct_errors=True)
preprocessor.close_vncorenlp()
print(preprocessed_texts)

IV. Model Development

Based on the original BERT paper, the model achieved the best results when concatenating last 4 layers of BERT together. So we applied that method to the PhoBERT layer in our model architectures and combined it with 2 output construction ways below, ACSA-v1 and ACSA-v2, to form the final solutions.

ACSA-v1. Multi-task Approach:

👉 Notebook Solutions: Hotel-v1.ipynb, Restaurant-v1.ipynb

1. Output Construction

We transformed each Aspect#Category pair and their corresponding Polarity labels in each dataset's review into a list of C one-hot vectors, where C is the number of Aspect#Category pairs:

  • Each vector has 3 polarity labels, Positive, Negative, Neutral, and 1 None label to indicate whether or not the input has this Aspect#Category so that it can have a polarity. Labels that exists will be 1, otherwise 0.
  • Therefore, we need to create C Dense layers with 4 neurons for each to predict the polarity of the corresponding Aspect#Category pair.
  • Softmax function will be applied here to get the probability distribution over the 4 polarity classes.

However, we will not simply feedforward the learned feature to each Dense layer one-by-one. Instead, we will concatenate them into a single Dense layer consisting of:

  • 34 Aspect#Categories × 4 Polarities = 136 neurons for the Hotel domain.
  • 12 Aspect#Categories × 4 Polarities = 48 neurons for the Restaurant domain.

Finally, the binary_crossentropy loss function will be applied to treat each Dense layer in the final Concatenated Dense layer as a binary classification problem.

2. Why use one-hot encoding and Softmax?

In this ACSA problem, each Aspect#Category,Polarity can represent an independent binary classification task (Is this Aspect#Category Positive or not?, Is this Aspect#Category Negative or not?, etc.).

So you might wonder that instead of treating each Aspect#Category,Polarity as a separate output neuron with Sigmoid, why we one-hot encoded them within a single 4-neuron block for each and used Softmax? The key issue here is that the polarities within an Aspect#Category are not entirely independent. For example:

  • If the Aspect#Category is strongly Positive, it's less likely to be Negative or Neutral.
  • If the Aspect#Category is very Negative, it's less likely to be Positive or Neutral.

Using separate Sigmoids doesn't inherently capture this relationship. You could end up with outputs like: Positive=0.9, Negative=0.8, Neutral=0.7. This doesn't make sense because the polarities should be mutually exclusive and the sum of the probabilities should be 1, which is what Softmax does.

3. Why concat each Aspect#Category into 1 Dense layer and apply binary_crossentropy?

The Concatenation mixes the independent Aspect#Category,Polarity information and allows the network to learn complex/shared relationships between them. For example, if the model sees that HOTEL#CLEANLINESS is Positive, it might be more likely to predict HOTEL#QUALITY as Positive as well.

When using this Concatenation, the binary_crossentropy will be applied to each output independently and the Softmax constraint is maintained during forward and backward passes for each Aspect#Category. This approach not only allows the model to learn to predict multiple Aspect#Category,Polarity simultaneously as binary classification problems but also maintains the mutual exclusivity of 4 polarities within each Aspect#Category.

Reference (Vietnamese): https://phamdinhkhanh.github.io/2020/04/22/MultitaskLearning.html

ACSA-v2. Multi-task with Multi-branch Approach:

👉 Notebook Solutions: Hotel-v2.ipynb, Restaurant-v2.ipynb

The only difference of this approach from the above is that it will branch into many sub-models by using C Dense layers (34 for Hotel and 12 for Restaurant) but not concatenating them into a single one. Each model will predict each task independently, not sharing parameters between them.

The Softmax function is applied here to get the probability distribution over the 4 polarity classes directly without converting them into one-hot vectors. Therefore, the categorical_crossentropy loss function will be used to treat each Dense layer as a multi-class classification problem.

Reference (Vietnamese): https://phamdinhkhanh.github.io/2020/05/05/MultitaskLearning_MultiBranch.html

V. Experimental Results

1. Evaluation on the VLSP 2018 ABSA Dataset

VLSP has their own Java evaluation script for their ACSA tasks. You have to prepare 2 files:

I already provided a script to run the evaluation for each domain and approach. You can run the following command to get the evaluation results:

source ./evaluators/vlsp_evaluate.sh
Task Method Hotel Restaurant
Precision Recall F1-score Precision Recall F1-score
Aspect#
Category
VLSP best submission 76.00 66.00 70.00 79.00 76.00 77.00
Bi-LSTM+CNN 84.03 72.52 77.85 82.02 77.51 79.70
BERT-based Hierarchical - - 82.06 - - 84.23
Multi-task 87.45 78.17 82.55 81.09 85.61 83.29
Multi-task Multi-branch 63.21 57.86 60.42 80.81 87.39 83.97
Aspect#
Category,
Polarity
VLSP best submission 66.00 57.00 61.00 62.00 60.00 61.00
Bi-LSTM+CNN 76.53 66.04 70.90 66.66 63.00 64.78
BERT-based Hierarchical - - 74.69 - - 71.30
Multi-task 81.90 73.22 77.32 69.66 73.54 71.55
Multi-task Multi-branch 57.55 52.67 55.00 68.69 74.29 71.38

2. Some Notes about the Results

The predictions in the experiments/predictions folder and the evaluation results in the evaluators folder are obtained from older models I did couple years ago.

I finished the paper on this project in 2021, so the above results are obtained from the experiments I conducted at that time, which is located from this e8439bc commit. Something to note if you want to re-run the notebooks in that commit to obtain the above results:

  • You can download the weights for each model here.
  • As the notebooks in this commit are deprecated, you can face some issues when running them. For example, when calling the create_model function, you will face the following error when initializing the input layer.
<class 'keras.src.backend.common.keras_tensor.KerasTensor'> is not allowed only (<class 'tensorflow.python.framework.tensor.Tensor'> ...)

This error is because the PhoBERT model in the current huggingface version does not support KerasTensor input in the notebook version of TensorFlow/Keras. There are 2 ways to fix this:

  • Downgrade the version of TensorFlow to nearly the same as when I did this project, around 2.10.
  • Use TensorFlow's Subclassing by creating your own model class, which is inherited from keras.Model. This is how I fixed that issue in this latest update.