GitHub - peetal/street-element-extraction: For this project, I fine-tuned a pre-trained BERT on Indonesian to extract key elements from unstructured address text.

1. Overview: The problem and solution

The Problem

For many e-commerce companies (e.g., Amazon), address related-information can be very informative and can be harnessed to build more accurate geocode. This can result faster and more efficient shipping systems. Here I work with the real world data provided by Shopee, the leading online shopping platform in southeast Asia. They are interested in POI (point of interest) and street name for each customer but address-related information they receive are usually unstructured, in free-text format. Here is an example (Note that we are working with Indonesian here)

The Solution

To solve this problem, I fine-tuned Indonesian BERT on huggingface transformer to perform Name-Entity-Recognition (i.e., token classification). Here I summarized the major steps I took and each step is elaborated in following sections.
- Used IOBES annotation scheme to label each word.
- Tokenized text inputs and aligned labels using a pre-trained BERT for Indonesian.
- Added a token classification head and fine-tune both body and head.
- Make predictions on unlabeled unstructured address; and reconstruct words from tokens.

Data Preprocessing and Model Inputs

Given raw, unstructured address: jalan tipar cakung no 26 depan rusun albo garasi dumtruk
- Tokens: using pre-trained tokenizer, the raw address was separated into sub-word tokens. [CLS] and [SEP] are also automatically added to the start and end of the sequence.
- Tokens_ID: INTs that map each token to the vocabulary of the pre-trained model
- Words IDs: specify which word the token belongs to. For example, both the token tip and the token ##ar have the ID being 1, suggesting that these two tokens are in fact from the same word, and the word is the second in the sequence.
- Labels: specify the name entity of each token. For example, the token jalan has the label B-STR, suggesting that it is the beginning of a street entity; the token ##o has the label E-POI, suggesting that it is the end of a POI entity.
- Labels_id: category coding for all labels.
The model was training with mini-batch gradient descent. Each mini-batch was prefetched and padded to the longest sequence of the batch using data collator.
The inputs of the model are Tokens_ID and Labels_id, along with the attention mask for each sequence.

Model Training and Evaluating

The model was trained ADAM with additional learning rate decay.
With more training epoch, the training loss keeps dropping but the validation loss ended up getting larger, indicating a trend of overfitting. Thus, the model was restored to the weights trained after the third epoch.

- The final model was evaluated using the validation set and the f1 score was computed for each tag category (POI and STR) using seqeval. The results show that the model was able to predict token classification with really good performance.

Model For Prediction

It is shown below that before fine-tuning the model, the model is making random predictions for the labels of the tokens, with pretty high loss. However, after tuning, the model could accurately predict each token’s label. Note that loss was computed based on each token’s logit. Thus although the predicted label is correct, loss may still not be zero.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Fine-Tune BERT for Address Element Extraction 3f9f41c9e81c44b7812e67b9379bc278		Fine-Tune BERT for Address Element Extraction 3f9f41c9e81c44b7812e67b9379bc278
data		data
logs		logs
scripts		scripts
.DS_Store		.DS_Store
.Rhistory		.Rhistory
README.md		README.md
preprocessing.ipynb		preprocessing.ipynb
submit_bert.sh		submit_bert.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Overview: The problem and solution

The Problem

The Solution

Data Preprocessing and Model Inputs

Model Training and Evaluating

Model For Prediction

About

Releases

Packages

Languages

peetal/street-element-extraction

Folders and files

Latest commit

History

Repository files navigation

1. Overview: The problem and solution

The Problem

The Solution

Data Preprocessing and Model Inputs

Model Training and Evaluating

Model For Prediction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages