- For many e-commerce companies (e.g., Amazon), address related-information can be very informative and can be harnessed to build more accurate geocode. This can result faster and more efficient shipping systems. Here I work with the real world data provided by Shopee, the leading online shopping platform in southeast Asia. They are interested in POI (point of interest) and street name for each customer but address-related information they receive are usually unstructured, in free-text format. Here is an example (Note that we are working with Indonesian here)
- To solve this problem, I fine-tuned Indonesian BERT on huggingface transformer to perform Name-Entity-Recognition (i.e., token classification). Here I summarized the major steps I took and each step is elaborated in following sections.
- Used IOBES annotation scheme to label each word.
- Tokenized text inputs and aligned labels using a pre-trained BERT for Indonesian.
- Added a token classification head and fine-tune both body and head.
- Make predictions on unlabeled unstructured address; and reconstruct words from tokens.
- Given raw, unstructured address: jalan tipar cakung no 26 depan rusun albo garasi dumtruk
- Tokens: using pre-trained tokenizer, the raw address was separated into sub-word tokens. [CLS] and [SEP] are also automatically added to the start and end of the sequence.
- Tokens_ID: INTs that map each token to the vocabulary of the pre-trained model
- Words IDs: specify which word the token belongs to. For example, both the token
tip
and the token##ar
have the ID being 1, suggesting that these two tokens are in fact from the same word, and the word is the second in the sequence. - Labels: specify the name entity of each token. For example, the token
jalan
has the labelB-STR
, suggesting that it is the beginning of a street entity; the token##o
has the labelE-POI
, suggesting that it is the end of a POI entity. - Labels_id: category coding for all labels.
- The model was training with mini-batch gradient descent. Each mini-batch was prefetched and padded to the longest sequence of the batch using data collator.
- The inputs of the model are
Tokens_ID
andLabels_id
, along with theattention mask
for each sequence.
- The model was trained ADAM with additional learning rate decay.
- With more training epoch, the training loss keeps dropping but the validation loss ended up getting larger, indicating a trend of overfitting. Thus, the model was restored to the weights trained after the third epoch.
- It is shown below that before fine-tuning the model, the model is making random predictions for the labels of the tokens, with pretty high loss. However, after tuning, the model could accurately predict each token’s label. Note that loss was computed based on each token’s logit. Thus although the predicted label is correct, loss may still not be zero.