English | 简体中文
content
Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets.
The work is accepted by EMNLP 2022 (Findings). To expand the scope of commercial applications for document intelligence, we release the multilingual model of ERNIE-Layout through PaddleNLP.
🧾 HuggingFace web demo is available here
- Invoice VQA
- Poster VQA
- WebPage VQA
- Table VQA
- Exam Paper VQA
- English invoice VQA by multilingual(CH, EN, JP, Th, ES, RUS) prompt
- Chinese invoice VQA by multilingual(CHS, CHT, EN, JP, FR) prompt
- Demo images are available here
- Input Format
[
{"doc": "./book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]},
{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
]
Default to use PaddleOCR, you can also use your own OCR result via word_boxes
, the data format is List[str, List[float, float, float, float]]
。
[
{"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
]
-
Support single and batch input
- Image from http link
>>> from pprint import pprint >>> from paddlenlp import Taskflow >>> docprompt = Taskflow("document_intelligence", lang="en") >>> docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]}]) [{'prompt': "What is the name of the author of 'The Adventure Zone: The " 'Crystal Kingdom’?', 'result': [{'end': 39, 'prob': 0.99, 'start': 22, 'value': 'Clint McElroy. Carey Pietsch, Griffn McElroy, Travis ' 'McElroy'}]}, {'prompt': 'What type of book cover does The Adventure Zone: The Crystal ' 'Kingdom have?', 'result': [{'end': 51, 'prob': 1.0, 'start': 51, 'value': 'Paperback'}]}, {'prompt': 'For Rage, who is the author listed as?', 'result': [{'end': 93, 'prob': 1.0, 'start': 91, 'value': 'Bob Woodward'}]}]
- Image from local path
>>> from pprint import pprint >>> from paddlenlp import Taskflow >>> docprompt = Taskflow("document_intelligence") >>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}])) [{'prompt': '五百丁本次想要担任的是什么职位?', 'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]}, {'prompt': '五百丁是在哪里上的大学?', 'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]}, {'prompt': '大学学的是什么专业?', 'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科)'}]}]
-
Parameter Description
batch_size
: number of input of each batch, default to 1.lang
: PaddleOCR language,en
is better to English images, default toch
.topn
: return the top n results with highest probability, default to 1.
-
Dataset
Dataset Task Language Note FUNSD Key Information Extraction English - XFUND-ZH Key Information Extraction Chinese - DocVQA-ZH Document Question Answering Chinese The submission of the competition of DocVQA-ZH is now closed so we split original dataset into three parts for model evluation. There are 4,187 training images, 500 validation images, and 500 test images. RVL-CDIP (sampled) Document Image Classification English The RVL-CDIP dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. Because of the original dataset is large and slow for training, so we downsampling from it. The sampled dataset consist of 6,400 training images, 800 validation images, and 800 test images. -
Results
Model FUNSD RVL-CDIP (sampled) XFUND-ZH DocVQA-ZH LayoutXLM-Base 86.72 90.88 86.24 66.01 ERNIE-LayoutX-Base 89.31 90.29 88.58 69.57 -
Evaluation Methods
-
All the above tasks do the Hyper Parameter searching based on Grid Search method. The evaluation step interval of FUNSD and XFUND-ZH are both 100, metric is F1-Score. The evaluation step interval of RVL-CDIP is 2000, metric is Accuracy. The evaluation step interval of DocVQA-ZH is 10000, metric is ANLS,
-
Hyper Parameters search ranges
Hyper Parameters FUNSD RVL-CDIP (sampled) XFUND-ZH DocVQA-ZH learning_rate 5e-6, 1e-5, 2e-5, 5e-5 5e-6, 1e-5, 2e-5, 5e-5 5e-6, 1e-5, 2e-5, 5e-5 5e-6, 1e-5, 2e-5, 5e-5 batch_size 1, 2, 4 8, 16, 24 1, 2, 4 8, 16, 24 warmup_ratio - 0, 0.05, 0.1 - 0, 0.05, 0.1 The strategy of
lr_scheduler_type
for FUNSD and XFUND is constant, so warmup_ratio is excluded. -
max_steps
is applied for the fine-tuning on both FUNSD and XFUND-ZH, 10000 steps and 20000 steps respectively;num_train_epochs
is set to 6 and 20 for DocVQA-ZH and RVL-CDIP respectively.
-
-
Best Hyper Parameter
Model FUNSD RVL-CDIP (sampled) XFUND-ZH DocVQA-ZH LayoutXLM-Base 1e-5, 2, _ 1e-5, 8, 0.1 1e-5, 2, _ 2e-5. 8, 0.1 ERNIE-LayoutX-Base 2e-5, 4, _ 1e-5, 8, 0. 1e-5, 4, _ 2e-5. 8, 0.05
- Installation
pip install -r requirements.txt
- FUNSD Train
python -u run_ner.py \
--model_name_or_path ernie-layoutx-base-uncased \
--output_dir ./ernie-layoutx-base-uncased/models/funsd/ \
--dataset_name funsd \
--do_train \
--do_eval \
--max_steps 10000 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 1 \
--load_best_model_at_end \
--pattern ner-bio \
--preprocessing_num_workers 4 \
--overwrite_cache false \
--use_segment_box \
--doc_stride 128 \
--target_size 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--learning_rate 2e-5 \
--lr_scheduler_type constant \
--gradient_accumulation_steps 1 \
--seed 1000 \
--metric_for_best_model eval_f1 \
--greater_is_better true \
--overwrite_output_dir
- XFUND-ZH Train
python -u run_ner.py \
--model_name_or_path ernie-layoutx-base-uncased \
--output_dir ./ernie-layoutx-base-uncased/models/xfund_zh/ \
--dataset_name xfund_zh \
--do_train \
--do_eval \
--lang "ch" \
--max_steps 20000 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 1 \
--load_best_model_at_end \
--pattern ner-bio \
--preprocessing_num_workers 4 \
--overwrite_cache false \
--use_segment_box \
--doc_stride 128 \
--target_size 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--learning_rate 1e-5 \
--lr_scheduler_type constant \
--gradient_accumulation_steps 1 \
--seed 1000 \
--metric_for_best_model eval_f1 \
--greater_is_better true \
--overwrite_output_dir
- DocVQA-ZH Train
python3 -u run_mrc.py \
--model_name_or_path ernie-layoutx-base-uncased \
--output_dir ./ernie-layoutx-base-uncased/models/docvqa_zh/ \
--dataset_name docvqa_zh \
--do_train \
--do_eval \
--lang "ch" \
--num_train_epochs 6 \
--lr_scheduler_type linear \
--warmup_ratio 0.05 \
--weight_decay 0 \
--eval_steps 10000 \
--save_steps 10000 \
--save_total_limit 1 \
--load_best_model_at_end \
--pattern "mrc" \
--use_segment_box false \
--return_entity_level_metrics false \
--overwrite_cache false \
--doc_stride 128 \
--target_size 1000 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--learning_rate 2e-5 \
--preprocessing_num_workers 32 \
--save_total_limit 1 \
--train_nshard 16 \
--seed 1000 \
--metric_for_best_model anls \
--greater_is_better true \
--overwrite_output_dir
- RVL-CDIP Train
python3 -u run_cls.py \
--model_name_or_path ernie-layoutx-base-uncased \
--output_dir ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ \
--dataset_name rvl_cdip_sampled \
--do_train \
--do_eval \
--num_train_epochs 20 \
--lr_scheduler_type linear \
--max_seq_length 512 \
--warmup_ratio 0.05 \
--weight_decay 0 \
--eval_steps 2000 \
--save_steps 2000 \
--save_total_limit 1 \
--load_best_model_at_end \
--pattern "cls" \
--use_segment_box \
--return_entity_level_metrics false \
--overwrite_cache false \
--doc_stride 128 \
--target_size 1000 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--learning_rate 1e-5 \
--preprocessing_num_workers 32 \
--train_nshard 16 \
--seed 1000 \
--metric_for_best_model acc \
--greater_is_better true \
--overwrite_output_dir
After fine-tuning, you can also export the inference model via Model Export Script, the inference model will be saved in the output_path
you specified.
- Export the model fine-tuned on FUNSD
python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export
- Export the model fine-tuned on DocVQA-ZH
python export_model.py --task_type mrc --model_path ./ernie-layoutx-base-uncased/models/docvqa_zh/ --output_path ./mrc_export
- Export the model fine-tuned on RVL-CDIP(sampled)
python export_model.py --task_type cls --model_path ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ --output_path ./cls_export
-
Parameter Description
model_path
:the save directory of dygraph model parameters, default to "./checkpoint/"。output_path
:the save directory of static graph model parameters, default to "./export"。
-
Directory
export/ ├── inference.pdiparams ├── inference.pdiparams.info └── inference.pdmodel
We provide the deploy example on Key Information Extraction, Document Question Answering and Document Image Classification, please follow the ERNIE-Layout Python Deploy Guide