Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Paper List-2] Add 10 textrecog papers #1652

Open
wants to merge 1 commit into
base: dev-1.x
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
Title: 'A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding'
Abbreviation: Qiao et al
Tasks:
- TextRecog
Venue: ICFHR
Year: 2022
Lab/Company:
- Tomorrow Advancing Life, Beijing, China
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-21648-0_14'
Arxiv: 'https://books.google.fr/books?hl=zh-CN&lr=&id=hvmdEAAAQBAJ&oi=fnd&pg=PA198&ots=Gg_BaAnXLm&sig=gpJ2h9NjKz1PjLWSfwDpyd8eLZE&redir_esc=y#v=onepage&q&f=false'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Recently, vision Transformer (ViT) has attracted more and more attention,
many works introduce the ViT into concrete vision tasks and achieve impressive
performance. However, there are only a few works focused on the applications of
the ViT for scene text recognition. This paper takes a further step and proposes
a strong scene text recognizer with a fully ViT-based architecture.
Specifically, we introduce multi-grained features into both the encoder and
decoder. For the encoder, we adopt a two-stage ViT with different grained
patches, where the first stage extracts extent visual features with 2D
ine-grained patches and the second stage aims at the sequence of contextual
features with 1D coarse-grained patches. The decoder integrates Connectionist
Temporal Classification (CTC)-based and attention-based decoding, where the
two decoding schemes introduce different grained features into the decoder and
benefit from each other with a deep interaction. To improve the extraction of
fine-grained features, we additionally explore self-supervised learning for
text recognition with masked autoencoders. Furthermore, a focusing mechanism is
proposed to let the model target the pixel reconstruction of the text area. Our
proposed method achieves state-of-the-art or comparable accuracies on benchmarks
of scene text recognition with a faster inference speed and nearly 50% reduction
of parameters compared with other recent works.'
MODELS:
Architecture:
- CTC
- Attention
- Transformer
Learning Method:
- Self-Supervised
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210053998-385587ef-2b0e-4c9b-a8b8-d6171261c621.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: 90.5
IIIT5K:
WAICS: 96.1
SVT:
WAICS: 92.3
IC13:
WAICS: 95.0
IC15:
WAICS: 86.0
SVTP:
WAICS: 87.0
CUTE:
WAICS: 86.8
Bibtex: '@inproceedings{qiao2022vision,
title={A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding},
author={Qiao, Zhi and Ji, Zhilong and Yuan, Ye and Bai, Jinfeng},
booktitle={International Conference on Frontiers in Handwriting Recognition},
pages={198--212},
year={2022},
organization={Springer}
}'
72 changes: 72 additions & 0 deletions paper_zoo/textrecog/Levenshtein OCR.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
Title: 'Levenshtein OCR'
Abbreviation: Lev-OCR
Tasks:
- TextRecog
Venue: ECCV
Year: 2022
Lab/Company:
- Alibaba DAMO Academy, Beijing, China
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_19'
Arxiv: 'https://arxiv.org/abs/2209.03594'
Paper Reading URL: 'https://mp.weixin.qq.com/s/Nuc8j3V5YeaXpY64SsIeCw'
Code: 'https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/LevOCR'
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'A novel scene text recognizer based on Vision-Language Transformer
(VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the
proposed method (named Levenshtein OCR, and LevOCR for short) explores an
alternative way for automatically transcribing textual content from cropped
natural images. Specifically, we cast the problem of scene text recognition as
an iterative sequence refinement process. The initial prediction sequence
produced by a pure vision model is encoded and fed into a cross-modal
transformer to interact and fuse with the visual features, to progressively
approximate the ground truth. The refinement process is accomplished via two
basic characterlevel operations: deletion and insertion, which are learned with
imitation learning and allow for parallel decoding, dynamic length change and
good interpretability. The quantitative experiments clearly demonstrate that
LevOCR achieves state-of-the-art performances on standard benchmarks and the
qualitative analyses verify the effectiveness and advantage of the proposed
LevOCR algorithm. Code will be released soon.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Explicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210163468-bb6c14ba-134a-4dd5-881e-a7adb4058dcd.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: 92.1
IIIT5K:
WAICS: 96.6
SVT:
WAICS: 92.9
IC13:
WAICS: 96.9
IC15:
WAICS: 86.4
SVTP:
WAICS: 88.1
CUTE:
WAICS: 91.7
Bibtex: '@inproceedings{wang2022multi,
title={Multi-granularity Prediction for Scene Text Recognition},
author={Wang, Peng and Da, Cheng and Yao, Cong},
booktitle={European Conference on Computer Vision},
pages={339--355},
year={2022},
organization={Springer}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Title: 'Multi-Granularity Prediction for Scene Text Recognition'
Abbreviation: MGP-STR
Tasks:
- TextRecog
Venue: ECCV
Year: 2022
Lab/Company:
- Alibaba DAMO Academy, Beijing, China
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_20'
Arxiv: 'https://arxiv.org/abs/2209.03592'
Paper Reading URL: N/A
Code: 'https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR'
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Scene text recognition (STR) has been an active research topic in
computer vision for years. To tackle this challenging problem, numerous
innovative methods have been successively proposed and incorporating linguistic
knowledge into STR models has recently become a prominent trend. In this work,
we first draw inspiration from the recent progress in Vision Transformer (ViT)
to construct a conceptually simple yet powerful vision STR model, which is built
upon ViT and outperforms previous state-of-the-art models for scene text
recognition, including both pure vision models and language-augmented methods.
To integrate linguistic knowledge, we further propose a Multi-Granularity
Prediction strategy to inject information from the language modality into the
model in an implicit way, i.e. , subword representations (BPE and WordPiece)
widely-used in NLP are introduced into the output space, in addition to the
conventional character level representation, while no independent language model
(LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the
performance envelop of STR to an even higher level. Specifically, it achieves
an average recognition accuracy of 93.35% on standard benchmarks. Code will be
released soon.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Explicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210163378-fc11a79b-fb7d-4a3f-947e-a8f6dfd14dd2.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: 92.8
IIIT5K:
WAICS: 96.4
SVT:
WAICS: 94.7
IC13:
WAICS: 97.3
IC15:
WAICS: 87.2
SVTP:
WAICS: 91.0
CUTE:
WAICS: 90.3
Bibtex: '@inproceedings{wang2022multi,
title={Multi-granularity Prediction for Scene Text Recognition},
author={Wang, Peng and Da, Cheng and Yao, Cong},
booktitle={European Conference on Computer Vision},
pages={339--355},
year={2022},
organization={Springer}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
Title: 'On Vocabulary Reliance in Scene Text Recognition'
Abbreviation: Wan et al
Tasks:
- TextRecog
Venue: CVPR
Year: 2020
Lab/Company:
- Megvii
- China University of Mining and Technology
- University of Rochester
URL:
Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Wan_On_Vocabulary_Reliance_in_Scene_Text_Recognition_CVPR_2020_paper.html'
Arxiv: 'https://arxiv.org/abs/2005.03959'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'The pursuit of high performance on public benchmarks has been the
driving force for research in scene text recognition, and notable progress has
been achieved. However, a close investigation reveals a startling fact that the
state-ofthe-art methods perform well on images with words within vocabulary but
generalize poorly to images with words outside vocabulary. We call this
phenomenon “vocabulary reliance”. In this paper, we establish an analytical
framework to conduct an in-depth study on the problem of vocabulary reliance
in scene text recognition. Key findings include: (1) Vocabulary reliance is
ubiquitous, i.e., all existing algorithms more or less exhibit such
characteristic; (2) Attention-based decoders prove weak in generalizing to
words outside vocabulary and segmentation-based decoders perform well in
utilizing visual features; (3) Context modeling is highly coupled with the
prediction layers. These findings provide new insights and can benefit future
research in scene text recognition. Furthermore, we propose a simple yet
effective mutual learning strategy to allow models of two families
(attention-based and segmentationbased) to learn collaboratively. This remedy
alleviates the problem of vocabulary reliance and improves the overall scene
text recognition performance.'
MODELS:
Architecture:
- CTC
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210054683-5d5f3117-4bee-43d6-a36c-8e645d47c2b1.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: N/A
IIIT5K:
WAICS: N/A
SVT:
WAICS: N/A
IC13:
WAICS: N/A
IC15:
WAICS: N/A
SVTP:
WAICS: N/A
CUTE:
WAICS: N/A
Bibtex: '@inproceedings{wan2020vocabulary,
title={On vocabulary reliance in scene text recognition},
author={Wan, Zhaoyi and Zhang, Jielei and Zhang, Liang and Luo, Jiebo and Yao, Cong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={11425--11434},
year={2020}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
Title: 'Parallel and Robust Text Rectifier for Scene Text Recognition'
Abbreviation: PRTR
Tasks:
- TextRecog
Venue: BMVC
Year: 2022
Lab/Company:
- Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China
- Ping An Technology (Shenzhen) Co. Ltd.
- School of Information and Telecommunication Engineering, Guangzhou Maritime University, Guangzhou, China
URL:
Venue: 'https://bmvc2022.mpi-inf.mpg.de/0770.pdf'
Arxiv: 'https://bmvc2022.mpi-inf.mpg.de/0770.pdf'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Scene text recognition (STR) is to recognize text appearing in images.
Current stateof-the-art STR methods usually adopt a multi-stage framework which
uses a rectifier to iteratively rectify errors from previous stage. However, the
rectifiers of those models are not proficient in addressing the misalignment
problem. To alleviate this problem, we proposed a novel network named Parallel
and Robust Text Rectifier (PRTR), which consists of a bi-directional position
attention initial decoder and a sequence of stacked Robust Visual Semantic
Rectifiers (RVSRs). In essence, PRTR is creatively designed as a coarse-to-fine
architecture that exploits a sequence of rectifiers for repeatedly refining the
prediction in a stage-wise manner. RVSR is a core component in the proposed
model which comprises two key modules, Dual-Path Semantic Alignment (DPSA)
module and Visual-Linguistic Alignment (VLA). DPSA can rectify the linguistic
misalignment issues via the global semantic features that are derived from the
recognized characters as a whole, while VLA re-aligns the linguistic features
with visual features by an attention model to avoid the overfitting of
linguistic features. All parts of PRTR are nonautoregressive (parallel), and
its RVSR re-aligns its output according to the linguistic features and the
visual features, so it is robust to the mis-aligned error. Extensive experiments
on mainstream benchmarks demonstrate that the proposed model can alleviate
the misalignment problem to a large extent and outperformed state-of-the-art
models.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Explicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210052800-ab1f29d1-de7c-43bd-8297-b13cd83e28d3.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- SA
- MJ
Test DataSets:
Avg.: 93.3
IIIT5K:
WAICS: 97.0
SVT:
WAICS: 94.4
IC13:
WAICS: 95.8
IC15:
WAICS: 86.1
SVTP:
WAICS: 89.8
CUTE:
WAICS: 96.5
Bibtex: '@article{tang2021visual,
title={Visual-semantic transformer for scene text recognition},
author={Tang, Xin and Lai, Yongquan and Liu, Ying and Fu, Yuanyuan and Fang, Rui},
journal={arXiv preprint arXiv:2112.00948},
year={2021}
}'
Loading