open-mmlab · Mountchicken · Jan 1, 2023
diff --git a/...ion Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding.yaml b/...ion Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding.yaml
@@ -0,0 +1,79 @@
+Title: 'A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding'
+Abbreviation: Qiao et al
+Tasks:
+ - TextRecog
+Venue: ICFHR
+Year: 2022
+Lab/Company:
+ - Tomorrow Advancing Life, Beijing, China
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-21648-0_14'
+  Arxiv: 'https://books.google.fr/books?hl=zh-CN&lr=&id=hvmdEAAAQBAJ&oi=fnd&pg=PA198&ots=Gg_BaAnXLm&sig=gpJ2h9NjKz1PjLWSfwDpyd8eLZE&redir_esc=y#v=onepage&q&f=false'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Recently, vision Transformer (ViT) has attracted more and more attention,
+many works introduce the ViT into concrete vision tasks and achieve impressive
+performance. However, there are only a few works focused on the applications of
+the ViT for scene text recognition. This paper takes a further step and proposes
+a strong scene text recognizer with a fully ViT-based architecture.
+Specifically, we introduce multi-grained features into both the encoder and
+decoder. For the encoder, we adopt a two-stage ViT with different grained
+patches, where the first stage extracts extent visual features with 2D
+ine-grained patches and the second stage aims at the sequence of contextual
+features with 1D coarse-grained patches. The decoder integrates Connectionist
+Temporal Classification (CTC)-based and attention-based decoding, where the
+two decoding schemes introduce different grained features into the decoder and
+benefit from each other with a deep interaction. To improve the extraction of
+fine-grained features, we additionally explore self-supervised learning for
+text recognition with masked autoencoders. Furthermore, a focusing mechanism is
+proposed to let the model target the pixel reconstruction of the text area. Our
+proposed method achieves state-of-the-art or comparable accuracies on benchmarks
+of scene text recognition with a faster inference speed and nearly 50% reduction
+of parameters compared with other recent works.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+  - Transformer
+ Learning Method:
+  - Self-Supervised
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210053998-385587ef-2b0e-4c9b-a8b8-d6171261c621.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: 90.5
+     IIIT5K:
+       WAICS: 96.1
+     SVT:
+       WAICS: 92.3
+     IC13:
+       WAICS: 95.0
+     IC15:
+       WAICS: 86.0
+     SVTP:
+       WAICS: 87.0
+     CUTE:
+       WAICS: 86.8
+Bibtex: '@inproceedings{qiao2022vision,
+  title={A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding},
+  author={Qiao, Zhi and Ji, Zhilong and Yuan, Ye and Bai, Jinfeng},
+  booktitle={International Conference on Frontiers in Handwriting Recognition},
+  pages={198--212},
+  year={2022},
+  organization={Springer}
+}'
diff --git a/paper_zoo/textrecog/Levenshtein OCR.yaml b/paper_zoo/textrecog/Levenshtein OCR.yaml
@@ -0,0 +1,72 @@
+Title: 'Levenshtein OCR'
+Abbreviation: Lev-OCR
+Tasks:
+ - TextRecog
+Venue: ECCV
+Year: 2022
+Lab/Company:
+ - Alibaba DAMO Academy, Beijing, China
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_19'
+  Arxiv: 'https://arxiv.org/abs/2209.03594'
+Paper Reading URL: 'https://mp.weixin.qq.com/s/Nuc8j3V5YeaXpY64SsIeCw'
+Code: 'https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/LevOCR'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'A novel scene text recognizer based on Vision-Language Transformer
+(VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the
+proposed method (named Levenshtein OCR, and LevOCR for short) explores an
+alternative way for automatically transcribing textual content from cropped
+natural images. Specifically, we cast the problem of scene text recognition as
+an iterative sequence refinement process. The initial prediction sequence
+produced by a pure vision model is encoded and fed into a cross-modal
+transformer to interact and fuse with the visual features, to progressively
+approximate the ground truth. The refinement process is accomplished via two
+basic characterlevel operations: deletion and insertion, which are learned with
+imitation learning and allow for parallel decoding, dynamic length change and
+good interpretability. The quantitative experiments clearly demonstrate that
+LevOCR achieves state-of-the-art performances on standard benchmarks and the
+qualitative analyses verify the effectiveness and advantage of the proposed
+LevOCR algorithm. Code will be released soon.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210163468-bb6c14ba-134a-4dd5-881e-a7adb4058dcd.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: 92.1
+     IIIT5K:
+       WAICS: 96.6
+     SVT:
+       WAICS: 92.9
+     IC13:
+       WAICS: 96.9
+     IC15:
+       WAICS: 86.4
+     SVTP:
+       WAICS: 88.1
+     CUTE:
+       WAICS: 91.7
+Bibtex: '@inproceedings{wang2022multi,
+  title={Multi-granularity Prediction for Scene Text Recognition},
+  author={Wang, Peng and Da, Cheng and Yao, Cong},
+  booktitle={European Conference on Computer Vision},
+  pages={339--355},
+  year={2022},
+  organization={Springer}
+}'
diff --git a/paper_zoo/textrecog/Multi-Granularity Prediction for Scene Text Recognition.yaml b/paper_zoo/textrecog/Multi-Granularity Prediction for Scene Text Recognition.yaml
@@ -0,0 +1,74 @@
+Title: 'Multi-Granularity Prediction for Scene Text Recognition'
+Abbreviation: MGP-STR
+Tasks:
+ - TextRecog
+Venue: ECCV
+Year: 2022
+Lab/Company:
+ - Alibaba DAMO Academy, Beijing, China
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_20'
+  Arxiv: 'https://arxiv.org/abs/2209.03592'
+Paper Reading URL: N/A
+Code: 'https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Scene text recognition (STR) has been an active research topic in
+computer vision for years. To tackle this challenging problem, numerous
+innovative methods have been successively proposed and incorporating linguistic
+knowledge into STR models has recently become a prominent trend. In this work,
+we first draw inspiration from the recent progress in Vision Transformer (ViT)
+to construct a conceptually simple yet powerful vision STR model, which is built
+upon ViT and outperforms previous state-of-the-art models for scene text
+recognition, including both pure vision models and language-augmented methods.
+To integrate linguistic knowledge, we further propose a Multi-Granularity
+Prediction strategy to inject information from the language modality into the
+model in an implicit way, i.e. , subword representations (BPE and WordPiece)
+widely-used in NLP are introduced into the output space, in addition to the
+conventional character level representation, while no independent language model
+(LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the
+performance envelop of STR to an even higher level. Specifically, it achieves
+an average recognition accuracy of 93.35% on standard benchmarks. Code will be
+released soon.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210163378-fc11a79b-fb7d-4a3f-947e-a8f6dfd14dd2.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: 92.8
+     IIIT5K:
+       WAICS: 96.4
+     SVT:
+       WAICS: 94.7
+     IC13:
+       WAICS: 97.3
+     IC15:
+       WAICS: 87.2
+     SVTP:
+       WAICS: 91.0
+     CUTE:
+       WAICS: 90.3
+Bibtex: '@inproceedings{wang2022multi,
+  title={Multi-granularity Prediction for Scene Text Recognition},
+  author={Wang, Peng and Da, Cheng and Yao, Cong},
+  booktitle={European Conference on Computer Vision},
+  pages={339--355},
+  year={2022},
+  organization={Springer}
+}'
diff --git a/paper_zoo/textrecog/On Vocabulary Reliance in Scene Text Recognition.yaml b/paper_zoo/textrecog/On Vocabulary Reliance in Scene Text Recognition.yaml
@@ -0,0 +1,77 @@
+Title: 'On Vocabulary Reliance in Scene Text Recognition'
+Abbreviation: Wan et al
+Tasks:
+ - TextRecog
+Venue: CVPR
+Year: 2020
+Lab/Company:
+ - Megvii
+ - China University of Mining and Technology
+ - University of Rochester
+URL:
+  Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Wan_On_Vocabulary_Reliance_in_Scene_Text_Recognition_CVPR_2020_paper.html'
+  Arxiv: 'https://arxiv.org/abs/2005.03959'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'The pursuit of high performance on public benchmarks has been the
+driving force for research in scene text recognition, and notable progress has
+been achieved. However, a close investigation reveals a startling fact that the
+state-ofthe-art methods perform well on images with words within vocabulary but
+generalize poorly to images with words outside vocabulary. We call this
+phenomenon “vocabulary reliance”. In this paper, we establish an analytical
+framework to conduct an in-depth study on the problem of vocabulary reliance
+in scene text recognition. Key findings include: (1) Vocabulary reliance is
+ubiquitous, i.e., all existing algorithms more or less exhibit such
+characteristic; (2) Attention-based decoders prove weak in generalizing to
+words outside vocabulary and segmentation-based decoders perform well in
+utilizing visual features; (3) Context modeling is highly coupled with the
+prediction layers. These findings provide new insights and can benefit future
+research in scene text recognition. Furthermore, we propose a simple yet
+effective mutual learning strategy to allow models of two families
+(attention-based and segmentationbased) to learn collaboratively. This remedy
+alleviates the problem of vocabulary reliance and improves the overall scene
+text recognition performance.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210054683-5d5f3117-4bee-43d6-a36c-8e645d47c2b1.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: N/A
+     IIIT5K:
+       WAICS: N/A
+     SVT:
+       WAICS: N/A
+     IC13:
+       WAICS: N/A
+     IC15:
+       WAICS: N/A
+     SVTP:
+       WAICS: N/A
+     CUTE:
+       WAICS: N/A
+Bibtex: '@inproceedings{wan2020vocabulary,
+  title={On vocabulary reliance in scene text recognition},
+  author={Wan, Zhaoyi and Zhang, Jielei and Zhang, Liang and Luo, Jiebo and Yao, Cong},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={11425--11434},
+  year={2020}
+}'
diff --git a/paper_zoo/textrecog/Parallel and Robust Text Rectifier for Scene Text Recognition.yaml b/paper_zoo/textrecog/Parallel and Robust Text Rectifier for Scene Text Recognition.yaml
@@ -0,0 +1,79 @@
+Title: 'Parallel and Robust Text Rectifier for Scene Text Recognition'
+Abbreviation: PRTR
+Tasks:
+ - TextRecog
+Venue: BMVC
+Year: 2022
+Lab/Company:
+ - Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China
+ - Ping An Technology (Shenzhen) Co. Ltd.
+ - School of Information and Telecommunication Engineering, Guangzhou Maritime University, Guangzhou, China
+URL:
+  Venue: 'https://bmvc2022.mpi-inf.mpg.de/0770.pdf'
+  Arxiv: 'https://bmvc2022.mpi-inf.mpg.de/0770.pdf'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Scene text recognition (STR) is to recognize text appearing in images.
+Current stateof-the-art STR methods usually adopt a multi-stage framework which
+uses a rectifier to iteratively rectify errors from previous stage. However, the
+rectifiers of those models are not proficient in addressing the misalignment
+problem. To alleviate this problem, we proposed a novel network named Parallel
+and Robust Text Rectifier (PRTR), which consists of a bi-directional position
+attention initial decoder and a sequence of stacked Robust Visual Semantic
+Rectifiers (RVSRs). In essence, PRTR is creatively designed as a coarse-to-fine
+architecture that exploits a sequence of rectifiers for repeatedly refining the
+prediction in a stage-wise manner. RVSR is a core component in the proposed
+model which comprises two key modules, Dual-Path Semantic Alignment (DPSA)
+module and Visual-Linguistic Alignment (VLA). DPSA can rectify the linguistic
+misalignment issues via the global semantic features that are derived from the
+recognized characters as a whole, while VLA re-aligns the linguistic features
+with visual features by an attention model to avoid the overfitting of
+linguistic features. All parts of PRTR are nonautoregressive (parallel), and
+its RVSR re-aligns its output according to the linguistic features and the
+visual features, so it is robust to the mis-aligned error. Extensive experiments
+on mainstream benchmarks demonstrate that the proposed model can alleviate
+the misalignment problem to a large extent and outperformed state-of-the-art
+models.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210052800-ab1f29d1-de7c-43bd-8297-b13cd83e28d3.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - SA
+     - MJ
+   Test DataSets:
+     Avg.: 93.3
+     IIIT5K:
+       WAICS: 97.0
+     SVT:
+       WAICS: 94.4
+     IC13:
+       WAICS: 95.8
+     IC15:
+       WAICS: 86.1
+     SVTP:
+       WAICS: 89.8
+     CUTE:
+       WAICS: 96.5
+Bibtex: '@article{tang2021visual,
+  title={Visual-semantic transformer for scene text recognition},
+  author={Tang, Xin and Lai, Yongquan and Liu, Ying and Fu, Yuanyuan and Fang, Rui},
+  journal={arXiv preprint arXiv:2112.00948},
+  year={2021}
+}'