This is an undergraduate research project at the University of Hong Kong, supervised by Prof. Kenneth K.Y. Wong, where we achived SOTA performace in terms of MS-SSIM (0.512578), and Local Distortion (7.581896). We tried different Content+3D combinations to do multimodel learning, and find out that RGB+D is the best combination for this task.
Please refer to DewarpNet for how to start, and please load data correctly. We provide different models, loaders, training and infer scripts for various combination of Content+3D. We also provide our joint training code.
You can download our best model here and our result here. We evaluate the result using the same code as DocUNet, our Matlab version is 2022a. Note that in the benchmark set, the 64th sample is up side down, please rotate it back before evaluation.
- We achieved SOTA performance compared with methods with the same pipeline. Specifically, we have improved the SOTA method by 0.32% and 1.93% in terms of MS-SSIM and LD respectively, using only about 1/3 parameters and 79.51% GPU memory.
- In document dewarping, we are the first to combine RGB and 3D information to do multimodal learning.
- We propose to use Adjoint Loss and Identical Loss so that the model can distinguish 3D and RGB information.
For the semantic segmentation task, use cross-entropy loss For the depth prediction task, use L1 Loss and the ground truth masked image For BM prediction model, we input the ground truth depth and masked image
we trained the BM model for 81 epochs, with batch size = 200, and learning rate = 0.0001. We reduce the learning rate by half when the validation loss doesn’t decrease for 5 epochs continuously. We don’t use auxiliary loss here because we found the performance will be worse if we use the auxiliary loss here.
we do joint training for the 3 models, i.e., the latter 2 models take the previous models’ outputs as their inputs. And we minimize all losses together, including the cross-entropy loss for the semantic segmentation, the L1 Loss for the depth prediction, L1 Loss for the BM prediction, and the auxiliary losses.
Our framework is based on DewarpNet.