-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question: How to Diagnose Overfitting and Underfitting of Tesseract Models? #200
Comments
@Shreeshrii Not sure about
You can always “finish” a checkpoint and convert it into a working Tesseract model via |
I have recently tried, splitting the training data into three sets. Eg. Using 80% for training and keeping 10% for eval during training and 10% for validation test with the traineddata files from checkpoints. I find that there are differences in the training CER for a checkpoint and the validation set CER for traineddata from same checkpoint. The CER from the eval set used during training is not easily available. Earlier I thought that the model with the lowest training CER was the best. However after running tests with the validation set, that does not seem to be true, because that might have been overfitted to the training set. Hence my question ... |
I am not familiar with this. Please elaborate or provide a link with further info. |
Here are my results from a validation run:
|
I was able to extract it from the training log file.
The eval iterations lag behind the training iterations. |
I was able to pull out the training CER and eval CER data from the Here is the script to extract the data and plot it.
plot.py
EDIT: This incorrectly uses WER for one of the data columns, so plot is not accurate. Looks like my problem might be using too many lines of training data (about 150000+) and then killing the training process without allowing for enough epochs. EDIT: Plot with correct CER data |
Wow, that's a great tool. Could you add it to the repo and we could think about utilizing it in the Makefile?
The diifference between test and eval is remarkable and could be an indicator for having lines the evaluation set which are very different from what is seen during training. Why is the number of data points for Evaluation CER so small? |
Calamari writes a number of models during its training process (not only the best as Tesseract does). In their papers, @chreul and colleagues propose a strategy called voting -- which is more or less a majority decision over the outputs of multiple models during the recognition stage -- and use all the models created during training as voters. Now, Tesseract used a different kind of combination of multiple OCR models. It returns the character (or character sequence) which received the highest probability from one of the models. Usually you use e.g. a Greek, a Latin and a Fraktur model when running Tesseract. What one could try to do instead/in addition is the combination of Tesseract models from different stages of the training process (i.e. checkpoints). |
Yes, if I remember correctly, @stweil had recommended using last three checkpoints models to get better results. |
The CER is from lstmtraining and the eval that it runs during lstmtraining using the eval list. Edit: I had accidentally used WER instead of CER. Corrected plot does not have that much difference.
I do not know the algorithm that tesseract uses internally to decide when to run the eval. @stweil maybe able to explain it. I have not yet run lstmeval on the validation set yet. I will do that and add to the plot. |
@wrznr Thanks for pointing out the difference between the evaluation and training error rates. I had accidentally used the WER from the log file instead of CER, for the evaluation run. Have fixed it now. I can submit the python script as a PR but it has hardcoded values pertaining to my data. It will need to be generalized for use with makefile. Here are the plots from two different runs of training for Sanskrit which has support for Devanagari, English and IAST (Diacritics to support Sanskrit). In the old run displayed above, I restarted training a few times, that may account for sharp changes in the eval values. It also has the plot from the validation run. |
@Shreeshrii Great stuff! Since the evaluation error is still on par or even below the training error, you can rule out overfitting and go on to train more iterations. |
New improved version of plotting in PR #218 (comment) |
Hi @Shreeshrii I'm mentioning the eval_listfile but I'm not getting eval char error rate while training like you get in training. |
Meanwhile, it turned out that the error values estimated during the training process are not to be trusted: #261 |
This has been mitigated to some extent by tesseract-ocr/tesseract#3644, which changed the descriptions given during training to reflect more honestly what is calculated. Regarding the OP's question and @Shreeshrii's proposal for a plotting facility, #377 is the latest incarnation of this. |
Is there a way to diagnose Overfitting and Underfitting of Tesseract Models?
@stweil had suggested in a different thread that one should evaluate all models to find the best fit for the eval data.
Is there a way to extract the checkpoint details of number of iterations and training CER and CER from
lstmeval
to find thebest
model and graph the results, similar to the output shown in this article for Keras.The text was updated successfully, but these errors were encountered: