-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plotting – again #377
Plotting – again #377
Conversation
- avoid creating extra subdirectories (would necessitate mkdir and extra deps) - produce all necessary outputs in OUTPUT_DIR - make temporary files .INTERMEDIATE (so they will be removed automatically afterwards) - define file names close to where they are actually used and in the correct order (no need for the user to see/customise them) - use fixed name for training log file and plot file (derived from OUTPUT_DIR and MODEL_NAME) - simplify extraction pipelines, make dependencies explicit - make checkpoint evaluations directly dependent on checkpoint models - provide phony (aliased aggregate) target lstmeval for these - provide phony (aliased aggregate) target plot for the actual plots - never cut off ranges from y scale (CER) - make naming and coloring more consistent
With the last commit I did re-instante @Shreeshrii's LOG_FILE variable. The big pro is that thus you can opt in to plotting even older logs, e.g.
|
I just make quick test on openSUSE (15.5) and here are a few suggestions:
git clone https://github.com/tesseract-ocr/tesstrain
cd tesstrain
mkdir data
unzip ocrd-testset.zip -d data/ocrd-ground-truth
...
# install needed requirements
...
nohup make training MODEL_NAME=ocrd START_MODEL=frk TESSDATA=~/tessdata_best MAX_ITERATIONS=10000 > plot/TESSTRAIN.LOG &
make plot MODEL_NAME=ocrd
python plot_cer.py data/ocrd ocrd data/ocrd/ocrd.iteration.tsv data/ocrd/ocrd.checkpoint.tsv data/ocrd/ocrd.eval.tsv data/ocrd/ocrd.sub.tsv data/ocrd/ocrd.lstmeval.tsv
/bin/bash: python: command not found What about using
Traceback (most recent call last):
File "/home/podobny/Projekty/tesstrain/plot_cer.py", line 6, in <module>
import matplotlib
ModuleNotFoundError: No module named 'matplotlib' It would be good to mentioned that user should install matplotlib and pandas ( |
Both are mentioned in |
@zdenop very good points – thanks! |
Signed-off-by: Stefan Weil <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bertsky, some of the commits could already be applied to the main branch. Would it be okay if I cherry-pick them? You (or I) would have to rebase your plotting branch after that. I think we can improve plotting faster by getting it integrated like that.
Co-authored-by: Stefan Weil <[email protected]>
Co-authored-by: Stefan Weil <[email protected]>
Co-authored-by: Stefan Weil <[email protected]>
So? |
In light of tesseract-ocr/tesseract#3763 (comment) I tend to prefer changing from |
I fail to see the ambiguity. The manual says they think l can be easily confused with 1. I really don't care for such subjective stances.
Not my code originally, so I don't care. But I also don't believe in code stylers. Since @stweil seems to be an avid user of these, I don't think my help is needed in that respect. I'll resolve the conflict on the readme arising from your concurrent edits in the plotting section, then I'll be done here, I think. |
@@ -126,19 +128,21 @@ help: | |||
@echo " unicharset Create unicharset" | |||
@echo " charfreq Show character histogram" | |||
@echo " lists Create lists of lstmf filenames for training and eval" | |||
@echo " training Start training" | |||
@echo " training Start training (i.e. create .checkpoint files)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echo " training Start training (i.e. create .checkpoint files)" | |
@echo " training Start training" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer a less technical description here and avoid details which might change any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said above, the description of traineddata
already mentions .checkpoint files, too. And so does evaluation
. So as a matter of clarity and consistency, it should be mentioned here as well.
Since this entire repo is just a small makefile wrapper around the actual training tools, which are only properly documented in tessdoc along these technical details, I prefer keeping it the way it is.
@echo " traineddata Create best and fast .traineddata files from each .checkpoint file" | ||
@echo " proto-model Build the proto model" | ||
@echo " tesseract-langdata Download stock unicharsets" | ||
@echo " evaluation Evaluate .checkpoint models on eval dataset via lstmeval" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echo " evaluation Evaluate .checkpoint models on eval dataset via lstmeval" | |
@echo " evaluation Evaluate intermediate models on evaluation dataset" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use help text with less technical details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't your help text even wrong? I though that lstmeval is running on .traineddata files, not on .checkpoint files.
If you agree, I'd commit my suggestions and apply your pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't agree, hence my comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't your help text even wrong? I though that lstmeval is running on .traineddata files, not on .checkpoint files.
No, it's running on .traineddata extracted from .checkpoint files. The help texts only mention the former when relevant to the user (traineddata
target).
@echo " TESSDATA_REPO Tesseract model repo to use (_fast or _best). Default: $(TESSDATA_REPO)" | ||
@echo " MAX_ITERATIONS Max iterations. Default: $(MAX_ITERATIONS)" | ||
@echo " EPOCHS Set max iterations based on the number of lines for the training. Default: none" | ||
@echo " DEBUG_INTERVAL Debug Interval. Default: $(DEBUG_INTERVAL)" | ||
@echo " LEARNING_RATE Learning rate. Default: $(LEARNING_RATE)" | ||
@echo " NET_SPEC Network specification. Default: $(NET_SPEC)" | ||
@echo " NET_SPEC Network specification (in VGSL) for new model from scratch. Default: $(NET_SPEC)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echo " NET_SPEC Network specification (in VGSL) for new model from scratch. Default: $(NET_SPEC)" | |
@echo " NET_SPEC Network specification (in VGSL), only used for training without START_MODEL. Default: $(NET_SPEC)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it more helpful if the relevant terms fine-tune
and scratch
were already mentioned in their respective variable's short help string. (Since START_MODEL
already mentions the former, NET_SPEC
now mentions the latter.) Also, scratch
already precludes a start model by definition.
@echo " LANG_TYPE Language Type - Indic, RTL or blank. Default: '$(LANG_TYPE)'" | ||
@echo " PSM Page segmentation mode. Default: $(PSM)" | ||
@echo " RANDOM_SEED Random seed for shuffling of the training data. Default: $(RANDOM_SEED)" | ||
@echo " RATIO_TRAIN Ratio of train / eval training data. Default: $(RATIO_TRAIN)" | ||
@echo " TARGET_ERROR_RATE Default Target Error Rate. Default: $(TARGET_ERROR_RATE)" | ||
@echo " LOG_FILE File to copy training output to and read plot figures from. Default: $(LOG_FILE)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echo " LOG_FILE File to copy training output to and read plot figures from. Default: $(LOG_FILE)" | |
@echo " LOG_FILE File copy of the training protocol (also used for plotting). Default: $(LOG_FILE)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like your formulation. LOG_FILE is not a protocol of the training procedure (like Tensorboard), it's just the output of the training recipe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefer "log" instead of "protocol", that would be fine for me, too. The output of the training is much more than the printed messages: the most important output is the new model file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A LOG_FILE is a log file and not a result file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Log file of training process (also used as input for plotting)" would also be fine for me.
Plotting can even be done while training is still running, and will depict the training status | ||
up to that point. (It can be rerun any time the `LOG_FILE` has changed or new checkpoints written.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plotting can even be done while training is still running, and will depict the training status | |
up to that point. (It can be rerun any time the `LOG_FILE` has changed or new checkpoints written.) | |
Plotting can even be done while training is still running, and will show the training status | |
up to that point. (It can be re-run any time the `LOG_FILE` is changed or new checkpoints are written.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like depict
better than show
here, and has changed
better than is changed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deepl and I disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deepl does not know anything about tesstrain, and I am pretty confident in my assessment of what the English language allows me to express.
Besides, IMO it should not be your role as a maintainer to micromanage individual contributions to the letter. This is becoming extremely tedious – I thought this addition is long overdue and would be welcomed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bertsky, I still have several change requests for the help text in Makefile. Those changes then must be made in README.md as well.
After several attempts by @Shreeshrii to share her excellent plotting scripts, each of which was unfortunately thwarted by bad circumstances (other big changes occurring at the same time), here comes a plotting facility again.
I based this on the ocrddata branch of her fork, cherry-picking only the two relevant changesets, resolving conflicts and then refactoring to make this better fit our makefileization.
Usage is simply
make plot
, which will only work aftermake training
. (I could also make this dependency explicit, but that would causemake plot
to start the training if it did not happen already for that combination of variables.)The output files will be created in
$OUTPUT_DIR/$MODEL_NAME.plot_log.png
, e.g.and
$OUTPUT_DIR/$MODEL_NAME.plot_cer.png
, e.g.All intermediate files (except for the lstmeval log files generated under
$OUTPUT_DIR/eval/*.log
because they are valuable in their own right) are marked as such and therefore removed by make.Perhaps we should discuss how both plots could be combined into a single one (which is probably what @Shreeshrii tried to do already) – I can see that there's a problem by the granularity these data points are recorded (training iterations for validation during lstmtraining vs. learning iterations for validation afterwards via external lstmeval). But IIUC we have everything it takes to be able to combine them (twin y plot with synced x axes)...