Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plotting – again #377

Merged
merged 30 commits into from
Apr 10, 2024
Merged

Plotting – again #377

merged 30 commits into from
Apr 10, 2024

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Mar 4, 2024

After several attempts by @Shreeshrii to share her excellent plotting scripts, each of which was unfortunately thwarted by bad circumstances (other big changes occurring at the same time), here comes a plotting facility again.

I based this on the ocrddata branch of her fork, cherry-picking only the two relevant changesets, resolving conflicts and then refactoring to make this better fit our makefileization.

Usage is simply make plot, which will only work after make training. (I could also make this dependency explicit, but that would cause make plot to start the training if it did not happen already for that combination of variables.)

The output files will be created in $OUTPUT_DIR/$MODEL_NAME.plot_log.png, e.g.
herrnhut-kurrent tess finetuned-htrbin plot_log

and $OUTPUT_DIR/$MODEL_NAME.plot_cer.png, e.g.
herrnhut-kurrent tess finetuned-htrbin plot_cer

All intermediate files (except for the lstmeval log files generated under $OUTPUT_DIR/eval/*.log because they are valuable in their own right) are marked as such and therefore removed by make.

Perhaps we should discuss how both plots could be combined into a single one (which is probably what @Shreeshrii tried to do already) – I can see that there's a problem by the granularity these data points are recorded (training iterations for validation during lstmtraining vs. learning iterations for validation afterwards via external lstmeval). But IIUC we have everything it takes to be able to combine them (twin y plot with synced x axes)...

bertsky and others added 8 commits March 3, 2024 23:33
- avoid creating extra subdirectories (would necessitate mkdir and extra deps)
- produce all necessary outputs in OUTPUT_DIR
- make temporary files .INTERMEDIATE (so they will be removed automatically afterwards)
- define file names close to where they are actually used and in the correct order
  (no need for the user to see/customise them)
- use fixed name for training log file and plot file (derived from OUTPUT_DIR and MODEL_NAME)
- simplify extraction pipelines, make dependencies explicit
- make checkpoint evaluations directly dependent on checkpoint models
- provide phony (aliased aggregate) target lstmeval for these
- provide phony (aliased aggregate) target plot for the actual plots
- never cut off ranges from y scale (CER)
- make naming and coloring more consistent
@bertsky
Copy link
Collaborator Author

bertsky commented Mar 5, 2024

With the last commit I did re-instante @Shreeshrii's LOG_FILE variable.

The big pro is that thus you can opt in to plotting even older logs, e.g.

make plot LOG_FILE=nohup.out

@zdenop
Copy link
Contributor

zdenop commented Mar 7, 2024

I just make quick test on openSUSE (15.5) and here are a few suggestions:

  • it would be nice to have a short example of how to make example plot on example data:
git clone https://github.com/tesseract-ocr/tesstrain
cd tesstrain
mkdir data
unzip ocrd-testset.zip -d data/ocrd-ground-truth
...
# install needed requirements
...
nohup make training MODEL_NAME=ocrd START_MODEL=frk TESSDATA=~/tessdata_best MAX_ITERATIONS=10000 > plot/TESSTRAIN.LOG &
make plot MODEL_NAME=ocrd
  • I removed python2 from openSUSE and I got this error:
python plot_cer.py data/ocrd ocrd data/ocrd/ocrd.iteration.tsv data/ocrd/ocrd.checkpoint.tsv data/ocrd/ocrd.eval.tsv data/ocrd/ocrd.sub.tsv data/ocrd/ocrd.lstmeval.tsv
/bin/bash: python: command not found

What about using PY_CMD (as rest of Makefile)?

  • When I run manually python3 plot_cer.py data/ocrd ocrd data/ocrd/ocrd.iteration.tsv data/ocrd/ocrd.checkpoint.tsv data/ocrd/ocrd.eval.tsv data/ocrd/ocrd.sub.tsv data/ocrd/ocrd.lstmeval.tsv I got error:
Traceback (most recent call last):
  File "/home/podobny/Projekty/tesstrain/plot_cer.py", line 6, in <module>
    import matplotlib
ModuleNotFoundError: No module named 'matplotlib'

It would be good to mentioned that user should install matplotlib and pandas (pip3 install matplotlib pandas) before running make plot.

@stweil
Copy link
Member

stweil commented Mar 7, 2024

It would be good to mention that user should install matplotlib and pandas (pip3 install matplotlib pandas) before running make plot.

Both are mentioned in requirements.txt, so running pip3 install -r requirements.txt is sufficient.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 7, 2024

@zdenop very good points – thanks!
I'll address these in a few follow-up commits.

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Copy link
Member

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bertsky, some of the commits could already be applied to the main branch. Would it be okay if I cherry-pick them? You (or I) would have to rebase your plotting branch after that. I think we can improve plotting faster by getting it integrated like that.

.gitignore Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
plot_cer.py Fixed Show fixed Hide fixed
plot_log.py Fixed Show fixed Hide fixed
@bertsky bertsky requested a review from stweil March 7, 2024 22:32
@bertsky
Copy link
Collaborator Author

bertsky commented Mar 20, 2024

So?

@zdenop
Copy link
Contributor

zdenop commented Mar 25, 2024

@bertsky : please reformat the Python scripts with blue (or black) there is plenty of formal formatting issues (missing spaces).

and ruff complains about this:

ruff check plot_cer.py
plot_cer.py:43:1: E741 Ambiguous variable name: `l`
Found 1 error.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 30, 2024

In light of tesseract-ocr/tesseract#3763 (comment) I tend to prefer changing from fast to best models for evaluation.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 5, 2024

@zdenop

and ruff complains about this:

ruff check plot_cer.py
plot_cer.py:43:1: E741 Ambiguous variable name: `l`
Found 1 error.

I fail to see the ambiguity. The manual says they think l can be easily confused with 1. I really don't care for such subjective stances.

please reformat the Python scripts with blue (or black) there is plenty of formal formatting issues (missing spaces).

Not my code originally, so I don't care. But I also don't believe in code stylers. Since @stweil seems to be an avid user of these, I don't think my help is needed in that respect.

I'll resolve the conflict on the readme arising from your concurrent edits in the plotting section, then I'll be done here, I think.

.gitignore Outdated Show resolved Hide resolved
@@ -126,19 +128,21 @@ help:
@echo " unicharset Create unicharset"
@echo " charfreq Show character histogram"
@echo " lists Create lists of lstmf filenames for training and eval"
@echo " training Start training"
@echo " training Start training (i.e. create .checkpoint files)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@echo " training Start training (i.e. create .checkpoint files)"
@echo " training Start training"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a less technical description here and avoid details which might change any time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said above, the description of traineddata already mentions .checkpoint files, too. And so does evaluation. So as a matter of clarity and consistency, it should be mentioned here as well.

Since this entire repo is just a small makefile wrapper around the actual training tools, which are only properly documented in tessdoc along these technical details, I prefer keeping it the way it is.

@echo " traineddata Create best and fast .traineddata files from each .checkpoint file"
@echo " proto-model Build the proto model"
@echo " tesseract-langdata Download stock unicharsets"
@echo " evaluation Evaluate .checkpoint models on eval dataset via lstmeval"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@echo " evaluation Evaluate .checkpoint models on eval dataset via lstmeval"
@echo " evaluation Evaluate intermediate models on evaluation dataset"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use help text with less technical details

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above.

Copy link
Member

@stweil stweil Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't your help text even wrong? I though that lstmeval is running on .traineddata files, not on .checkpoint files.

If you agree, I'd commit my suggestions and apply your pull request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't agree, hence my comment above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't your help text even wrong? I though that lstmeval is running on .traineddata files, not on .checkpoint files.

No, it's running on .traineddata extracted from .checkpoint files. The help texts only mention the former when relevant to the user (traineddata target).

@echo " TESSDATA_REPO Tesseract model repo to use (_fast or _best). Default: $(TESSDATA_REPO)"
@echo " MAX_ITERATIONS Max iterations. Default: $(MAX_ITERATIONS)"
@echo " EPOCHS Set max iterations based on the number of lines for the training. Default: none"
@echo " DEBUG_INTERVAL Debug Interval. Default: $(DEBUG_INTERVAL)"
@echo " LEARNING_RATE Learning rate. Default: $(LEARNING_RATE)"
@echo " NET_SPEC Network specification. Default: $(NET_SPEC)"
@echo " NET_SPEC Network specification (in VGSL) for new model from scratch. Default: $(NET_SPEC)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@echo " NET_SPEC Network specification (in VGSL) for new model from scratch. Default: $(NET_SPEC)"
@echo " NET_SPEC Network specification (in VGSL), only used for training without START_MODEL. Default: $(NET_SPEC)"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it more helpful if the relevant terms fine-tune and scratch were already mentioned in their respective variable's short help string. (Since START_MODEL already mentions the former, NET_SPEC now mentions the latter.) Also, scratch already precludes a start model by definition.

@echo " LANG_TYPE Language Type - Indic, RTL or blank. Default: '$(LANG_TYPE)'"
@echo " PSM Page segmentation mode. Default: $(PSM)"
@echo " RANDOM_SEED Random seed for shuffling of the training data. Default: $(RANDOM_SEED)"
@echo " RATIO_TRAIN Ratio of train / eval training data. Default: $(RATIO_TRAIN)"
@echo " TARGET_ERROR_RATE Default Target Error Rate. Default: $(TARGET_ERROR_RATE)"
@echo " LOG_FILE File to copy training output to and read plot figures from. Default: $(LOG_FILE)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@echo " LOG_FILE File to copy training output to and read plot figures from. Default: $(LOG_FILE)"
@echo " LOG_FILE File copy of the training protocol (also used for plotting). Default: $(LOG_FILE)"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like your formulation. LOG_FILE is not a protocol of the training procedure (like Tensorboard), it's just the output of the training recipe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer "log" instead of "protocol", that would be fine for me, too. The output of the training is much more than the printed messages: the most important output is the new model file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A LOG_FILE is a log file and not a result file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Log file of training process (also used as input for plotting)" would also be fine for me.

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Comment on lines +215 to +216
Plotting can even be done while training is still running, and will depict the training status
up to that point. (It can be rerun any time the `LOG_FILE` has changed or new checkpoints written.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Plotting can even be done while training is still running, and will depict the training status
up to that point. (It can be rerun any time the `LOG_FILE` has changed or new checkpoints written.)
Plotting can even be done while training is still running, and will show the training status
up to that point. (It can be re-run any time the `LOG_FILE` is changed or new checkpoints are written.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like depict better than show here, and has changed better than is changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deepl and I disagree.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deepl does not know anything about tesstrain, and I am pretty confident in my assessment of what the English language allows me to express.

Besides, IMO it should not be your role as a maintainer to micromanage individual contributions to the letter. This is becoming extremely tedious – I thought this addition is long overdue and would be welcomed.

Copy link
Member

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bertsky, I still have several change requests for the help text in Makefile. Those changes then must be made in README.md as well.

@stweil stweil merged commit 4770fee into tesseract-ocr:main Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants