-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft for slim containers #386
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early stage, dry review – still trying to make it run for me.
import subprocess | ||
|
||
# Later the address (or rather the port) should be dynamic | ||
processing_server_address = "http://localhost:8000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not the exposed OCRD_PS_PORT
here? Or even better, instead of the host side (localhost
), we should be fine with the Docker network's DNS:
processing_server_address = "http://localhost:8000" | |
processing_server_address = "http://ocrd-processing-server:8000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does that work (I mean the thing with the hostname, I think I know how to set set Port via env)? When I am on the host it cannot resolve ocrd-processing-server out of the box. What do I have to change additionally to make it work to query the container from the host via its service/host name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not from the host (network), but from the virtual Docker network (i.e. from inside another container). See Compose documentation. (The port then is the container internal port BTW.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After trying some things I think I cannot solve this properly. First the hostname cannot be the service name because the delegator is run from the host. And I cannot read the port from .env because I don't know the working dir where the delegator is executed from. So I decided to set the port dynamic in the Makefile like the processor-name. The proposed suggestion at least does not work. Commit: 7736969
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First the hostname cannot be the service name because the delegator is run from the host.
Oh, right! Sorry, weak thinking on my side.
And I cannot read the port from .env because I don't know the working dir where the delegator is executed from.
Ok, good point. However, we could write all the .env values into the venv's activate
by make, dynamically (i.e. setting the exact values used for the servers):
%/bin/activate:
$(PYTHON) -m venv $(subst /bin/activate,,$@)
. $@ && pip install --upgrade pip setuptools wheel
@echo ". $(CURDIR)/.env" >> $@
IMO the .env should be the central source for configuration, so you should be able to modify it to your needs after it was generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a problem with that approach, because it only works with the venv activated in bash (not csh or fish) and invoking executables directly would not work with this approach (e.g.: venv/bin/ocrd-some-processor ...
). Btw in the example code this does not work: @echo ". $(CURDIR)/.env" >> $@
, exporting is needed to be available to a started python process e.g. the processor.
But I think it is a good idea to use the .env for the config. For now I have implemented another approach approach: In the Makefile the env-path is set into the delegator which then reads the port from it. I know it is not ideal so I think we should talk about this again and agree on a proper solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have reverted the last commit regarding this and decided to use the proposed approach
cmd = [ | ||
"ocrd", "network", "client", "processing", "processor", | ||
processor_name, "--address", processing_server_address | ||
] | ||
subprocess.run(cmd + args[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Processing Server API is asynchronous, and so is its network client. So this CLI will not give the same user experience as the native CLI (for example, you cannot script these calls).
Thus, either we add some callback mechanism here and block until the job is done, or we switch to the Processor Server API which is synchronous (but has no client CLI yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be as simple as passing "--callback-url", "http://localhost:" + PORT + "/"
here. The PORT
should be generated before by an internal background task like so:
PORT = None
async def serve_callback():
nonlocal PORT
# set up web server
def get(path):
raise Exception("done")
...
# run web server on arbitrary available port
PORT = ...
server = asyncio.create_task(serve_callback)
Then after the subprocess.run
, you can simply block by doing await server
. The whole thing must be in a function that you asyncio.run(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a waiting mechanism with the python build-in HTTPServer
. I tried using your proposed async approach but discarded that before I could completely finish. Probably async is possible but it is way more complicated (that's why I finally opted out) and I think a "non-async" http server has the same effect and the waiting-costs for running a "normal" http server vs. a async http server is neglectable.
I cannot say I have much more than an idea of how it internally works but I think a python-process with an async server gets scheduled by the operating system as well anyway. The async-stuff with polling and so on is "only" python internal. It would gain time if we had another async process running alongside which could then be run when the server is waiting.
So I don't think async is helpful here anyway, that's why I use the synchrony server for waiting.
depends_on: | ||
- ocrd-processing-server | ||
- ocrd-mongodb | ||
- ocrd-rabbitmq | ||
# restart: The worker creates its queue but rabbitmq needs a few seconds to be available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If timing is an issue, I suggest to change the dependency type:
depends_on: | |
- ocrd-processing-server | |
- ocrd-mongodb | |
- ocrd-rabbitmq | |
# restart: The worker creates its queue but rabbitmq needs a few seconds to be available | |
depends_on: | |
- ocrd-processing-server | |
ocrd-mongodb: | |
condition: service_started | |
ocrd-rabbitmq: | |
condition: service_started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlining problem is the following: The container for the queue is started and running, but it needs 1-3 seconds that queue creation is possible. But the processing worker tries to create it's queue right away.
This suggestion (service_started
) is not working because the queue-service is considered started from docker-compose but it is in reality not ready to be used right away although it is considered started. I have a similar issue in another project and already tried a few things solving this. I think the only solution to this problem from the docker side would be to implement a (manual) health-check for the rabbitmq container. But therefore I'd have to extend the rabbit-mq image which I do not want.
For this PR to function some extension to core is needed anyway. There I want to add an optional queue-creation-timeout to the worker startup so that it waits a few seconds with adding its queue or to try again a few times. But this restart-fix was the fastest way to do that that's why it is here and I agree that it should be removed finally. I will remark this as solved as soon as the needed changes to core are made (which need one change to this PR as well).
|
||
{{ processor_name }}: | ||
extends: | ||
file: slim-containers-files/{{ processor_group_name}}/docker-compose.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC in the final setup, when we have correct Dockerfiles and compose files in all modules, this will simply become {{ module_name }}/docker-compose.yaml
?
Makefile-slim
Outdated
export PYTHON ?= python3 | ||
VIRTUAL_ENV = $(CURDIR)/venv2 | ||
BIN = $(VIRTUAL_ENV)/bin | ||
ACTIVATE_VENV = $(BIN)/activate | ||
OCRD_MODULES = OCRD_CIS OCRD_TESSEROCR | ||
OCRD_CIS = ocrd-cis-ocropy-binarize ocrd-cis-ocropy-dewarp | ||
OCRD_TESSEROCR = ocrd-tesserocr-recognize ocrd-tesserocr-segment-region | ||
PROCESSORS = $(foreach mod,$(OCRD_MODULES),$(foreach proc,$($(mod)), $(proc) )) | ||
DELEGATORS = $(foreach proc,$(PROCESSORS),$(BIN)/$(proc)) | ||
|
||
slim-venv: docker-compose.yaml .env $(DELEGATORS) | $(VIRTUAL_ENV) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this would be needed if you added to the existing Makefile
directly. We already have (sensible+configurable) definitions for VIRTUAL_ENV
, OCRD_MODULES
(not a variable but the true submodule name) and OCRD_EXECUTABLES
(your PROCESSORS
). There is even an existing delegator mechanism (used for sub-venvs on some modules).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far but I need a bit more explanation of all the parts involved, what gets generated and what is actually run.
Changes not yet merged to core are needed for this PR to work. This must be reset later
I have started the process of addressing all comments but this will take a while. I think I'll add them step by step. I will report back if that is finished. |
1de5dbb
to
10725b9
Compare
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
|
||
processing_server_address = "http://localhost:{{ OCRD_PS_PORT }}" | ||
env_path = "{{ OCRD_SLIM_ENV_PATH }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use https://github.com/theskumar/python-dotenv for that, so you don't have to parse sh
variable assignments?
This reverts commit cfd1a79. Goal is to read .env when activating the venv and use the env-variable to set the processing server port.
context: ../../ocrd_tesserocr | ||
dockerfile: ../slim-containers-files/ocrd_tesserocr/Dockerfile | ||
command: | ||
ocrd network processing-worker ocrd-tesseroc-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ocrd network processing-worker ocrd-tesseroc-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue | |
ocrd network processing-worker ocrd-tesserocr-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue |
context: ../../ocrd_tesserocr | ||
dockerfile: ../slim-containers-files/ocrd_tesserocr/Dockerfile | ||
command: | ||
ocrd network processing-worker ocrd-tesserocr-segment-region --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should now use the ocrd-tesserocr-segment-region worker
syntax instead to get instance caching, right?
superseded by #449 |
Description:
All processors run in their own docker-container as a processing-worker. Also there are containers: processing-server, rabbitmq and mongodb running. Executables for processors are delegators to
ocrd network client
.Related issue:
#69 (comment)
Example usage:
Clone this repo:
git clone [email protected]:joschrew/ocrd_all.git
Change to repository
cd ocrd_all
Core, ocrd_cis and ocrd_tesserocr are needed for the example-run:
git submodule update --init core/ ocrd_cis/ ocrd_tesserocr/
Create the venv and docker-compose.yaml:
make -f Makefile-slim slim-venv
Create datadir (necessary to get the workspaces to the containers):
mkdir data
Start the containers:
docker-compose up -d
Get a workspace for testing:
Activate the venv:
. venv2/bin/activate
Run a processor on the workspace:
ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN -m mets.xml