Skip to content
Colton Loftus edited this page Sep 3, 2024 · 15 revisions

Scheduler IoW Notes

Current Challenges

  • build files (implnet_{jobs,ops}_*.py) are tracked in the repo, making a verbose git history and PRs more work to review
  • multiple organizations have stored configurations in the repo, causing a higher burden on maintainers
  • build is done with environment variables and multiple coupled components instead of one build script, making it more challenging to debug, test, and refactor

Current understanding for alignment

Steps

  1. Build the gleanerconfig.yml

    • This config builds upon a gleanerconfigPREFIX.yaml file that is the base template
  2. Source the nabuconfig.yaml which specifies configuration and context for how to retrieve triplet data and how to store it in minio

  3. Generate the jobs/ ops/ sch/ and repositories/ directories which container the Python files that describe when to run the job

  4. Generate the workspace.yaml file

    • Some configurations of the workspace.yaml file include a grpc_server key. Other just describe the relative path for the Python file which contains references to all the jobs
    • This might be able to be eliminated or condense into the other config when refactoring
  5. Set up the docker swarm configuration using dagster_setup_docker.sh

    • Create the docker network
    • Create the volume and read in the gleanerconfig.yaml workspace.yaml and nabuconfig.yaml

NOTE: After this point the configuration and docker compose has a significant number of env vars, configuration options, and merged configurations that make the proceeding steps a bit unclear

  1. Run the docker compose project
    • Source the .env file to hold env variables and pass these into the compose project
    • Ensure all the config files are contained inside the container
    • Check if there is a compose override .yml file and if so, pass it in
  2. This docker compose project will manage:
    • traefik for a proxy to access container resources
    • dagster for scheduling crawls. This in turn manages the following:
      • postgres appears to be just for storing internal data
      • dagit appears to be the config for the actual crawl itself (i.e. uses the GLEANERIO_* env vars.
      • daemon appears to source the base config for dagster
      • code-tasks and code-project seem to be grpc endpoints for interacting with dagster (NOTE: I am a bit unclear on their usage)
    • the s3 provider (minio in this case), gleaner, and nabu for crawling / storing data
  3. Once crawling is scheduled and completed, I am assuming that the resulting triples will be output in the specified s3 bucket

Ideas for improvement

  • Condense code into one central Python build program
    • Use https://github.com/docker/docker-py to control the containers instead of shell scripts. (Makes it easier to test and debug to have it all in one language as a data pipeline)
    • By using a cli library like https://typer.tiangolo.com/ we can validate argument correctness and fail early, making it easier to debug instead of reading in the arguments and failing after containers are spun up
  • Move all build files to the root of the repo to make it more clear for end users
    • (i.e. makefiles, build/ directory, etc.)
  • Refactor such that individual organizations store their configuration outside the repo.
    • The Python build program should be able to read the configuration files at an arbitrary path that the user specifices
  • Add types and doc strings for easier maintenance long term
  • Use jinja templating instead of writing raw text to the output files
  • Currently jobs are ran by generated by outputting literal function names templated inside a Python file
    • Unclear if this is scalable to huge datasets. Probably best to use a generator so we do not need to load everything into the ast
  • Create a more clear documentation website
Clone this wiki locally