Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data flow diagram for various ETL steps in pipelines #4465

Merged
merged 9 commits into from
Jun 22, 2024

Conversation

AetherUnbound
Copy link
Collaborator

Fixes

Fixes #4455 by @AetherUnbound

Description

This PR adds a document which visualizes and describes the current and proposed data flows for the project. It also enumerates the ETL steps at each stage to make it easy to understand which operations are happening at which step.

I didn't include any transformations that happen between the client and the frontend because I wasn't aware of them, but please feel free to point them out and I can add them!

Testing Instructions

Check out the document preview and make sure it makes sense.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (just catalog/generate-docs for catalog
    PRs) or the media properties generator (just catalog/generate-docs media-props
    for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@AetherUnbound AetherUnbound requested a review from a team as a code owner June 7, 2024 22:25
@openverse-bot openverse-bot added 🧱 stack: documentation Related to Sphinx documentation 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository labels Jun 7, 2024
Copy link

github-actions bot commented Jun 7, 2024

Full-stack documentation: https://docs.openverse.org/_preview/4465

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

Changed files 🔄:

@sarayourfriend
Copy link
Collaborator

Excited to spend more time reviewing this!

Tiny error detail I noticed to begin with, but the API pulls from ES, and does dead link filtering on the ES hits (meaning, only has ES's data) + Redis (which is being both read from and written to during this process), not the API DB. The API only pulls the API DB results just before passing to the serializer, after all the search_controller routines are finished working on the ES hits. The difference there is that any dead link filtering that happens only has ES and Redis to work off of (currently), and to incorporate the API DB there would require a new API DB query for each ES query. Right now, there's only ever one single API DB request, and 1 ... n ES queries (limited by the max iteration depth of search_controller._post_process_results).

The change to the graph would be to say, the API pulls from ES, filters and processes the ES results, and then only pulls the final presented subset from the API DB. So ES + Redis -> dead link filtering (filter) -> API DB -> API serializers (transform) -> requester, and then if the frontend is the requester, the frontend also does some additional processing (like generating attributions), but those are not a necessary (as in, constant) part of the data lifecycle. It might be self-evident, but worth keeping in mind that the frontend is not the only API consumer, and transformations created at the frontend level are not getting to anyone else. That's fine if they're just for the frontend to use, but not fine if those transformations could also matter to other clients (like the WP media inserter, or Openverse Slack Reaction app).

There's a cyclical relationship between the API DB and ES in that one informs the way the other is used, either to decide what data to load into ES to begin with, but then also what results to present from the API DB, with the API DB results being those that are presented, not the ES hits.

Additionally, it would be good to add the fact that Redis is both written to and read from during dead link filtering (which is basically itself an ETL process, I guess, we're at the very least generating data, if not extracting it, though we do save the specific status code, not just a dead/not-dead boolean).

Really awesome, though, very excited to dig into this more, and to get a more cohesive view of how each part of our data lifecycle works with the other parts.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! This LGMT aside from the issue I commented on before (the slight inaccuracy of leaving Redis out of the API picture, and clarifying where results come from, + adding serialization as a pre-client transformation step). Some of that is being a bit nit-picky, but I think it's worth especially clarifying what data is sent in API responses, including what transformations the API itself may do to them in serialization. No need to outline the specifics of those transformations here. As you've done with the other sections, just mentioning that it happens with a link to the relevant code (serializers) it is more than enough and will prevent documentation drift 👍

With documentation drift in mind, though, I suppose we'll need to update this document once the three mentioned projects are complete!

Otherwise, this is great. It does not answer every question I have, but it's also purely descriptive, which I think is good for now. I personally would really like a prescriptive version of this document in the future, a "vision" of the data lifecycle, but I think that will come with time as we all discuss and refine what our goal is with the data lifecycle and each part of it.

Staci and I had an interesting conversation earlier this week about the difference between approaching problems in Openverse (generally, not just with data eng. stuff), with a "keep the changes minimal" mindset. While that has some benefits (e.g., it makes some things go faster), it has the effect of us treating the way things are as the way they should be, and eliminating chances for us to talk about what we want (our vision) for these systems in concrete terms. That, in turn, stops us from making moves towards that vision. It also, I believe, introduces significant opportunities for miscommunication and misalignment. It is probably the case that all of us have some ideas of how things should be (and each of us to varying degrees for each area depending on our areas of focus in Openverse), but because we very rarely have that clearly defined, we end up talking past each other, or getting confused about one choice or another. I often feel like we (meaning the whole team as individuals) end up ever so slightly (or sometimes even more) misaligned in our visions, leading to spinning wheels in some places. Here, for example, we see that there's a convergence of three understandings of the intention of each piece of our data lifecycle that were planned independently of each other, with only really offhand mentions of the others in planning, rather than true integration. Ideally, these three projects would all work together to move our data engineering and data lifecycle towards our shared vision. But that presupposed the shared vision is defined, and indeed shared between us all!

Anyway, just musing on the fact that this document is a step in that direction, and I'm really glad for that. The conclusion Staci and I came to with respect to how we've planned projects to aim for minimal changes is that if we had a grander vision overall, we could make decisions in project planning that set us up for better situations in the future. Especially with the ingestion server removal project, we're pulling things more inline with treating Airflow as the coordinator of movement in our data infrastructure. Using the EC2 workers as workers is standard practice (I've learned this recently), but something like the ingestion server never was! It was doing Airflow's job, and way more tediously. Moving towards more Airflow-y way of coordinating our Catalog -> API DB -> ES ETL was not an explicit intention (we instead wanted to simplify data refresh to make it easier to work on, without ever explicitly saying that the reason it would be simpler is because it would be more in line with how Airflow should be used to coordinate these things!), but is a significant and meaningful realignment and readjustment of how the ETL pipelines are thought about. That's excellent, because it means we are making prescriptive changes towards a vision of "how things should be for Openverse", by saying "Airflow is the place for this".

That's powerful! I'm excited for the possibilities that brings up, especially with respect to other things we can use Airflow for aside from data engineering task management. My view of Airflow is personally expanded by learning more about it and speaking with Staci this week.

TL;DR: This is great, looking forward to more discussions about these things, and I'm very excited to learn more about all of this before and through those discussions with y'all. Always exciting to see what we come up with when we give ourselves permission to think beyond the minimal constraints of a specific single need.

documentation/meta/media_properties/etls.md Outdated Show resolved Hide resolved
Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AetherUnbound this is so awesome! The structure of the document itself is lovely, with the graphic linking off to the the descriptive sections. ➕ to Sara's suggestions but I'll approve now to streamline things.

end

A --> DL(Dead link filtering):::ETL
DL --> CL[Client]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    DL --> CL[Client]

Obviously from a computing and data flow standpoint this name makes sense, but we don't refer to the API as a "client" anywhere else, really. Is there a way we could expand this (maybe a parenthetical including "Django API") so that others can "connect the dots" between this graphic and our stack as understood by developers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, see I meant for "Client" here to be someone using the API (which could be the Gutenberg integration, a WP plugin, etc). I'll try to clarify that!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't have the Django API on here anywhere, and I think I should! I'll expand on it a bit more 😄

@AetherUnbound
Copy link
Collaborator Author

Especially with the ingestion server removal project, we're pulling things more inline with treating Airflow as the coordinator of movement in our data infrastructure. Using the EC2 workers as workers is standard practice (I've learned this recently), but something like the ingestion server never was! It was doing Airflow's job, and way more tediously. Moving towards more Airflow-y way of coordinating our Catalog -> API DB -> ES ETL was not an explicit intention (we instead wanted to simplify data refresh to make it easier to work on, without ever explicitly saying that the reason it would be simpler is because it would be more in line with how Airflow should be used to coordinate these things!), but is a significant and meaningful realignment and readjustment of how the ETL pipelines are thought about. That's excellent, because it means we are making prescriptive changes towards a vision of "how things should be for Openverse", by saying "Airflow is the place for this".

It's interesting to hear you say this. I feel like I've been trying to communicate this point and push for exactly this way of thinking about Airflow for a while 🙂 It was part of the reason I wanted to see if we could use the ECSRunTaskOperator or something similar for the actual reindexing step rather than keep the Falcon server/indexer workers that we had. Airflow is built for this kind of coordination and centralized management, and it's definitely a vision I've been working to communicate.

What I'm hearing, though, is that it might have been useful to make that intention explicit in some form of policy or "vision" document. That might have helped make it clearer to communicate "this is what we want to use Airflow for" rather than me saying it over and over during the occasions where an opportunity to leverage it cropped up organically. That's helpful feedback, and hopefully that can also be part of the things we make explicit:

  • The catalog database is the data warehouse where we store as much information as we can/is reasonable.
  • Airflow is the orchestrator for our data management and automation workflows, particularly those that interact with multiple systems or aspects of our stack.

@zackkrida
Copy link
Member

@AetherUnbound the changes look great 👍

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Jun 19, 2024

Airflow is the orchestrator for our data management and automation workflows, particularly those that interact with multiple systems or aspects of our stack.

I think the specific thing that didn't connect to me is that Airflow isn't even a data engineering tool, per se. It's just a task scheduler and you can do anything with it. The data engineering happens with other tools, like Postgres, or Python scripts. It didn't click to me until I read about data engineering tools generally and realised that Airflow doesn't have much to do with it aside from being a task coordinator.

With respect to the ingestion server removal, I did not understand the difference. I still don't think ECS is a good option (FARGATE's pricing structure is actively hostile to resource cost/utilisation optimisations and wildly expensive), but AWS Batch with spot instances would be a tremendous change and I'd support going in that direction.

Yeah, for whatever reason it never clicked that Airflow is just a task coordinator 😕. Sorry! I haven't been that involved in data things until the ingestion server removal project, so I'm not sure I encountered that idea before, and when reviewing that IP I did not understand the idea of removing the API from the indexer workers. We can still do that if we want, and swap out the ASG for AWS Batch, even without spot instances, and it would be an improvement and move towards a more airflow-y model for sure. I think I was also specifically hung up on ECS FARGATE, which as I said above I think is a specifically bad solution, but it's possible to structure the containerised workload such that the execution environment doesn't matter, so swapping ECS for kubernetes or AWS Batch becomes more or less trivial. I apologise for not understanding that, something just didn't click until I learned more about data engineering as a whole and Airflow in particular in the past two weeks.

A synchronous conversation about these things would probably have helped a lot! I'm realising there are a lot more important concerns tied up between data engineering and infrastructure than I knew before, which is why I am spending a large amount of time outside of work learning about these things more, as well.

Co-authored-by: sarayourfriend <[email protected]>
Co-authored-by: zack <[email protected]>
@openverse-bot
Copy link
Collaborator

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@sarayourfriend
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 9 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and is so nice to have! The only thing that struck me as potentially confusing is placing the temp table within the ingestion server visually, when this is really a temp table in the API database that later gets promoted. Not sure how best to reflect that via the diagram but maybe we could add a section explaining what the temp table is, and have it be clickable like the other elements? Not blocking.

This is fantastic @AetherUnbound :)

@AetherUnbound
Copy link
Collaborator Author

The only thing that struck me as potentially confusing is placing the temp table within the ingestion server visually, when this is really a temp table in the API database that later gets promoted. Not sure how best to reflect that via the diagram but maybe we could add a section explaining what the temp table is, and have it be clickable like the other elements?

This was definitely a tough thing to convey 😅 we already had enough connections to the API database that I wanted to try and have it be separate. I'll add a note for it at the bottom of the diagram, thanks!

@sarayourfriend
Copy link
Collaborator

Not sure where or how to document this, but I just remembered the provider occurrance tallying we aggregate on a weekly basis into Redis:

def count_provider_occurrences(results: list[dict], index: str) -> None:

It doesn't fit into and current ETL wants (like dead link detection does) but it strikes me that it could be useful to have a place where we've described the data we're generating overall, regardless of whether it's currently part of or intended to be part of an entire ETL pipeline. Provider occurrance rate might be a useful relevancy metric, for example, and it'd be a shame if we forgot about it because we aren't referring to it regularly (that I know of).

Just mentioning in this PR because of it's tangential relevancy to the subject, not because I think it should be included in this documentation. I needed to write it down somewhere and this was the easiest and most sensible place in the immediate.

@AetherUnbound AetherUnbound merged commit c0b6646 into main Jun 22, 2024
40 checks passed
@AetherUnbound AetherUnbound deleted the docs/etl-diagram branch June 22, 2024 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: documentation Related to Sphinx documentation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Document current & desired ETL steps and data flow
5 participants