Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a separate service to generate audio waveforms #731

Closed
1 task
sarayourfriend opened this issue Feb 14, 2022 · 9 comments
Closed
1 task

Use a separate service to generate audio waveforms #731

sarayourfriend opened this issue Feb 14, 2022 · 9 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: api Related to the Django API 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python

Comments

@sarayourfriend
Copy link
Collaborator

Problem

Currently the audio waveforms are created upon request in the API. This has two effects:

  1. The API must either cache the waveforms in it's own table or else recreate them upon every request
  2. The API (which is open to the world) has an extra binary installed, audiowaveform which could present an unnecessary vulnerability

Description

From @AetherUnbound in a private chat:

I would like to see if it would be possible to generate them during ingestion. If that proves feasible, it has the added benefit of removing the audiowaveform dependency from the API. We can start populating the waveforms now in the catalog, and then swap over to the column in the API once we’ve backfilled everything.

Alternatives

Continue just creating the waveforms in the API and accept the two issues in the description.

Implementation

  • 🙋 I would be interested in implementing this feature.
@sarayourfriend sarayourfriend added 🟩 priority: low Low priority and doesn't need to be rushed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🐍 tech: python Involves Python 🔧 tech: airflow Involves Apache Airflow labels Feb 14, 2022
@dhruvkb
Copy link
Member

dhruvkb commented Feb 15, 2022

I imagine it'll be a very interesting exercise to write a thin API wrapper (using something fast and close to the metal like Go) over the BBC audiowaveform library. Given a URL it can return (and possibly also cache) the waveform. This allows us to make it work similar to the imageproxy thumbnail service in the API without tightly coupling it to the API or the ingestion server.

@sarayourfriend
Copy link
Collaborator Author

Oh interesting idea Dhruv. Do you mean basically writing a wrapper around audiowaveform that memoizes the calls to it? I could even see this being some kind of general purpose CLI utility that creates a unix socket to make the requests against or something. Though if you used something with a fast boot time (idk if Go qualifies for that, I can't find any research online of how fast Go binaries boot vs Rust binaries; but based on what I'm reading online Go is fast but mostly fast at compilation and Rust will have faster execution speeds, at the cost of compilation of course) then you could just call the binary directly each time and have it establish a configured connection with Redis or the like. A long lived-daemon might shave some time there though.

However, I have to say that might be over-complicating it when you could just memoize the calls to the audiowaveform binary in Python against a long-lived Redis cache? py-memoize for example can accommodate a Redis or memcached backend.

Then again, all of that might be over-complicating it when storing it in the database could be the simplest solution and provide sufficient performance anyway.

My vote would probably be to go the simple database route, measure the performance, and if we see some noticeable peaks in the 95P+ range then look into improving it.

That being said, there are probably other parts of our stack, especially in the API, that could use that same kind of analysis and I'm eager for us to get some monitoring in place that will allow us to do that.

@obulat
Copy link
Contributor

obulat commented Mar 18, 2022

Can we close this issue now that #1551 has been merged (and deployed 🚀), or is this a more broad issue, @sarayourfriend ?

@obulat obulat removed the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Mar 18, 2022
@sarayourfriend
Copy link
Collaborator Author

I think this is still an issue that needs to be directly addressed. Relying on manually running a django command to "warm the cache" of waveforms, as it were, is not a sustainable (or desirable) solution in the long term.

@zackkrida
Copy link
Member

zackkrida commented May 16, 2022

As discussed recently, I'm leaning towards the following:

I think we will want to codify a pattern for things like waveforms, thumbnails, etc.—anything that isn’t a necessary piece of data to provide in the API, but that we’d like to display in the front-end, should be generated dynamically on read, and cached aggressively. If it’s not ‘data’ it shouldn’t be in the catalog, but a reference to it could be served by the API (for example thumbnail_url , waveform_url, etc.).

Our dataset is so large that I don’t think running computations against media during ingestion is going to work.

So basically, I don't personally think we should warm the cache at all. To revisit the original problems:

  1. The API must either cache the waveforms in it's own table or else recreate them upon every request

With my discussed approach, we would remove any waveform data from the DB and instead treat them like we do image thumbnails, where the API response includes a reference to the waveform data

  1. The API (which is open to the world) has an extra binary installed, audiowaveform which could present an unnecessary vulnerability

This waveform generator would become a standalone microservice.

I'm open to closing this issue, I don't think it explicitly relates to the catalog anymore.

@sarayourfriend
Copy link
Collaborator Author

Let's move the issue to either openverse or openverse-api, wherever we thing it'd make the most sense to record the need for an entirely new service.

@AetherUnbound AetherUnbound transferred this issue from WordPress/openverse-catalog May 20, 2022
@AetherUnbound
Copy link
Collaborator

Moved the issue to the API, since that's where the thumbnail service currently resides.

@AetherUnbound AetherUnbound changed the title Generate waveform for audio files upon ingestion Use a separate service to generate audio waveforms Jun 24, 2022
@krysal krysal self-assigned this Jun 24, 2022
@zackkrida
Copy link
Member

@krysal I'm going to move this out of the todo column, I don't think it's a realistic goal for the next two weeks and their might be some infra considerations to deal with first.

@obulat obulat transferred this issue from WordPress/openverse-api Feb 22, 2023
dhruvkb pushed a commit that referenced this issue Apr 14, 2023
@AetherUnbound AetherUnbound added 🧱 stack: api Related to the Django API and removed 🧱 stack: backend labels May 15, 2023
@sarayourfriend
Copy link
Collaborator Author

As part of the #2843 discussion, we decided not to rely on microservices for these things, instead to use async Python to ensure waveform generation doesn't block workers.

There's still some argument to be made that waveform generation is CPU intensive, in a way that could interrupt the overall stability of a single task, but that is not anything we've seen happen thus far.

I'm closing this issue as won't do for now. If we see waveforms become a performance issue, we can explore ways of improving it, whether via a separate service or some other approach. In the meantime, however, it isn't worth considering this as work we need to think about.

@sarayourfriend sarayourfriend closed this as not planned Won't fix, can't repro, duplicate, stale Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: api Related to the Django API 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python
Projects
Archived in project
Development

No branches or pull requests

6 participants