Use a separate service to generate audio waveforms #731

sarayourfriend · 2022-02-14T22:47:52Z

Problem

Currently the audio waveforms are created upon request in the API. This has two effects:

The API must either cache the waveforms in it's own table or else recreate them upon every request
The API (which is open to the world) has an extra binary installed, audiowaveform which could present an unnecessary vulnerability

Description

From @AetherUnbound in a private chat:

I would like to see if it would be possible to generate them during ingestion. If that proves feasible, it has the added benefit of removing the audiowaveform dependency from the API. We can start populating the waveforms now in the catalog, and then swap over to the column in the API once we’ve backfilled everything.

Alternatives

Continue just creating the waveforms in the API and accept the two issues in the description.

Implementation

🙋 I would be interested in implementing this feature.

The text was updated successfully, but these errors were encountered:

dhruvkb · 2022-02-15T11:13:47Z

I imagine it'll be a very interesting exercise to write a thin API wrapper (using something fast and close to the metal like Go) over the BBC audiowaveform library. Given a URL it can return (and possibly also cache) the waveform. This allows us to make it work similar to the imageproxy thumbnail service in the API without tightly coupling it to the API or the ingestion server.

sarayourfriend · 2022-02-15T11:53:30Z

Oh interesting idea Dhruv. Do you mean basically writing a wrapper around audiowaveform that memoizes the calls to it? I could even see this being some kind of general purpose CLI utility that creates a unix socket to make the requests against or something. Though if you used something with a fast boot time (idk if Go qualifies for that, I can't find any research online of how fast Go binaries boot vs Rust binaries; but based on what I'm reading online Go is fast but mostly fast at compilation and Rust will have faster execution speeds, at the cost of compilation of course) then you could just call the binary directly each time and have it establish a configured connection with Redis or the like. A long lived-daemon might shave some time there though.

However, I have to say that might be over-complicating it when you could just memoize the calls to the audiowaveform binary in Python against a long-lived Redis cache? py-memoize for example can accommodate a Redis or memcached backend.

Then again, all of that might be over-complicating it when storing it in the database could be the simplest solution and provide sufficient performance anyway.

My vote would probably be to go the simple database route, measure the performance, and if we see some noticeable peaks in the 95P+ range then look into improving it.

That being said, there are probably other parts of our stack, especially in the API, that could use that same kind of analysis and I'm eager for us to get some monitoring in place that will allow us to do that.

obulat · 2022-03-18T05:11:33Z

Can we close this issue now that #1551 has been merged (and deployed 🚀), or is this a more broad issue, @sarayourfriend ?

sarayourfriend · 2022-03-18T13:10:36Z

I think this is still an issue that needs to be directly addressed. Relying on manually running a django command to "warm the cache" of waveforms, as it were, is not a sustainable (or desirable) solution in the long term.

zackkrida · 2022-05-16T20:40:34Z

As discussed recently, I'm leaning towards the following:

I think we will want to codify a pattern for things like waveforms, thumbnails, etc.—anything that isn’t a necessary piece of data to provide in the API, but that we’d like to display in the front-end, should be generated dynamically on read, and cached aggressively. If it’s not ‘data’ it shouldn’t be in the catalog, but a reference to it could be served by the API (for example thumbnail_url , waveform_url, etc.).

Our dataset is so large that I don’t think running computations against media during ingestion is going to work.

So basically, I don't personally think we should warm the cache at all. To revisit the original problems:

The API must either cache the waveforms in it's own table or else recreate them upon every request

With my discussed approach, we would remove any waveform data from the DB and instead treat them like we do image thumbnails, where the API response includes a reference to the waveform data

The API (which is open to the world) has an extra binary installed, audiowaveform which could present an unnecessary vulnerability

This waveform generator would become a standalone microservice.

I'm open to closing this issue, I don't think it explicitly relates to the catalog anymore.

sarayourfriend · 2022-05-19T06:29:53Z

Let's move the issue to either openverse or openverse-api, wherever we thing it'd make the most sense to record the need for an entirely new service.

AetherUnbound · 2022-05-24T00:28:29Z

Moved the issue to the API, since that's where the thumbnail service currently resides.

zackkrida · 2022-08-09T21:30:33Z

@krysal I'm going to move this out of the todo column, I don't think it's a realistic goal for the next two weeks and their might be some infra considerations to deal with first.

sarayourfriend · 2023-11-22T23:04:14Z

As part of the #2843 discussion, we decided not to rely on microservices for these things, instead to use async Python to ensure waveform generation doesn't block workers.

There's still some argument to be made that waveform generation is CPU intensive, in a way that could interrupt the overall stability of a single task, but that is not anything we've seen happen thus far.

I'm closing this issue as won't do for now. If we see waveforms become a performance issue, we can explore ways of improving it, whether via a separate service or some other approach. In the meantime, however, it isn't worth considering this as work we need to think about.

AetherUnbound mentioned this issue Feb 22, 2022

Audio waveform cache-warming Django command WordPress/openverse-api#529

Closed

1 task

obulat removed the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Mar 18, 2022

AetherUnbound mentioned this issue Apr 17, 2023

Explore calculating bitrate during ingestion #1589

Open

1 task

AetherUnbound transferred this issue from WordPress/openverse-catalog May 20, 2022

krysal mentioned this issue May 24, 2022

Switch audio media status from 'beta' to 'supported' WordPress/openverse-frontend#1315

Closed

9 tasks

AetherUnbound changed the title ~~Generate waveform for audio files upon ingestion~~ Use a separate service to generate audio waveforms Jun 24, 2022

krysal self-assigned this Jun 24, 2022

obulat added the stack: backend label Feb 22, 2023

obulat transferred this issue from WordPress/openverse-api Feb 22, 2023

dhruvkb pushed a commit that referenced this issue Apr 14, 2023

Bump Airflow version to 2.3.4 (#731)

a329be2

AetherUnbound added 🧱 stack: api Related to the Django API and removed 🧱 stack: backend labels May 15, 2023

zackkrida unassigned krysal Jun 20, 2023

sarayourfriend closed this as not planned Won't fix, can't repro, duplicate, stale Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a separate service to generate audio waveforms #731

Use a separate service to generate audio waveforms #731

sarayourfriend commented Feb 14, 2022

dhruvkb commented Feb 15, 2022 •

edited

Loading

sarayourfriend commented Feb 15, 2022

obulat commented Mar 18, 2022

sarayourfriend commented Mar 18, 2022

zackkrida commented May 16, 2022 •

edited

Loading

sarayourfriend commented May 19, 2022

AetherUnbound commented May 24, 2022

zackkrida commented Aug 9, 2022

sarayourfriend commented Nov 22, 2023

Use a separate service to generate audio waveforms #731

Use a separate service to generate audio waveforms #731

Comments

sarayourfriend commented Feb 14, 2022

Problem

Description

Alternatives

Implementation

dhruvkb commented Feb 15, 2022 • edited Loading

sarayourfriend commented Feb 15, 2022

obulat commented Mar 18, 2022

sarayourfriend commented Mar 18, 2022

zackkrida commented May 16, 2022 • edited Loading

sarayourfriend commented May 19, 2022

AetherUnbound commented May 24, 2022

zackkrida commented Aug 9, 2022

sarayourfriend commented Nov 22, 2023

dhruvkb commented Feb 15, 2022 •

edited

Loading

zackkrida commented May 16, 2022 •

edited

Loading