Skip to content

Working end-to-end transcription integrations. #1774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 22, 2025

Conversation

boazsender
Copy link
Contributor

Fixes from webhook debugging work. Follow this test plan to verify:

Local testing guide

The following instructions will let you test the scraping half of the implementation. The webhook half will of the implementation needs to be on the development project for an end-to-end test.

In order to perform this test you will need

  • An assembly API key in your local environment (Ask Nathan Sanders or Matt Victor for this)
  • IAM permissions for the maple dev project in the Google Cloud Console (Ask Matt King for this)
  • A logged in and working firebase-admin session
  • A local copy of github.com/codeforboston/maple with installed deps

Instructions

  1. Setup an ngrok tunnel for the local copy of the transcription webhook
    1. Get an account at https://dashboard.ngrok.com
    2. Connect your ngrok account ngrok config add-authtoken your-auth-token
    3. Start the local function emulator with yarn run dev:functions
    4. Open an ngrok tunnel: ngrok http http://localhost:5001
  2. Find a hearing that has a video within the last 8 days. In this example we’ll use hearing-5091.
  3. Modify functions/src/events/scrapeEvents.ts to create a local testing state
    1. Hardcode const EventId = 5091 into the beginning of getEvent
    2. Change the webhook endpoint in the assembly api call to the ngrok endpoint: https://put-your-ngrok-tunnel-id-here.ngrok-free.app/demo-dtp/us-central1/scrapeHearings-0
  4. Run the firebase admin console on the dev project with yarn firebase-admin -e dev console
    1. From inside the the admin repl run the following command with the id of your hearing (using 5091 in this example):
      1. const resp = await db.collection("events").doc("hearing-5091").get()
      2. const data = resp.data()
      3. JSON.stringify(data)
    2. Copy the output of JSON.stringify(data) without the surrounding quotes (so you have a JS object) and save it somewhere for the next step
    3. Exit the repl with .exit
  5. Start your local firebase emulator with yarn dev:functions
    1. While that emulator is running, open a second shell
      1. Connect to the local firebase admin with yarn firebase-admin -e local console
      2. Load the data into your local instance with await db.collection('events').doc('hearing-5091').set(data) where data is the js object on your clipboard. It should look something like await db.collection('events').doc('hearing-5091').set({"startsAt":{"_seconds":1741881600,"_nanoseconds":0},"id":"hearing-5091","type":"hearing","content":{"EventDate":"2025-03-13T12:00:00","HearingAgendas":[{"DocumentsInAgenda":[],"Topic":"Annual Hearing on a Potential Modification of the Health Care Cost Growth Benchmark","StartTime":"2025-03-13T12:00:00","EndTime":"2025-03-13T15:00:00"}],"RescheduledHearing":null,"StartTime":"2025-03-13T12:00:00","EventId":5091,"HearingHost":{"Details":"http://malegislature.gov/api/GeneralCourts/194/Committees/J24","CommitteeCode":"J24","GeneralCourtNumber":194},"Name":"Joint Committee on Health Care Financing","Description":"Joint Committee on Health Care Financing & Health Policy Commission Public Hearing on a Potential Modification of the CY 2026 Health Care Cost Growth Benchmark","Location":{"AddressLine2":null,"AddressLine1":"24 Beacon Street","State":"MA","ZipCode":"02133","City":"Boston","LocationName":"Gardner Auditorium"},"Status":"Completed"},"fetchedAt":{"_seconds":1741967707,"_nanoseconds":728000000}})
      3. Exit the repl with .exit
    2. In the local emulator ui edit 5091 (http://localhost:3010/firestore/data/events/hearing-5091)
      1. Delete fetchedAt and fetchedAt, and replace them with timestamp type fields of the same property names and values of right now.
    3. Back in the yarn dev:functions emulator run scrapeHearings()
    4. Check the local hearing-5901 document with
      1. const resp = await db.collection("events").doc("hearing-5091").get()
      2. resp.data()
    5. Look for the following props on the hearing 5901-doc:
      1. videoURL: location of the video hosted by the MA Legislature
      2. videoFetchedAt: timestamp of when the video url was fetched
      3. videoAssemblyId: id of the transcript in the Assembly SaaS API
  6. Wait for the assembly job to finish (5-10 minutes usually)
  7. Check the transcriptions collection and look for a new transcript for hearing-5901:
    1. https://console.cloud.google.com/firestore/databases/-default-/data/panel/transcriptions/17c91397-c023-4f28-a621-4cef45c70749?authuser=1&project=digital-testimony-dev

Copy link

vercel bot commented Mar 26, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
maple-dev ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 17, 2025 3:35pm

Copy link
Collaborator

@Mephistic Mephistic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still an open question around what transcription data we should actually store - that doesn't necessarily have to be part of this PR, but we should get at least a tentative answer/plan on that before we run any backfill

@@ -158,20 +158,6 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> {
const hearing = Hearing.check(eventData)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-checking here - how does this handle the first time a hearing is scraped? If there is no hearing in the db, eventData would be undefined - does Hearing.check blow up (and prevent us from returning the non-transcription related event data)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I think I was drawing inspiration from this prior line:

const content = HearingContent.check(await api.getHearing(EventId))

How would you want to see this done differently?

I cal also try some things out if nothing comes to mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the chief value of that HearingContent.check there is about guarding against changes to the external Mass Legislature API (i.e. if the data they return deviates too much from what we expect, don't risk polluting our database - instead, throw an error to force a dev to investigate).

For our purposes here:

If eventInDb.exists() is false (or equivalently AFAIK, if eventData is undefined), it should mean that we are scraping the present hearing for the first time. This means that we have definitely not scraped the hearing's video yet and it's valid to try scraping videos for it. The simplest, most robust solution is to just add this case to the logic of when to scrape video:

Currently, we scrape video only if:

  • there is a pre-existing hearing with this EventId in our DB that does not have a videoUrl set
  • AND
  • the hearing in question is within our scraping time window

We should instead scrape video only if:

  • EITHER (we have not scraped this hearing before) OR (there is a pre-existing hearing in our DB with this EventId that does not have a videoUrl set)
  • AND
  • the hearing in question is within our scraping time window

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating this logic. Leaving the .check in there since that was there prior, and I think is still needed for the first run on a hearing. Do you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait nevermind, I conflated two different .checks. Will leave the first and remove the second.

@@ -191,25 +177,32 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> {
maybeVideoURL = firstVideoSource.src

transcript = await assembly.transcripts.submit({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've worked out the kinks here, could we refactor getEvent a bit?

I feel like there's at least two functions here that could be broken out to make this more readable :

  • shouldScrapeVideo (handles fetching any previous event + checking for existingVideo + the date cutoff logic, returns a boolean yes/no)
  • getHearingVideoUrl(does the JSDOM wrangling, returns a string | undefined with the videoUrl`)
  • And maybe submitTranscription (takes a videoUrl, sends the AssemblyAI request + saves the webhookAuth in the DB, returns transcriptId)

Which would leave the new code as something like:

if (shouldScrapeVideo(EventId) {
  const maybeVideoUrl = getHearingVideoUrl(EventId)
  if (maybeVideoUrl) {
      const transcriptId = await submitTranscription(maybeVideoUrl)
      // add video/transcription data to Event
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I can take a pass at this.

})

await db
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I think we shouldn't actually save the hearing event to Firestore in getEvent - the abstract class already handles that for us. The webhookAuth is in a subcollection (which wouldn't be caught by the abstract class) and is the exception to that rule - this should be fine since Firestore lets you write to subcollections of documents that don't exist yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you like me to do here differently?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we scrape video, we can simply conditionally add the relevant video fields to the return object of getEvent (those fields being videoURL, videoFetchedAt, and videoAssemblyId). So long as they're present on the returned object, the code in the abstract class will handle the work of saving the events to the DB. See the line writer.set(db.doc(/events/${event.id}), event, { merge: true }) in run() at the top of this file.

As it stands, I think this would save the event twice - first in getEvent() and then again in run().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I understand now, let me know if I got it right in my next commit

if (tokenInDbData) {
return hashedToken === tokenInDbData.videoAssemblyWebhookToken
}
return false
}
)
console.log("authenticatedEventsInDb", authenticatedEventsInDb)

if (authenticatedEventsInDb) {
try {
await db
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should start thinking about what parts of the response we actually want to save here - There are a lot of fields in the example response you posted in Slack, but the only ones that looks potentially relevant are:

  • id
  • text
  • utterances
  • words
  • audio_url (though we already have it on the hearing event, it would still be helpful here)
  • audio_duration (not sure what we'd need it for though, but it's conceivably useful)
  • confidence

I'm most interested in text/utterances/words - since those will take up >99% of the data size (and words likely takes up >90%). We should chat with Matt V and design about the desired functionality, but I don't think we need all three. I goofed around with some of the example data: it looks like utterances gives us mostly sensible, mostly accurate divisions of the content, and text is good to have as a fallback.

IMO words is likely unneccessary - I believe it would only be useful if we either:

  • Find ourselves dissatisfied with the breakpoints/speaker split inherent to utterances and want to devise our own
  • Want to visually identify specific words that Assembly flagged as low confidence

Want to get @mvictor55 's input on the desired functionality here before dropping the axe though (and @mertbagt 's take on what we actually need for the front-end). If we do cut words, we should also probably cut the words array in utterances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would love to know what you think @mvictor55 and @mertbagt. I can proceed accordingly based on what you three align on.

@@ -48,7 +52,7 @@ export const transcription = functions.https.onRequest(async (req, res) => {

authenticatedEventsInDb.forEach(async d => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking - is it possible that the firebase function will exit before the async function in the forEach completes?

If so, it may be worth switching to a transaction for the writes here - something like

const batch = db.batch()

batch.set(
  db.collection("transcriptions").doc(transcript.id),
  { _timestamp: new Date(), ...transcript }
)

authenticatedEventsInDb.forEach(doc => {
  batch.update(doc.ref, {["x-maple-webhook"]: null})
})

await batch.commit()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart, will do.

return false
}
if (!eventData.videoFetchedAt) {
return withinCutoff(new Date(eventData.StartTime))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change from Hearing.check(eventData).startsAt.toDate()? I don't think eventData has a StartTime field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstand an earlier review comment. I'll change this in the next commit.

if (req.body.status === "completed") {
const transcript = await assembly.transcripts.get(req.body.transcript_id)
if (transcript && transcript.webhook_auth) {
const maybeEventInDb = await db
.collection("events")
.where("videoAssemblyId", "==", transcript.id)
.get()

if (maybeEventInDb.docs.length) {
const authenticatedEventsInDb = maybeEventInDb.docs.filter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking here - does this successfully filter out requests where the tokens don't match?

Now that I think about it, I would expect the async filter function to return a Promise that resolves to a boolean instead of a boolean - and since the Promise itself would always be truthy, this check would always return true and never filter anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so based on return hashedToken === tokenInDbData.videoAssemblyWebhookToken

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we've manually tested with a mismatched token and ensured the request was rejected, I'll drop this - but I still believe the async filter may be an issue - sample code below, see also https://stackoverflow.com/a/71600833:

const test = [1, 2, 3, 4]

// sync filter - result is [2, 4]
test.filter(e => e % 2 == 0)

// async filter = result is [1, 2, 3, 4]
test.filter(async e => e % 2 == 0)

text,
timestamp: new Date(),
audio_url,
words
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per our discussion the other day, we don't want words in the main transcriptions document - just in the subcollection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oy of course. Fixing in the next commit.

transcriptionInDb.set({
id,
text,
timestamp: new Date(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be _timestamp to match the timestamp below? If they're actually different timestamps, could we name them accordingly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, come to think of it, should this be a Timestamp instead of a Date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, updating in next commit.

words
})

transcriptionInDb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these .set() calls return Promises, so we should either await them all or write them in a transaction batch.

I don't have a strong opinion on which - firestore shouldn't need the root document to exist to create the subcollections, and worst case if one of the writes fails, we'll need to re-run the webhook anyway and that will overwrite any partially saved values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding awaits in next commit


transcriptionInDb
.collection("timestamps")
.doc("utterances")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Based on our discussion the other day, I was expecting two separate sub-collections (utterances and words) where each doc represents one utterance/word. I'm not locked in on that approach necessarily - just wanted to clarify.

I am onboard with storing all of the utterances as one doc for now given how we'll be using it (we can always re-structure without re-scraping later if we find we need the flexibility).

I do have two concerns here, especially re: words:

  • Indexing - I'd rather not index anything with the words since we're not planning to query on them. I don't think Firestore gives us the ability to index only a subset of the documents in a collection.
    • Given our planned initial query pattern (e.g. grab the only utterances doc for a given transcription), I'm fine with just ensuring we don't index anything on utterances/words for launch (which may require adding single-field exemptions) - we can always go back and re-index if we change our query patterns. If that's not in this PR, it should come before we backfill existing hearings.
  • Document Size - there's a stated max size of 1 MB for a Firestore document, and I'm worried stuffing all of the words into one document could hit that limit.
    • For reference, the JSON of the hearing at https://prodarchivevideo.blob.core.windows.net/video/2024/Hearings/Joint/June/26_1.mp4 is ~1.1MB with both words and utterances.words included, and ~110KB without either words array included for a ~45 minute video. Reasonably assuming that splits out to 500KB for words, that could become an issue for hearings over ~90 minutes.
    • Have we tested this with a much longer (3/4+ hour) hearing? It's possible Firestore compacts this by default and this is a false alarm, but I'd rather know now so we can adjust how we store words.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to make a doc for each word and utterance. Is there a more clever/efficient/firestorie way that you would suggest other than iterating over the arrays I already have and making a doc for each object in that loop?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really AFAIK - one caveat would be that we should probably use the BulkWriter (since some of these videos will have dozens of utterances and tens of thousands of words). I don't think we should need sequential ids on these subcollection docs since the startTimestamp should provide a natural ordering, but I'll leave that to your best judgment.

If making words work proves problematic in any way, give me a shout in Slack and we can try another solution (whether trying a different data structure, just dumping the words json into storage and dealing with it later, or ignoring words and just re-scraping if we find we need them).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to one doc per utterance and word. I'm not sure if there is a more efficient way to do to all the db transactions together, so I left everything else as is.

if (req.body.status === "completed") {
const transcript = await assembly.transcripts.get(req.body.transcript_id)
if (transcript && transcript.webhook_auth) {
const maybeEventInDb = await db
.collection("events")
.where("videoAssemblyId", "==", transcript.id)
.get()

if (maybeEventInDb.docs.length) {
const authenticatedEventsInDb = maybeEventInDb.docs.filter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we've manually tested with a mismatched token and ensured the request was rejected, I'll drop this - but I still believe the async filter may be an issue - sample code below, see also https://stackoverflow.com/a/71600833:

const test = [1, 2, 3, 4]

// sync filter - result is [2, 4]
test.filter(e => e % 2 == 0)

// async filter = result is [1, 2, 3, 4]
test.filter(async e => e % 2 == 0)

}

const batch = db.batch()
batch.set(db.collection("transcriptions").doc(transcript.id), {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we've already saved the transcription once at this point, can this just be a batch.update to update the _timestamp? (And come to think of it, now that we renamed the other timestamp to createdAt, should this be updatedAt?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE the async filter, I haven't tested this with a mismatching webhook, but I see your point. Seems like firestore doesn't have a findunique api. Any suggestions for this? I'm worried about working off of maybeEventInDb[0], but maybe I'm over thinking it? It's not like we would end up with two docs with the same videoAssemblyId.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE the timestamp, I think we can actually just delete that batch. I don't think we need a created at and a timestamp. My bad for leaving it in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wouldn't expect the Firestore API itself to help us with the async filter issue.

It looks like the easiest way to do that is to first map the events to something like a Promise<Event|null> that resolves to the event when it's valid and resolves to null when the event is invalid, then filter out the nulls afterwards.

Then you can do something like:

async function getValidEvent(event: Event): Promise<Event|null> {
  // whatever async/await logic
}

const authenticatedEvents = (await Promise.all(events.map(getValidEvent))).filter(e => e)

I also wouldn't turn down something more iterative like:

const authenticatedEvents = []

docs.forEach(async doc => {
  const otherDoc = await getOtherDoc(whatever)
  
  if (isValid(doc, otherDoc)) {
    authenticatedEvents.push(doc)
  }
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the iterative approach is more readable. I'll do that.

audio_url
})

if (utterances) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the expected scale of utterances (i.e. several dozen per document), I believe this is fine.

Firestore does generally caution against using sequential ids because it can lead to hotspotting - but given that we'll be disabling indexes for both utterances and words subcollections and our only initial query plan is to "fetchAll" the utterances subcollection, I don't think it makes much of a difference here.

If this does end up causing an issue for utterances, we can always switch to an autogenerated documentId and just use ordering on start in queries to maintain sequential order.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, this would have a potential problem if we ever re-transcribe the same document (e.g. with different settings) - the utterance divisions and start times could change and would only partially overwrite any existing data.

We don't have any plans for that right now - I just want to note a caveat that we need to delete any existing transcriptions/utterances/words in the DB before re-transcribing (should that ever prove necessary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to go ahead and preemptively change this to a start. If I just remove ${utterance.start} from the doc path, that should hand over the ID generation to firestore, correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can test in the firebase-admin repl to be sure, but I think you might also need the more explicit method for a subcollection autogenerated id: e.g.

 db.collection("transcriptions").doc(`${transcript.id}`).collection("utterances").doc()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Reading this line of my last commit back and realizing it wouldn't work.

await writer.close()
}

if (words) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there are tens of thousands of words per document and we're using a sequential doc id, I'm worried this could potentially be problematically slow for a webhook.

e.g.:

  • Firestore default is 500 writes / second for new collections or those with sequential indexes.
  • ~50,000 words in an average hearing / 500 = ~100 seconds to write all the word docs to firestore.
  • The AssemblyAI webhook response timeout is 10 seconds, and they retry up to 10 times before failing permanently.
  • Therefore, we could potentially see a ton of repeated firestore writes before failing anyway.

Switching to auto-generated doc ids might help to avoid the sequential index issue (there's no documented writes/second cap on those), and we could also partition multiple words into one doc to cut down on doc count (e.g. 1000 words per doc as an in-between for 1 word per doc and all words in one doc).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, this is getting more complicated than it's worth for a "nice-to-have" feature. How does the following sound for a rollout plan?

  • Don't save words to the DB at all in this PR - only save the transcription metadata and the utterances
  • Merge + Deploy this and have the flow run against a small subset of hearings on the Dev environment (IMO the default behavior is safe since it should only scrape recent hearings and there are only a few of those at any given time).
  • While we get a few hearings to run through the flow, I can tweak my hacky test front-end for transcriptions to support the new utterances subcollection format.
  • We'll have Matt V + a few others look at the transcribed hearings and confirm that we only need utterances
    • I suspect the visual UI will help get a better feel for the accuracy of the diarization than we've previously gotten by skimming example data.

At this point, we'll either have confirmation that we're fine with just utterances, or will have a better sense for how to organize words to get some use out of it.

  • If we're fine with just utterances, we can just run the backfill to transcribe all past hearings
  • If we find utterances not good enough for V1, we should have a quick chat
    • V1 isn't supposed to make promises around transcription accuracy, so anything complicated is likely out of scope here, but it's possible we'll be fine with just the steps I mentioned in the above comments (to effectively hack together pseudo-utterances out of the words data).

This approach should let us unblock this PR and get a better sense for how well our current approach plays with real data. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds very sensible to me, I was starting to think the same thing. Pulling words out in the next commit.

const authenticatedEventIds = [] as string[]
const hashedToken = sha256(String(req.headers["x-maple-webhook"]))

maybeEventsInDb.docs.forEach(async doc => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad here based on the last comment - given that the forEach function is async, I don't think these calls are guaranteed to complete before moving to the authenticatedEventIds.length === 1 check.

The map-predicate approach is likely the better option because it awaits on all the async checks before proceeding - though a less async iterative approach like the following might still work:

const authenticatedEventIds = []
for (docIndex in maybeEventsInDoc.docs){
  const doc = maybeEventsInDoc.docs[docIndex]
  const tokenDocInDb = await db.whatever()
 
  if (isAuthed(doc, tokenDocInDb)) {
    authenticatedEventIds.push(doc.id)
  }
}

// No extra async, so authenticatedEventIds is now guaranteed to have all ids

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for in is more like what I already wrote so going with that.

// Delete the hashed webhook auth token from our db now that
// we're done.
authenticatedEventIds.forEach(async docId => {
await db.doc(docId).set({ ["x-maple-webhook"]: null })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, are we actually saving this x-maple-webhook key to Firestore such that this would null it? I though we stored the token in the private sub-collection when we first submit the transcription request (and we're now a lot more restrictive in the fields we actually save to the main transcription document).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, good catch, let me fix that.


// Delete the hashed webhook auth token from our db now that
// we're done.
authenticatedEventIds.forEach(async docId => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this async forEach could run into the same problem as the other one that was just fixed - there's no guarantee the async function will complete before the request ends.

Copy link
Collaborator

@Mephistic Mephistic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I think there might still be some useful improvements to be made around the error handling in the webhook (e.g. different response codes if auth fails), but I don't think it's worth blocking this PR.

@Mephistic Mephistic merged commit b28e44d into codeforboston:main Apr 22, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants