-
-
Notifications
You must be signed in to change notification settings - Fork 135
Working end-to-end transcription integrations. #1774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working end-to-end transcription integrations. #1774
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still an open question around what transcription data we should actually store - that doesn't necessarily have to be part of this PR, but we should get at least a tentative answer/plan on that before we run any backfill
functions/src/events/scrapeEvents.ts
Outdated
@@ -158,20 +158,6 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> { | |||
const hearing = Hearing.check(eventData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-checking here - how does this handle the first time a hearing is scraped? If there is no hearing in the db, eventData
would be undefined - does Hearing.check
blow up (and prevent us from returning the non-transcription related event data)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I think I was drawing inspiration from this prior line:
maple/functions/src/events/scrapeEvents.ts
Line 142 in a18436a
const content = HearingContent.check(await api.getHearing(EventId)) |
How would you want to see this done differently?
I cal also try some things out if nothing comes to mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the chief value of that HearingContent.check
there is about guarding against changes to the external Mass Legislature API (i.e. if the data they return deviates too much from what we expect, don't risk polluting our database - instead, throw an error to force a dev to investigate).
For our purposes here:
If eventInDb.exists()
is false (or equivalently AFAIK, if eventData
is undefined), it should mean that we are scraping the present hearing for the first time. This means that we have definitely not scraped the hearing's video yet and it's valid to try scraping videos for it. The simplest, most robust solution is to just add this case to the logic of when to scrape video:
Currently, we scrape video only if:
- there is a pre-existing hearing with this EventId in our DB that does not have a
videoUrl
set - AND
- the hearing in question is within our scraping time window
We should instead scrape video only if:
- EITHER (we have not scraped this hearing before) OR (there is a pre-existing hearing in our DB with this EventId that does not have a
videoUrl
set) - AND
- the hearing in question is within our scraping time window
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating this logic. Leaving the .check
in there since that was there prior, and I think is still needed for the first run on a hearing. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait nevermind, I conflated two different .check
s. Will leave the first and remove the second.
functions/src/events/scrapeEvents.ts
Outdated
@@ -191,25 +177,32 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> { | |||
maybeVideoURL = firstVideoSource.src | |||
|
|||
transcript = await assembly.transcripts.submit({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we've worked out the kinks here, could we refactor getEvent
a bit?
I feel like there's at least two functions here that could be broken out to make this more readable :
shouldScrapeVideo
(handles fetching any previous event + checking for existingVideo + the date cutoff logic, returns a boolean yes/no)getHearingVideoUrl
(does the JSDOM wrangling, returns astring | undefined
with the videoUrl`)- And maybe
submitTranscription
(takes avideoUrl
, sends the AssemblyAI request + saves the webhookAuth in the DB, returnstranscriptId
)
Which would leave the new code as something like:
if (shouldScrapeVideo(EventId) {
const maybeVideoUrl = getHearingVideoUrl(EventId)
if (maybeVideoUrl) {
const transcriptId = await submitTranscription(maybeVideoUrl)
// add video/transcription data to Event
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I can take a pass at this.
functions/src/events/scrapeEvents.ts
Outdated
}) | ||
|
||
await db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I think we shouldn't actually save the hearing event to Firestore in getEvent
- the abstract class already handles that for us. The webhookAuth
is in a subcollection (which wouldn't be caught by the abstract class) and is the exception to that rule - this should be fine since Firestore lets you write to subcollections of documents that don't exist yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you like me to do here differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we scrape video, we can simply conditionally add the relevant video fields to the return object of getEvent
(those fields being videoURL
, videoFetchedAt
, and videoAssemblyId
). So long as they're present on the returned object, the code in the abstract class will handle the work of saving the events to the DB. See the line writer.set(db.doc(
/events/${event.id}), event, { merge: true })
in run()
at the top of this file.
As it stands, I think this would save the event twice - first in getEvent()
and then again in run()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think I understand now, let me know if I got it right in my next commit
if (tokenInDbData) { | ||
return hashedToken === tokenInDbData.videoAssemblyWebhookToken | ||
} | ||
return false | ||
} | ||
) | ||
console.log("authenticatedEventsInDb", authenticatedEventsInDb) | ||
|
||
if (authenticatedEventsInDb) { | ||
try { | ||
await db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should start thinking about what parts of the response we actually want to save here - There are a lot of fields in the example response you posted in Slack, but the only ones that looks potentially relevant are:
id
text
utterances
words
audio_url
(though we already have it on the hearing event, it would still be helpful here)audio_duration
(not sure what we'd need it for though, but it's conceivably useful)confidence
I'm most interested in text
/utterances
/words
- since those will take up >99% of the data size (and words
likely takes up >90%). We should chat with Matt V and design about the desired functionality, but I don't think we need all three. I goofed around with some of the example data: it looks like utterances
gives us mostly sensible, mostly accurate divisions of the content, and text
is good to have as a fallback.
IMO words
is likely unneccessary - I believe it would only be useful if we either:
- Find ourselves dissatisfied with the breakpoints/speaker split inherent to
utterances
and want to devise our own - Want to visually identify specific words that Assembly flagged as low confidence
Want to get @mvictor55 's input on the desired functionality here before dropping the axe though (and @mertbagt 's take on what we actually need for the front-end). If we do cut words
, we should also probably cut the words
array in utterances
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, would love to know what you think @mvictor55 and @mertbagt. I can proceed accordingly based on what you three align on.
@@ -48,7 +52,7 @@ export const transcription = functions.https.onRequest(async (req, res) => { | |||
|
|||
authenticatedEventsInDb.forEach(async d => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double-checking - is it possible that the firebase function will exit before the async function in the forEach
completes?
If so, it may be worth switching to a transaction for the writes here - something like
const batch = db.batch()
batch.set(
db.collection("transcriptions").doc(transcript.id),
{ _timestamp: new Date(), ...transcript }
)
authenticatedEventsInDb.forEach(doc => {
batch.update(doc.ref, {["x-maple-webhook"]: null})
})
await batch.commit()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart, will do.
functions/src/events/scrapeEvents.ts
Outdated
return false | ||
} | ||
if (!eventData.videoFetchedAt) { | ||
return withinCutoff(new Date(eventData.StartTime)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change from Hearing.check(eventData).startsAt.toDate()
? I don't think eventData
has a StartTime
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstand an earlier review comment. I'll change this in the next commit.
if (req.body.status === "completed") { | ||
const transcript = await assembly.transcripts.get(req.body.transcript_id) | ||
if (transcript && transcript.webhook_auth) { | ||
const maybeEventInDb = await db | ||
.collection("events") | ||
.where("videoAssemblyId", "==", transcript.id) | ||
.get() | ||
|
||
if (maybeEventInDb.docs.length) { | ||
const authenticatedEventsInDb = maybeEventInDb.docs.filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double-checking here - does this successfully filter out requests where the tokens don't match?
Now that I think about it, I would expect the async
filter function to return a Promise that resolves to a boolean instead of a boolean - and since the Promise itself would always be truthy, this check would always return true and never filter anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so based on return hashedToken === tokenInDbData.videoAssemblyWebhookToken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we've manually tested with a mismatched token and ensured the request was rejected, I'll drop this - but I still believe the async filter may be an issue - sample code below, see also https://stackoverflow.com/a/71600833:
const test = [1, 2, 3, 4]
// sync filter - result is [2, 4]
test.filter(e => e % 2 == 0)
// async filter = result is [1, 2, 3, 4]
test.filter(async e => e % 2 == 0)
text, | ||
timestamp: new Date(), | ||
audio_url, | ||
words |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per our discussion the other day, we don't want words
in the main transcriptions
document - just in the subcollection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oy of course. Fixing in the next commit.
transcriptionInDb.set({ | ||
id, | ||
text, | ||
timestamp: new Date(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be _timestamp
to match the timestamp below? If they're actually different timestamps, could we name them accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, come to think of it, should this be a Timestamp
instead of a Date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, updating in next commit.
words | ||
}) | ||
|
||
transcriptionInDb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these .set()
calls return Promises, so we should either await them all or write them in a transaction batch.
I don't have a strong opinion on which - firestore shouldn't need the root document to exist to create the subcollections, and worst case if one of the writes fails, we'll need to re-run the webhook anyway and that will overwrite any partially saved values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding awaits in next commit
|
||
transcriptionInDb | ||
.collection("timestamps") | ||
.doc("utterances") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Based on our discussion the other day, I was expecting two separate sub-collections (utterances
and words
) where each doc represents one utterance/word. I'm not locked in on that approach necessarily - just wanted to clarify.
I am onboard with storing all of the utterances
as one doc for now given how we'll be using it (we can always re-structure without re-scraping later if we find we need the flexibility).
I do have two concerns here, especially re: words
:
- Indexing - I'd rather not index anything with the
words
since we're not planning to query on them. I don't think Firestore gives us the ability to index only a subset of the documents in a collection.- Given our planned initial query pattern (e.g. grab the only utterances doc for a given transcription), I'm fine with just ensuring we don't index anything on
utterances
/words
for launch (which may require adding single-field exemptions) - we can always go back and re-index if we change our query patterns. If that's not in this PR, it should come before we backfill existing hearings.
- Given our planned initial query pattern (e.g. grab the only utterances doc for a given transcription), I'm fine with just ensuring we don't index anything on
- Document Size - there's a stated max size of 1 MB for a Firestore document, and I'm worried stuffing all of the
words
into one document could hit that limit.- For reference, the JSON of the hearing at https://prodarchivevideo.blob.core.windows.net/video/2024/Hearings/Joint/June/26_1.mp4 is ~1.1MB with both
words
andutterances.words
included, and ~110KB without eitherwords
array included for a ~45 minute video. Reasonably assuming that splits out to 500KB forwords
, that could become an issue for hearings over ~90 minutes. - Have we tested this with a much longer (3/4+ hour) hearing? It's possible Firestore compacts this by default and this is a false alarm, but I'd rather know now so we can adjust how we store words.
- For reference, the JSON of the hearing at https://prodarchivevideo.blob.core.windows.net/video/2024/Hearings/Joint/June/26_1.mp4 is ~1.1MB with both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to make a doc for each word and utterance. Is there a more clever/efficient/firestorie way that you would suggest other than iterating over the arrays I already have and making a doc for each object in that loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really AFAIK - one caveat would be that we should probably use the BulkWriter (since some of these videos will have dozens of utterances and tens of thousands of words). I don't think we should need sequential ids on these subcollection docs since the startTimestamp
should provide a natural ordering, but I'll leave that to your best judgment.
If making words
work proves problematic in any way, give me a shout in Slack and we can try another solution (whether trying a different data structure, just dumping the words
json into storage and dealing with it later, or ignoring words and just re-scraping if we find we need them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to one doc per utterance and word. I'm not sure if there is a more efficient way to do to all the db transactions together, so I left everything else as is.
if (req.body.status === "completed") { | ||
const transcript = await assembly.transcripts.get(req.body.transcript_id) | ||
if (transcript && transcript.webhook_auth) { | ||
const maybeEventInDb = await db | ||
.collection("events") | ||
.where("videoAssemblyId", "==", transcript.id) | ||
.get() | ||
|
||
if (maybeEventInDb.docs.length) { | ||
const authenticatedEventsInDb = maybeEventInDb.docs.filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we've manually tested with a mismatched token and ensured the request was rejected, I'll drop this - but I still believe the async filter may be an issue - sample code below, see also https://stackoverflow.com/a/71600833:
const test = [1, 2, 3, 4]
// sync filter - result is [2, 4]
test.filter(e => e % 2 == 0)
// async filter = result is [1, 2, 3, 4]
test.filter(async e => e % 2 == 0)
} | ||
|
||
const batch = db.batch() | ||
batch.set(db.collection("transcriptions").doc(transcript.id), { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we've already saved the transcription
once at this point, can this just be a batch.update
to update the _timestamp
? (And come to think of it, now that we renamed the other timestamp to createdAt
, should this be updatedAt
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE the async filter, I haven't tested this with a mismatching webhook, but I see your point. Seems like firestore doesn't have a findunique api. Any suggestions for this? I'm worried about working off of maybeEventInDb[0]
, but maybe I'm over thinking it? It's not like we would end up with two docs with the same videoAssemblyId
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE the timestamp, I think we can actually just delete that batch. I don't think we need a created at and a timestamp. My bad for leaving it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wouldn't expect the Firestore API itself to help us with the async filter issue.
It looks like the easiest way to do that is to first map
the events to something like a Promise<Event|null>
that resolves to the event when it's valid and resolves to null
when the event is invalid, then filter out the nulls afterwards.
Then you can do something like:
async function getValidEvent(event: Event): Promise<Event|null> {
// whatever async/await logic
}
const authenticatedEvents = (await Promise.all(events.map(getValidEvent))).filter(e => e)
I also wouldn't turn down something more iterative like:
const authenticatedEvents = []
docs.forEach(async doc => {
const otherDoc = await getOtherDoc(whatever)
if (isValid(doc, otherDoc)) {
authenticatedEvents.push(doc)
}
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the iterative approach is more readable. I'll do that.
audio_url | ||
}) | ||
|
||
if (utterances) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the expected scale of utterances (i.e. several dozen per document), I believe this is fine.
Firestore does generally caution against using sequential ids because it can lead to hotspotting - but given that we'll be disabling indexes for both utterances
and words
subcollections and our only initial query plan is to "fetchAll" the utterances
subcollection, I don't think it makes much of a difference here.
If this does end up causing an issue for utterances
, we can always switch to an autogenerated documentId and just use ordering on start
in queries to maintain sequential order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of it, this would have a potential problem if we ever re-transcribe the same document (e.g. with different settings) - the utterance divisions and start
times could change and would only partially overwrite any existing data.
We don't have any plans for that right now - I just want to note a caveat that we need to delete any existing transcriptions
/utterances
/words
in the DB before re-transcribing (should that ever prove necessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just going to go ahead and preemptively change this to a start. If I just remove ${utterance.start}
from the doc path, that should hand over the ID generation to firestore, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can test in the firebase-admin repl to be sure, but I think you might also need the more explicit method for a subcollection autogenerated id: e.g.
db.collection("transcriptions").doc(`${transcript.id}`).collection("utterances").doc()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Reading this line of my last commit back and realizing it wouldn't work.
await writer.close() | ||
} | ||
|
||
if (words) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there are tens of thousands of words per document and we're using a sequential doc id, I'm worried this could potentially be problematically slow for a webhook.
e.g.:
- Firestore default is 500 writes / second for new collections or those with sequential indexes.
- ~50,000 words in an average hearing / 500 = ~100 seconds to write all the word docs to firestore.
- The AssemblyAI webhook response timeout is 10 seconds, and they retry up to 10 times before failing permanently.
- Therefore, we could potentially see a ton of repeated firestore writes before failing anyway.
Switching to auto-generated doc ids might help to avoid the sequential index issue (there's no documented writes/second cap on those), and we could also partition multiple words into one doc to cut down on doc count (e.g. 1000 words per doc as an in-between for 1 word per doc and all words in one doc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, this is getting more complicated than it's worth for a "nice-to-have" feature. How does the following sound for a rollout plan?
- Don't save
words
to the DB at all in this PR - only save thetranscription
metadata and theutterances
- Merge + Deploy this and have the flow run against a small subset of hearings on the Dev environment (IMO the default behavior is safe since it should only scrape recent hearings and there are only a few of those at any given time).
- While we get a few hearings to run through the flow, I can tweak my hacky test front-end for transcriptions to support the new utterances subcollection format.
- We'll have Matt V + a few others look at the transcribed hearings and confirm that we only need utterances
- I suspect the visual UI will help get a better feel for the accuracy of the diarization than we've previously gotten by skimming example data.
At this point, we'll either have confirmation that we're fine with just utterances
, or will have a better sense for how to organize words
to get some use out of it.
- If we're fine with just
utterances
, we can just run the backfill to transcribe all past hearings - If we find
utterances
not good enough for V1, we should have a quick chat- V1 isn't supposed to make promises around transcription accuracy, so anything complicated is likely out of scope here, but it's possible we'll be fine with just the steps I mentioned in the above comments (to effectively hack together pseudo-utterances out of the
words
data).
- V1 isn't supposed to make promises around transcription accuracy, so anything complicated is likely out of scope here, but it's possible we'll be fine with just the steps I mentioned in the above comments (to effectively hack together pseudo-utterances out of the
This approach should let us unblock this PR and get a better sense for how well our current approach plays with real data. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds very sensible to me, I was starting to think the same thing. Pulling words out in the next commit.
const authenticatedEventIds = [] as string[] | ||
const hashedToken = sha256(String(req.headers["x-maple-webhook"])) | ||
|
||
maybeEventsInDb.docs.forEach(async doc => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my bad here based on the last comment - given that the forEach function is async, I don't think these calls are guaranteed to complete before moving to the authenticatedEventIds.length === 1
check.
The map-predicate approach is likely the better option because it awaits on all the async checks before proceeding - though a less async iterative approach like the following might still work:
const authenticatedEventIds = []
for (docIndex in maybeEventsInDoc.docs){
const doc = maybeEventsInDoc.docs[docIndex]
const tokenDocInDb = await db.whatever()
if (isAuthed(doc, tokenDocInDb)) {
authenticatedEventIds.push(doc.id)
}
}
// No extra async, so authenticatedEventIds is now guaranteed to have all ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for in is more like what I already wrote so going with that.
// Delete the hashed webhook auth token from our db now that | ||
// we're done. | ||
authenticatedEventIds.forEach(async docId => { | ||
await db.doc(docId).set({ ["x-maple-webhook"]: null }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of it, are we actually saving this x-maple-webhook
key to Firestore such that this would null it? I though we stored the token in the private
sub-collection when we first submit the transcription request (and we're now a lot more restrictive in the fields we actually save to the main transcription
document).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, good catch, let me fix that.
|
||
// Delete the hashed webhook auth token from our db now that | ||
// we're done. | ||
authenticatedEventIds.forEach(async docId => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this async forEach
could run into the same problem as the other one that was just fixed - there's no guarantee the async function will complete before the request ends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I think there might still be some useful improvements to be made around the error handling in the webhook (e.g. different response codes if auth fails), but I don't think it's worth blocking this PR.
Fixes from webhook debugging work. Follow this test plan to verify:
Local testing guide
The following instructions will let you test the scraping half of the implementation. The webhook half will of the implementation needs to be on the development project for an end-to-end test.
In order to perform this test you will need
Instructions
ngrok config add-authtoken your-auth-token
yarn run dev:functions
ngrok http http://localhost:5001
hearing-5091
.functions/src/events/scrapeEvents.ts
to create a local testing stateconst EventId = 5091
into the beginning ofgetEvent
https://put-your-ngrok-tunnel-id-here.ngrok-free.app/demo-dtp/us-central1/scrapeHearings-0
yarn firebase-admin -e dev console
const resp = await db.collection("events").doc("hearing-5091").get()
const data = resp.data()
JSON.stringify(data)
JSON.stringify(data)
without the surrounding quotes (so you have a JS object) and save it somewhere for the next stepyarn dev:functions
yarn firebase-admin -e local console
await db.collection('events').doc('hearing-5091').set(data)
where data is the js object on your clipboard. It should look something likeawait db.collection('events').doc('hearing-5091').set({"startsAt":{"_seconds":1741881600,"_nanoseconds":0},"id":"hearing-5091","type":"hearing","content":{"EventDate":"2025-03-13T12:00:00","HearingAgendas":[{"DocumentsInAgenda":[],"Topic":"Annual Hearing on a Potential Modification of the Health Care Cost Growth Benchmark","StartTime":"2025-03-13T12:00:00","EndTime":"2025-03-13T15:00:00"}],"RescheduledHearing":null,"StartTime":"2025-03-13T12:00:00","EventId":5091,"HearingHost":{"Details":"http://malegislature.gov/api/GeneralCourts/194/Committees/J24","CommitteeCode":"J24","GeneralCourtNumber":194},"Name":"Joint Committee on Health Care Financing","Description":"Joint Committee on Health Care Financing & Health Policy Commission Public Hearing on a Potential Modification of the CY 2026 Health Care Cost Growth Benchmark","Location":{"AddressLine2":null,"AddressLine1":"24 Beacon Street","State":"MA","ZipCode":"02133","City":"Boston","LocationName":"Gardner Auditorium"},"Status":"Completed"},"fetchedAt":{"_seconds":1741967707,"_nanoseconds":728000000}})
fetchedAt
andfetchedAt
, and replace them with timestamp type fields of the same property names and values of right now.yarn dev:functions
emulator runscrapeHearings()
const resp = await db.collection("events").doc("hearing-5091").get()
resp.data()