Skip to content

Conversation

TyHil
Copy link
Member

@TyHil TyHil commented Sep 26, 2025

Scrape academic calendars from Box. The links come from https://www.utdallas.edu/academics/calendar/ and a special Box download URL is used to get the PDF based on its file ID.

Parse the PDFs with Gemini. Pull the first page of text out of the PDF (works better than sending the actual file) and send it with a prompt describing the schema. It's been mistake free since I started sending it the PDF text instead of the PDF file. It costs about $0.13 to parse all the academic calendars but I'm hashing the prompts and storing the results in cloud storage so nothing should ever be re-parsed unless the PDF changes. 8 new environment variables are required. 4 for Gemini and 4 for the Nebula API for cloud storage caching.

Upload is very similar to the other uploaders.

There's a lot of go.mod changes that may not be entirely necessary. They were just made automatically so I could work on reverting them.

I haven't yet set this up to run automatically but I can after merge. Unsure if we'd want it daily or weekly.

@TyHil TyHil marked this pull request as ready for review September 26, 2025 02:07
@mikehquan19 mikehquan19 self-requested a review September 26, 2025 16:39
@TyHil TyHil changed the title Academic calendar scraper Academic calendar scraper, parser, and uploader Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant