Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💬 Fixes pbcore json double quote bug #2896

Merged
merged 5 commits into from
Jan 28, 2025
Merged

💬 Fixes pbcore json double quote bug #2896

merged 5 commits into from
Jan 28, 2025

Conversation

mrharpo
Copy link
Contributor

@mrharpo mrharpo commented Dec 18, 2024

Fixes double quote escaping bug in pbcore_xml_to_json.xsl

Closes #2697

@mrharpo mrharpo requested review from afred and foglabs December 18, 2024 23:04
@mrharpo mrharpo self-assigned this Dec 18, 2024
@mrharpo mrharpo added bug 🐛 production 🎭 Relating to the production deployment labels Dec 18, 2024
@afred afred requested a review from foo4thought December 19, 2024 14:15
@mrharpo
Copy link
Contributor Author

mrharpo commented Dec 20, 2024

I keep improving this script and it keeps revealing new bugs! 🐞 🪰 🪱

Tester

Requires xsltproc and jq

curl https://americanarchive.org/catalog/cpb-aacip-211-46d25n09.pbcore | xsltproc ~/gbh/aapb/AAPB2/lib/pbcore_xml_to_json.xsl - | jq

Overall, I'm getting about 1% error rate with the latest revision, which is much better than the 10-20% errors before!

Working

Double quotes

cpb-aacip-211-46d25n09 now properly escapes quotes in the <pbcoreDescription descriptionType="Episode"> field

Newlines

cpb-aacip-254-18dfn5gb was breaking with the original xsl, but is now properly escaping double quotes and newlines in <pbcoreDescription>

Backslashes

cpb-aacip-211-49g4fnmn now properly escapes lone backslash characters

  • Side note: Why does this record have a \0xFFFD (unicode replacement charater: �) in the description?

Not working

HTML

cpb-aacip-211-46d25p4w has html that is not being escaped (breaking on unescaped double quote in <a href="..."

@foo4thought
Copy link
Collaborator

foo4thought commented Jan 7, 2025

nice stylesheet - much cleaner than mine (adapted for FMP) but too simple where it globally "escapes" (for JSON) the backslash with another. XML expressing simple things like a line feed as hideous hex code like '\0x0A' instead of &#10; must be dealt with explicitly. I commented earlier with the exact sequence of substitutions I'm using for XML to JSON used for the Filemaker database "AAPB_Enhancements," then deleted the comment because I wanted to mess with it. Will be more cogent tomorrow

@foo4thought
Copy link
Collaborator

https://americanarchive.org/catalog/cpb-aacip-211-46d25p4w.pbcore is not valid XML, so a pretty tall order to transform it as XML. I cannot find a way using XSL without preprocessing it

@foo4thought
Copy link
Collaborator

LOL all the crap I've added to my XSL to handle weird hex codes and idiotic backslashes is better dealt with using your code! I'm going to adopt your sequence of escaping and expect it to work just fine.

@foo4thought foo4thought closed this Jan 7, 2025
@mrharpo mrharpo reopened this Jan 9, 2025
@mrharpo mrharpo marked this pull request as ready for review January 9, 2025 18:36
@ekemeyer
Copy link
Contributor

Kevin confirms this looks good to him

@ekemeyer ekemeyer merged commit 9a40882 into master Jan 28, 2025
6 checks passed
@ekemeyer ekemeyer deleted the pbcore-xml-to-json branch January 28, 2025 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 production 🎭 Relating to the production deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AAPB JSON API breaks when xml values contain double-quotes
3 participants