Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract USLM to database tables #31

Open
doi-jschlagel opened this issue Sep 25, 2023 · 4 comments
Open

Extract USLM to database tables #31

doi-jschlagel opened this issue Sep 25, 2023 · 4 comments
Assignees

Comments

@doi-jschlagel
Copy link

Hello! I am working to extract data from several collections from this location: https://www.govinfo.gov/app/collection/comps/w to database tables. I'm working to store key content for each law in one table as follows:

section identifier
section – subsection identifier
section – subsection – paragraph identifier
section – subsection – paragraph - subparagraph identifier
section – subsection – paragraph - subparagraph – clause identifier

section num
section – subsection num
section – subsection – paragraph num
section – subsection – paragraph - subparagraph num
section – subsection – paragraph - subparagraph – clause num

etc etc.

I am using open refine to extract data to css, then load to database. open refine does an ok job, but it a) does not seem able to get all content b) does not deal well with large files.

I wonder if I am reinventing the wheel here, and if the USLM team has any ideas for best practice extracting content from the xml.

thank you, and thanks for the good work!

-Joel

@llaplant
Copy link
Member

Hi Joel,

Files are also available from the GovInfo Bulk Data Repository at https://www.govinfo.gov/bulkdata/COMPS.

Also passing your questions to others on the USLM team.

Thanks,
Lisa

@llaplant llaplant self-assigned this Sep 25, 2023
@bradleechang
Copy link

bradleechang commented Sep 26, 2023

Hi Joel,

Thank you for your question.

Because XML has a fundamentally different architecture from relational databases (hierarchical versus tabular), mapping between the two is never easy. Relational database companies have tackled this problem through XML-specific databases (eg. "Oracle XML DB").

Regarding your specific question of how to extract the content of a given XML element (e.g. a given subparagraph) - it is unclear to us if OpenRefine will meet your needs. You might try the parseXml() function on the whole document, then select() the portion you need. But it might may not perform as well as you need.

A 'best practice' solution would be to adopt XML tool(s) that are able to use XQuery to extract information from the XML content set (https://en.wikipedia.org/wiki/XQuery). Tools designed for the XML data model are better suited to reliably extracting data from XML files of any size, than tools which were not designed for XML.

@doi-jschlagel
Copy link
Author

doi-jschlagel commented Sep 26, 2023 via email

@doi-jschlagel
Copy link
Author

USLM team - we built a good-but-not-perfect parser for USLM to create database tables from text fields in the USLM schema. The USLM schema documentation and consistency of the xml made this possible. Happy to share it back to the community.
Screenshot 2023-11-16 at 11 12 29 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants