-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract USLM to database tables #31
Comments
Hi Joel, Files are also available from the GovInfo Bulk Data Repository at https://www.govinfo.gov/bulkdata/COMPS. Also passing your questions to others on the USLM team. Thanks, |
Hi Joel, Thank you for your question. Because XML has a fundamentally different architecture from relational databases (hierarchical versus tabular), mapping between the two is never easy. Relational database companies have tackled this problem through XML-specific databases (eg. "Oracle XML DB"). Regarding your specific question of how to extract the content of a given XML element (e.g. a given subparagraph) - it is unclear to us if OpenRefine will meet your needs. You might try the parseXml() function on the whole document, then select() the portion you need. But it might may not perform as well as you need. A 'best practice' solution would be to adopt XML tool(s) that are able to use XQuery to extract information from the XML content set (https://en.wikipedia.org/wiki/XQuery). Tools designed for the XML data model are better suited to reliably extracting data from XML files of any size, than tools which were not designed for XML. |
Thank you all so much for getting back to me!
We are going to press on then, at least knowing we are at least not duplicating effort. Oracle xmldb is the next stop for sure after this proof of concept.
Our ultimate goal is to be able to have a searchable database of full text of laws relevant to our mission (army corps of engineers civil works) that we can link to our appropriations and expenditures.
If anyone is interested in this type of work, please feel free to forward my contact info.
Thank you again for the awesome work at gpo!
…-joel
[Graphical user interface, text, application, email Description automatically generated]
From: bradleechang ***@***.***>
Reply-To: usgpo/uslm ***@***.***>
Date: Monday, September 25, 2023 at 8:19 PM
To: usgpo/uslm ***@***.***>
Cc: "Schlagel, Joel D" ***@***.***>, Author ***@***.***>
Subject: [EXTERNAL] Re: [usgpo/uslm] Extract USLM to database tables (Issue #31)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
Hi Joel,
Thank you for your question.
Because XML has a fundamentally different architecture from relational databases (hierarchical versus tabular), mapping between the two is never easy. Relational database companies have tackled this problem through XML-specific databases (eg. "Oracle XML DB").
Regarding your specific question of how to extract the content of a given XML element (e.g. a given subparagraph) - it is unclear to us if OpenRefine will meet your needs. You might try the parseXml() function on the whole document, then select() the portion you need. But it might may not perform as well as you need.
A 'best practice' solution would be to adopt XML tool(s) that are able to use XQuery to extract information from the XML content set (https://en.wikipedia.org/wiki/XQuery). Tools designed for the XML data model are better suited to reliably extracting data from XML files of any size, than tools which were not designed for XML.
—
Reply to this email directly, view it on GitHub<#31 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BB77A6XJQDGMUJBGG3MQ4ILX4INSPANCNFSM6AAAAAA5GFWU4Y>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Hello! I am working to extract data from several collections from this location: https://www.govinfo.gov/app/collection/comps/w to database tables. I'm working to store key content for each law in one table as follows:
section identifier
section – subsection identifier
section – subsection – paragraph identifier
section – subsection – paragraph - subparagraph identifier
section – subsection – paragraph - subparagraph – clause identifier
section num
section – subsection num
section – subsection – paragraph num
section – subsection – paragraph - subparagraph num
section – subsection – paragraph - subparagraph – clause num
etc etc.
I am using open refine to extract data to css, then load to database. open refine does an ok job, but it a) does not seem able to get all content b) does not deal well with large files.
I wonder if I am reinventing the wheel here, and if the USLM team has any ideas for best practice extracting content from the xml.
thank you, and thanks for the good work!
-Joel
The text was updated successfully, but these errors were encountered: