Split HMDB All Metabolites XML

Mr. George L. Malone

21^st of April, 2021

Overview

This repository documents the scripts and requirements for splitting and storing the HMDB All Metabolites XML. The operations are completed using the Ruby programming language. The operations rely on certain harcoded data, such as the initial XML declaration and the document opening and closing tags, including namespace declaration. Due to the size of the XML document containing all metabolite data, Ruby was used to take advantage of the ease of use of the File.foreach method, in order to maximise speed and minimise RAM usage. The file was too large for Nokogiri to handle -- approx. 4.1GB.

Operations

After initial setup and variable declarations, the file is opened and the rows are iterated over. If the row is the doctype declaration or the hmdb opening or closing tag, it is skipped. The row text is then pushed to the output object. If the ID of the current metabolite is nil, the current row text is checked for the ID inside the primary accession tag, and assigned if found^[n1]. If the row is the metabolite closing tag, the current data are written out using the current ID. If the current ID count is nil, the nil-ID count is used with the IDNA_ prefix, and the nil-ID count is incremented. The output data hash is then reset, which makes me a bit concerned about scoping, but it appears to be working properly.

[n1]: I realise it's not generally a good idea to parse XML using regex, but this is a line-by-line/within-text operation, rather than a global parse/search.

Conversion to Julia

Recently, the operations were transcribed into the Julia programming language, as a bit of a thought exercise. The resulting data are identical, but the time to complete operations is much faster.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
split.jl		split.jl
splitHmdbXmlMetabolitesAll.rb		splitHmdbXmlMetabolitesAll.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Split HMDB All Metabolites XML

Mr. George L. Malone

21^st of April, 2021

Overview

Operations

Conversion to Julia

About

Releases

Packages

Languages

License

darmn-js/splitHmdbXml

Folders and files

Latest commit

History

Repository files navigation

Split HMDB All Metabolites XML

Mr. George L. Malone

21st of April, 2021

Overview

Operations

Conversion to Julia

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

21^st of April, 2021

Packages