RiskLib 0.1 Parsing

Overview:

The OpenQuake engines need to support a variety of input and output formats. At a later stage in the project it will likely make sense to develop a rigorous and documented formal exchange format - at this point, it's most important for us to:

Not duplicate work
Get things working end-to-end
Support as diverse a group of real-world users as possible, as early as possible

With this in mind, I suggest that, rather than undertaking formal development of the data format specification, we simply treat it as an area of common development. However, this means it's truly COMMON - one set of python modules that are collaborated upon. I expect to see a tremendous amount of discussion, either in Skype and IRC, or on a mailing list (if folks would like to take the time to develop well-reasoned rationale for their approach). From a technical standpoint, let's make sure we're using the appropriate underlying python classes for each type of input file:

If it's a data file (e.g., if we need to support both input and output of this format), use the Python "codecs" module, and implement IncrementalEncoder and IncrementalDecoder.
If it's a configuration file, make sure you shouldn't be using a --flagfile before using properties/ini config files.
When you're writing your parsing library, make sure you can round-trip the data (decode a file, and then encode to a file, and end up with equivalent files.)

Note also that the python zlib_codec supports on-the-fly decompression, which is an optimization for large binary datasets (and is almost always faster than the disk IO).

Some research and references:

REQUIREMENTS:

Fast (quantify) serialization and deserialization
Buffered deserialization
Straightforward ETL / simple translations
Schema and schema validation (nice-to-have)
Human-readable (nice-to-have)

OPTIONS:

XML
Binary Data, using:
- Marshall / cPickle
- Protobuf
- Thrift (http://en.wikipedia.org/wiki/Thrift_(protocol) http://incubator.apache.org/thrift/)
- Properties File (plist binary format)
- BSON
- Redis aof (append-only file)
Ascii
- Properties / ini file
- CSV (with built-in python modules)
- YAML
- JSON (YAML-subset)

Back to Blueprints

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RiskLib 0.1 Parsing

Clone this wiki locally