- Fixed-Length (
fixedlength2
) Schema in Depthenvelope
Basicsenvelope
s Hierarchyfile_declaration
forfixedlength2
- Sample 1:
file_declaration
for Repeated Single-Rowenvelope
- Sample 2:
file_declaration
for Repeated Multi Fixed-Number-of-Rowsenvelope
- Sample 3:
file_declaration
for Repeated Variable Lengthenvelope
Bounded byheader
/footer
- Sample 4:
file_declaration
for Nested Hierarchicalenvelope
s with Different Types - Fixed-Length IDR Structure
- Migration from
'fixed-length'
Schemas
Note: With the introduction of
fixedlength2
schema, the oldfixed-length
schema is now considered deprecated. It will continue to work, and its documentation remains but it will not be directly linked from the homepage README.md. There is a migration section near the end of this page to illustrate the simple schema migration steps fromfixed-length
tofixedlength2
.
Fixed-length (sometimes also called fixed-width) schema has 3 parts: parser_settings
, file_declaration
,
and transform_declarations
. We've covered parser_settings
in Getting Started;
we've covered transform_declarations
in depth in All About Transforms. Before we go
into file_declaration
, we need to talk about the concept of envelope
.
An envelope
is a basic data unit ingested in from the input, processed and transformed by omniparser for
a fixed length input file. Here are a few examples:
-
Single line
envelope
(from this sample):2019/01/31T12:34:56-0800 10.5 30.2 N 33 37.7749 122.4194 2020/07/31T01:23:45-0500 39 95 SE 8 32.7767 96.7970
Here each line in the input is an
envelope
, individually ingested, processed and transformed. -
Multi-line
envelope
(from this sample):H001TW 0689311345 030 DEL ... H0020689311345 BM 040670500067120 H0020689311345 CN 100000103732 H0020689311345 OA 2400 Highway 155 South H0020689311345 OC Locust Grove H0020689311345 ON THE HOME DEPOT 6705 H0020689311345 OO W841206858 H0020689311345 OP 30248 H0020689311345 OST GA H0020689311345 OT US H0020689311345 PO 7029083C H001TW 0689311348 030 DEL ... H0020689311348 BM 040677700277714 H0020689311348 CN H0020689311348 OA 2500 Highway 155 South H0020689311348 OC Locust Grove H0020689311348 ON THE HOME DEPOT 6777 H0020689311348 OO W938003272 H0020689311348 OP 30248 H0020689311348 OST GA H0020689311348 OT US H0020689311348 PO 6816137C
Each
envelope
consists of 11 consecutive lines, and these 11 consecutive lines are ingested, processed and transformed as a unit.Single line
envelope
is obviously a special case of multi-lineenvelope
. -
Variable length
envelope
encapsulated by aheader
and afooter
(from this sample):A010 20191105 ... ... A999 V010 V ... ... V999 V010 V ... ... V999 ... ... Z001 Z999
Here we have 3 different
envelope
s:- first
envelope
is encapsulated byA010
andA999
. This envelope has one occurrence. - second
envelope
is encapsulated byV010
andV999
. This envelope has multiple occurrences. - last
envelope
is encapsulated byZ001
andZ999
. This envelope has one occurrence.
Note if the
footer
is the same as theheader
, suchenvelope
will consist of only one line. In such case,footer
can be omitted for brevity of schema writing. - first
In most use cases, a fixed-length input consists of a single-line envelope
that repeats itself.
Sometimes, we encounter input that consists of multi-line (but fixed number of lines) envelope
that repeats itself. Occasionally, envelope
with header
/footer
. But in some highly customized
scenarios, we do see some nested and mixed envelope
situation:
<envelope1> {0,}
<envelope2> {1,2}
<envelope3> {0,1}
where in the pseudo example above, envelope1
is followed by envelope2
1 or 2 times, then
followed by envelope3
0 or 1 time, then repeats itself (envelope1
). The following is adapted and
simplified from the nested sample:
HDR...
GRH...
NWR...
SPU...
SPT...
SPT...
NWR...
SPU...
SPU...
SPT...
GRT...
TRL...
While this is a flattened text file with fixed length data encoded within, it, however, represents a hierarchical and nested data structure:
HDR
GRH
NWR
SPU
SPT
GRT
TRL
- There is 1 and only 1
HDR
in the input. HDR
is paired withTRL
.- Underneath
HDR
, there could be 0 or many repeats ofGRH
. - Each
GRH
is paired with aGRT
. - Underneath
GRT
, there could be 0 or many repeats ofNWR
. - There is not closing pair, so to speak, for each of
NWR
. - Underneath
NWR
, there could be 0 or many repeats ofSPU
. - Underneath
SPU
, there could be 0 or many repeats ofSPT
.
If this nested structure embedded in a flat txt file reminds you of EDI, you are not alone :). In the wild business world, there are often such blend of things created: half EDI (with the concepts of hierarchy, loops, min/max occurrences, etc) half fixed-length with all the data pieces are in fixed locations without use of any delimiters.
"file_declaration": {
"envelopes": [ <= required
{
"name": <envelope name>, <= optional
"rows": <integer value>, <= optional
"header": <regex>, <= optional
"footer": <regex>, <= optional
"type": <envelope|envelope_group>, <= optional
"is_target": <true|false>, <= optional
"min": <integer value>, <= optional
"max": <integer value>, <= optional
"columns": [ <= optional
{
"name": "<column name>", <= optional
"start_pos": <integer>, <= required
"length": <integer>, <= required
"line_index": "<integer>", <= optional
"line_pattern": "<line regexp>" <= optional
},
<more columns>
],
"child_envelopes": [ <= optional
<more envelopes>
]
}
]
}
-
name
: the name of theenvelope
is used inxpath
query intransform_declarations
. It is optional in simple use cases where theenvelope
name isn't needed in any of the transform's xpath query, such as in this example. -
rows
: the fixed number of the rows in this envelope. Ifrows
specified,header
/footer
must not be used; If none ofrows
,header
, orfooter
is specified, theenvelope
is defaulted to being a rows basedenvelope
withrows
equals 1. -
header
/footer
: the regex pattern matches the starting and ending lines of a variable-lengthenvelope
.rows
must not be used whenheader
/footer
are used. Iffooter
is omitted, it defaults to be the value ofheader
, which means anenvelope
with onlyheader
specified will only match a single line for the envelope. -
type
: specifies whether anenvelope
is a soldenvelope
or anenvelope_group
which serves as a container for childenvelope
s. Iftype
isenvelope_group
,rows
/header
/footer
must not be used, as they are only relevant to solid/concreteenvelope
. Alsoenvelope_group
cannot co-exists withcolumns
for the same reason. -
is_target
: specifies if anenvelope
is a streaming target.is_target
concept here is the same as in EDI. Only oneenvelope
(group or not) can haveis_target: true
. For ease and brevity of schema writing, if nois_target: true
is specified, then the firstenvelope
will be auto-marked asis_target: true
. -
min
: specifies the minimum number of consecutive occurrences of theenvelope
in the data stream.min
can be specified onenvelope_group
. If omitted,min
defaults to 0. -
max
: specifies the maximum number of consecutive occurrences of theenvelope
in the data stream.max
can be specified onenvelope_group
. If omitted,max
defaults to -1, which means unlimited. -
columns
: specifies a collection of columns of theenvelope
. -
column.start_pos
: specifies the starting character position (1-based) for the column's data. -
column.length
: specifies the length of the column's data. -
column.line_index
: used in multi-lineenvelope
(rows
based orheader
/footer
based) where the index indicates which line this column's data will be extracted from. 1-based. -
column.line_pattern
: used in multi-lineenvelope
(rows
based orheader
/footer
based) where the pattern identifies which line this column's data will be extracted from. -
child_envelopes
: specifies, recursively, any hierarchical and nested child envelope structure.
Full sample input is here. Full sample schema is here.
"file_declaration": {
"envelopes" : [
{
"columns": [
{ "name": "DATE", "start_pos": 1, "length": 24 },
{ "name": "HIGH_TEMP_C", "start_pos": 26, "length": 4 },
{ "name": "LOW_TEMP_F", "start_pos": 31, "length": 4 },
{ "name": "WIND_DIR", "start_pos": 36, "length": 2 },
{ "name": "WIND_SPEED_KMH", "start_pos": 9, "length": 2 },
{ "name": "LAT", "start_pos": 42, "length": 8 },
{ "name": "LONG", "start_pos": 51, "length": 8 }
]
}
]
}
Note, many of the envelope
fields are omitted to use default values for simplicity and brevity:
rows
/header
/footer
omitted and defaulted torows:1
.type
omitted and defaulted totype:envelope
.is_target
omitted and defaulted tois_target:true
.min
omitted and defaulted tomin:0
.max
omitted and defaulted tomax:-1
.column.line_index
omitted since eachenvelope
contains only 1 line.column.line_pattern
omitted since eachenvelope
contains only 1 line.
Full sample input is here. Full sample schema is here.
"file_declaration": {
"envelopes" : [
{
"rows": 11,
"columns": [
{ "name": "tracking_number_h001", "start_pos": 464, "length": 30, "line_pattern": "^H001" },
{ "name": "destination_country", "start_pos": 607, "length": 2, "line_pattern": "^H001" },
{ "name": "guaranteed_delivery_date", "start_pos": 376, "length": 8, "line_pattern": "^H001" },
{ "name": "event_date", "start_pos": 142, "length": 8, "line_pattern": "^H001" },
{ "name": "event_time", "start_pos": 150, "length": 8, "line_pattern": "^H001" },
{ "name": "event_timezone", "start_pos": 158, "length": 4, "line_pattern": "^H001" },
{ "name": "event_city", "start_pos": 564, "length": 30, "line_pattern": "^H001" },
{ "name": "event_state", "start_pos": 594, "length": 2, "line_pattern": "^H001" },
{ "name": "scan_facility_zip", "start_pos": 596, "length": 11, "line_pattern": "^H001" },
{ "name": "tracking_number_h002_cn", "start_pos": 28, "length": 50, "line_pattern": "^H002.{19}CN" }
]
}
]
}
Full sample input is here. Full sample schema is here.
"file_declaration": {
"envelopes" : [
{
"name": "HEADER", "min": 1, "max": 1, "header": "^A010", "footer": "^A999",
"columns": [ { "name": "carrier", "start_pos": 6, "length": 6, "line_pattern": "^A060" } ]
},
{
"name": "REC", "header": "^V010", "footer": "^V999", "is_target": true,
"columns": [
{ "name": "tracking_number", "start_pos": 6, "length": 15, "line_pattern": "^V020" },
{ "name": "delivery_date", "start_pos": 6, "length": 8, "line_pattern": "^V045" },
{ "name": "observation_type", "start_pos": 6, "length": 1, "line_pattern": "^V060" },
{ "name": "reason_for_observation", "start_pos": 6, "length": 2, "line_pattern": "^V070" },
{ "name": "date_observation", "start_pos": 6, "length": 8, "line_pattern": "^V080" },
{ "name": "time_observation", "start_pos": 6, "length": 6, "line_pattern": "^V081" },
{ "name": "weight_in_grams", "start_pos": 6, "length": 6, "line_pattern": "^V110" },
{ "name": "postal_code_addressee", "start_pos": 6, "length": 6, "line_pattern": "^V160" },
{ "name": "city_name_addressee", "start_pos": 6, "length": 24, "line_pattern": "^V180" },
{ "name": "country_code_addressee", "start_pos": 6, "length": 2, "line_pattern": "^V200" }
]
},
{ "name": "FOOTER", "min": 1, "max": 1, "header": "^Z001", "footer": "^Z999" }
]
}
Note the use of is_target:true
on envelope
named REC
.
Full sample input is here. Full sample schema is here.
"file_declaration": {
"envelopes": [
{
"name": "HDR", "header": "^HDR", "min": 1, "max": 1,
"child_envelopes": [
{
"name": "GRH", "header": "^GRH",
"child_envelopes": [
{
"name": "NWR", "header": "^NWR", "is_target": true,
"columns": [ { "name": "title", "start_pos": 20, "length": 60 } ],
"child_envelopes": [
{
"name": "SPU", "header": "^SPU",
"columns": [ { "name": "publisher_name", "start_pos": 31, "length": 45 } ],
"child_envelopes": [
{
"name": "SPT", "header": "^SPT",
"columns": [ { "name": "territory_id", "start_pos": 51, "length": 4 } ]
}
]
},
{
"name": "SWR", "header": "^SWR",
"columns": [
{ "name": "writer_id", "start_pos": 20, "length": 9 },
{ "name": "last_name", "start_pos": 29, "length": 45 }
],
"child_envelopes": [
{
"name": "SWT", "header": "^SWT",
"columns": [ { "name": "territory_id", "start_pos": 45, "length": 4 } ]
}
]
}
]
},
{ "name": "GRT", "header": "^GRT" }
]
},
{ "name": "TRL", "header": "^TRL" }
]
}
]
}
The schema is lengthy but fairly straightforward:
- The input/schema has a singleton/global header
envelope
namedHDR
, it is paired with a trailerenvelope
TRL
. Therefore bothenvelope
s specifymin
/max
to 1. - Inside the global singleton header/trailer, we have unlimited repeats of an
envelope
namedGRH
, which is also paired with a trailerGRT
. - Under
GRH
, we have unlimited repeats ofNWR
which is our streaming and transform target. - Each
NWR
can have two "loopy" children,SPU
andSWR
, each of which has unlimited repeats of childenvelope
,SPT
andSWT
, respectively. - Each envelope is a single line, thus no
line_index
orline_pattern
is used.
See here for more details.
If one looks at the documentation for the old fixed-length
schema
here, you notice fixed-length
and fixedlength2
schemas are really
similar! It certainly makes migration incredibly trivial:
-
For single-row
envelope
based schema, migrating fromfixed-length
schema tofixedlength2
schema requires only 1 change:"parser_settings": { "version": "omni.2.1", "file_format_type": "fixed-length" ===> change it to "fixedlength2", yes the "-" is gone. }
-
For multi-row
envelope
based schema, migrating fromfixed-length
schema tofixedlength2
schema requires 2 changes:- Change 1: the same
parser_settings.file_format_type
change mentioned above. - Change 2:
"file_declaration": { "envelopes" : [ { "by_rows": 11, ===> change it to "rows". "columns": [ ... ... ] } ] },
- Change 1: the same
-
For
header
/footer
envelope
based schema, migrating fromfixed-length
schema tofixedlength2
schema requires 3 changes:- Change 1: the same
parser_settings.file_format_type
change mentioned above. - Change 2:
Any mentioning of
"file_declaration": { "envelopes" : [ { "by_header_footer": { "header": "^A010", "footer": "^A999" }, ... } ] },
"by_header_footer": { "header": "^A010", "footer": "^A999" }
changes to"header": "^A010", "footer": "^A999"
. Basically remove theby_header_footer
field, and moveheader
andfooter
one level up:"file_declaration": { "envelopes" : [ { "header": "^A010", "footer": "^A999", ... } ] },
- Change 3:
Any mentioning of
"file_declaration": { "envelopes" : [ { "not_target": true, ... } { ... } ] },
"not_target": true
, remove that line. Anyenvelope
that doesn't have"not_target": true
(there should be one and only one of suchenvelope
), add a line"is_target": true
to it:"file_declaration": { "envelopes" : [ { ... } { "is_target": true, ... } ] },
- Change 1: the same
That's all the migration changes needed!