Skip to content

Commit

Permalink
Adjust
Browse files Browse the repository at this point in the history
  • Loading branch information
kou committed Dec 24, 2024
1 parent 44c48bf commit ad49371
Showing 1 changed file with 62 additions and 74 deletions.
136 changes: 62 additions & 74 deletions docs/source/format/StatisticsSchema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ be read as Apache Arrow data may have statistics. For example, the
Apache Parquet C++ implementation can read an Apache Parquet file as
Apache Arrow data and the Apache Parquet file may have statistics.

We standardize the representation of statistics as an Apache Arrow array
for ease of exchange.
We standardize the representation of statistics as an Apache Arrow
array for ease of exchange.

Use case
--------
Expand Down Expand Up @@ -86,7 +86,7 @@ Here is the outline of the schema for statistics::
struct<
column: int32,
statistics: map<
key: dictionary<values=utf8, indices=int32>,
key: dictionary<values: utf8, indices: int32>,
items: dense_union<...all needed types...>
>
>
Expand Down Expand Up @@ -124,12 +124,13 @@ Here is the details of the ``map`` of the ``statistics``:
- Nullable
- Notes
* - key
- ``dictionary<values=utf8, indices=int32>``
- ``dictionary<values: utf8, indices: int32>``
- ``false``
- The string key is the name of the statistic. Dictionary-encoding is used for
efficiency as the same statistic may be repeated for different columns.
Different keys are assigned for exact and
approximate statistic values. Each statistic has their own description below.
- The string key is the name of the
statistic. Dictionary-encoding is used for efficiency as the
same statistic may be repeated for different columns.
Different keys are assigned for exact and approximate statistic
values. Each statistic has their own description below.
* - items
- ``dense_union``
- ``false``
Expand All @@ -139,31 +140,31 @@ Here is the details of the ``map`` of the ``statistics``:
``int64`` distinct count statistic and a ``float64`` average
byte width statistic. See the description of each statistic below.

Dense union has name for each field but we don't standardize field names for the dense union because we
can access to proper field by type code not name. So we can use
any valid name for fields.
Dense union has name for each field but we don't standardize
field names for the dense union because we can access to proper
field by type code not name. So we can use any valid name for
fields.

.. _statistics-schema-key:

Standard statistics
-------------------

Each statistic kind has a name that appears as a key in the statistics map
for each column or entire table. ``dictionary<values=utf8, indices=int32>``
is used to encode the key for space-efficiency.
Each statistic kind has a name that appears as a key in the statistics
map for each column or entire table. ``dictionary<values: utf8,
indices: int32>`` is used to encode the key for space-efficiency.

We assign different names for variations of the same statistic instead
of using flags. For example, we assign different statistic names for
exact and approximate values of the "distinct_count" statistic.
exact and approximate values of the "distinct count" statistic.

The colon symbol ``:`` is to be used as a namespace separator like
:ref:`format_metadata`. It can be used multiple times in a key.

The ``ARROW`` prefix is a reserved namespace for pre-defined
statistic names in current and future versions of this specification.
User-defined statistics must not use it.
For example, you can use your product name as namespace
such as ``MY_PRODUCT:my_statistics:exact``.
The ``ARROW`` prefix is a reserved namespace for pre-defined statistic
names in current and future versions of this specification.
User-defined statistics must not use it. For example, you can use your
product name as namespace such as ``MY_PRODUCT:my_statistics:exact``.

Here are pre-defined statistics keys:

Expand Down Expand Up @@ -223,13 +224,12 @@ Here are pre-defined statistics keys:
- The number of rows in the target table, record batch or
array. (approximate)

If you find a statistic that might be useful to multiple
systems, please propose it on the `Apache Arrow development
mailing-list <https://arrow.apache.org/community/>`__.
If you find a statistic that might be useful to multiple systems,
please propose it on the `Apache Arrow development mailing-list
<https://arrow.apache.org/community/>`__.

Interoperability improves when producers and consumers of
statistics follow a previously agreed upon statistic
specification.
Interoperability improves when producers and consumers of statistics
follow a previously agreed upon statistic specification.

.. _statistics-schema-examples:

Expand Down Expand Up @@ -304,10 +304,7 @@ Statistics schema::
struct<
column: int32,
statistics: map<
key: dictionary<
indices: int32,
dictionary: utf8
>,
key: dictionary<values: utf8, indices: int32>,
items: dense_union<0: int64>
>
>
Expand All @@ -327,6 +324,13 @@ Statistics array::
]
statistics:
key:
values: [
"ARROW:row_count:exact",
"ARROW:null_count:exact",
"ARROW:distinct_count:exact",
"ARROW:max_value:exact",
"ARROW:min_value:exact",
],
indices: [
0, # "ARROW:row_count:exact"
1, # "ARROW:null_count:exact"
Expand All @@ -338,13 +342,6 @@ Statistics array::
3, # "ARROW:max_value:exact"
4, # "ARROW:min_value:exact"
]
dictionary: [
"ARROW:row_count:exact",
"ARROW:null_count:exact",
"ARROW:distinct_count:exact",
"ARROW:max_value:exact",
"ARROW:min_value:exact",
],
items:
children:
0: [ # int64
Expand Down Expand Up @@ -478,10 +475,7 @@ Statistics schema::
struct<
column: int32,
statistics: map<
key: dictionary<
indices: int32,
dictionary: utf8
>,
key: dictionary<values: utf8, indices: int32>,
items: dense_union<
# For the number of rows, the number of nulls and so on.
0: int64,
Expand Down Expand Up @@ -511,6 +505,15 @@ Statistics array::
]
statistics:
key:
values: [
"ARROW:row_count:exact",
"ARROW:null_count:exact",
"ARROW:distinct_count:exact",
"ARROW:max_value:approximate",
"ARROW:min_value:approximate",
"ARROW:max_value:exact",
"ARROW:min_value:exact",
]
indices: [
0, # "ARROW:row_count:exact"
1, # "ARROW:null_count:exact"
Expand All @@ -527,15 +530,6 @@ Statistics array::
1, # "ARROW:null_count:exact"
2, # "ARROW:distinct_count:exact"
]
dictionary: [
"ARROW:row_count:exact",
"ARROW:null_count:exact",
"ARROW:distinct_count:exact",
"ARROW:max_value:approximate",
"ARROW:min_value:approximate",
"ARROW:max_value:exact",
"ARROW:min_value:exact",
],
items:
children:
0: [ # int64
Expand Down Expand Up @@ -639,10 +633,7 @@ Statistics schema::
struct<
column: int32,
statistics: map<
key: dictionary<
indices: int32,
dictionary: utf8
>,
key: dictionary<values: utf8, indices: int32>,
items: dense_union<0: int64>
>
>
Expand All @@ -658,20 +649,20 @@ Statistics array::
]
statistics:
key:
values: [
"ARROW:row_count:exact",
"ARROW:null_count:exact",
"ARROW:distinct_count:exact",
"ARROW:max_value:exact",
"ARROW:min_value:exact",
]
indices: [
0, # "ARROW:row_count:exact"
1, # "ARROW:null_count:exact"
2, # "ARROW:distinct_count:exact"
3, # "ARROW:max_value:exact"
4, # "ARROW:min_value:exact"
]
dictionary: [
"ARROW:row_count:exact",
"ARROW:null_count:exact",
"ARROW:distinct_count:exact",
"ARROW:max_value:exact",
"ARROW:min_value:exact",
],
items:
children:
0: [ # int64
Expand Down Expand Up @@ -783,10 +774,7 @@ Statistics schema::
struct<
column: int32,
statistics: map<
key: dictionary<
indices: int32,
dictionary: utf8
>,
key: dictionary<values: utf8, indices: int32>,
items: dense_union<
# For the number of rows, the number of nulls and so on.
0: int64,
Expand Down Expand Up @@ -814,6 +802,15 @@ Statistics array::
]
statistics:
key:
values: [
"ARROW:row_count:exact",
"ARROW:null_count:exact",
"ARROW:distinct_count:exact",
"ARROW:max_value:approximate",
"ARROW:min_value:approximate",
"ARROW:max_value:exact",
"ARROW:min_value:exact",
]
indices: [
0, # "ARROW:row_count:exact"
1, # "ARROW:null_count:exact"
Expand All @@ -828,15 +825,6 @@ Statistics array::
3, # "ARROW:max_value:approximate"
4, # "ARROW:min_value:approximate"
]
dictionary: [
"ARROW:row_count:exact",
"ARROW:null_count:exact",
"ARROW:distinct_count:exact",
"ARROW:max_value:approximate",
"ARROW:min_value:approximate",
"ARROW:max_value:exact",
"ARROW:min_value:exact",
],
items:
children:
0: [ # int64
Expand Down

0 comments on commit ad49371

Please sign in to comment.