Python: Compute parquet stats #7831

maxdebayser · 2023-06-13T18:09:57Z

This commit partly addresses issue #7256. Unfortunately the pyarrow library is not as flexible as we would like. When passing write_statistics=True to pyarrow.parquet.write_table the statistics are written out for each row group in the file, instead of computed globally.

In the issue a "metadata_collector" was mentioned which I assume is the parameter of the pyarrow.parquet.write_metadata function. The pyarrow.parquet.write_table function has no such parameter.

The function in this PR intentionally works at the level of individual parquet files instead of the dataset to support scenarios such as writing from Ray where each file of the dataset is written by a different task.

unnecessarily

Fokko · 2023-06-13T20:03:02Z

Hey @maxdebayser thanks for raising this PR. I had something different in mind with the issue. The main problem here is that we pull a lot of data through Python, which brings in a lot of issues (the binary conversion that's probably expensive, and limited scalability due to the GIL).

I just did a quick stab (and should have done that sooner), and found the following:

➜  Desktop python3 
Python 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> metadata_collector = []
>>> import pyarrow.parquet as pq
>>> pq.write_to_dataset(
...     table, '/tmp/table',
...      metadata_collector=metadata_collector)
>>> metadata_collector
[<pyarrow._parquet.FileMetaData object at 0x11f955850>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 2
  num_rows: 6
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 0]

>>> metadata_collector[0].row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x105837d80>
  num_columns: 2
  num_rows: 6
  total_byte_size: 256

>>> metadata_collector[0].row_group(0).to_dict()
{
	'num_columns': 2,
	'num_rows': 6,
	'total_byte_size': 256,
	'columns': [{
		'file_offset': 119,
		'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
		'physical_type': 'INT64',
		'num_values': 6,
		'path_in_schema': 'n_legs',
		'is_stats_set': True,
		'statistics': {
			'has_min_max': True,
			'min': 2,
			'max': 100,
			'null_count': 0,
			'distinct_count': 0,
			'num_values': 6,
			'physical_type': 'INT64'
		},
		'compression': 'SNAPPY',
		'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
		'has_dictionary_page': True,
		'dictionary_page_offset': 4,
		'data_page_offset': 46,
		'total_compressed_size': 115,
		'total_uncompressed_size': 117
	}, {
		'file_offset': 359,
		'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
		'physical_type': 'BYTE_ARRAY',
		'num_values': 6,
		'path_in_schema': 'animal',
		'is_stats_set': True,
		'statistics': {
			'has_min_max': True,
			'min': 'Brittle stars',
			'max': 'Parrot',
			'null_count': 0,
			'distinct_count': 0,
			'num_values': 6,
			'physical_type': 'BYTE_ARRAY'
		},
		'compression': 'SNAPPY',
		'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
		'has_dictionary_page': True,
		'dictionary_page_offset': 215,
		'data_page_offset': 302,
		'total_compressed_size': 144,
		'total_uncompressed_size': 139
	}]
}

I think it is much better to retrieve the min-max from there. This is done by PyArrow and is probably much faster than when we do it in Python. I really would like to stay away from processing data as much as possible.

I think the complexity here is:

How does this work with nested columns
I think most of it is there, but we're missing the NaN's. Adding this to Parquet, is still an open PR: PARQUET-2249: Add nan_count to handle NaNs in statistics parquet-format#196

Fokko · 2023-06-13T20:46:00Z

It looks like the metadata is also available on the plain writer, where we don't have to write a dataset:

➜  Desktop python3                       
Python 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> 
>>> writer = pq.ParquetWriter('/tmp/vo.parquet', table.schema)
>>> writer.write_table(table)
>>> writer.close()
>>> writer.writer.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x1375b9080>
  num_columns: 2
  num_rows: 6
  total_byte_size: 256
>>> writer.writer.metadata.row_group(0).to_dict()
{
	'num_columns': 2,
	'num_rows': 6,
	'total_byte_size': 256,
	'columns': [{
		'file_offset': 119,
		'file_path': '',
		'physical_type': 'INT64',
		'num_values': 6,
		'path_in_schema': 'n_legs',
		'is_stats_set': True,
		'statistics': {
			'has_min_max': True,
			'min': 2,
			'max': 100,
			'null_count': 0,
			'distinct_count': 0,
			'num_values': 6,
			'physical_type': 'INT64'
		},
		'compression': 'SNAPPY',
		'encodings': ('RLE_DICTIONARY', 'PLAIN', 'RLE'),
		'has_dictionary_page': True,
		'dictionary_page_offset': 4,
		'data_page_offset': 46,
		'total_compressed_size': 115,
		'total_uncompressed_size': 117
	}, {
		'file_offset': 359,
		'file_path': '',
		'physical_type': 'BYTE_ARRAY',
		'num_values': 6,
		'path_in_schema': 'animal',
		'is_stats_set': True,
		'statistics': {
			'has_min_max': True,
			'min': 'Brittle stars',
			'max': 'Parrot',
			'null_count': 0,
			'distinct_count': 0,
			'num_values': 6,
			'physical_type': 'BYTE_ARRAY'
		},
		'compression': 'SNAPPY',
		'encodings': ('RLE_DICTIONARY', 'PLAIN', 'RLE'),
		'has_dictionary_page': True,
		'dictionary_page_offset': 215,
		'data_page_offset': 302,
		'total_compressed_size': 144,
		'total_uncompressed_size': 139
	}]
}

maxdebayser · 2023-06-13T20:48:32Z

@Fokko, I understand your concern, I think it's because we have different use cases in mind.

If I understand correctly you want to write a pyarrow.Table to a partitioned dataset with write_dataset. Therefore computing min/max on the whole Table is not what you need because you actually need the min/max for the columns of the individual files. (Just pointing out that with the metadata collector you get the stats for the row chunks, so you'll still have to compute the stats for the file from those).

I'm coming from a different use case. I would like to write from Ray using something like https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html#ray.data.Dataset.write_parquet . In this case there is no global pyarrow.Table that represent the dataset, Pyarrow tables are the blocks of the dataset that each individual ray task sees, for example in map_batches. In this scenario the pyarrow.write_dataset cannot be used because the full dataset is not entirely loaded into the memory of any compute node. In this scenario the GIL is also not a big concern because ray uses multiple worker processes.

I think we have to see if there is a way to have a single API for both use cases or if we'll need to have different API calls for both. In the second case it would be better to share part of the implementation to ensure that the behavior is consistent, but it could perhaps lead to bad performance.

Regarding the efficiency, the pyarrow.compute.min function is implemented in C++, so I think the performance is probably not a huge concern here. But I can try to compare both approaches with a large enough data set to measure it.

python/pyiceberg/utils/file_stats.py

maxdebayser · 2023-06-18T13:15:31Z

It looks like the metadata is also available on the plain writer, where we don't have to write a dataset:

➜  Desktop python3                       
Python 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> 
>>> writer = pq.ParquetWriter('/tmp/vo.parquet', table.schema)
>>> writer.write_table(table)
>>> writer.close()
>>> writer.writer.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x1375b9080>
  num_columns: 2
  num_rows: 6
  total_byte_size: 256
>>> writer.writer.metadata.row_group(0).to_dict()
{
	'num_columns': 2,
	'num_rows': 6,
	'total_byte_size': 256,
	'columns': [{
		'file_offset': 119,
		'file_path': '',
		'physical_type': 'INT64',
		'num_values': 6,
		'path_in_schema': 'n_legs',
		'is_stats_set': True,
		'statistics': {
			'has_min_max': True,
			'min': 2,
			'max': 100,
			'null_count': 0,
			'distinct_count': 0,
			'num_values': 6,
			'physical_type': 'INT64'
		},
		'compression': 'SNAPPY',
		'encodings': ('RLE_DICTIONARY', 'PLAIN', 'RLE'),
		'has_dictionary_page': True,
		'dictionary_page_offset': 4,
		'data_page_offset': 46,
		'total_compressed_size': 115,
		'total_uncompressed_size': 117
	}, {
		'file_offset': 359,
		'file_path': '',
		'physical_type': 'BYTE_ARRAY',
		'num_values': 6,
		'path_in_schema': 'animal',
		'is_stats_set': True,
		'statistics': {
			'has_min_max': True,
			'min': 'Brittle stars',
			'max': 'Parrot',
			'null_count': 0,
			'distinct_count': 0,
			'num_values': 6,
			'physical_type': 'BYTE_ARRAY'
		},
		'compression': 'SNAPPY',
		'encodings': ('RLE_DICTIONARY', 'PLAIN', 'RLE'),
		'has_dictionary_page': True,
		'dictionary_page_offset': 215,
		'data_page_offset': 302,
		'total_compressed_size': 144,
		'total_uncompressed_size': 139
	}]
}

Thanks for pointing that out, I completely missed the fact that you can get this data from the ParquetWriter after caling close(). That changes everything.

This commit makes sure to test the metadata computation both using `pyarrow.parquet.ParqueWriter` and `pyarrow.parquet.write_to_dataset`.

maxdebayser · 2023-06-26T22:43:32Z

@Fokko , I've rewritten the implementation to use the pyarrow metadata_collector. I've added a test cases to make sure that it's compatible with write_dataset as well as ParquetWriter. I've also also added a function to map the path of columns in the parquet metadata to Iceberg field IDs.

Fokko · 2023-07-06T08:10:27Z

@maxdebayser can you rebase against lastest master?

maxdebayser · 2023-07-06T14:40:29Z

@Fokko done!

Fokko

I left some comments to generalize it, so we can also re-use it outside of PyArrow.

python/pyiceberg/utils/file_stats.py

python/tests/utils/test_file_stats.py

python/pyiceberg/utils/file_stats.py

…g.apache.org/docs/latest/configuration/

python/pyiceberg/io/pyarrow.py

python/tests/io/test_pyarrow.py

python/pyiceberg/io/pyarrow.py

python/pyiceberg/utils/file_stats.py

python/tests/io/test_pyarrow.py

Fokko · 2023-07-11T17:06:50Z

For some reason my big comment was collapsed:

Since Arrow is column-oriented, I'm sure that it will follow the order of the write schema:

The reason I try to avoid using too many internal details from PyArrow is that we support PyArrow [9.0.0, 12.0.1] currently. There is no guarantee that all these internals stay the same, therefore I think we should do as much as possible within our own control (also regarding the internal Parquet types, or how the column names are structured).

We already have the write schema, so we can easily filter out the primitive types:

class PyArrowStatisticsCollector(PreOrderSchemaVisitor[List[StatisticsCollector]]):
    _field_id = 0
    _schema: Schema
    _properties: Properties

    def __init__(self, schema: Schema, properties: Dict[str, str]):
        self._schema = schema
        self._properties = properties

    def schema(self, schema: Schema, struct_result: Callable[[], List[StatisticsCollector]]) -> List[StatisticsCollector]:
        return struct_result()

    def struct(self, struct: StructType, field_results: List[Callable[[], List[StatisticsCollector]]]) -> List[StatisticsCollector]:
        return list(chain(*[result() for result in field_results]))

    def field(self, field: NestedField, field_result: Callable[[], List[StatisticsCollector]]) -> List[StatisticsCollector]:
        self._field_id = field.field_id
        result = field_result()
        return result

    def list(self, list_type: ListType, element_result: Callable[[], List[StatisticsCollector]]) -> List[StatisticsCollector]:
        self._field_id = list_type.element_id
        return element_result()

    def map(self, map_type: MapType, key_result: Callable[[], List[StatisticsCollector]], value_result: Callable[[], List[StatisticsCollector]]) -> List[StatisticsCollector]:
        self._field_id = map_type.key_id
        k = key_result()
        self._field_id = map_type.value_id
        v = value_result()
        return k + v

    def primitive(self, primitive: PrimitiveType) -> List[StatisticsCollector]:
        return [StatisticsCollector(
            field_id=self._field_id,
            iceberg_type=primitive,
            mode=MetricsMode.TRUNC
            # schema
            # self._properties.get(f"write.metadata.metrics.column.{schema.find_column_name(self._field_id)}")
        )]

This way we get a nice list of columns that we need to collect statistics for. We have:

@dataclass(frozen=True)
class StatisticsCollector:
    field_id: int
    iceberg_type: PrimitiveType
    mode: MetricsMode

Where we can use the field_id to properly populate the maps, and the iceberg_type to feed into to_bytes to do the conversion (so we don't have to have yet another one for PyArrow).

I did a quick test, and it seems to work:

def test_complex_schema(table_schema_nested: Schema):
    tbl = pa.Table.from_pydict({
        "foo": ["a", "b"],
        "bar": [1, 2],
        "baz": [False, True],
        "qux": [["a", "b"], ["c", "d"]],
        "quux": [[("a", (("aa", 1), ("ab", 2)))], [("b", (("ba", 3), ("bb", 4)))]],
        "location": [[(52.377956, 4.897070), (4.897070, -122.431297)],
                     [(43.618881, -116.215019), (41.881832, -87.623177)]],
        "person": [("Fokko", 33), ("Max", 42)]  # Possible data quality issue
    },
        schema=schema_to_pyarrow(table_schema_nested)
    )
    stats_columns = pre_order_visit(table_schema_nested, PyArrowStatisticsCollector(table_schema_nested, {}))

    visited_paths = []

    def file_visitor(written_file: Any) -> None:
        visited_paths.append(written_file)

    with TemporaryDirectory() as tmpdir:
        pq.write_to_dataset(tbl, tmpdir, file_visitor=file_visitor)

    assert visited_paths[0].metadata.num_columns == len(stats_columns)

python/pyiceberg/io/pyarrow.py

Fokko · 2023-08-09T11:43:03Z

python/pyiceberg/io/pyarrow.py

+        return list(chain(*[result() for result in field_results]))
+
+    def field(self, field: NestedField, field_result: Callable[[], List[StatisticsCollector]]) -> List[StatisticsCollector]:
+        self._field_id = field.field_id


I see your point, and happy to add those fields. Currently, they are not there, so I would suggest doing that in a separate PR, to avoid going on a tangent in this PR. Created: #8273

Fokko · 2023-08-09T12:02:58Z

python/pyiceberg/io/pyarrow.py

+
+    stats_columns = pre_order_visit(schema, PyArrowStatisticsCollector(schema, table_metadata.properties))
+
+    if parquet_metadata.num_columns != len(stats_columns):


I think we should leave this one in, for now, I think they are always the same, but when this is not the case, then we should be notified

python/pyiceberg/io/pyarrow.py

Fokko · 2023-08-09T12:27:21Z

python/tests/io/test_pyarrow.py

 import os
 import tempfile
-from typing import Any, List, Optional
+import uuid
+from datetime import (


We use this for PyArrow which has the datetime classes in their public API.

python/tests/io/test_pyarrow.py

Fokko · 2023-08-09T12:41:41Z

python/tests/io/test_pyarrow.py

+    assert datafile.value_counts[1] == 4
+    assert datafile.value_counts[2] == 4
+    assert datafile.value_counts[5] == 10  # 3 lists with 3 items and a None value
+    assert datafile.value_counts[6] == 5


With Spark:

CREATE TABLE nyc.test_map_maps2 AS SELECT map_from_arrays(array(1.0, 3.0), array('2', '4')) as map, array('a', 'b', 'c') as arr

Schema:

{ "type": "struct", "schema-id": 0, "fields": [{ "id": 1, "name": "map", "required": false, "type": { "type": "map", "key-id": 3, "key": "decimal(2, 1)", "value-id": 4, "value": "string", "value-required": false } }, { "id": 2, "name": "arr", "required": false, "type": { "type": "list", "element-id": 5, "element": "string", "element-required": false } }] }

We don't get any stats in Spark:

{ "status": 1, "snapshot_id": { "long": 4895801649705337905 }, "data_file": { "file_path": "s3://warehouse/nyc/test_map_maps/data/00000-95-750d8f3e-8d49-44ec-b37e-9e101e003a5d-00001.parquet", "file_format": "PARQUET", "partition": {}, "record_count": 1, "file_size_in_bytes": 1438, "block_size_in_bytes": 67108864, "column_sizes": { "array": [{ "key": 3, "value": 57 }, { "key": 4, "value": 58 }, { "key": 5, "value": 61 }] }, "value_counts": { "array": [{ "key": 3, "value": 2 }, { "key": 4, "value": 2 }, { "key": 5, "value": 3 }] }, "null_value_counts": { "array": [{ "key": 3, "value": 0 }, { "key": 4, "value": 0 }, { "key": 5, "value": 0 }] }, "nan_value_counts": { "array": [] }, "lower_bounds": { "array": [] }, "upper_bounds": { "array": [] }, "key_metadata": null, "split_offsets": { "array": [4] }, "sort_order_id": { "int": 0 } } }

The stats are only computed for the primitive types (see PyArrowStatisticsCollector).

Fokko · 2023-08-09T12:45:23Z

python/tests/io/test_pyarrow.py

+    assert len(datafile.null_value_counts) == 5
+    assert datafile.null_value_counts[1] == 1
+    assert datafile.null_value_counts[2] == 0
+    assert datafile.null_value_counts[5] == 1


🤔 Type:

{"type": "list", "element-id": 5, "element": "long", "element-required": False}

And the data:

_list = [[1, 2, 3], [4, 5, 6], None, [7, 8, 9]]

I count a single null

Fokko · 2023-08-09T12:46:33Z

python/tests/io/test_pyarrow.py

+
+    # #arrow does not include this in the statistics
+    # assert len(datafile.nan_value_counts)  == 3
+    # assert datafile.nan_value_counts[1]    == 0


I was just wondering. The evaluator is not taking those into consideration for any not float or double type, any reason to still add those?

…te_parquet_stats

characters

…te_parquet_stats

This commit adds a visitor to compute a mapping from parquet column path to iceberg field ID.

…te_parquet_stats

Fokko · 2023-09-20T10:58:58Z

@maxdebayser Thanks for working on this, and for your patience because it took quite a while, and we also went a bit back and forth. Let's merge this in. As a next step, we can revisit #8012 and add a lot of integration tests to make sure that we don't have any correctness issues.

I think there are still some open-ends on the statistics, but they are not limited to Python. Please refer to #8598

Fokko · 2023-09-20T10:59:42Z

Thanks @rdblue for the review, that was very helpful.

maxdebayser · 2023-09-20T13:05:44Z

Thanks, @Fokko and @rdblue

maxdebayser added 2 commits June 13, 2023 12:35

Add function to compute parquet file metadata

e16d7d4

Addition of docstring and extra parameter to avoid reading the file

96adb31

unnecessarily

github-actions bot added the python label Jun 13, 2023

Fokko reviewed Jun 14, 2023

View reviewed changes

maxdebayser added 4 commits June 26, 2023 19:10

Refactor the statistics computation entirely to use pyarrow metadata

ce8a5df

This commit makes sure to test the metadata computation both using `pyarrow.parquet.ParqueWriter` and `pyarrow.parquet.write_to_dataset`.

Merge remote-tracking branch 'iceberg/master' into compute_parquet_stats

1be86e7

Appease pre-commit hooks

e6c3f94

Fix temporary path

e4e0b2b

Fokko added this to the PyIceberg 0.5.0 release milestone Jul 6, 2023

Fokko changed the title ~~Compute parquet stats~~ Python: Compute parquet stats Jul 6, 2023

Merge remote-tracking branch 'iceberg/master' into compute_parquet_stats

a5f4ef9

Fokko reviewed Jul 10, 2023

View reviewed changes

maxdebayser added 8 commits July 10, 2023 12:57

Merge remote-tracking branch 'iceberg/master' into compute_parquet_stats

ac23783

Make the metrics mode configurable as documented here: https://iceber…

ed27875

…g.apache.org/docs/latest/configuration/

Initialize binary serializers only once

de46bef

Log arrow not implemented exception

5ae5b2e

Fix None comparison expression

33218eb

Add map column to test data

4975e99

Moving pyarrow specific code to io.pyarrow

98c93ca

type annotation

a480539

Fokko reviewed Jul 11, 2023

View reviewed changes

python/tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

Fix potential null write

45abc6d

Fokko reviewed Aug 9, 2023

View reviewed changes

maxdebayser added 11 commits August 9, 2023 10:32

Merge branch 'master' of https://github.com/apache/iceberg into compu…

7ee1ef0

…te_parquet_stats

Apply function name refactoring

5898f3f

Move pyarrow statistics tests to a new file

e7edf0b

Disable stats computation for nested types

6f7bd98

Modularize the fill_parquet_file_metadata function

05579ff

Allow metrics modes to have extra whitespace but not other trailing

aae1118

characters

Move upper bound truncation logic to another file

11b5d3a

Be defensive with regards to missing row group statistics

4332a95

Add tests for structs

09c5955

Merge branch 'master' of https://github.com/apache/iceberg into compu…

c131b58

…te_parquet_stats

Remove special treatment of UUIDType

5e01924

Fokko modified the milestones: PyIceberg 0.5.0 release, PyIceberg 0.6.0 release Aug 15, 2023

corleyma mentioned this pull request Aug 15, 2023

Python write support #6564

Closed

4 tasks

maxdebayser added 6 commits August 18, 2023 17:08

Merge branch 'master' of https://github.com/apache/iceberg into compu…

7f768eb

…te_parquet_stats

Merge branch 'master' of https://github.com/apache/iceberg into compu…

be70fd5

…te_parquet_stats

Rely on parquet column path rather than column order

0be438e

This commit adds a visitor to compute a mapping from parquet column path to iceberg field ID.

Merge branch 'master' of https://github.com/apache/iceberg into compu…

8226a01

…te_parquet_stats

Merge branch 'master' of https://github.com/apache/iceberg into compu…

867ea80

…te_parquet_stats

Change mood to imperative to appease linter

ebb604a

Fokko mentioned this pull request Sep 13, 2023

feat: support transform function apache/iceberg-rust#42

Merged

maxdebayser added 2 commits September 15, 2023 09:40

Merge branch 'master' of https://github.com/apache/iceberg into compu…

640f885

…te_parquet_stats

Factor out the logic to obtain the current table schema

acf6d4f

Fokko approved these changes Sep 20, 2023

View reviewed changes

Fokko merged commit 47ae4ac into apache:master Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Compute parquet stats #7831

Python: Compute parquet stats #7831

maxdebayser commented Jun 13, 2023

Fokko commented Jun 13, 2023 •

edited

Loading

Fokko commented Jun 13, 2023

maxdebayser commented Jun 13, 2023

maxdebayser commented Jun 18, 2023

maxdebayser commented Jun 26, 2023

Fokko commented Jul 6, 2023

maxdebayser commented Jul 6, 2023

Fokko left a comment

Fokko commented Jul 11, 2023

Fokko Aug 9, 2023

Fokko Aug 9, 2023

Fokko Aug 9, 2023

Fokko Aug 9, 2023

Fokko Aug 9, 2023 •

edited

Loading

Fokko Aug 9, 2023

Fokko commented Sep 20, 2023

Fokko commented Sep 20, 2023

maxdebayser commented Sep 20, 2023


		stats_columns = pre_order_visit(schema, PyArrowStatisticsCollector(schema, table_metadata.properties))

		if parquet_metadata.num_columns != len(stats_columns):

Python: Compute parquet stats #7831

Python: Compute parquet stats #7831

Conversation

maxdebayser commented Jun 13, 2023

Fokko commented Jun 13, 2023 • edited Loading

Fokko commented Jun 13, 2023

maxdebayser commented Jun 13, 2023

maxdebayser commented Jun 18, 2023

maxdebayser commented Jun 26, 2023

Fokko commented Jul 6, 2023

maxdebayser commented Jul 6, 2023

Fokko left a comment

Choose a reason for hiding this comment

Fokko commented Jul 11, 2023

Fokko Aug 9, 2023

Choose a reason for hiding this comment

Fokko Aug 9, 2023

Choose a reason for hiding this comment

Fokko Aug 9, 2023

Choose a reason for hiding this comment

Fokko Aug 9, 2023

Choose a reason for hiding this comment

Fokko Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Fokko Aug 9, 2023

Choose a reason for hiding this comment

Fokko commented Sep 20, 2023

Fokko commented Sep 20, 2023

maxdebayser commented Sep 20, 2023

Fokko commented Jun 13, 2023 •

edited

Loading

Fokko Aug 9, 2023 •

edited

Loading