From 944665978243a4100d6d9666e39abf98e99f3a93 Mon Sep 17 00:00:00 2001
From: Murphy <96611012+murphyatwork@users.noreply.github.com>
Date: Fri, 24 Jan 2025 14:59:39 +0800
Subject: [PATCH] [BugFix] fix misuse of statistic_max_full_collect_data_size
 (#55381)

Signed-off-by: Murphy <mofei@starrocks.com>
---
 docs/en/administration/management/FE_configuration.md         | 2 +-
 docs/en/using_starrocks/Cost_based_optimizer.md               | 4 ++--
 docs/zh/administration/management/FE_configuration.md         | 2 +-
 docs/zh/using_starrocks/Cost_based_optimizer.md               | 4 ++--
 .../com/starrocks/statistic/StatisticsCollectJobFactory.java  | 3 ++-
 5 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/docs/en/administration/management/FE_configuration.md b/docs/en/administration/management/FE_configuration.md
index b45647ca529a1..f690ad4b0389f 100644
--- a/docs/en/administration/management/FE_configuration.md
+++ b/docs/en/administration/management/FE_configuration.md
@@ -1758,7 +1758,7 @@ ADMIN SET FRONTEND CONFIG ("key" = "value");
 - Type: Long
 - Unit: bytes
 - Is mutable: Yes
-- Description: The size of the largest partition for the automatic collection of statistics. If a partition exceeds this value, then sampled collection is performed instead of full.
+- Description: The data size threshold for the automatic collection of statistics. If the total size exceeds this value, then sampled collection is performed instead of full.
 - Introduced in: -
 
 ##### statistic_collect_max_row_count_per_query
diff --git a/docs/en/using_starrocks/Cost_based_optimizer.md b/docs/en/using_starrocks/Cost_based_optimizer.md
index 18bcad7e6774c..510f4df793457 100644
--- a/docs/en/using_starrocks/Cost_based_optimizer.md
+++ b/docs/en/using_starrocks/Cost_based_optimizer.md
@@ -126,7 +126,7 @@ In addition, StarRocks allows you to configure collection policies based on tabl
 
   - When the collection interval is met and the statistics health is higher than the threshold for automatic sampled collection (`statistic_auto_collect_sample_threshold`) and lower than the automatic collection threshold (`statistic_auto_collect_ratio`), full collection is triggered.
 
-  - When the size of the largest partition to collect data (`statistic_max_full_collect_data_size`) is greater than 100 GB, sampled collection is triggered.
+  - When the size of partitions to collect data (`statistic_max_full_collect_data_size`) is greater than 100 GB, sampled collection is triggered.
 
   - Only statistics of partitions whose update time is later than the time of the previous collection task are collected. Statistics of partitions with no data change are not collected.
 
@@ -151,7 +151,7 @@ The following table describes the default settings. If you need to modify them,
 | statistic_auto_collect_large_table_interval | LONG    | 43200        | The interval for automatically collecting full statistics of large tables. Unit: seconds. Default value: 43200 (12 hours).                               |
 | statistic_auto_collect_ratio          | FLOAT    | 0.8               | The threshold for determining  whether the statistics for automatic collection are healthy. If statistics health is below this threshold, automatic collection is triggered. |
 | statistic_auto_collect_sample_threshold  | DOUBLE | 0.3   | The statistics health threshold for triggering automatic sampled collection. If the health value of statistics is lower than this threshold, automatic sampled collection is triggered. |
-| statistic_max_full_collect_data_size | LONG      | 107374182400      | The size of the largest partition for automatic collection to collect data. Unit: Byte. Default value: 107374182400 (100 GB). If a partition exceeds this value, full collection is discarded and sampled collection is performed instead. |
+| statistic_max_full_collect_data_size | LONG      | 107374182400      | The data size of the partitions for automatic collection to collect data. Unit: Byte. Default value: 107374182400 (100 GB). If the data size exceeds this value, full collection is discarded and sampled collection is performed instead. |
 | statistic_full_collect_buffer | LONG | 20971520 | The maximum buffer size taken by automatic collection tasks. Unit: Byte. Default value: 20971520 (20 MB). |
 | statistic_collect_max_row_count_per_query | INT  | 5000000000        | The maximum number of rows to query for a single analyze task. An analyze task will be split into multiple queries if this value is exceeded. |
 | statistic_collect_too_many_version_sleep | LONG | 600000 | The sleep time of automatic collection tasks if the table on which the collection task runs has too many data versions. Unit: ms. Default value: 600000 (10 minutes).  |
diff --git a/docs/zh/administration/management/FE_configuration.md b/docs/zh/administration/management/FE_configuration.md
index 50d5ef8a42e74..37fc41f98456e 100644
--- a/docs/zh/administration/management/FE_configuration.md
+++ b/docs/zh/administration/management/FE_configuration.md
@@ -1750,7 +1750,7 @@ ADMIN SET FRONTEND CONFIG ("key" = "value");
 - 类型：Long
 - 单位：bytes
 - 是否动态：是
-- 描述：自动统计信息采集的最大分区大小。如果超过该值，则放弃全量采集，转为对该表进行抽样采集。
+- 描述：自动统计信息采集的单次任务最大数据量。如果超过该值，则放弃全量采集，转为对该表进行抽样采集。
 - 引入版本：-
 
 ##### statistic_collect_max_row_count_per_query
diff --git a/docs/zh/using_starrocks/Cost_based_optimizer.md b/docs/zh/using_starrocks/Cost_based_optimizer.md
index 5f1b0de5c66d7..f0ab84d19ef1b 100644
--- a/docs/zh/using_starrocks/Cost_based_optimizer.md
+++ b/docs/zh/using_starrocks/Cost_based_optimizer.md
@@ -126,7 +126,7 @@ StarRocks 提供灵活的信息采集方式，您可以根据业务场景选择
 
   - 满足采集间隔的条件下，健康度高于抽样采集阈值，低于采集阈值时，触发全量采集，通过 `statistic_auto_collect_ratio` 配置。
 
-  - 当采集的最大分区大小大于 100G 时，触发抽样采集，通过 `statistic_max_full_collect_data_size` 配置。
+  - 当采集的分区数据总量大于 100G 时，触发抽样采集，通过 `statistic_max_full_collect_data_size` 配置。
 
   - 采集任务只会对分区更新时间晚于上次采集任务时间的分区进行采集，未发生修改的分区不进行采集。
 
@@ -152,7 +152,7 @@ StarRocks 提供灵活的信息采集方式，您可以根据业务场景选择
 | statistic_auto_collect_large_table_interval | LONG    | 43200        | 自动全量采集任务的大表采集间隔，默认 12 小时，单位：秒。                               |
 | statistic_auto_collect_ratio                | DOUBLE  | 0.8          | 触发自动统计信息收集的健康度阈值。如果统计信息的健康度小于该阈值，则触发自动采集。           |
 | statistic_auto_collect_sample_threshold     | DOUBLE  | 0.3          | 触发自动统计信息抽样收集的健康度阈值。如果统计信息的健康度小于该阈值，则触发自动抽样采集。       |
-| statistic_max_full_collect_data_size        | LONG    | 107374182400 | 自动统计信息采集的最大分区大小，默认 100 GB。单位：Byte。如果超过该值，则放弃全量采集，转为对该表进行抽样采集。 |
+| statistic_max_full_collect_data_size        | LONG    | 107374182400 | 自动统计信息采集的单次任务最大数据量，默认 100 GB。单位：Byte。如果超过该值，则放弃全量采集，转为对该表进行抽样采集。 |
 | statistic_full_collect_buffer               | LONG    | 20971520     | 自动全量采集任务写入的缓存大小，单位：Byte。默认值：20971520（20 MB）。                              |
 | statistic_collect_max_row_count_per_query   | LONG    | 5000000000   | 统计信息采集单次最多查询的数据行数。统计信息任务会按照该配置自动拆分为多次任务执行。 |
 | statistic_collect_too_many_version_sleep    | LONG    | 600000       | 当统计信息表的写入版本过多时 (Too many tablet 异常)，自动采集任务的休眠时间。单位：毫秒。默认值：600000（10 分钟）。 |
diff --git a/fe/fe-core/src/main/java/com/starrocks/statistic/StatisticsCollectJobFactory.java b/fe/fe-core/src/main/java/com/starrocks/statistic/StatisticsCollectJobFactory.java
index 741b1211309f0..4be5babd47c85 100644
--- a/fe/fe-core/src/main/java/com/starrocks/statistic/StatisticsCollectJobFactory.java
+++ b/fe/fe-core/src/main/java/com/starrocks/statistic/StatisticsCollectJobFactory.java
@@ -582,7 +582,8 @@ private static void createFullStatsJob(List<StatisticsCollectJob> allTableJobMap
             }
         }
 
-        if (partitionList.stream().anyMatch(p -> p.getDataSize() > Config.statistic_max_full_collect_data_size)) {
+        long totalDataSize = partitionList.stream().mapToLong(Partition::getDataSize).sum();
+        if (totalDataSize > Config.statistic_max_full_collect_data_size) {
             analyzeType = StatsConstants.AnalyzeType.SAMPLE;
             LOG.debug("statistics job choose sample on table: {}, partition data size greater than config: {}",
                     table.getName(), Config.statistic_max_full_collect_data_size);