Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

row count estimation on DATETIME type is over-estimated when time span across years/months #50080

Open
pcqz opened this issue Jan 4, 2024 · 9 comments · May be fixed by #58661
Open

row count estimation on DATETIME type is over-estimated when time span across years/months #50080

pcqz opened this issue Jan 4, 2024 · 9 comments · May be fixed by #58661
Assignees
Labels
found/gs found by gs sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.

Comments

@pcqz
Copy link

pcqz commented Jan 4, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

test.zip

use test;
source test.sql;
load stats 'test.json';
explain select * from test where updated_date > '2023-12-31 23:40:00' and updated_date<'2023-12-31 23:50:00';
explain select * from test where updated_date > '2023-12-31 23:50:00' and updated_date<'2024-01-01 00:00:00';

2. What did you expect to see? (Required)

mysql> explain select * from test where updated_date > '2023-12-31 23:40:00' and updated_date<'2023-12-31 23:50:00';
+-------------------------------+---------+-----------+---------------------------------------------------+-------------------------------------------------------------------+
| id                            | estRows | task      | access object                                     | operator info                                                     |
+-------------------------------+---------+-----------+---------------------------------------------------+-------------------------------------------------------------------+
| IndexLookUp_7                 | 1382.13 | root      |                                                   |                                                                   |
| ├─IndexRangeScan_5(Build)     | 1382.13 | cop[tikv] | table:test, index:IX_CUST_RELA_DATE(updated_date) | range:(2023-12-31 23:40:00,2023-12-31 23:50:00), keep order:false |
| └─TableRowIDScan_6(Probe)     | 1382.13 | cop[tikv] | table:test                                        | keep order:false                                                  |
+-------------------------------+---------+-----------+---------------------------------------------------+-------------------------------------------------------------------+
3 rows in set (0.00 sec)

3. What did you see instead (Required)

mysql> explain select * from test where updated_date > '2023-12-31 23:50:00' and updated_date<'2024-01-01 00:00:00';
+-------------------------------+------------+-----------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                            | estRows    | task      | access object                                     | operator info                                                                                                                                                                      |
+-------------------------------+------------+-----------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| IndexLookUp_10                | 9355819.67 | root      |                                                   |                                                                                                                                                                                    |
| ├─IndexRangeScan_8(Build)     | 9355819.67 | cop[tikv] | table:test, index:IX_CUST_RELA_DATE(updated_date) | range:(2023-12-31 23:50:00,2024-01-01 00:00:00), keep order:false, stats:partial[ix_tpcnr_bcn:allEvicted, ix_tpcnr_epn:allEvicted, ix_tpcnr_pan:allEvicted...(more: 1 allEvicted)] |
| └─TableRowIDScan_9(Probe)     | 9355819.67 | cop[tikv] | table:test                                        | keep order:false, stats:partial[ix_tpcnr_bcn:allEvicted, ix_tpcnr_epn:allEvicted, ix_tpcnr_pan:allEvicted...(more: 1 allEvicted)]                                                  |
+-------------------------------+------------+-----------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.01 sec)

4. What is your TiDB version? (Required)

v6.5.3

@pcqz pcqz added the type/bug The issue is confirmed as a bug. label Jan 4, 2024
@pcqz pcqz changed the title out-of-range estimation when time span across years out-of-range estimation is over-estimated when time span across years Jan 4, 2024
@pcqz
Copy link
Author

pcqz commented Jan 4, 2024

/found gs

@ti-chi-bot ti-chi-bot bot added the found/gs found by gs label Jan 4, 2024
@jebter jebter added the affects-6.5 This bug affects the 6.5.x(LTS) versions. label Jan 5, 2024
@pcqz
Copy link
Author

pcqz commented Jan 10, 2024

Seems the histogram estimates the range size based on the converted values of type bytes/string, and the time type is not taken into account. Even if there are no out-of-range conditions, when the time span is in the range between different years/months/days, the over-estimated rows can still be seen, like this:

mysql> explain select * from test where updated_date > '2023-08-01 00:00:00' and updated_date<'2023-08-01 00:00:01';
+-------------------------------+---------+-----------+---------------------------------------------------+-------------------------------------------------------------------+
| id                            | estRows | task      | access object                                     | operator info                                                     |
+-------------------------------+---------+-----------+---------------------------------------------------+-------------------------------------------------------------------+
| IndexLookUp_10                | 1.50    | root      |                                                   |                                                                   |
| ├─IndexRangeScan_8(Build)     | 1.50    | cop[tikv] | table:test, index:IX_CUST_RELA_DATE(updated_date) | range:(2023-08-01 00:00:00,2023-08-01 00:00:01), keep order:false |
| └─TableRowIDScan_9(Probe)     | 1.50    | cop[tikv] | table:test                                        | keep order:false                                                  |
+-------------------------------+---------+-----------+---------------------------------------------------+-------------------------------------------------------------------+
3 rows in set (0.00 sec)

mysql> explain select * from test where updated_date > '2023-07-31 23:59:59' and updated_date<'2023-08-01 00:00:00';
+-------------------------------+-----------+-----------+---------------------------------------------------+-------------------------------------------------------------------+
| id                            | estRows   | task      | access object                                     | operator info                                                     |
+-------------------------------+-----------+-----------+---------------------------------------------------+-------------------------------------------------------------------+
| IndexLookUp_7                 | 246855.44 | root      |                                                   |                                                                   |
| ├─IndexRangeScan_5(Build)     | 246855.44 | cop[tikv] | table:test, index:IX_CUST_RELA_DATE(updated_date) | range:(2023-07-31 23:59:59,2023-08-01 00:00:00), keep order:false |
| └─TableRowIDScan_6(Probe)     | 246855.44 | cop[tikv] | table:test                                        | keep order:false                                                  |
+-------------------------------+-----------+-----------+---------------------------------------------------+-------------------------------------------------------------------+
3 rows in set (0.00 sec)

@time-and-fate time-and-fate added type/enhancement The issue or PR belongs to an enhancement. and removed type/bug The issue is confirmed as a bug. severity/major may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 affects-6.5 This bug affects the 6.5.x(LTS) versions. may-affects-7.1 may-affects-7.5 labels Jan 24, 2024
@hawkingrei hawkingrei self-assigned this Dec 31, 2024
@hawkingrei
Copy link
Member

It is still in the master (c44e991).

@hawkingrei hawkingrei linked a pull request Jan 2, 2025 that will close this issue
13 tasks
@winoros
Copy link
Member

winoros commented Jan 4, 2025

The DATETIME type is finally dealed as uint.
But, it's not in its microseconds representation.
Image

When the year/month jumps, the difference between the uint representation is not 1. It's a very big value. In other words, datetime(2024-01-01 00:00:00.000000).UintRepresentation() - datetime(2023-12-31 23:59:59.000000).UintRepresentation() is a very big value.
This caused the wrong estimation.

@winoros
Copy link
Member

winoros commented Jan 4, 2025

To solve the issue completely, a analyze version v3 which can correctly handle this issue in all possible cases may be needed.

@winoros winoros changed the title out-of-range estimation is over-estimated when time span across years row count estimation on DATETIME type is over-estimated when time span across years/months Jan 4, 2025
@winoros
Copy link
Member

winoros commented Jan 4, 2025

Also, it's not only about the out-of-range estimation. The in-bucket estimation suffers from the problem too. But the over-estimation will not be as large as the out-of-range just because the upper bound of the estimation is the row count of that bucket.

@winoros
Copy link
Member

winoros commented Jan 4, 2025

After a glance, I think this uint representation is to deal with the possible invalid DateTime that MySQL allows to insert.
Image

@hawkingrei
Copy link
Member

Another issue is that under this problem, the estimation error between non-indexed datetime and indexed datetime is too large. The current solution is to first reduce this part of the error.

@hawkingrei
Copy link
Member

> explain select * from test where updated_date > '2023-12-31 23:50:00' and updated_date<'2024-01-01 00:00:00';

+---------------------------+------------+-----------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                        | estRows    | task      | access object                                     | operator info                                                                                                                                                                                  |
+---------------------------+------------+-----------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| IndexLookUp_10            | 9768858.44 | root      |                                                   |                                                                                                                                                                                                |
| ├─IndexRangeScan_8(Build) | 9768858.44 | cop[tikv] | table:test, index:IX_CUST_RELA_DATE(updated_date) | range:(2023-12-31 23:50:00,2024-01-01 00:00:00), keep order:false, stats:partial[ix_tpcnr_bcn:unInitialized, ix_tpcnr_epn:unInitialized, ix_tpcnr_pan:unInitialized...(more: 1 unInitialized)] |
| └─TableRowIDScan_9(Probe) | 9768858.44 | cop[tikv] | table:test                                        | keep order:false, stats:partial[ix_tpcnr_bcn:unInitialized, ix_tpcnr_epn:unInitialized, ix_tpcnr_pan:unInitialized...(more: 1 unInitialized)]                                                  |
+---------------------------+------------+-----------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+




> drop index IX_CUST_RELA_DATE on test;

> explain select * from test where updated_date > '2023-12-31 23:50:00' and updated_date<'2024-01-01 00:00:00';

+---------------------+--------------+-----------+---------------+----------------------------------------------------------------------------------------------------------------+
| id                  | estRows      | task      | access object | operator info                                                                                                  |
+---------------------+--------------+-----------+---------------+----------------------------------------------------------------------------------------------------------------+
| TableReader_7       | 2333.21      | root      |               | data:Selection_6                                                                                               |
| └─Selection_6       | 2333.21      | cop[tikv] |               | gt(test.test.updated_date, 2023-12-31 23:50:00.000000), lt(test.test.updated_date, 2024-01-01 00:00:00.000000) |
|   └─TableFullScan_5 | 859718933.00 | cop[tikv] | table:test    | keep order:false                                                                                               |
+---------------------+--------------+-----------+---------------+----------------------------------------------------------------------------------------------------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
found/gs found by gs sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants