Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Continuous aggregate (perhaps only hierarchical ones are affected) #7524

Open
salisbury-espinosa opened this issue Dec 8, 2024 · 7 comments

Comments

@salisbury-espinosa
Copy link

salisbury-espinosa commented Dec 8, 2024

What type of bug is this?

Data corruption

What subsystems and features are affected?

Continuous aggregate

What happened?

sometimes an continuous aggregate built on top of another aggregate (hierarchical) does not work.
I found it by directly checking with SQL queries (directly compared the data in the aggregate that was lying after a while and what should be obtained using the query for which it was created).
I did not catch this on top of simple hypertables - only hierarchical.
I can't reproduce it again ...and this only creates more anxiety and misunderstanding.

CALL refresh_continuous_aggregate didn't help - NOTICE: continuous aggregate is already up-to-date...
Only manual insertion of data into continuous_aggs_materialization_invalidation_log works for me....

DELETE from  _timescaledb_catalog.continuous_aggs_materialization_invalidation_log
    WHERE materialization_id=52;

INSERT INTO _timescaledb_catalog.continuous_aggs_materialization_invalidation_log
SELECT
	52,
	EXTRACT(EPOCH FROM '2022-12-01'::TIMESTAMPTZ)::BIGINT * 1000000,
	EXTRACT(EPOCH FROM '2024-12-10'::TIMESTAMPTZ)::BIGINT * 1000000

and when I did this, the continuous aggregate started working again, but before that it DID NOT WORK AT ALL - as if it had stopped.
and on some continuous aggregate there were simply holes - inconsistency.
that is, filling started and stopped.

timescaledb_information.job_stats contains 'Success' for all jobs...
job_errors - is empty

# select * from timescaledb_information.job_errors;
 job_id | proc_schema | proc_name | pid | start_time | finish_time | sqlerrcode | err_message 
--------+-------------+-----------+-----+------------+-------------+------------+-------------
(0 rows)

Example agg:

CREATE MATERIALIZED VIEW "aggsTest"
        WITH (
            timescaledb.continuous,
            timescaledb.create_group_indexes=false,
            timescaledb.materialized_only = true
        ) AS
        SELECT
            time_bucket('1 day', "bucketDaily") AS "testBucketDaily",
            "sourceId",
            "expiresAt",
            SUM ("amount") AS "amount",
            SUM("cnt") AS "cnt"
        FROM
            "aggsTestDay"
        WHERE
            "expiresAt" <= CURRENT_TIMESTAMP
        GROUP BY
            "testBucketDaily",
            "sourceId",
            "expiresAt"
        WITH NO DATA;

SELECT add_continuous_aggregate_policy(
            '"aggsTest"',
            start_offset => INTERVAL '360 day',
            end_offset => NULL,
            schedule_interval => INTERVAL '1 min');

aggsTestDay - another continuous aggregate.

when I recreate a duplicate of this aggsTest, all the data in it appears without holes and immediately after call refresh.

version of timescale - 2.17.0, PostgreSQL 16.4

TimescaleDB version affected

2.17.0

PostgreSQL version used

16.4

What operating system did you use?

Ubuntu 24.04.1 LTS

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

i dont know((
@kpan2034
Copy link
Contributor

kpan2034 commented Dec 9, 2024

@salisbury-espinosa it's difficult to investigate this without being able to reproduce the bug. Can you provide more details if it happens again?

Also, Timescale 2.7.0 is pretty old and doesn't officially support PG 16. Did you mean version 2.17.0 perhaps? If not, we recommend upgrading to the latest version.

@salisbury-espinosa
Copy link
Author

salisbury-espinosa commented Dec 10, 2024

. Did you mean version 2.17.0 perhaps?

yes! my mistake (I typed the text by hand - I didn't copy it - that's where the error comes from)

and 2.17.1/2.17.2 now - the same behavior ( I observe it on different instances specifically with different versions)

SELECT extversion
FROM pg_extension
where extname = 'timescaledb';
 extversion 
------------
 2.17.1
(1 row)

it's difficult to investigate this without being able to reproduce the bug

and it only reproduces after a day or two of work. I can't catch it from a clean setup.
even after manual setup of time in the system everything works stably, and then after a day or two i see holes in the data and no refresh_continuous_aggregate helps - only either completely re-creating the aggregate or manual invalidation ( continuous_aggs_materialization_invalidation_log).
because of this phantom bug I even wrote a script for checking.
I will observe how often this happens now.

with manual setting of the date forward to dial units - everything works as expected normally (today 2024-12-10):

sudo timedatectl set-time '2024-12-20 00:00:01'

everything starts after a day or two, as if something is being cached or something like that. it is unclear and apparently I will have to delve into the sources ( https://github.com/timescale/timescaledb/tree/main/tsl/src/continuous_aggs )

the error occurs after a few days in the hierarchical aggregate where I use here:

WHERE
            "expiresAt" <= CURRENT_TIMESTAMP

@salisbury-espinosa
Copy link
Author

no refresh_continuous_aggregate helps - only either completely re-creating the aggregate or manual invalidation ( continuous_aggs_materialization_invalidation_log).

by the way, I'll just note that this PR https://github.com/timescale/timescaledb/pull/7521/files will help to refresh the data, apparently.
of course, this does not fix this issue, but just as a note.

@salisbury-espinosa
Copy link
Author

salisbury-espinosa commented Dec 12, 2024

regarding the script - which I check the correctness of the data (
I simply calculate sum(cnt) in the cagg and sum(cnt) in sql which creates this cagg:

  1. SELECT SUM("cnt") AS "cnt" FROM "aggsTestDay" WHERE "expiresAt" <= CURRENT_TIMESTAMP
  2. SELECT SUM("cnt") AS "cnt" FROM "aggsTest"
    )
    I have more than 50 aggregates in the database, half of them are hierarchical and I see this phantom case only in 2 aggregates and they all use it in their body condition: WHERE "expiresAt" <= CURRENT_TIMESTAMP

@kpan2034
Copy link
Contributor

Were you able to observe if this issue occurred again?

@salisbury-espinosa
Copy link
Author

Were you able to observe if this issue occurred again?

Yes! (

@salisbury-espinosa
Copy link
Author

reproduces stably after deleting the CAGG and creating it again - after 2-3 days of uptime...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants