Skip to content

Conversation

@peterylh
Copy link
Contributor

Profile Archive Feature

Summary

Add automatic profile archiving feature to preserve query profiles beyond current memory and disk limits. When profile storage reaches capacity, old profiles are automatically archived to compressed ZIP files instead of being deleted, enabling long-term profile retention for troubleshooting and analysis.

Problem

Currently, Doris has strict limits on profile storage:

  • Memory profiles: Maximum 500 profiles (max_query_profile_num)
  • Spilled profiles: Maximum 500 profiles (max_spilled_profile_num)
  • Storage size: Maximum 1GB (spilled_profile_storage_limit_bytes)

When these limits are exceeded, old profiles are permanently deleted, making it impossible to analyze historical slow queries beyond the retention window. This is problematic for:

  • Post-incident analysis of production issues
  • Long-term performance trend analysis
  • Debugging intermittent query problems

Solution

Implement an automatic profile archiving system that:

  1. Moves outdated profiles to an archive directory instead of deleting them
  2. Batches profiles into compressed ZIP files for efficient storage
  3. Retains profiles for a configurable period (default 7 days)
  4. Provides predictable file naming for easy profile location

Key Features

  • Pending Buffer Strategy: Profiles are staged in archive/pending/ before archiving to ensure optimal batch sizes
  • Dual Trigger Mechanism:
    • Archive when batch size reaches configured limit (default 100 profiles)
    • Archive when oldest pending file exceeds timeout (default 24 hours)
  • Automatic Cleanup: Remove archives older than retention period (configurable)
  • Graceful Degradation: Falls back to direct deletion if archiving fails

Implementation Details

Directory Structure

${LOG_DIR}/profile/
├── {timestamp}_{queryid}.zip          # Active spilled profiles
└── archive/                           # Archive root
    ├── pending/                       # Staging area for batching
    │   └── {timestamp}_{queryid}.zip
    └── profiles_20240101_000000_20240101_235959.zip  # Archived batches

Archive File Naming

Archive ZIPs follow the naming pattern: profiles_{start_timestamp}_{end_timestamp}.zip

  • start_timestamp: Earliest profile in the batch (YYYYMMDD_HHMMSS)
  • end_timestamp: Latest profile in the batch (YYYYMMDD_HHMMSS)

This enables quick location of profiles by query time.

Workflow

Query Profile Creation
    ↓
Memory Storage (max 500)
    ↓
Spilled to Disk (when memory full)
    ↓
Periodic Cleanup (every 1s)
    ↓
Move to archive/pending/ (when limits exceeded)
    ↓
Archive to ZIP (batch size reached OR timeout exceeded)
    ↓
Delete Pending Files
    ↓
Cleanup Old Archives (every 24h, default retention 7 days)

Code Changes

New Files:

  • ProfileArchiveManager.java (682 lines) - Core archiving logic

Modified Files:

  • Config.java (+35 lines) - Configuration parameters
  • ProfileManager.java (+157/-17 lines) - Integration with archive system

Test Files:

  • ProfileArchiveManagerTest.java (+1111 lines) - 26 comprehensive test cases
  • ProfileManagerTest.java (+227 lines) - Integration tests

Configuration

All parameters have sensible defaults and can be tuned via FE configuration:

Parameter Type Default Description
enable_profile_archive boolean true Enable/disable profile archiving
profile_archive_batch_size int 100 Number of profiles per ZIP file
profile_archive_path String "" Custom archive path (empty = use default ${spilled_profile_storage_path}/archive)
profile_archive_retention_seconds int 604800 Archive retention period in seconds (7 days). Set to -1 for unlimited retention, 0 to disable archiving
profile_archive_pending_timeout_seconds int 86400 Maximum wait time for pending files in seconds (24 hours). Force archive even if batch is not full

Configuration Examples

# Increase batch size for larger archives (reduces file count)
profile_archive_batch_size = 1000

# Keep archives for 30 days
profile_archive_retention_seconds = 2592000

# Use custom archive path (e.g., mounted network storage)
profile_archive_path = /mnt/nfs/doris-profiles/archive

# Force archive after 12 hours instead of 24
profile_archive_pending_timeout_seconds = 43200

# Disable archiving (keep current behavior)
enable_profile_archive = false

Usage

For System Administrators

Step 1: Locate Slow Query

SELECT query_id, time, frontend_ip, query_time
FROM __internal_schema.audit_log
WHERE time >= NOW() - INTERVAL 1 DAY
  AND query_time > 10000
ORDER BY query_time DESC;

Step 2: Find Archive File

ssh user@<frontend_ip>
cd ${LOG_DIR}/profile/archive
ls -lh profiles_*.zip

Step 3: Extract and Analyze

unzip profiles_20240101_120000_20240101_130000.zip -d /tmp/analysis/
ls /tmp/analysis/ | grep <query_id>
vim /tmp/analysis/<timestamp>_<query_id>.profile

Space Management

# Check archive storage usage
du -sh ${LOG_DIR}/profile/archive

# Manual cleanup (if needed beyond automatic retention)
find ${LOG_DIR}/profile/archive -name "profiles_*.zip" -mtime +90 -delete

Backward Compatibility

  • Fully backward compatible - existing profile storage continues to work
  • Default enabled - archives are created automatically
  • Can be disabled - set enable_profile_archive = false to restore old behavior
  • No schema changes - no database migration required

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Oct 31, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@peterylh
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 189517 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e9cc48325dedfbfa26c7b5202270661e5b67a863, data reload: false

query1	1036	402	397	397
query2	6587	1684	1684	1684
query3	6751	222	225	222
query4	26036	23847	23131	23131
query5	5090	624	470	470
query6	335	244	233	233
query7	4644	516	304	304
query8	308	256	249	249
query9	8722	2596	2599	2596
query10	561	353	299	299
query11	15348	15117	14984	14984
query12	194	120	116	116
query13	1705	580	465	465
query14	11815	9309	9334	9309
query15	268	200	177	177
query16	7774	681	532	532
query17	1634	815	664	664
query18	2078	485	357	357
query19	245	246	198	198
query20	147	140	131	131
query21	227	153	125	125
query22	4681	4880	4609	4609
query23	34846	33478	33781	33478
query24	8377	2477	2550	2477
query25	615	557	463	463
query26	1239	277	161	161
query27	2988	550	364	364
query28	4451	2243	2253	2243
query29	796	685	509	509
query30	302	247	209	209
query31	963	871	819	819
query32	78	71	76	71
query33	593	374	354	354
query34	826	866	543	543
query35	816	848	774	774
query36	978	1022	939	939
query37	125	117	89	89
query38	3750	3745	3514	3514
query39	1463	1426	1402	1402
query40	213	125	118	118
query41	59	61	56	56
query42	128	113	107	107
query43	481	508	474	474
query44	1203	749	731	731
query45	181	179	169	169
query46	875	979	625	625
query47	1761	1820	1725	1725
query48	398	419	323	323
query49	740	498	413	413
query50	629	690	398	398
query51	3881	3972	3909	3909
query52	107	107	102	102
query53	235	261	195	195
query54	300	296	286	286
query55	85	86	83	83
query56	317	317	319	317
query57	1156	1225	1119	1119
query58	295	277	279	277
query59	2615	2683	2552	2552
query60	339	347	325	325
query61	158	163	168	163
query62	811	715	655	655
query63	232	185	188	185
query64	4621	1325	979	979
query65	4036	3986	3992	3986
query66	1103	439	352	352
query67	15561	15506	15018	15018
query68	8392	922	595	595
query69	516	324	306	306
query70	1305	1310	1236	1236
query71	466	354	328	328
query72	6036	4983	4917	4917
query73	657	581	358	358
query74	9171	9067	8713	8713
query75	3757	3355	2798	2798
query76	3556	1165	750	750
query77	800	420	312	312
query78	9574	9613	8882	8882
query79	2472	868	596	596
query80	697	567	515	515
query81	527	260	232	232
query82	451	161	130	130
query83	264	266	255	255
query84	258	114	100	100
query85	910	499	451	451
query86	417	312	283	283
query87	3669	3757	3564	3564
query88	3770	2251	2233	2233
query89	395	313	301	301
query90	1993	212	213	212
query91	184	170	139	139
query92	86	70	66	66
query93	2131	950	637	637
query94	728	458	345	345
query95	400	318	316	316
query96	493	580	278	278
query97	2924	2969	2873	2873
query98	244	216	207	207
query99	1381	1400	1286	1286
Total cold run time: 280536 ms
Total hot run time: 189517 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.5 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e9cc48325dedfbfa26c7b5202270661e5b67a863, data reload: false

query1	0.04	0.05	0.05
query2	0.09	0.04	0.05
query3	0.26	0.08	0.08
query4	1.61	0.11	0.11
query5	0.28	0.28	0.25
query6	1.17	0.64	0.63
query7	0.03	0.02	0.03
query8	0.06	0.04	0.04
query9	0.59	0.52	0.53
query10	0.58	0.58	0.57
query11	0.16	0.12	0.11
query12	0.16	0.12	0.12
query13	0.62	0.60	0.60
query14	1.00	1.00	1.00
query15	0.84	0.82	0.84
query16	0.40	0.40	0.40
query17	1.03	1.04	1.01
query18	0.21	0.23	0.20
query19	1.95	1.80	1.88
query20	0.01	0.02	0.01
query21	15.47	0.20	0.12
query22	5.00	0.06	0.05
query23	15.68	0.26	0.10
query24	2.54	1.15	0.44
query25	0.08	0.06	0.06
query26	0.13	0.13	0.14
query27	0.07	0.06	0.05
query28	4.79	1.16	0.94
query29	12.59	3.93	3.28
query30	0.27	0.14	0.11
query31	2.81	0.60	0.37
query32	3.23	0.56	0.47
query33	3.00	3.04	3.13
query34	15.90	5.14	4.52
query35	4.61	4.58	4.61
query36	0.66	0.50	0.50
query37	0.10	0.08	0.07
query38	0.07	0.04	0.04
query39	0.04	0.02	0.03
query40	0.18	0.14	0.14
query41	0.08	0.04	0.03
query42	0.04	0.04	0.03
query43	0.05	0.03	0.03
Total cold run time: 98.48 s
Total hot run time: 27.5 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 69.35% (233/336) 🎉
Increment coverage report
Complete coverage report

@peterylh
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

ClickBench: Total hot run time: 29.43 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 70230004f6f0750840cdab1f56644cab937c5478, data reload: false

query1	0.06	0.06	0.05
query2	0.10	0.06	0.05
query3	0.26	0.09	0.09
query4	1.61	0.12	0.13
query5	0.30	0.28	0.27
query6	1.21	0.69	0.68
query7	0.04	0.03	0.03
query8	0.06	0.04	0.04
query9	0.66	0.57	0.59
query10	0.65	0.62	0.61
query11	0.19	0.13	0.13
query12	0.18	0.14	0.13
query13	0.64	0.62	0.62
query14	1.04	1.03	1.02
query15	0.90	0.88	0.87
query16	0.44	0.42	0.45
query17	1.12	1.20	1.17
query18	0.24	0.22	0.22
query19	2.06	1.92	1.92
query20	0.01	0.01	0.02
query21	15.38	0.20	0.15
query22	5.05	0.09	0.06
query23	15.63	0.30	0.11
query24	2.27	0.78	1.21
query25	0.10	0.07	0.06
query26	0.17	0.17	0.16
query27	0.08	0.06	0.06
query28	5.70	1.21	0.95
query29	12.60	4.67	3.80
query30	0.30	0.15	0.13
query31	2.85	0.67	0.41
query32	3.24	0.59	0.48
query33	3.17	3.10	3.17
query34	15.97	5.29	4.60
query35	4.69	4.70	4.62
query36	0.72	0.53	0.54
query37	0.11	0.08	0.07
query38	0.07	0.05	0.04
query39	0.05	0.04	0.04
query40	0.18	0.15	0.14
query41	0.10	0.04	0.03
query42	0.05	0.04	0.04
query43	0.06	0.05	0.05
Total cold run time: 100.31 s
Total hot run time: 29.43 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 9.52% (32/336) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants