Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[controller] Ensure real-time topic partition count matches hybrid version partition count #1338

Merged

Conversation

sushantmane
Copy link
Collaborator

@sushantmane sushantmane commented Nov 22, 2024

Ensure real-time topic partition count matches hybrid version partition count

This fix addresses an issue where the real-time topic partition count did not align
with the hybrid version partition count, which caused ingestion failures in hybrid
stores.

Problem Description

The issue occurs in the following scenario:

  1. Create a store with 1 partition.
  2. Perform a batch push, creating a batch version with 1 partition.
  3. Update the store to 3 partitions and convert it to a hybrid store.
  4. Start real-time writes using push type STREAM.
  5. Perform a full push to create a hybrid version with 3 partitions. This push fails
    because, after the topic switch, real-time consumers cannot find partitions 2 and 3
    due to the real-time topic having only 1 partition.

Root Cause

  • In step 4, if the real-time topic did not exist, it gets created with a partition
    count derived from the largest existing version (batch version with 1 partition),
    which led to a mismatch.

Solution

  • Stream and incremental push types are now disallowed if there is no online hybrid
    version.
  • If an online hybrid version exists, the real-time topic partition count is now
    validated to match the hybrid version partition count.
  • The requestTopicForPushing method no longer creates a real-time topic if it does
    not already exist, except during new hybrid version creation.
  • For new hybrid versions, the real-time topic is created at version creation time
    instead of during a topic switch. The logic for real-time topic creation during a
    topic switch is retained for backward compatibility and will be removed once the
    release with these changes is fully rolled out.

Miscellaneous Changes

  • Refactored CreateVersion::requestTopicForPushing to improve unit testability.
  • Added similar validation checks for incremental push job types.
  • Stopped creating real-time topics for regional system stores like meta and PS3 in
    the parent region.
  • Improved logging around topic switches to make it easier to debug and search.

How was this PR tested?

UTs and E2E in CI

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

@sushantmane sushantmane marked this pull request as draft November 22, 2024 19:51
@sushantmane
Copy link
Collaborator Author

Moving to draft as I would like to move topic creation in addVersion and see how that works out

@arjun4084346
Copy link
Contributor

I agree with the solution "STREAM push type is now disallowed if there is no online hybrid version." However, I think, instead of returning error, we should create a new version (that would be of right partition count) and then let the push start.
If we do not auto create the version, we can at least tell the user to create a version, if we find store's partition count different from the VT's partition count.

If we look carefully, changing partition count without creating a version (even for batch stores) is similar to changing partition for a hybrid store; even for the batch stores, because batch stores can be changed to hybrid

I am skeptical about "The requestTopicForPushing method no longer creates a real-time topic if it does not already exist."; I believe in most of the cases, this is a benign operation and we should not prevent creation of the RT.

@sushantmane
Copy link
Collaborator Author

sushantmane commented Dec 6, 2024

I agree with the solution "STREAM push type is now disallowed if there is no online hybrid version." However, I think, instead of returning error, we should create a new version (that would be of right partition count) and then let the push start. If we do not auto create the version, we can at least tell the user to create a version, if we find store's partition count different from the VT's partition count.

If we look carefully, changing partition count without creating a version (even for batch stores) is similar to changing partition for a hybrid store; even for the batch stores, because batch stores can be changed to hybrid

I am skeptical about "The requestTopicForPushing method no longer creates a real-time topic if it does not already exist."; I believe in most of the cases, this is a benign operation and we should not prevent creation of the RT.

we should create a new version (that would be of right partition count)

I think this is not safe in some cases, for example, user wants to create a new version but keep old data.

If we look carefully, changing partition count without creating a version (even for batch stores) is similar to changing partition for a hybrid store; even for the batch stores, because batch stores can be changed to hybrid

I'm don't really follow the rest of the comment. Could you please elaborate? Thanks

changing partition for a hybrid store

We do not allow this as of now

I believe in most cases, this is a harmless operation, and we should not prevent the creation of the RT.

I disagree that this is a harmless operation. The case mentioned in the commit message is one example why this is not a good idea. Moreover, we should tie hybrid version materialization to RT creation for better alignment and not let RT creation happen at every other place IMHO

@sushantmane sushantmane force-pushed the li-hybrid-store-partition-count-issue branch 7 times, most recently from c0d5fa3 to c709994 Compare December 12, 2024 01:59
@sushantmane sushantmane marked this pull request as ready for review December 18, 2024 05:42
@sushantmane sushantmane force-pushed the li-hybrid-store-partition-count-issue branch from 6547aa6 to 7f27513 Compare December 18, 2024 10:53
@sushantmane sushantmane force-pushed the li-hybrid-store-partition-count-issue branch from 7f27513 to 07d57b0 Compare January 6, 2025 17:25
@sushantmane sushantmane force-pushed the li-hybrid-store-partition-count-issue branch from c2492aa to 68d59b1 Compare January 10, 2025 20:05
…rtition count

This fix addresses an issue where the real-time topic partition count did not align with the hybrid version
partition count, causing errors during hybrid store operations. The issue occurred in the following scenario:

1. Create a store with 1 partition.
2. Perform a batch push, creating a batch version with 1 partition.
3. Update the store to 3 partitions and convert it to a hybrid store.
4. Start real-time writes using push type STREAM.
5. Perform a full push to create a hybrid version with 3 partitions. This push fails because, after the topic
   switch, real-time consumers cannot find partitions 2 and 3 due to the real-time topic having only 1 partition.

Root Cause:
- In step 4, if the real-time topic did not exist, it was created with a partition count derived from the largest
  existing version (batch version with 1 partition), leading to a mismatch.

Solution:
- STREAM push type is now disallowed if there is no online hybrid version.
- If an online hybrid version exists, it ensures the real-time topic partition count matches the hybrid version
  partition count.
- The `requestTopicForPushing` method no longer creates a real-time topic if it does not already exist.

Move real-time topic creation logic in addVersion

enable participant message store and revert HB interval

Fix tests in 1430

Fix test

Fix tests

Fix cc tests

revert log4j2

Disable topic creation in RT topic switcher

Fix flakies

Do not call getRealTimeTopic

fix tests
sixpluszero
sixpluszero previously approved these changes Jan 13, 2025
Copy link
Contributor

@sixpluszero sixpluszero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for the changes!

@sushantmane sushantmane force-pushed the li-hybrid-store-partition-count-issue branch from 68d59b1 to c73ec4c Compare January 13, 2025 19:05
@sushantmane sushantmane changed the title [controller] Fix mismatch between hybrid version partition count and real-time partition count [controller] Ensure real-time topic partition count matches hybrid version partition count Jan 13, 2025
@sushantmane
Copy link
Collaborator Author

sushantmane commented Jan 13, 2025

Thanks a lot, @sixpluszero, @lusong64, and @arjun4084346 for the review!

@sushantmane sushantmane enabled auto-merge (squash) January 13, 2025 19:31
@sushantmane sushantmane disabled auto-merge January 13, 2025 19:31
@sushantmane sushantmane enabled auto-merge (squash) January 13, 2025 19:31
@sushantmane sushantmane disabled auto-merge January 13, 2025 19:50
@sushantmane sushantmane force-pushed the li-hybrid-store-partition-count-issue branch from c73ec4c to d32b206 Compare January 13, 2025 19:59
@sushantmane sushantmane merged commit 6ee2a5d into linkedin:main Jan 14, 2025
56 checks passed
@sushantmane sushantmane deleted the li-hybrid-store-partition-count-issue branch January 31, 2025 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants