Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PM-14360] Import Batching #703

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Conversation

BTreston
Copy link
Contributor

@BTreston BTreston commented Dec 27, 2024

🎟️ Tracking

https://bitwarden.atlassian.net/browse/PM-14360

📔 Objective

This PR implements services that allow the directory connector to batch large import requests into multiple small requests to avoid potential timeouts due to packet size exceeding default size limits for some servers. This implementation splits the requests into batches of 2000 users/groups and POSTs them sequentially.

This change also refactors some of the sync service to be more testable and removes some directory related code that does not belong. Also included are some new unit tests to the new functionality and the sync service.

📸 Screenshots

⏰ Reminders before review

  • Contributor guidelines followed
  • All formatters and local linters executed and passed
  • Written new unit and / or integration tests where applicable
  • Used internationalization (i18n) for all UI strings
  • CI builds passed
  • Communicated to DevOps any deployment requirements
  • Updated any necessary documentation (Confluence, contributing docs) or informed the documentation team

🦮 Reviewer guidelines

  • 👍 (:+1:) or similar for great changes
  • 📝 (:memo:) or ℹ️ (:information_source:) for notes or general info
  • ❓ (:question:) for questions
  • 🤔 (:thinking:) or 💭 (:thought_balloon:) for more open inquiry that's not quite a confirmed issue and could potentially benefit from discussion
  • 🎨 (:art:) for suggestions / improvements
  • ❌ (:x:) or ⚠️ (:warning:) for more significant problems or concerns needing attention
  • 🌱 (:seedling:) or ♻️ (:recycle:) for future improvements or indications of technical debt
  • ⛏ (:pick:) for minor or nitpick changes

@BTreston BTreston requested a review from eliykat December 27, 2024 15:19
Copy link
Contributor

github-actions bot commented Dec 27, 2024

Logo
Checkmarx One – Scan Summary & Details45cb4f5b-99da-43d1-9bf1-8578dd87835e

New Issues (1)

Checkmarx found the following issues in this Pull Request

Severity Issue Source File / Package Checkmarx Insight
LOW Use_Of_Hardcoded_Password /src/utils/test-fixtures.ts: 21
detailsThe application uses the hard-coded password "admin" for authentication purposes, either using it to verify users' identities, or to access another...
Attack Vector
Fixed Issues (1)

Great job! The following issues were fixed in this Pull Request

Severity Issue Source File / Package
LOW Use_Of_Hardcoded_Password /src/services/ldap-directory.service.integration.spec.ts: 175

@BTreston
Copy link
Contributor Author

BTreston commented Dec 27, 2024

@eliykat Whenever you're back and able to give this first pass a look over, I just had a few questions:

  • it seems like jest doesnt like ECMAScript Modules (which are used in the googleapis) and plugging it into the existing transformer in the jest config didn't seem to fix it. Not sure how to handle the resulting error.
  • stripe has a cap on how large a transaction can be (999,999.99 USD) so there's a soft limit to how many users you can import and provision seats for at one time. Have we run into this before?
  • in terms of batching groups, do we want to batch based on the number of groups, or the number of users within the groups?

I also still have a few todo's:

  • More unit tests for the sync service (ie. for generateNewHash, buildRequests)
  • Refactor the sync call into more testable pieces, potentially move some code that might not belong in this service (looking at some of the directory stuff in this service)

Copy link
Member

@eliykat eliykat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, here are a few comments but I'll take a closer look once you've finished your draft.

To answer your questions:

it seems like jest doesnt like ECMAScript Modules (which are used in the googleapis) and plugging it into the existing transformer in the jest config didn't seem to fix it. Not sure how to handle the resulting error.

Looking at the jest documentation, it doesn't have great support for ECMAScript. I think the easiest solution is to decouple the GsuiteDirectoryService from SyncService, thus avoiding importing the google apis into jest at all. e.g. move getDirectoryService to its own class with its own abstraction, and have SyncService only inject the abstraction.

This will become a problem again if/when we get a Gsuite testbed set up, but I think that solves it for now, and is a better design.

stripe has a cap on how large a transaction can be (999,999.99 USD) so there's a soft limit to how many users you can import and provision seats for at one time. Have we run into this before?

We have not run into this before, but spending $1 million USD in a single provisioning sync seems unlikely. I think we can leave it unhandled.

in terms of batching groups, do we want to batch based on the number of groups, or the number of users within the groups?

The goal here is to avoid timeouts at the network edge, so there isn't any hard and fast rule as to how we should split our requests. What do you think?

I can see an argument for splitting based on the number of users in each group - that's the property that scales as you add more users. But my assumption was that these timeouts are actually fairly generous (this request data is very small and this is the first time I've heard of requests timing out), so the simplest approach of batching by n users and n groups would be enough to resolve the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both strategies should implement this interface. For example, a RequestBuilder interface which has BatchRequestBuilder and SingleRequestBuilder implementations for the new and existing behaviour respectively.

Comment on lines 241 to 260
return [
new OrganizationImportRequest({
groups: (groups ?? []).map((g) => {
return {
name: g.name,
externalId: g.externalId,
memberExternalIds: Array.from(g.userMemberExternalIds),
};
}),
users: (users ?? []).map((u) => {
return {
email: u.email,
externalId: u.externalId,
deleted: u.deleted || (removeDisabled && u.disabled),
};
}),
overwriteExisting: overwriteExisting,
largeImport: largeImport,
}),
];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on from my comment above - this would be moved into the SingleRequestBuilder implementation.

@@ -133,6 +116,36 @@ export class SyncService {
}
}

async generateNewHash(reqJson: string): Promise<string> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is doing too much - it's called generateNewHash, but it's both generating a new hash and comparing it to the existing hash from state. And its return value is not very clear - a new hash if it's different, or an empty string if it's not.

I suggest that generateNewHash only generates a hash (generateHash really), and the comparison is done elsewhere (either a separate method or inline).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory is used for integration tests. Because your new service is more or less functional - i.e. it's just processing a set of inputs to a set of outputs - I'm not sure we need dedicated integration tests on it.

If you disagree and want to integration test it, I don't think we should maintain a set of 11k users and 11k groups. Ideally you'd specify a lower batchSize and use a more maintanable set of test data (e.g. 250 users with a batchSize of 45). I recall we discussed trying to replicate real world conditions as much as possible, but this is a lot of fixture data to maintain forever. And would be comparatively slow to run in the pipeline.

Comment on lines 41 to 44
expect(requests.length).toEqual(
Math.ceil(mockGroups.length / batchingService.batchSize) +
Math.ceil(mockUsers.length / batchingService.batchSize),
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost always prefer constants in assertions, otherwise you risk rewriting the logic of the SUT.

For example - you know that you've generated 11k users and you have batch sizes of 2k, so you know you're going to have 6 user requests. Assert that your request has a length of 6. (example only, obviously you have groups as well.)

I think it would help with this to specify the batchSize in the constructor of the service. That way, the test can specify a fixed batch size and base its assertions on that, without affecting the batch size used by the production code.

Comment on lines 14 to 31
userSimulator = (userCount: number) => {
const simulatedArray: UserEntry[] = [];
for (let i = 0; i < userCount; i += batchingService.batchSize) {
for (let j = 0; j <= batchingService.batchSize; j++) {
simulatedArray.push(new UserEntry());
}
}
return simulatedArray;
};

groupSimulator = (groupCount: number) => {
const simulatedArray: GroupEntry[] = [];
for (let i = 0; i < groupCount; i += batchingService.batchSize) {
for (let j = 0; j <= batchingService.batchSize; j++) {
simulatedArray.push(new GroupEntry());
}
}
return simulatedArray;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this needs to know about batchSize - it should just generate the number of mock objects requested. Particularly because this is generating mock input data, which is usually supplied by a directory service which knows nothing about batching.

@BTreston BTreston marked this pull request as ready for review January 7, 2025 15:30
@BTreston BTreston requested a review from a team as a code owner January 7, 2025 15:30
@BTreston BTreston requested review from JimmyVo16 and eliykat January 7, 2025 15:30
@BTreston BTreston changed the title initial implementation [PM-14360] Import Batching Jan 17, 2025
Copy link
Member

@eliykat eliykat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me! Comments are all fairly minor.

jslib/angular/src/services/jslib-services.module.ts Outdated Show resolved Hide resolved
jslib/angular/src/services/jslib-services.module.ts Outdated Show resolved Hide resolved
jslib/common/src/abstractions/directory-factory.service.ts Outdated Show resolved Hide resolved
src/app/services/services.module.ts Outdated Show resolved Hide resolved
src/bwdc.ts Show resolved Hide resolved
src/services/sync.service.spec.ts Outdated Show resolved Hide resolved
src/services/sync.service.spec.ts Outdated Show resolved Hide resolved
src/services/sync.service.ts Outdated Show resolved Hide resolved
jslib/common/src/services/batch-requests.service.ts Outdated Show resolved Hide resolved
jslib/common/src/services/batch-requests.service.spec.ts Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 60.86957% with 36 lines in your changes missing coverage. Please review.

Project coverage is 12.15%. Comparing base (23713d9) to head (7d7623e).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/services/directory-factory.service.ts 0.00% 17 Missing ⚠️
src/bwdc.ts 0.00% 7 Missing ⚠️
src/app/services/services.module.ts 0.00% 4 Missing ⚠️
src/services/single-request-builder.ts 75.00% 0 Missing and 2 partials ⚠️
src/services/sync.service.ts 93.10% 2 Missing ⚠️
src/abstractions/directory-factory.service.ts 0.00% 1 Missing ⚠️
src/abstractions/request-builder.service.ts 0.00% 1 Missing ⚠️
...ervices/ldap-directory.service.integration.spec.ts 0.00% 1 Missing ⚠️
src/utils/test-fixtures.ts 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##            main     #703      +/-   ##
=========================================
+ Coverage   2.23%   12.15%   +9.91%     
=========================================
  Files         60       66       +6     
  Lines       2634     2682      +48     
  Branches     467      473       +6     
=========================================
+ Hits          59      326     +267     
+ Misses      2572     2329     -243     
- Partials       3       27      +24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@BTreston BTreston requested a review from eliykat January 22, 2025 20:20
Copy link
Member

@eliykat eliykat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a test failure here that I can't explain. Is it actually trying to use apiService somewhere? https://github.com/bitwarden/directory-connector/runs/36020428086#r0s26

jslib/angular/src/services/jslib-services.module.ts Outdated Show resolved Hide resolved
src/services/sync-config-helpers.ts Outdated Show resolved Hide resolved
src/services/sync.service.spec.ts Outdated Show resolved Hide resolved
src/services/default-single-request-builder.ts Outdated Show resolved Hide resolved
src/services/default-batch-requests-builder.spec.ts Outdated Show resolved Hide resolved
@BTreston BTreston requested a review from eliykat January 27, 2025 18:19
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've duplicated this file instead of renaming it (there is now a default-batch-request-builder.ts and a batch-requests-builder.ts).

Comment on lines 57 to 59
directoryFactory.createService.mockReturnValue(
new LdapDirectoryService(logService, i18nService, stateService),
mockDirectoryService as unknown as IDirectoryService,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how to do this with jest without having to write your own mock implementation and mess around with the types not matching:

const mockDirectoryService = mock<LdapDirectoryService>();
mockDirectoryService.getEntries.mockResolvedValue([groupFixtures, userFixtures]);
directoryFactory.createService.mockReturnValue(ldapDirectoryService);

);

syncService = new SyncService(
cryptoFunctionService,
apiService,
apiService as unknown as ApiService,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this type cast necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover artifact from trying to fix that one test failure.

import { GroupEntry } from "@/src/models/groupEntry";
import { UserEntry } from "@/src/models/userEntry";

export abstract class RequestBuilderAbstratction {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abstraction suffix is not required:

Suggested change
export abstract class RequestBuilderAbstratction {
export abstract class RequestBuilder {

(then your names should be good!)

Comment on lines +104 to +105
// @ts-expect-error This is a workaround to make the batchsize smaller to trigger the batching logic since its a const.
constants.batchSize = 4;
Copy link
Member

@eliykat eliykat Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested to know if there's a better way of doing this. This feels hacky, but I don't have any better ideas at the moment. 🤔 It's more important that it's readonly at runtime.

Non-blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not a fan of this either, couldn't come up with anything better. Tried doing jest stuff to mock it but none of the myriad solutions offered through searching quite fit the bill.

@BTreston BTreston requested a review from eliykat January 31, 2025 15:53
Copy link
Member

@eliykat eliykat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks for working through my feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants