Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate multiple collections #73

Merged
merged 10 commits into from
Apr 24, 2024
Merged

Generate multiple collections #73

merged 10 commits into from
Apr 24, 2024

Conversation

lovromazgon
Copy link
Member

@lovromazgon lovromazgon commented Apr 16, 2024

Description

Important

Note that 235,975 added lines come from the file internal/words.txt, which is just the contents of /usr/share/dict/words on my machine. The idea behind this is to generate random words rather than UUIDs. Also, it's an order of magnitude faster than generating a UUID, the price is a 2MB bigger binary.

Adds support for generating records coming from multiple collections with different schemas. The configuration of the connector changed to allow configuring the new functionality, please refer to the readme and specification for the new parameters.

Some backwards-incompatible changes were made to parameters:

  • format.options is now a map[string]string, meaning that configuring fields is now done using separate parameters (e.g. format.options.my-field-name: "string")
  • operation doesn't accept the value random anymore, instead it allows you to specify a comma-separated list of operations (e.g. create,update)
  • the parameter readTime was deprecated in favor of rate which lets you define the rate limit in records per second (the old parameter still works)

I've taken the opportunity to overhaul the connector. Rate limiting is now smarter and takes into account the time since the last read (i.e. if the rate limit is 1 rec/s and Conduit executes reads only once per second, then the connector won't add any delay). Bursts now start in the generating phase (instead of in the sleep phase, as before). The operation of the generated record is taken into account when populating fields .Payload.After and .Payload.Before to come closer to reality. The type time now generates a unix nanoseconds timestamp.

Example

You can use this pipeline configuration pipeline to test the connector. It is configured to generate 10 records per second in bursts of 1 second, while taking a break for 3 seconds between bursts. Records will be logged on the destination.

version: 2.2
pipelines:
  - id: noop
    status: running
    connectors:
      - id: gen
        type: source
        plugin: standalone:generator
        settings:
          # global settings
          rate: 10 # generated records per second
          burst.sleepTime: 3s # duration of sleep between bursts
          burst.generateTime: 1s # duration of burst
          # collection "gen-in" (make sure file gen.in exists, can contain anything)
          collections.gen-in.format.type: file
          collections.gen-in.format.options.path: gen.in
          collections.gen-in.operation: create,update
          # collection "str"
          collections.str.format.type: structured
          collections.str.format.options.id: int
          collections.str.format.options.name: string
          collections.str.format.options.admin: bool
          collections.str.format.options.joined: time
          collections.str.operation: delete
          # collection "raw"
          collections.raw.format.type: raw
          collections.raw.format.options.id: int
          collections.raw.format.options.name: string
          collections.raw.operation: snapshot
      - id: empty.out
        type: destination
        plugin: builtin:log
        settings:
          level: INFO

Closes ConduitIO/conduit#1475

Quick checks:

  • I have followed the Code Guidelines.
  • There is no other pull request for the same update/change.
  • I have written unit tests.
  • I have made sure that the PR is of reasonable size and can be easily reviewed.

@lovromazgon lovromazgon changed the title [WIP] Generate multiple collections Generate multiple collections Apr 17, 2024
@lovromazgon lovromazgon marked this pull request as ready for review April 17, 2024 16:09
@lovromazgon lovromazgon requested a review from a team as a code owner April 17, 2024 16:09
README.md Outdated
@@ -30,6 +30,8 @@ specified time, and then repeating the same cycle. The connector always start wi

### Configuration

// TODO update this section
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the readme shortly (probably tomorrow).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be super, thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulb @hariso can you take one more look at the readme if you get a chance? 🙏

config.go Outdated Show resolved Hide resolved
.github/workflows/lint.yml Show resolved Hide resolved
README.md Outdated
@@ -30,6 +30,8 @@ specified time, and then repeating the same cycle. The connector always start wi

### Configuration

// TODO update this section
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be super, thank you!

config.go Outdated Show resolved Hide resolved
config.go Show resolved Hide resolved
config.go Show resolved Hide resolved
Copy link
Member

@raulb raulb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lovromazgon finished reviewing. Code looks great. I think the only things that are missing are updating documentation in README, and changing to Operations instead of Operation in the type ConfigCollection.

Copy link
Contributor

@hariso hariso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (the comments nitpicks only:))!

config.go Show resolved Hide resolved
config.go Outdated Show resolved Hide resolved
Copy link
Member

@raulb raulb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

It's possible to simulate a 'read' time for records. It's also possible to simulate bursts through "sleep and generate"
cycles, where the connector is sleeping for some time (not generating any records), then generating records for the
specified time, and then repeating the same cycle. The connector always start with the sleeping phase.
<!-- readmegen:source.parameters.table -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@lovromazgon lovromazgon merged commit 5e16fd4 into main Apr 24, 2024
3 checks passed
@lovromazgon lovromazgon deleted the lovro/multiple-collections branch April 24, 2024 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MC: Generator Source - Generate records from multiple collections
3 participants