Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing repositories #286

Merged
merged 30 commits into from
Jan 21, 2025
Merged

Indexing repositories #286

merged 30 commits into from
Jan 21, 2025

Conversation

leoraba
Copy link
Contributor

@leoraba leoraba commented Nov 26, 2024

Summary:

  • Repository package: Added new repository package as part of a monorepo package. The purpose is to create an interface to manage interactions with data source repositories, and leverage integrations to systems like Song or Lyric. New environment variables were added to facilitate repository configuration.
  • Indexing Lyric as source data repository: Implemented Lyric as data source repository for indexing the entire repository by category (all studies/organizations), a specific study/organization, or an individual document by it's unique ID. Issue Maestro V5 - Lyric integration #283
  • Documentation: Updated README files and docs/ folder according new implementations.
  • Build image: Updated Dockerfile to build Maestro Docker image using Node.js 22 as base
  • NPM publish: Readiness to publish Maestro npm package

Related issues:

@joneubank
Copy link

For future consideration, each of the bullet points in the PR description could be its own separate change request. This would vastly simplify reviews and shorten time for revisions.

Copy link

@joneubank joneubank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of great work. Can see that this will work as is, but I found a couple functional changes (in addition to code style things). Of importance:

  • chunk the bulk data requests to ES to ensure that we have a maximum request size
  • clean up the repository requests to paginated endpoints to prevent infinite loops

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the pretty printing in the dev console, but it can be a lot of noise in a log file in a production env where the colors arent being rendered. I'm curious to see how these logs look in a deployed environment and if we might want a flag to disable the pretty printing.

Might be nice to have some env log options for enabling pretty printing, logging as json, and choosing the log level, etc.

(Honestly we should standardize all this into a logger we can reuse across overture packages but thats another issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Development environment: Moved pretty printing to dev only (as recommended here).
  • Production: standard JSON format with ISO date format.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cool work parsing arrays of repositories. May want to have a separate file for the repo env processing, its kind of messy mixed in with the other env stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created a new file to separate repo env processing as suggested

const getRepositoryConfig = (
repos: (z.infer<typeof lyricSchemaDefinition> | z.infer<typeof songSchemaDefinition>)[],
): (LyricRepositoryConfig | SongRepositoryConfig)[] => {
const lyricRepos = repos

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare this as const lyricRepos: LyricRepositoryConfig[] and then you won't need the type assertion on the map return. This will then validate if your return type is correct as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix type assertion as requested. thanks for the suggestion!

Comment on lines 54 to 56
} catch (error) {
logger.error(`Error fetching Lyric records on category '${categoryId}'`, error);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the request always fails we will get stuck in an infinite look because hasMoreData is never updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the code to throw an error to stop infinite loop

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeat comment re: infinite loop potential across multiple functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw an error to stop infinite loop

Comment on lines 3 to 9
/**
* Finds the repository by its repository code
* @param repositories
* @param repoCode
* @returns
*/
export const getRepoInformation = (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description should clearly state that if the repository cannot be found this will return undefined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specified in comment when undefined is returned

Comment on lines 15 to 35
/**
* Determines if passed configuration is appropiate for Lyric
* @param object
* @returns
*/
export const isLyricConfiguration = (
object: LyricRepositoryConfig | SongRepositoryConfig,
): object is LyricRepositoryConfig => {
return object.type === 'LYRIC';
};

/**
* Determines if passed configuration is appropiate for Song
* @param object
* @returns
*/
export const isSongConfiguration = (
object: LyricRepositoryConfig | SongRepositoryConfig,
): object is SongRepositoryConfig => {
return object.type === 'SONG';
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since TS 5.5 you shouldn't need to declare the type assertion for this predicate, if this type narrowing is valid then TS will infer it on its own. (ie. remove object is songRepositoryConfig.

Since this is just using differentiated union on object.type you could probably replace these functions with a switch(repository.case) in the code that uses these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these 2 functions as they are no longer needed, replaced them with a switch case as suggested,Typescript handles automatically discriminated union to know each type

Comment on lines 37 to 43
/**
* Checks if a given value is an array of objects where each object has values that are either
* strings, numbers, booleans, `null`, or nested objects
* @param value
* @returns
*/
export const isArrayOfObjects = (value: unknown): value is Array<Record<string, DataRecordValue>> => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the purpose of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is specifically used when we expect the value to be strictly an array of objects. For instance, when fetching records from Song/Lyric, we expect to retrieve an array of records, any other type is considered an invalid response.

@leoraba leoraba force-pushed the indexing_repositories branch from 738b19a to 935f5ca Compare January 7, 2025 04:57
README.md Outdated
Comment on lines 85 to 87
- kafka
- elastic search
- other helper tools if you want like kafka rest proxy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- kafka
- elastic search
- other helper tools if you want like kafka rest proxy
- Kafka
- Elasticsearch
- other helper tools you may want, like a Kafka RESTful proxy

const repoCount = getRepoCount();

// Loop through the repositories found
for (let i = 0; i < repoCount; i++) {
Copy link
Member

@justincorrigible justincorrigible Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please replace this with repoCount.forEach to make things more declarative (and consistent with the rest of our codebases)
#ImperativeSucks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repoCount is a number, so this may be a reasonable case for a forLoop. We would want to have a utility function to re-write this in a declarative way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right... and in my defence, coffee hadn't kicked in.
my argument against imperative/procedural code stands, we're meant to tell the system "what" to do, not "how"

README.md Outdated

- **Maestro Common:** Designed to centralize common utilities, reusable functions, and TypeScript type definitions.

- **Maestro Indexer Client:** Abstracts communication with Elasticsearch clients, supporting both version 7 and version 8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Maestro Indexer Client:** Abstracts communication with Elasticsearch clients, supporting both version 7 and version 8.
- **Maestro Indexer Client:** Abstracts communication with Elasticsearch clients, supporting both version 7 and 8.

docs/usage.md Outdated
- switch your Aliases if it still points to the old index
- now that your data is migrated Maestro will be indexing based on the new mapping.
- Create new index with the updated mapping
- Reindex your data (you can point maestro to your new index and trigger indexing on full repositories to do that,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Reindex your data (you can point maestro to your new index and trigger indexing on full repositories to do that,
- Reindex your data (you can point Maestro to your new index and trigger indexing on full repositories to do that,

docs/usage.md Outdated
@@ -63,66 +62,67 @@ curl -X POST \

## Kafka topics

Maestro can be configured as mentioned under the running configurations section to listen to kafka topics
Maestro can be configured as mentioned under the running configurations section to listen to kafka topics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Maestro can be configured as mentioned under the running configurations section to listen to kafka topics
Maestro can be configured as mentioned under the running configurations section to listen to Kafka topics

* @returns
*/
export const isArrayOfObjects = (value: unknown): value is Array<DataRecordNested> => {
return (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious to see this implemented in vanilla code, instead of Zod (and I like that).
why the choice though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh didn't thought about using Zod for this, initially this function was checking only arrays types were added later.

Copy link

@joneubank joneubank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking this has improved a bit since I first read it, though I am concerned we haven't addressed pagination, or ES bulk request chunking. I see Result objects for some responses which is good.

I know you plan on chunking some of those changes into a separate PR adn thats fine. Since this is a major refactor and being worked on in a branch separate form the primary, I say go ahead and merge this when you feel its ready, and I'll continue to review future change requests.

Comment on lines 25 to 28
MAESTRO_ELASTICSEARCH_VERSION: z.coerce
.number()
.int()
.refine((val) => val === 7 || val === 8, { message: 'Version must be 7 or 8' }), // Ensure 7 or 8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this just be a zod enum? Doing a refine to a limited set of values seems way trickier. We also would probably like to have a list of valid versions to use as a switch in other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created a constant in common package to be used anywhere in the monorepo

const loggerConfigSchema = z.object({
MAESTRO_LOGGING_LEVEL_ROOT: z.string().default('info'),
MAESTRO_LOGGING_LEVEL: z.string().default('info'),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are a set of valid logging levels, lets define those. zod enum use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added enum with valid logger levels.

import { env } from './envConfig.js';
import { type lyricSchemaDefinition, repositoryTypes, type songSchemaDefinition } from './repositoryConfig.js';

setLogLevel(env.MAESTRO_LOGGING_LEVEL);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is happening at a global import level, not inside the provider instance. Isn't the logger to use part of the dependencies passed to each provider instance? The log level should be set there based on the dependencies/config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this setter function into the MaestroProviderConfig object. Good catch!

* @returns an array of valid repositories
* @throws an error if repository is invalid
*/
export const validateRepositories = (env: NodeJS.ProcessEnv) => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeJS.ProcessEnv I didn't know of this type, interesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's part of the typescript global types 😉


type SongRepositoryConfig = {
type ValueOf<T> = T[keyof T];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a common utility type that you could export from its own file for global use. Typically I call this Values<T>.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just call this file isEmpty.ts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is the case, this export in line 6 should become default.

however, there is a case for this validation to be a barrelled folder instead

utils
|-validation
  |-index (exports centrally from all files in this folder)
  |-isEmpty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created a folder validation to hold isEmpty.ts

* @param value The value to check. This can be of any type.
* @returns `true` if the value is `null`, `undefined`, an empty object, an empty array or an empty string; otherwise, `false`.
*/
export const isEmpty = (value: unknown) => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, is there a test for this?

[null, undefined, '', ' ', '0', true, false, {}, []].map((value)=>({value,isEmpty:isEmpty(value)}))
[
  { value: null, isEmpty: true },
  { value: undefined, isEmpty: true },
  { value: "", isEmpty: true },
  { value: " ", isEmpty: false },
  { value: "0", isEmpty: false },
  { value: true, isEmpty: false },
  { value: false, isEmpty: false },
  { value: {}, isEmpty: true },
  { value: [], isEmpty: true }
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this has a suite of unit tests


return {
indexName: index,
successful,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this get addressed?

Copy link
Member

@justincorrigible justincorrigible Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inclined to suggest this could follow the standard "module" pattern, i.e. having this file named "types.ts" and placed within the "src/logger" folder. that pattern facilitates finding all code in their relative/contextual scopes. in this current pattern you have "buckets full of X", which may or not include the ones for this or that component or controller, or provider, or whatever.

let me explain myself: following what you did in logger, if i want to create a new button for example, and knowing me you know my button needs to look special, right? so it has all sorts of fancy styles and contains an emoji image in it to make it "cool!", so then I go create the "src/Button" folder and within I create the ts file for it (Button/Button.ts), then I'd go back out to src create an "images" folder where I will put all the images including my emoji for the button, then put the css file in a "src/styles" folder, and then put the types in this "src/types" folder here where all the types are?
later on, if we realize my button was silly and want to remove it from the codebase, I'll be chasing down files all over the repo instead of just deleting the one single folder and calling it a day. I don't know, this file/folder structure pattern seems scattered and hard to maintain, but I'm willing to be convinced otherwise.

@joneubank probably has a different opinion from mine?

> [!NOTE]
> This package is likely not the one you want to use in your project, it is primarily used as a dependency within the Maestro monorepo. [@overture-stack/maestro-provider](https://www.npmjs.com/package/@overture-stack/maestro-provider) is what you are most likely to want to use.

This package is an internal component of the Maestro monorepo designed to manage interactions with data source repositories. It serves as the central interface for retrieving data from various repositories, such as **Song** and **Lyric**, ensuring a streamlined and consistent approach to data access.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This package is an internal component of the Maestro monorepo designed to manage interactions with data source repositories. It serves as the central interface for retrieving data from various repositories, such as **Song** and **Lyric**, ensuring a streamlined and consistent approach to data access.
This package is an internal component of the Maestro monorepo designed to manage interactions with data source repositories. It serves as the central interface for retrieving data from various repositories, such as **SONG** and **Lyric**, ensuring a streamlined and consistent approach to data access.

"version": "5.0.0-alpha.1",
"description": "A submodule of Maestro that manages interactions with data source repositories, providing a unified interface for efficient data access and management",
"main": "dist/index.js",
"typings": "dist/index.d.ts",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought this was supposed to be "types"? curious to understand what the difference is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be both. ref: Note that the "typings" field is synonymous with "types", and could be used as well.

@leoraba
Copy link
Contributor Author

leoraba commented Jan 21, 2025

Closing this PR as most of the feeback was resolved, the rest of work will be done on a refactoring PR to reduce complexity.

@leoraba leoraba merged commit 2a6d7cf into M5-revised Jan 21, 2025
0 of 2 checks passed
@leoraba leoraba deleted the indexing_repositories branch January 21, 2025 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WW - Maestro
3 participants