Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose adding a file format for custom pipelines #109
Propose adding a file format for custom pipelines #109
Changes from all commits
5c0e071
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this default should be achieved as
ilab data generate --pipeline PATH_TO_THE_SIMPLE_WORKFLOW_YAML
for the purpose of unification.we can provide alias like simple or full to the built-in YAML files to make users' like easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah,
simple
andfull
should be thought of as aliasesBut it is a bit more complex, and it had occurred to me we'll need to get into this as part of the CLI integration
Currently, the aliases look more like this:
So we need to be able to express a
(knowledge, freeform_skills, grounded_skills)
tuple really ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this issue warrants a significant change to the design
I order to specify a pipeline, you need to specify a chain of blocks for knowledge, freeform_skills, and grounded_skills
I think we can see from the POC that keeping the chains for knowledge, freeform_skills, and grounded_skills in separate files aids clarity
So I think we need something like this
and a custom pipeline would be a directory with these 3 YAML files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be confusing what a pipeline is ... I would rather say there are actually 3 pipelines under simple/full and we route them differently in our particular frontend. this semantics should not be enforced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, "full" or "simple" is a set of pipelines. I agree with that.
But likewise, when using custom pipelines, you should be providing a set of custom pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on Slack, we talked about add the idea of a
PipelineRouter
which I guess would have an interface like this:(hmm, that's the same interface as a Pipeline which suggests a base class that Pipeline and Router inherit from)
and there would be two variants:
and the use case for this would be if you are developing a pipeline against a known data set?
But the more common use case is covered by this variant that understands how our taxonomy repo is structured:
"full" and "simple" would be an alias for a set of yaml files
(the above is just a sketch, it would evolve quite a bit as you hack on it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the point of all of this is "full" isn't a pipeline, but instead its some logic that understands the taxonomy repo and looks at the data set before picking from one of 3 pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think this is making sense to me.
From the
ilab
CLI perspective, it only cares about the taxonomy variant. It can still just talk about a "pipeline" as a user concept, but the target is either:altered syntax is over here: https://github.com/instructlab/dev-docs/pull/109/files#r1674565115
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the class definitions and methods themselves make sense to me as well.
Abstraction-wise, one thing I've been wondering though is whether routing is a thing that
Pipeline
should be responsible for, or it's a higher level concept should route pipelines (in our case, it could be theSDG
).Also as per Mark's word: routing sounds like "an object that has a set of pipelines, and knows which pipeline to use for different input datasets", which also doesn't sound like another pipeline but something conveys the details about the actual synthetic data generation process (a bunch of pipelines applied to a set of files from taxonomy, in our case).
PS: open to deal with this later as the design above already unblocks us I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of all of this, I think the really important details are:
ilab data generate --pipeline
can take a path to a directory. I updated the doc text to reflect that part already.We expect the contents of that directory to have 3 specific files:
knowledge.yaml
,grounded_skills.yaml
, andfreeform_skills.yaml
. I will add that in a moment.There are also some really good points about how we can implement this in terms of abstractions. I am comfortable leaving that part out of the doc and see how it shakes out in the implementation. I have a feeling the final details will get sorted as we see what it looks like in implementation, but knowing that it's inspired by our discussion here. Let me know if anyone feels diffrently.