-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to automatically generate Kitfiles #661
Conversation
df6f11c
to
f2011c2
Compare
if err != nil { | ||
return "", fmt.Errorf("failed to read input: %w", err) | ||
} | ||
return strings.TrimSpace(string(bytes)), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be trimming? What if the prompt is intended to have spaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strings.TrimSpace only trims leading and trailing whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right... but I was thinking is there a case where we want to keep the leading/trailing spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't imagine one, and not trimming likely leads to weirder behaviour than otherwise. This code was originally used for handling login (username and password) where we definitely want to trim white-space, since the read input includes the newline delimiter.
opts.configHome = configHome | ||
opts.path = args[0] | ||
|
||
if opts.modelkitName == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we still prompt for if name or description has been passed as options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll only prompt if we didn't get a name/description (and if the session is interactive). If you specify --name
or --desc
we don't prompt (unless you somehow pass in an empty name/description).
// Generate a basic Kitfile by looking at the contents of a directory. Parameter | ||
// packageOpt can be used to define metadata for the Kitfile (i.e. the package | ||
// section), which is left empty if the parameter is nil. | ||
func GenerateKitfile(baseDir string, packageOpt *artifact.Package) (*artifact.KitFile, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to look into the sub directories as well. If somebody organizes their files under data/
for instance data/training.csv
it should be detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was hoping to defer that to a future improvement since it introduces a fair bit of complexity in the general case.
- If we're trying to grab specific files (e.g. a dataset layer for
path: data/training.csv
, then we need to decide what to do with the rest of the directory- If that's the only file, we don't need to include the directory itself, since it'll be included in the dataset layer, which means signalling another condition back to the caller
- If there are other files, we need to either classify them or include the whole directory as code (?) layer
- Then we also need to handle the case where we've got a catch-all
code
section
- If we want to include the whole directory (i.e.
path: data/
) then we need to figure out a heuristic for deciding what type the whole directory is (most common file type? I'm not sure here) - If a
data/
dir has 20 .csv files, do we include a singledata/
layer, or generate a Kitfile with 20 dataset layers (one for each file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I've added a (messy) implementation of this in commit adf3372. For e.g. https://huggingface.co/onnx-community/dpt-dinov2-small-kitti/tree/main it generates the following Kitfile:
manifestVersion: 1.0.0
model:
name: onnx/model
path: onnx/model.onnx
parts:
- path: onnx/model_bnb4.onnx
- path: onnx/model_fp16.onnx
- path: onnx/model_int8.onnx
- path: onnx/model_q4.onnx
- path: onnx/model_q4f16.onnx
- path: onnx/model_quantized.onnx
- path: onnx/model_uint8.onnx
- path: config.json
- path: preprocessor_config.json
- path: quantize_config.json
code:
- path: .
docs:
- path: README.md
description: Readme file
where the code
section includes the .gitattributes
file and the .git
directory (since I cloned it locally)
It may be worth linking to this list here. https://github.com/trailofbits/ml-file-formats?tab=readme-ov-file |
I glanced over that list at one point, but I'm not sure how valuable it is in general. I can add it but it doesn't feel like an authoritative source for this sort of thing. |
f2011c2
to
51c8638
Compare
Try to attach the detected license type to a Kitfile layer rather than the overall package section -- i.e. if we have a model, assume the license applies to the model, otherwise, stick it with a dataset or code section. Fallback to package if nothing else makes sense.
…se.txt When deciding if a file is a license file, include all files with `license` as the prefix, to ensure we catch e.g. LICENSE.txt.
Add .joblib as model-type suffix, and move .csv to datasets instead of metadata
Read the contents of any directories and attempt to determine their type for the purposes of Kitfile generation; if a directory contains multiple files but e.g. they are all 'dataset'-type files, add the directory as a dataset layer. This is not a very pretty implementation, but hopefully covers a decent number of common cases.
bacde8a
to
adf3372
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I try this on
/scikitlearn-tabular
├── Kitfile
├── data
│ └── online_boutique_v1.csv
── models
│ └── model.onnx
└── notebooks
└── ML_MLFlow.ipynb
I am getting the following. I was expecting the model.onnx to be added at least.
manifestVersion: 1.0.0
package:
name: test
description: this
code:
- path: .
output.Logf(output.LogLevelWarn, "Unable to determine license type") | ||
} | ||
output.Logf(output.LogLevelTrace, "Detected license %s for license file", detectedLicenseType) | ||
detectedLicenseType = licenseType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be before the logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, good catch! Fixed.
@gorkem Hm, are you sure you've fetched the latest changes? I rebased on main as part of it so you would need to I tested with
(files are just basic text files containing "test") and I see the Kitfile manifestVersion: 1.0.0
model:
name: model
path: models/model.onnx
code:
- path: notebooks
datasets:
- path: data |
7fbd3d6
to
3dee98c
Compare
Figured it out -- I was mixing up dirEntry.Name() and the full path to the directory. If you ran the init command from the directory you want to generate a Kitfile in it would work because it's looking for directories in the current working dir. This fails if kit is run from a different directory (I had tried to test this using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs tests
When reading the contents of directories while generating Kitfiles, make sure we use the full path to open the directory.
3dee98c
to
8879bfd
Compare
Description
Add
kit init
command, that can be used to generate a basic Kitfile from the contents of a directory. It basically does this by looking at files and trying to guess what they might be, based on file extensions.I've tested it against a few hand-written Kitfiles and (modulo user-provided metadata) it does an okay job :)
Linked issues
Closes #124
Related: #564