Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to automatically generate Kitfiles #661

Merged
merged 14 commits into from
Jan 10, 2025

Conversation

amisevsk
Copy link
Contributor

Description

Add kit init command, that can be used to generate a basic Kitfile from the contents of a directory. It basically does this by looking at files and trying to guess what they might be, based on file extensions.

I've tested it against a few hand-written Kitfiles and (modulo user-provided metadata) it does an okay job :)

Linked issues

Closes #124
Related: #564

if err != nil {
return "", fmt.Errorf("failed to read input: %w", err)
}
return strings.TrimSpace(string(bytes)), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be trimming? What if the prompt is intended to have spaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.TrimSpace only trims leading and trailing whitespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right... but I was thinking is there a case where we want to keep the leading/trailing spaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine one, and not trimming likely leads to weirder behaviour than otherwise. This code was originally used for handling login (username and password) where we definitely want to trim white-space, since the read input includes the newline delimiter.

opts.configHome = configHome
opts.path = args[0]

if opts.modelkitName == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still prompt for if name or description has been passed as options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll only prompt if we didn't get a name/description (and if the session is interactive). If you specify --name or --desc we don't prompt (unless you somehow pass in an empty name/description).

// Generate a basic Kitfile by looking at the contents of a directory. Parameter
// packageOpt can be used to define metadata for the Kitfile (i.e. the package
// section), which is left empty if the parameter is nil.
func GenerateKitfile(baseDir string, packageOpt *artifact.Package) (*artifact.KitFile, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to look into the sub directories as well. If somebody organizes their files under data/ for instance data/training.csv it should be detected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was hoping to defer that to a future improvement since it introduces a fair bit of complexity in the general case.

  • If we're trying to grab specific files (e.g. a dataset layer for path: data/training.csv, then we need to decide what to do with the rest of the directory
    • If that's the only file, we don't need to include the directory itself, since it'll be included in the dataset layer, which means signalling another condition back to the caller
    • If there are other files, we need to either classify them or include the whole directory as code (?) layer
    • Then we also need to handle the case where we've got a catch-all code section
  • If we want to include the whole directory (i.e. path: data/) then we need to figure out a heuristic for deciding what type the whole directory is (most common file type? I'm not sure here)
  • If a data/ dir has 20 .csv files, do we include a single data/ layer, or generate a Kitfile with 20 dataset layers (one for each file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've added a (messy) implementation of this in commit adf3372. For e.g. https://huggingface.co/onnx-community/dpt-dinov2-small-kitti/tree/main it generates the following Kitfile:

manifestVersion: 1.0.0
model:
  name: onnx/model
  path: onnx/model.onnx
  parts:
    - path: onnx/model_bnb4.onnx
    - path: onnx/model_fp16.onnx
    - path: onnx/model_int8.onnx
    - path: onnx/model_q4.onnx
    - path: onnx/model_q4f16.onnx
    - path: onnx/model_quantized.onnx
    - path: onnx/model_uint8.onnx
    - path: config.json
    - path: preprocessor_config.json
    - path: quantize_config.json
code:
  - path: .
docs:
  - path: README.md
    description: Readme file

where the code section includes the .gitattributes file and the .git directory (since I cloned it locally)

@gorkem
Copy link
Contributor

gorkem commented Dec 16, 2024

It may be worth linking to this list here. https://github.com/trailofbits/ml-file-formats?tab=readme-ov-file

@amisevsk
Copy link
Contributor Author

It may be worth linking to this list here. https://github.com/trailofbits/ml-file-formats?tab=readme-ov-file

I glanced over that list at one point, but I'm not sure how valuable it is in general. I can add it but it doesn't feel like an authoritative source for this sort of thing.

Try to attach the detected license type to a Kitfile layer rather than
the overall package section -- i.e. if we have a model, assume the
license applies to the model, otherwise, stick it with a dataset or code
section. Fallback to package if nothing else makes sense.
…se.txt

When deciding if a file is a license file, include all files with
`license` as the prefix, to ensure we catch e.g. LICENSE.txt.
Add .joblib as model-type suffix, and move .csv to datasets instead of
metadata
Read the contents of any directories and attempt to determine their type
for the purposes of Kitfile generation; if a directory contains multiple
files but e.g. they are all 'dataset'-type files, add the directory as a
dataset layer.

This is not a very pretty implementation, but hopefully covers a decent
number of common cases.
@amisevsk amisevsk requested a review from gorkem December 18, 2024 19:45
Copy link
Contributor

@gorkem gorkem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try this on

/scikitlearn-tabular
├── Kitfile
├── data
 │   └── online_boutique_v1.csv
 ── models
 │   └── model.onnx
└── notebooks
    └── ML_MLFlow.ipynb

I am getting the following. I was expecting the model.onnx to be added at least.

manifestVersion: 1.0.0
package:
  name: test
  description: this
code:
  - path: .

output.Logf(output.LogLevelWarn, "Unable to determine license type")
}
output.Logf(output.LogLevelTrace, "Detected license %s for license file", detectedLicenseType)
detectedLicenseType = licenseType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be before the logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, good catch! Fixed.

@amisevsk
Copy link
Contributor Author

@gorkem Hm, are you sure you've fetched the latest changes? I rebased on main as part of it so you would need to git reset --hard ... to see current state (or delete the branch locally and use gh pr checkout again).

I tested with

.
├── data
│   └── online_boutique_v1.csv
├── models
│   └── model.onnx
└── notebooks
    └── ML_MLFlow.ipynb

(files are just basic text files containing "test") and I see the Kitfile

manifestVersion: 1.0.0
model:
  name: model
  path: models/model.onnx
code:
  - path: notebooks
datasets:
  - path: data

@amisevsk
Copy link
Contributor Author

Figured it out -- I was mixing up dirEntry.Name() and the full path to the directory. If you ran the init command from the directory you want to generate a Kitfile in it would work because it's looking for directories in the current working dir. This fails if kit is run from a different directory (I had tried to test this using kit init $(pwd) but that obviously misses this bug).

Copy link
Contributor

@gorkem gorkem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs tests

When reading the contents of directories while generating Kitfiles, make
sure we use the full path to open the directory.
@amisevsk amisevsk merged commit c381b40 into jozu-ai:main Jan 10, 2025
3 checks passed
@amisevsk amisevsk deleted the kitfile-generation branch January 10, 2025 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

init command to generate a kitfile
2 participants