Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model and training management #62

Open
2 of 15 tasks
sooraj1002 opened this issue May 21, 2024 · 6 comments
Open
2 of 15 tasks

Model and training management #62

sooraj1002 opened this issue May 21, 2024 · 6 comments
Assignees
Labels

Comments

@sooraj1002
Copy link
Collaborator

sooraj1002 commented May 21, 2024

  • Migrate training API from older code
  • Factual Data (Dataset) vs Model Training Dataset - Admin panel should transform
  • Difference in a dataset (like a git difference) (doesnt need to be shown, will mostly just be addition of data items)
  • Dataset Registry => Huggingface Dataset Management | Save delta for the day => Commit End of Day |
    • Admin Panel Dataset Changes Collection | POST Request to a Dataset Registry
    • Admin Panel
      • Linkages to which huggingface dataset and which row(s) was utilized in a query (if relevant)
      • Telemetry should include the above information
  • Trigger training pipeline
    • Admin => Dataset Update => Classifier | NER | Spell Check (@sooraj1002) | Samagra Models
  • Update datasets on HF ( do it once, store data in a cache till update hasnt taken place)
  • Schedule training at specific times
  • Support a different type of workflow which can be used for just training a model ( so that an example and other things dont need to be created)
  • Can define workflow dataset on workflow
  • Can define a dataset for workflow

https://app.diagrams.net/#G11tk9s4YZBIvWqAmBB6_8pvo0dppWTGbi#%7B%22pageId%22%3A%22uskC_wnftH2uWe6gHLMZ%22%7D

label studio playground

@sooraj1002 sooraj1002 self-assigned this May 21, 2024
@sooraj1002 sooraj1002 changed the title Label studio requirements Model and training management May 21, 2024
@KDwevedi
Copy link
Contributor

@sooraj1002
Copy link
Collaborator Author

sooraj1002 commented May 27, 2024

  • receive the doc (pdf) from the user - @xorsuyash
  • give that to to autotune - @xorsuyash
  • Document service will chunk the pdf and send back the chunks and json - @sooraj1002 to share the API to @xorsuyash and give handover of how to use
  • you will create a list of prompts using each chunk : e.g: 'create 2 questions from this chunk : {chunk1} '
    prompt 1: 'create 2 questions from this chunk : {chunk1} '
    prompt2: 'create 2 questions from this chunk : {chunk2} ' @xorsuyash
  • send the list of prompts to auto-tune and this will return a json/csv of question answer pairs. - create train/eval/test splits as specified by the user - @sooraj1002 will handover to @xorsuyash
  • measure the current retrieval accuracy of 'pre-trained' model by asking to retrieve on questions and marking the related chunk as the chunk to be retrieved @xorsuyash
  • you'll create triplets of {q,positive_doc,negaitve-doc} for each question using the chunks and the q-a dataset @TakshPanchal will handover to @xorsuyash
  • this dataset will be shared HF and uploaded there @xorsuyash
  • Embedding training integration with auto-tune : the dataset HF link will be shared with autotune along with pretrained model. Autotune will fine tune the model and upload the finetuned model onto HF ( upload of model onto HF needs to be done such that an existing repo of finetuned models is updated).
    user should again pass train/test/eval splits as a part of Trainer config to autotrain, autotrain will default pick up 'train','test','eval' datasets.. it'll raise a flag if 'eval'/'test' split is not there..
    Trainer class has 'eval' split argument, it overrides other datasets. ( copy HF ka trainer class)
    commit history of the model update should have dataset used to train, evaluation logs and timestamp etc .. the repo commits should have some version control to be able to pick up older models and fix them as latest if necessary. @xorsuyash and @TakshPanchal
  • schedule train @TakshPanchal
  • my trained model should not be pushed on HF as the 'latest model - model to be picked up' unless the eval results are validated by user @TakshPanchal
  • Autotune should allow me to update an existing dataset with another HF dataset by passing the 2 HF dataset links @sooraj1002
  • fine tuned model retrievla accuracy should also be measured @xorsuyash

@sooraj1002
Copy link
Collaborator Author

  • Add API for embedding models - Calls PDF parser -> Create the negative questions to create triplet -> fine-tune a model (this is for sentence similarity - optionally a document) @xorsuyash

@KDwevedi
Copy link
Contributor

KDwevedi commented May 28, 2024

Let's figure out what data is coming for which models @KDwevedi

@KDwevedi
Copy link
Contributor

KDwevedi commented May 29, 2024

Gautam

  • Dataset given Model | Task

Sooraj + Kanav

  • Registry for Model
  • Registry for Task
  • Registry for Dataset
  • Dataset <> Model | Task Mapping
    • Table Def
    • JSON Schema
  • Dataset & (Model | Task) <> Label Studio Interface Mapping
  • POST API call taking LS output and mapping that to dataset update/add/delete in autotune

Model Integration

Dhruv

  • Making sure all datapoints requested are actually available from telemetry, dev if not

Karan/Shreyansh

  • Dataset & (Model | Task) <> Label Studio Interface Mapping: V1 API Interface

@KDwevedi
Copy link
Contributor

KDwevedi commented May 29, 2024

Docs in #69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants