Skip to content

Commit

Permalink
Merge pull request #100 from BharatSahAIyak/dev
Browse files Browse the repository at this point in the history
Dev -> Main v0.4.5
  • Loading branch information
sooraj1002 authored Jun 19, 2024
2 parents ead3717 + 6699129 commit f598e6c
Show file tree
Hide file tree
Showing 59 changed files with 676 additions and 229 deletions.
113 changes: 69 additions & 44 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,56 +4,81 @@ on: [push, pull_request]

jobs:
test:

runs-on: ubuntu-latest

strategy:
matrix:
python-version: ['3.10']

steps:
- uses: actions/checkout@v4

- name: Install poetry
run: pip install poetry

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: 'poetry'

- run: poetry install
- run: poetry show --latest
- run: poetry run pytest --cov --cov-report xml

- name: Coveralls
uses: coverallsapp/github-action@v2
with:
github-token: ${{ secrets.COVERALLS_REPO_TOKEN }}
file: coverage.xml
flag-name: python-${{ matrix.python-version }}

test-mac:

runs-on: macos-latest
strategy:
matrix:
python-version: ['3.10']
- uses: actions/checkout@v4

steps:
- uses: actions/checkout@v4

- name: Install poetry
run: pip install poetry

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: 'poetry'

- run: poetry install
- run: poetry show --latest
- run: poetry run pytest --cov
- name: Install poetry
run: pip install poetry

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

- name: Cache poetry dependencies
uses: actions/cache@v3
with:
path: ~/.cache/pypoetry
key: ${{ runner.os }}-poetry-${{ hashFiles('**/poetry.lock') }}
restore-keys: |
${{ runner.os }}-poetry-
- run: poetry install
- run: poetry show --latest
- run: poetry run pytest --cov --cov-report xml

- name: Coveralls
uses: coverallsapp/github-action@v2
with:
github-token: ${{ secrets.COVERALLS_REPO_TOKEN }}
file: coverage.xml
flag-name: python-${{ matrix.python-version }}

# test-mac:
# runs-on: macos-latest
# strategy:
# matrix:
# python-version: ['3.10']

# steps:
# - uses: actions/checkout@v4

# - name: Install Poetry
# run: |
# curl -sSL https://install.python-poetry.org | python -
# echo "$HOME/.local/bin" >> $GITHUB_PATH

# - name: Set up Python ${{ matrix.python-version }}
# uses: actions/setup-python@v4
# with:
# python-version: ${{ matrix.python-version }}
# cache: 'poetry'

# - name: Install Faiss
# run: |
# brew install cmake libomp openblas
# git clone https://github.com/facebookresearch/faiss.git
# cd faiss
# cmake -B build -DFAISS_ENABLE_GPU=ON -DFAISS_ENABLE_PYTHON=ON -DCMAKE_BUILD_TYPE=Release
# make -C build -j
# cd build/faiss/python
# python setup.py install
# shell: bash

# - name: Install Poetry
# run: pip install poetry

# - name: Install Project Dependencies
# run: poetry install

# - name: Show Latest Poetry Packages
# run: poetry show --latest

# - name: Run Tests
# run: poetry run pytest --cov
2 changes: 2 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ RUN pip install poetry==1.6.0 && poetry config virtualenvs.create false

COPY pyproject.toml poetry.lock ./

RUN apt-get update -qq && apt-get install ffmpeg -y

# RUN poetry install

COPY . .
Expand Down
2 changes: 1 addition & 1 deletion autotune/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
SECRET_KEY = os.getenv("DJANGO_SECRET_KEY")

# SECURITY WARNING: don't run with debug turned on in production!
DEBUG = os.getenv("DEBUG")
DEBUG = True

# ALLOWED_HOSTS = []

Expand Down
70 changes: 70 additions & 0 deletions docs/AUTOTUNE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# INTRODUCTION

## Entities in the system

### WORKFLOWS

Every action taken by a user in autotune is mapped to a workflow. Autotune has two broad functions which is housed in the same place: `Synthetic Data Generation` and `Model Training`. These are built as two separate functions, with interoperability provided in autotune.
Based on this, there are two types of workflows in autotune: `training` and `complete`. A complete workflow indicates that the entire process from the data generation to the training is being performed at autotune. A `training` workflow can be used to perform a subset of operations of a complete workflow.
In training workflows, user can provide a HuggingFace dataset for training/fine tuning a model of a model.

Autotune has the assumption that a given user will have only one workflow for training a given model type like `Text Classification`, `Named entity recognition`, etc.

### CONFIG

Configs are re-usable components which provides metadata and various other fixed aspects of a workflow.

Overall config items which can be stored are:

- temperature: OpenAI temperature used in dataset generation
- system_prompt: System prompt which is passed to OpenAI API.
- user_prompt_template: A template with replaceable values according to workflow needs.
- schema_example: A sample JSON which we want the generated data to follow. We can create dynamic models of any structure we like, with validation using dynamically created pydantic models

### TASKS

### TRAINING

## Development Journey

## Models Supported

- Text Classification
- Colbert training
- Force Alignment

# SETUP

## API specifications

There are two versions of the APIs, with the core functionality accross both the APIs the same

### POST /v1/workflow/config

- REQUEST:

- RESPONSE:

### POST /v1/workflow/create

- REQUEST:

- RESPONSE:

### POST /v1/workflow/iterate/<UUID>

- REQUEST:

- RESPONSE:

### POST /v1/workflow/generate/<UUID>

- REQUEST:

- RESPONSE:

### POST /v1/workflow/status/<WORKFLOW_ID>

- REQUEST:

- RESPONSE:
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion 01old/tasks/train.py → old/tasks/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from datasets import load_dataset
from huggingface_hub import HfApi, login

from utils import CeleryProgressCallback, get_task_class
from old.utils import CeleryProgressCallback, get_task_class


def train_model(celery, req, api_key):
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@

import pytest

from models import GenerationAndCommitRequest
from tasks.data_fetcher import DataFetcher
from old.models import GenerationAndCommitRequest
from old.tasks.data_fetcher import DataFetcher

from .fixtures import REDIS_DATA

Expand Down Expand Up @@ -78,7 +78,7 @@ async def test_initialization_from_redis():

# We only care about the data key here
mock_redis.hgetall = AsyncMock(return_value={"data": "[]"})
with patch("utils.get_data", mock_get_data):
with patch("old.utils.get_data", mock_get_data):
fetcher = DataFetcher(
GENERATION_AND_COMMIT_REQUEST, "openai_key", mock_redis, "task_id"
)
Expand All @@ -103,7 +103,7 @@ async def test_fetch_and_update():
mock_redis.hset = AsyncMock()
mock_redis.hgetall = AsyncMock(return_value={})

with patch("utils.get_data", mock_get_data):
with patch("old.utils.get_data", mock_get_data):
fetcher = DataFetcher(
GENERATION_AND_COMMIT_REQUEST, "openai_key", mock_redis, "task_id"
)
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
51 changes: 38 additions & 13 deletions workflow/align_tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,27 +10,52 @@
from django_pandas.io import read_frame
from huggingface_hub import CommitOperationAdd, HfApi, login
from transformers import TrainerCallback
from celery.utils.log import get_task_logger

from workflow.models import Task
from workflow.force_alignment.alignment import ForceAligner

logger=get_task_logger(__name__)


@shared_task(bind=True, max_retries=settings.CELERY_MAX_RETRIES, retry_backoff=True)
def align_task(self,req_data):

logger.info('Starting align_task wirh request_data: %s',req_data)

task_id=self.request.id
task=Task.objects.get(id=task_id)
task.status="ALIGNING"
try:
task=Task.objects.get(id=task_id)
task.status="ALIGNING"
task.save()

logger.info('Task %s status set to ALIGNING', task_id)

dataset=req_data["dataset"]
if "time_duration" in req_data:
time_duration=req_data["time_duration"]
else:
time_duration=None

api_key=settings.HUGGING_FACE_TOKEN
alignment_object=ForceAligner()

logger.info('Aligning dataset')
alignment_object.align_dataset(dataset,alignment_duration=time_duration)

logger.info('Pushing aligned audios to hugging-face at path: %s',req_data["save_path"])

task.status="PUSHING"
task.save()
alignment_object.push_to_hub(req_data["save_path"],api_key)

logger.info('Task %s status set to PUSHING',task_id)

except Exception as e:
logger.info('An error occured: %s',str(e))

task.status='COMPLETE'
task.save()
dataset=req_data["dataset"]
if "time_duration" in req_data:
time_duration=req_data["time_duration"]
else:
time_duration=None

api_key=settings.HUGGING_FACE_TOKEN
alignment_object=ForceAligner()
alignment_object.align_dataset(dataset,alignment_duration=time_duration)
alignment_object.push_to_hub(req_data["save_path"],api_key)
task.status="PUSHING"
task.save()


3 changes: 3 additions & 0 deletions workflow/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ def default_split():


LLM_MODELS = [
"gpt-4-turbo",
"gpt-4-turbo-preview",
"gpt-4o",
"gpt-4-0125-preview",
"gpt-4-1106-preview",
"gpt-4-vision-preview",
Expand Down
Loading

0 comments on commit f598e6c

Please sign in to comment.