Contributing to Data Recipes AI

Hi! Thanks for your interest in contributing to Data Recipes AI, we're really excited to see you! In this document we'll try to summarize everything that you need to know to do a good job.

New contributor guide

To get an overview of the project, please read the README and our Code of Conduct to keep our community approachable and respectable.

Getting started

Creating Issues

If you spot a problem, search if an issue already exists. If a related issue doesn't exist, you can open one there, selecting an appropriate issue type.

As a general rule, we don’t assign issues to anyone. If you find an issue to work on, you are welcome to open a PR with a fix.

Making Code changes

Setting up a Development Environment

To set up your local development environment for contributing follow the steps in the paragraphs below.

The easiest way to develop is to run in the Docker environment, see README for more details.

Resetting your environment

If running locally, you can reset your environment - removing any data for your databases, which means re-registration - by running ./cleanup.sh.

Code quality tests

The repo has been set up with black and flake8 pre-commit hooks. These are configured in the ``.pre-commit-config.yamlfile and initialized withpre-commit autoupdate`.

On a new repo, you must run pre-commit install to add pre-commit hooks.

To run code quality tests, you can run pre-commit run --all-files

GitHub has an action to run the pre-commit tests to ensure code adheres to standards. See folder 'github/workflows for more details.

Tests

Unit tests

You should write tests for every feature you add or bug you solve in the code. Having automated tests for every line of our code lets us make big changes without worries: there will always be tests to verify if the changes introduced bugs or lack of features. If we don't have tests we will be blind and every change will come with some fear of possibly breaking something.

For a better design of your code, we recommend using a technique called test-driven development, where you write your tests before writing the actual code that implements the desired feature.

You can use pytest to run your tests, no matter which type of test it is.

End-to-end tests (using Selenium and Promptflow)

End-to-end tests have been configured in GitHub actions which use promptflow to call a wrapper around the chainlit UI, or order to test when memories/recipes are used as well as when the assistant does some on-the-fly analysis. To do this, the chainlit class is patched heavily, and there are limitations in how cleanly this could be done, so it isn't an exact replica of the true application, but does capture changes with the flow as well as test the assistant directly. The main body of integration tests will test recipes server and the assistant independently.

Additionally, there were some limitation when implementing in GitHub actions where workarounds were implemented until a lter data, namely: promptflow is run on the GitHub actions host rather than in docker, and the promptflow wrapper to call chainlit has to run as a script and kill the script based on a STDOUT string. These should be fixed in future.

Code for e2e tests can be found in flows/chainlit-ui-evaluation as run by .github/workflows/e2e_tests.yml

The tests work using promptflow evaluation and a call to an LLM to guage groundedness, due to the fact LLM assistants can produce slightly different results if not providing answers from memory/recipes. The promptflow evaluation test data can be found in flows/chainlit-ui-evaluation/data.jsonl.

See "Evaluating with Promptflow" below to see how to run e2e tests locally.

Running Promptflow evaluation locally

First, you will need to build the environment to include Prompt Flow ...

docker compose -f docker-compose.yml -f docker-compose-dev.yml up -d --build

Then ...

Install the DevContainers VSCode extension
Build data recipes using the docker compose command mentioned above
Open the command palette in VSCode (CMD + Shift + P on Mac; CTRL + Shift + P on Windows) and select

Dev Containers: Attach to remote container.

Select the promptflow container. This opens a new VSCode window - use it for the next steps.
It should happen automatically, but if not, install Promptflow add-in
Open folder /app
Click on flow.dag.yaml
Top left of main pane, click on 'Visual editor'
- If you are taken to the promptflow 'Install dependencies'' screen, change the Python runtime to be /azureml-envs/prompt-flow/runtime/bin/python 'runtime', then close and re-open flow.dag.yaml
On the Groundedness node, select your new connection
You can no run by clicking the play icon. See Promptflow documentation for more details

Changing between Azure OpenAI <> OpenAI

As noted in the README, the repo supports assistants on OpenAI or Azure OpenAI. The README has instructions on how to change in the .env file, remeber to change ASSISTANT_ID as well as the API settings, but you will also have to change the connection in the promptflow groundedness node accordingly.

GitHub Workflow

As many other open source projects, we use the famous gitflow to manage our branches.

Summary of our git branching model:

Get all the latest work from the upstream repository (git checkout main)
Create a new branch off with a descriptive name (for example: feature/new-test-macro, bugfix/bug-when-uploading-results). You can do it with (git checkout -b <branch name>)
Make your changes and commit them locally (git add <changed files>>, git commit -m "Add some change" <changed files>). Whenever you commit, the self-tests and code quality will kick in; fix anything that gets broken
Push to your branch on GitHub (with the name as your local branch: git push origin <branch name>). This will output a URL for creating a Pull Request (PR)
Create a pull request by opening the URL a browser. You can also create PRs in the GitHub interface, choosing your branch to merge into main
Wait for comments and respond as-needed
Once PR review is complete, your code will be merged. Thanks!!

Tips

Write helpful commit messages
Anything in your branch must have no failing tests. You can check by looking at your PR online in GitHub
Never use git add .: it can add unwanted files;
Avoid using git commit -a unless you know what you're doing;
Check every change with git diff before adding them to the index (stage area) and with git diff --cached before committing;
If you have push access to the main repository, please do not commit directly to dev: your access should be used only to accept pull requests; if you want to make a new feature, you should use the same process as other developers so your code will be reviewed.

Code Guidelines

Use PEP8;
Write tests for your new features (please see "Tests" topic below);
Always remember that commented code is dead code;
Name identifiers (variables, classes, functions, module names) with readable names (x is always wrong);
When manipulating strings, we prefer either f-string formatting (f'{a} = {b}') or new-style formatting ('{} = {}'.format(a, b)), instead of the old-style formatting ('%s = %s' % (a, b));
You will know if any test breaks when you commit, and the tests will be run again in the continuous integration pipeline (see below);

Demo Data

The quick start instructions and self-tests require demo data in the data db. This can be downloaded from Google drive.

Uploading new demo data

To upload new demo data ...

Run the ingestion (see main README)
In the data directory, tar -cvf datadb-<DATE>.tar ./datadb then gzip datadb-<DATE>.tar
Upload file to this folder
Edit data/download_demo_data.py to use file URL

Downloading demo data

To download demo data ...

docker compose stop datadb
cd data && python3 download_demo_data.py && cd ..
docker compose start datadb

Evaluation with Prompt Flow

First, you will need to build the environment to include Prompt Flow ...

docker compose -f docker-compose.yml -f docker-compose-dev.yml up -d --build

Then ...

Install the DevContainers VSCode extension
Build data recipes using the docker compose command mentioned above
Open the command palette in VSCode (CMD + Shift + P on Mac; CTRL + Shift + P on Windows) and select

Dev Containers: Attach to remote container.

Select the promptflow container. This opens a new VSCode window - use it for the next steps.
Install Promptflow add-in
Open folder /app
Click on flow.dag.yaml
Top left of main pane, click on 'Visual editor'
On the Groundedness node, select your new connection
You can no run by clicking the play icon. See Promptflow documentation for more details

Adding new Data sources

Open API (not OpenAI!) data sources

As mentioned in the main README, the assistant can be used with openapi standard API, such as the included HDX API. To add another, extend the configuration in iningestion/ingest.config. The ingestion script will process this data and import into the Data Recipes AI database. This works for simple APIs with relatively low data volumes, and may need some adjustment depending on the complexity of the API.

API interaction without ingestion

Some APIs are too extensive to ingest. These can be defined as tools (functions) for the assistant, which can query the API on request to get data. See assistants/recipes_agents/create_update_assistant.py which already has a couple of functions which integrate with APIs which you could extend for new data sources.

Files for the assistant

As mentioned in the main README, the assistant can be provided your data in the form of data files (eg CSV, Excel) and documents (eg PDF and word). These are available to the assistant for all interactions. Additionally, users can upload files during conversation. In both cases analysis is done by the LLM assistant and should be treated with caution.

Deployment

We will add more details here soon, for now, here are some notes on Azure ...

Deploying to Azure

A deployment script './deployment/deploy_azure.py' is provided to deploy to an Azure Multicontainer web app you have set up with these instructions. The script is run from the top directory. Note: This is for demo purposes only, as Multicontainer web app are still in Public Preview.

To run the deployment ...

python3 ./deployment/deploy_azure.py

One thing to mention on an Azure deploy, it doesn't get pushed to the web app sometimes, until a user tries to access the web app's published URL. No idea why, but if your release is 'stuck', try this.

Note:

./deployment/./deployment/docker-compose-azure.yml is the configuration used in the deployment center screen on the web app
./deployment/./deployment/docker-compose-deploy.yml is the configuration used when building the deployment
docker-compose.yml is used for building locally

⚠️ This is very much a work in progress, deployment will be automated with fewer compose files soon

You will need to set key environment variables, see your local .env for examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTRIBUTION.md

CONTRIBUTION.md

Contributing to Data Recipes AI

New contributor guide

Getting started

Creating Issues

Making Code changes

Setting up a Development Environment

Resetting your environment

Code quality tests

Tests

Unit tests

End-to-end tests (using Selenium and Promptflow)

Running Promptflow evaluation locally

Changing between Azure OpenAI <> OpenAI

GitHub Workflow

Tips

Code Guidelines

Demo Data

Uploading new demo data

Downloading demo data

Evaluation with Prompt Flow

Adding new Data sources

Open API (not OpenAI!) data sources

API interaction without ingestion

Files for the assistant

Deployment

Deploying to Azure

Files

CONTRIBUTION.md

Latest commit

History

CONTRIBUTION.md

File metadata and controls

Contributing to Data Recipes AI

New contributor guide

Getting started

Creating Issues

Making Code changes

Setting up a Development Environment

Resetting your environment

Code quality tests

Tests

Unit tests

End-to-end tests (using Selenium and Promptflow)

Running Promptflow evaluation locally

Changing between Azure OpenAI <> OpenAI

GitHub Workflow

Tips

Code Guidelines

Demo Data

Uploading new demo data

Downloading demo data

Evaluation with Prompt Flow

Adding new Data sources

Open API (not OpenAI!) data sources

API interaction without ingestion

Files for the assistant

Deployment

Deploying to Azure