Skip to content

A Data-Centric library providing a unified interface for state-of-the-art methods for hardness characterisation of data points.

License

Notifications You must be signed in to change notification settings

vanderschaarlab/Datagnosis

Repository files navigation

Datagnosis

A Data-Centric AI library for measuring hardness categorization.

Features:

Please note: datagnosis does not handle missing data and so these values must be imputed first HyperImpute can be used to do this.

πŸš€ Installation

The library can be installed from PyPI using

$ pip install datagnosis

or from source, using

$ pip install .

Other library extensions:

  • Install the library with unit-testing support
 pip install datagnosis[testing]

πŸ’₯ Sample Usage

# Load iris dataset from sklearn and create DataHandler object
from sklearn.datasets import load_iris
from datagnosis.plugins.core.datahandler import DataHandler
X, y = load_iris(return_X_y=True, as_frame=True)
datahander = DataHandler(X, y, batch_size=32)

# Create model an parameters
from datagnosis.plugins.core.models.simple_mlp import SimpleMLP
import torch

model = SimpleMLP()

# creating our optimizer and loss function object
learning_rate = 0.01
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)


# Get a plugin and fit it
hcm = Plugins().get(
    "vog",
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    lr=learning_rate,
    epochs=10,
    num_classes=3,
    logging_interval=1,
)
hcm.fit(
    datahandler=datahander,
    use_caches_if_exist=True,
)

# Plot the resulting scores
hcm.plot_scores(axis=1, plot_type="scatter")

πŸ”‘ Methods

Datagnosis builds on D-CAT which is a Hardness Characterization Method Benchmarking framework also from the van der Schaar lab.

For benchmarking of the below methods see https://github.com/seedatnabeel/D-CAT.

Generic methods

Method Type Description Score Reference
Area Under the Margin (AUM) Generic Characterizes data examples based on the margin of a classifier – i.e. the difference between the logit values of the correct class and the next class. Hard - low scores. AUM Paper
Confident Learning Generic Confident learning estimates the joint distribution of noisy and true labels β€” characterizing data as easy and hard for mislabeling. Hard - low scores Confident Learning Paper
Conf Agree Generic Agreement measures the agreement of predictions on the same example. Hard - low scores Conf Agree Paper
Data IQ Generic Data-IQ computes the aleatoric uncertainty and confidence to characterize the data into easy, ambiguous and hard examples. Hard - low confidence scores. High Aleatoric Uncertainty scores define ambiguous Data-IQ Paper
Data Maps Generic Data Maps focuses on measuring variability (epistemic uncertainty) and confidence to characterize the data into easy, ambiguous and hard examples. Hard - low confidence scores. High Epistemic Uncertainty scores define ambiguous Data-Maps Paper
Gradient Normed (GraNd) Generic GraNd measures the gradient norm to characterize data. Hard - high scores GraNd Paper
Error L2-Norm (EL2N) Generic EL2N calculates the L2 norm of error over training in order to characterize data for computational purposes. Hard - high scores EL2N Paper
Forgetting Generic Forgetting scores analyze example transitions through training. i.e., the time a sample correctly learned at one epoch is then forgotten. Hard - high scores Forgetting Paper
Large Loss Generic Large Loss characterizes data based on sample-level loss magnitudes. Hard - high scores Large Loss Paper
Prototypicalilty Generic Prototypicality calculates the latent space clustering distance of the sample to the class centroid as the metric to characterize data. Hard - high scores Prototypicalilty Paper
Variance of Gradients (VOG) Generic VoG (Variance of gradients) estimates the variance of gradients for each sample over training Hard - high scores VOG Paper
Active Learning Guided by Local Sensitivity and Hardness (ALLSH) Images ALLSH computes the KL divergence of softmax outputs between original and augmented samples to characterize data. Hard - high scores ALLSH Paper

Generic type plugins can be used for tabular or image data. Image type plugins only work for images.

πŸ”¨ Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vvvsx tests/ --durations=50

Contributing to datagnosis

We want to make contributing to datagnosis is as easy and transparent as possible. We hope to collaborate with as many people as we can.

Development installation

First create a new environment. It is recommended that you use conda. This can be done as follows:

conda create -n your-datagnosis-env python=3.11
conda activate your-datagnosis-env

Python versions , 3.8, 3.9, 3.10, 3.11 are all compatible, but it is best to use the most up to date version you can, as some models may not support older python versions.

To get the development installation with all the necessary dependencies for linting, testing, auto-formatting, and pre-commit etc. run the following:

git clone https://github.com/vanderschaarlab/datagnosis.git
cd datagnosis
pip install -e .[testing]

Please check that the pre-commit is properly installed for the repository, by running:

pre-commit run --all

This checks that you are set up properly to contribute, such that you will match the code style in the rest of the project. This is covered in more detail below.

⌨️ Our Development Process

πŸ‚ Code Style

We believe that having a consistent code style is incredibly important. Therefore datagnosis imposes certain rules on the code that is contributed and the automated tests will not pass, if the style is not adhered to. These tests passing is a requirement for a contribution being merged. However, we make adhering to this code style as simple as possible. First, all the libraries required to produce code that is compatible with datagnosis's Code Style are installed in the step above when you set up the development environment. Secondly, these libraries are all triggered by pre-commit, so once you are set-up, you don't need to do anything. When you run git commit, any simple changes to enforce the style will run automatically and other required changes are explained in the stdout for you to go through and fix.

datagnosis uses the black and flake8 code formatter to enforce a common code style across the code base. No additional configuration should be needed (see the black documentation for advanced usage).

Also, datagnosis uses isort to sort imports alphabetically and separate into sections.

❕Type Hints

datagnosis is fully typed using python 3.7+ type hints. This is enforced for contributions by mypy, which is a static type-checker.

↩️ Pull Requests

We actively welcome your pull requests.

  1. Fork the repo and create your branch from main.
  2. If you have added code that should be tested, add tests in the same style as those already present in the repo.
  3. If you have changed APIs, document the API change in the PR.
  4. Ensure the test suite passes.
  5. Make sure your code passes the pre-commit, this will be required in order to commit and push, if you have properly installed pre-commit, which is included in the testing extra.

πŸ”Ά Issues

We use GitHub issues to track public bugs. Please ensure your description is clear and has sufficient instructions to be able to reproduce the issue.

πŸ“œ License

By contributing to datagnosis, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree. You should therefore, make sure that if you have introduced any dependencies that they also are covered by a license that allows the code to be used by the project and is compatible with the license in the root directory of this project.

About

A Data-Centric library providing a unified interface for state-of-the-art methods for hardness characterisation of data points.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published