Skip to content

Simplifying cheque processing for banks using Transformers

Notifications You must be signed in to change notification settings

shivalikasingh95/cheque-easy

Repository files navigation

ChequeEasy: Banking with Transformers

ChequeEasy is a project that aims to simplify the process of approval of bank cheques and make it easier and quicker for both bank officials and customers.

The core of the project is Donut (Document Understanding Transformer) which was proposed in the paper OCR-free Document Understanding Transformer and is used for parsing required data from cheques. Donut is based on a very simple transformer encoder and decoder architecture. It's main USP is that it is an OCR-free approach to Visual Document Understanding (VDU). OCR based techniques come with several limitations such as requiring use of additional downstream models, lack of understanding about document structure, requiring use of hand crafted rules, etc. Donut helps you get rid of all of these OCR specific limitations.

This project is not limited to fine-tuning Donut but is an end-to-end solution from labeling to inference. Pipelines have been created to cover different aspects of the MLOps lifecycle of this project i.e. data processing including annotation, model training, deployment and inference. Both the pipelines and the underlying infrastructure (stacks) to run those pipelines have been setup using the ZenML MLOps framework!

The project leverages Label Studio for annotation and MLflow for experiment tracking as well as model deployment! The model has been fine tuned with the help of Hugging Face transformers and datasets library!

The model for the project has been trained using a subset of this Kaggle dataset. The original dataset contains images of cheques from 10 different banks. A filtered version of this dataset containing images of cheques from 4 banks that are more commonly found in the Indian Banking Sector was created along with corresponding ground truth. This dataset is available on the Hugging Face Hub for download.

Check out this blog for more details about the project. You can even try out the demo of the project available on the Hugging Face Hub.

Note: This project was developed as a submission for ZenML's Month of MLOps Competition

Contents

Project Structure:

  1. The entrypoint for running the project are run_label_process_data.py and the run_train_deploy.py files.
  2. app.py and predict_cheque_parser.py correspond to the Gradio app set up for this project.
  3. materializers folder contains custom materializers implemented for VisionEncoderDecoderConfig and DonutProcessor. ZenML uses materializers to read & write artifacts from an artifact store associated with a stack.
  4. pipelines folder includes all the pipelines implemented as part of this project i.e. labelling, data_processing, train_and_deploy and inference.
  5. Similarly, steps folder includes all the steps corresponding to the pipelines declared under the pipelines dir.
  6. utils contains some util files corresponding to dataset preparation, model training, etc
  7. zenml_stacks contains some shell scripts containing zenml cli commands that can be used to deploy the stacks that will be used by zenml to run the pipelines of this project. Note: The labelling pipeline must be run on the stack generated by the file zenml_stacks/label_data_process_stack.sh The training and inference pipeline must be run on the stack generated by the file zenml_stacks/train_inference_stack.sh The data_process pipeline can be run as part of either stack.

Prerequisites:

  1. Installing dependencies: Create your python virtual environment and install zenml and zenml[server]. Note that ZenML is compatible with Python 3.7, 3.8, and 3.9. This project uses some custom changes (support for generating OCR labelling config for Label Studio) that are not available as part of official zenml release yet so please install zenml as shown below.
    pip install -q git+https://github.com/shivalikasingh95/zenml.git@label_studio_ocr_config

Now, install ZenML server:

    pip install "zenml[server]"

All the dependencies of this project are mentioned in the requirements.txt file. However, I would recommend installing all integrations of zenml using the zenml integration install command to ensure full compatibility with zenml.

    zenml integration install label_studio azure mlflow torch huggingface pytorch-lightning pillow

However, this project has a few additional dependencies such as mysqlclient, nltk and donut-python which would have to be installed separately as these are not covered by the zenml integration command.

Also, transformers must be installed from this git branch as it also contains some minor fixes which not available as part of official transformers library yet:

    pip install -q git+https://github.com/shivalikasingh95/transformers.git@image_utils_fix

The following dependencies must be installed if you want to run the gradio demo app for the project: word2number, gradio, sympspellpy

  1. Install all system level dependencies:

This is for being able to connect to the MySQL server which will be used by mlflow as a backend store to keep track of all experiments runs, metadata, etc.

sudo apt-get update

sudo apt-get install python3-dev default-libmysqlclient-dev build-essential

If you don't want to run your mlflow server with a MySQL backend, you can skip this step.

  1. Cloud resources: At the moment, ZenML supports only cloud based artifact stores for use with label-studio as annotator component so if you wish to use the annotation component of this project then you need to have an AWS/GCP/Azure account which will be used for storing the artifacts generated as part of pipelines run using the annotator stack. The below setup has been described for use with Azure but similar set up can be done for AWS/GCP. To see how to setup label studio with ZenML using AWS/GCP refer this link.

    For using label studio with Azure, make sure you have an Azure storage account and an Azure Key Vault. You can leverage, ZenML's MLOps stack recipes to do this for you in case you don't have one. For Azure, you can take a look at the azure-minimal stack. Although this creates a few additional resources apart from Azure Blob Storage & Key vault so you might want to modify the stack according to your needs before deploying.

  2. Downloading the dataset: This project is built using a filtered version of this Kaggle dataset. This dataset contains images of cheques and corresponding labels for 10 different banks. For the sake of simplicity, I have created a filtered version of this dataset containing data belonging to only 4 banks which are more popular in the Indian Banking sector i.e. Axis, Canara, HSBC and ICICI. This filtered version of the dataset is available now on the Hugging Face Hub

To download the original Kaggle dataset you need to follow the below steps.

  • You must have a Kaggle account and go to your account settings, scroll to API section and Click Expire API Token to remove previous tokens. Click on Create New API Token - It will download kaggle.json for you to use.

  • Now create a directory to keep your downloaded access token which will be used by Kaggle for authentication.

    mkdir ~/.kaggle
  • Copy the downloaded json file to the directory created using step 2.
    cp kaggle.json ~/.kaggle/
  • Change the permissions of the file:
    chmod 600 ~/.kaggle/kaggle.json
  • Install the kaggle library:
    pip install -q kaggle
  • You can now run the below command in your console to verify that your setup is working correctly.
    kaggle datasets list
  • If it is running fine, then you run below command to download the cheque-images dataset
    kaggle datasets download -d medali1992/cheque-images
  • Now unzip the downloaded dataset and restructure folder structure as shown below:
cheques_dataset
│   cheques_label_file.csv  
│
└───cheque_images
│   │   1.jpg
│   │   2.jpg
|   |    ....

  1. Hardware Requirements: It is recommended to run the training pipeline on an instance with GPU support. You can leverage Google Colab for this purpose. The other pipelines can be run on CPU.

Setting up your ZenML stacks:

  1. Initialise this repository as a ZenML repository by running:
zenml init
  1. Start the ZenML Server by running below command:
zenml up
  1. If you want to set up the stack for labelling using label studio, you need to setup the following environment variables:

    • ANNOT_STACK_NAME: Name to assign to the zenml stack that will be used for labelling.
    • AZURE_KEY_VAULT: Name of the key vault that will be used as secrets-manager for your stack.
    • STORAGE_ACCOUNT: This is the name of the Azure storage account which contains the bucket that can be used by ZenML as an artifact store.
    • BUCKET_NAME: The path of the Azure Blob storage that will be used as an artifact store for this stack. It would something like - az://<storage_bucket_or_container_name>
    • STORAGE_ACCOUNT_KEY: This refers to the access token value for the azure storage account.
    • LABEL_STUDIO_API_KEY: This refers to the Access Token of your label studio instance. You'll to have first start your label studio instance using the command - label studio start -p 8094 and go to Account page to retrieve your Access Token value to set this environment variable.
    • LABEL_DATA_STORAGE_BUCKET_NAME - Path of folder in Azure Blob Storage (or any cloud storage you wish to connect) that will have the dataset that needs to be loaded into Label Studio for annotation.

    Once you have set up the required variables, you can run the zenml_stacks/label_data_process_stack.sh script to setup your ZenML stack for running the labelling pipeline.

    Once this script finishes running, it'll tell you on which port zenml has launched label_studio for annotation. Make a note of this.

  2. If you want to set up the stack for training and inference, you need to setup the following environment variables:

    • TRAIN_STACK_NAME: Name to assign to the zenml stack that will be used for labelling.
    • MLFLOW_TRACKING_URI: Name of the key vault that will be used as secrets-manager for your stack.
    • MLFLOW_USERNAME: This is the name of the Azure storage account which contains the bucket that can be used by ZenML as an artifact store.
    • MLFLOW_PASSWORD: This refers to the access token value for the azure storage account.

    Once you have set up the required variables, you can run the zenml_stacks/train_inference_stack.sh script to setup your ZenML stack for running training and inference pipelines.

How to run pipelines:

  1. If you don't wish to run the labelling pipeline, you can skip the creation and setup of the labelling stack and just setup the training stack and proceed with the pre-labelled kaggle dataset to run the training.

To run the data processing pipeline, run the command below. This pipeline takes the kaggle dataset that we downloaded earlier and prepares a Hugging Face datasets library compatible dataset using it. This pipeline pushes the prepared dataset to the HF Hub at the end.

python run_train_deploy --pipeline_type=data_process
  1. To run the training pipeline, run the command below. This pipeline loads the dataset prepared using the data_process pipeline from the HF Hub and uses it to fine-tune Donut Model. The trained model is logged with MLflow registry at the end. It also includes a model evaluation step where the trained model is evaluated on the test set and if the obtained accuracy satisfies minimum accuracy bar then a deployment of the model is triggered and performed.
python run_train_deploy --pipeline_type=train
  1. To run the inference pipeline, run the command below. This pipeline loads an input, the deployed model prediction service and sends a request to the model endpoint to retrieve the prediction corresponding to the input data.
python run_train_deploy --pipeline_type=inference
  1. In case, you wish to you can run the labelling pipeline, first ensure that your current stack is set to the $ANNOT_STACK_NAME. You can use the command zenml stack describe to check your current running active stack. You can run the command zenml stack set $ANNOT_STACK_NAME to switch your current stack to another previously created stack with the name $ANNOT_STACK_NAME.

  2. To start labelling process, first an annotation project (or dataset) must be created in label studio. For this purpose run the below pipeline:

    python run_label_process_data.py --pipeline_type=label

The above command will set up a dataset(or project) in label studio. To check if above pipeline was successful in creating a dataset, run the following command - zenml annotator dataset list. To start labelling, now run zenml annotator dataset annotate <dataset_name>.

Once you have finished labelling, you can run the below pipeline to retrieve annotations from label studio and convert the into the format similar to the label file available as part of the Kaggle dataset.

    python run_label_process_data.py --pipeline_type=get_labelled_data

Once, you have you have the converted labelling file, you can run the data_process pipeline to use the dataset that was labelled and corresponding label_file to produce the dataset compatible with Hugging Face datasets. At present, this data_process logic is written keeping the kaggle dataset in mind. If you're using a different dataset then you might have to modify this logic accordingly.

    python run_label_process_data.py --pipeline_type=data_process

About

Simplifying cheque processing for banks using Transformers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published