Skip to content

Latest commit

 

History

History
184 lines (134 loc) · 6.87 KB

readme-kt.md

File metadata and controls

184 lines (134 loc) · 6.87 KB
Table of Contents
  1. Folder structure
  2. What?
  3. Why?
  4. Application
    1. Starting application
    2. Accessing application
    3. Starting application
    4. Stopping application
    5. Environment variables
  5. Which endpoint to choose?
  6. How does prefetch endpoint works?
  7. How to check failed background tasks?

Folder structure

.
├── Dockerfile                <- Dockerfile for production environment.
├── Dockerfile.dev            <- Dockerfile for development environment.
├── README.md                 <- Project documentation and information.
├── app
│   ├── __init__.py           <- Initialization for the 'app' package.
│   ├── api                   <- API route definitions and handlers.
│   ├── core                  <- Core application components and utilities.
│   ├── db                    <- Database configuration and management.
│   ├── main.py               <- Main FastAPI application entry point.
│   ├── models                <- Data models and database schemas.
│   ├── tests                 <- Unit and integration tests for the application.
│   ├── utils                 <- Utility functions and helper modules.
│   └── worker.py             <- Background task workers (if applicable).
├── docker-compose-dev.yml    <- Docker Compose configuration for development.
├── docker-compose.yml        <- Docker Compose configuration for production.
├── logs                      <- Log files generated by the application.
├── poetry.lock               <- Lock file for Python dependency management.
├── pyproject.toml            <- Project configuration and dependencies (Poetry).
├── scripts
    ├── codespaces-init.sh    <- Initialization script for Codespaces (if using VS Code Codespaces).
    └── gitpod-init.sh        <- Initialization script for Gitpod (if using Gitpod).


What

hunting service facilitates creation of dataset profiles i.e. details of dataset variable, correlation, numeric and categorical variables, missing values, etc as a fastAPI service.

It uses ydata-profiling: https://docs.profiling.ydata.ai/


Why

Hunting servers the following purpose:

  • Automate the process of dataset profiling that could be for dataset description for Dataful datasets page. Hunting provides following attributes that are used for Dataset description page :
    • Number of Rows
    • Number of Columns
    • Data Preview
    • Metadata (column name, number of distinct values, number of unique values, type, count)

Application

Start Application

  • Execute the following command docker-compose command to start the entire NEDC Database application and the dependent services

      docker-compose up
    
  • When the application is started using docker-compose, a directory with name volumes will be created in the directory where all the docker data for all the services will be persisted.

Access Application

Once the application is up and running you should be able to access it using the following urls:

Service URL
Server API Root: http://0.0.0.0:8000/api/v1
Swagger: http://0.0.0.0:8000/api/docs
Redoc: http://0.0.0.0:8000/redoc
MongoDB http://localhost:27017
Username: root
Password: example
Redis http://localhost:6379
Password: password
Flower Dashboard: http://localhost:5555

Stopping Application

  • Execute the following command docker-compose command to stop Dega and all the components

      docker-compose stop
    

Or use the following command to stop the application, and remove all the containers and networks that were created:

  docker-compose down

Environment Variables

  • Create .env file in the root directory based on .env.example.
    • The values in .env.example are pre-configured to running the application using the default docker-compose.yml
  • If no .env file is found in the root directory, the default values provided in /app/core/config.py will be considered for the environment variables.
    • The values in /app/core/config.py are pre-configured to running the application using the default docker-compose.yml

Which Endpoint to Choose?

There are 2 types of endpoint that could be used for dataset profiling:

flowchart TD
    A[Hunting Service] --> B{background task?}
    B --> |Yes| C[prefetch]
    B --> |No| D[description & other specific endpoint]
Loading

Below table provide the detailed explanation for the above classification:

prefetch description
Post request that takes a list of s3 file path for datasets Get Request that takes 1 file path of dataset
Puts the Datasets into the Celery queue and process them as background job Process the datasets on the fly
Output is saved into mongoDB for each file path Pandas profiling output is returned as JSON output

How does Prefetch Endpoint works?

Following process happens in sequence for prefetch endpoint:

timeline
    1 : Pass the request body with the list of file paths that are to be processed
    2 : Response output with background job task
      : All the file paths will be put into the celery queue for processing
    3 : Celery workers will pick up one path at a time
    4 : Output will be saved into mongoDB
Loading

NOTE:

  • Prefetch group of routes will only be enabled if ENABLE_PREFETCH is true.

How to check Failed background tasks?

Use flower to check the status for the background tasks. With every prefetch request, a trigger_id is passed. That is helpful to check the status of the background task.

  • Example of request made to prefetch:
{
    "urls": [
        "s3://roapitest/titanic.csv"
    ],
    "minimal": true,
    "samples_to_fetch": 10,
    "trigger_id": "dd6d2667-366c-4ca5-a403-b227cc2148ff"
}
  • Example fo response from prefetch:
{
  "task_id": "ff8dfbc7-b383-43cd-a0e9-6af7d1226205",
  "trigger_id": "dd6d2667-366c-4ca5-a403-b227cc2148ff"
}

Open the Flower task option and in the search bar pass the trigger_id. It will show only tasks that are related to that trigger_id. Click on the state to sort on status for the task for each file.