Table of Contents

Folder structure
What?
Why?
Application
Which endpoint to choose?
How does prefetch endpoint works?
How to check failed background tasks?

Folder structure

.
├── Dockerfile                <- Dockerfile for production environment.
├── Dockerfile.dev            <- Dockerfile for development environment.
├── README.md                 <- Project documentation and information.
├── app
│   ├── __init__.py           <- Initialization for the 'app' package.
│   ├── api                   <- API route definitions and handlers.
│   ├── core                  <- Core application components and utilities.
│   ├── db                    <- Database configuration and management.
│   ├── main.py               <- Main FastAPI application entry point.
│   ├── models                <- Data models and database schemas.
│   ├── tests                 <- Unit and integration tests for the application.
│   ├── utils                 <- Utility functions and helper modules.
│   └── worker.py             <- Background task workers (if applicable).
├── docker-compose-dev.yml    <- Docker Compose configuration for development.
├── docker-compose.yml        <- Docker Compose configuration for production.
├── logs                      <- Log files generated by the application.
├── poetry.lock               <- Lock file for Python dependency management.
├── pyproject.toml            <- Project configuration and dependencies (Poetry).
├── scripts
    ├── codespaces-init.sh    <- Initialization script for Codespaces (if using VS Code Codespaces).
    └── gitpod-init.sh        <- Initialization script for Gitpod (if using Gitpod).

Table of Content

What

hunting service facilitates creation of dataset profiles i.e. details of dataset variable, correlation, numeric and categorical variables, missing values, etc as a fastAPI service.

It uses ydata-profiling: https://docs.profiling.ydata.ai/

Why

Hunting servers the following purpose:

Automate the process of dataset profiling that could be for dataset description for Dataful datasets page. Hunting provides following attributes that are used for Dataset description page :
- Number of Rows
- Number of Columns
- Data Preview
- Metadata (column name, number of distinct values, number of unique values, type, count)

Application

Start Application

Execute the following command docker-compose command to start the entire NEDC Database application and the dependent services
```
  docker-compose up
```
When the application is started using docker-compose, a directory with name volumes will be created in the directory where all the docker data for all the services will be persisted.

Access Application

Once the application is up and running you should be able to access it using the following urls:

Service	URL
Server	API Root: http://0.0.0.0:8000/api/v1 Swagger: http://0.0.0.0:8000/api/docs Redoc: http://0.0.0.0:8000/redoc
MongoDB	http://localhost:27017 Username: root Password: example
Redis	http://localhost:6379 Password: password
Flower	Dashboard: http://localhost:5555

Stopping Application

Execute the following command docker-compose command to stop Dega and all the components
```
  docker-compose stop
```

Or use the following command to stop the application, and remove all the containers and networks that were created:

  docker-compose down

Environment Variables

Create .env file in the root directory based on .env.example.
- The values in .env.example are pre-configured to running the application using the default docker-compose.yml
If no .env file is found in the root directory, the default values provided in /app/core/config.py will be considered for the environment variables.
- The values in /app/core/config.py are pre-configured to running the application using the default docker-compose.yml

Which Endpoint to Choose?

There are 2 types of endpoint that could be used for dataset profiling:

flowchart TD
    A[Hunting Service] --> B{background task?}
    B --> |Yes| C[prefetch]
    B --> |No| D[description & other specific endpoint]

Loading

Below table provide the detailed explanation for the above classification:

prefetch	description
Post request that takes a list of s3 file path for datasets	Get Request that takes 1 file path of dataset
Puts the Datasets into the Celery queue and process them as background job	Process the datasets on the fly
Output is saved into mongoDB for each file path	Pandas profiling output is returned as JSON output

How does Prefetch Endpoint works?

Following process happens in sequence for prefetch endpoint:

timeline
    1 : Pass the request body with the list of file paths that are to be processed
    2 : Response output with background job task
      : All the file paths will be put into the celery queue for processing
    3 : Celery workers will pick up one path at a time
    4 : Output will be saved into mongoDB

Loading

NOTE:

Prefetch group of routes will only be enabled if ENABLE_PREFETCH is true.

How to check Failed background tasks?

Use flower to check the status for the background tasks. With every prefetch request, a trigger_id is passed. That is helpful to check the status of the background task.

Example of request made to prefetch:

{
    "urls": [
        "s3://roapitest/titanic.csv"
    ],
    "minimal": true,
    "samples_to_fetch": 10,
    "trigger_id": "dd6d2667-366c-4ca5-a403-b227cc2148ff"
}

Example fo response from prefetch:

{
  "task_id": "ff8dfbc7-b383-43cd-a0e9-6af7d1226205",
  "trigger_id": "dd6d2667-366c-4ca5-a403-b227cc2148ff"
}

Open the Flower task option and in the search bar pass the trigger_id. It will show only tasks that are related to that trigger_id. Click on the state to sort on status for the task for each file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme-kt.md

readme-kt.md

Folder structure

What

Why

Application

Start Application

Access Application

Stopping Application

Environment Variables

Which Endpoint to Choose?

How does Prefetch Endpoint works?

How to check Failed background tasks?

Files

readme-kt.md

Latest commit

History

readme-kt.md

File metadata and controls

Folder structure

What

Why

Application

Start Application

Access Application

Stopping Application

Environment Variables

Which Endpoint to Choose?

How does Prefetch Endpoint works?

How to check Failed background tasks?