Table of Contents
.
├── Dockerfile <- Dockerfile for production environment.
├── Dockerfile.dev <- Dockerfile for development environment.
├── README.md <- Project documentation and information.
├── app
│ ├── __init__.py <- Initialization for the 'app' package.
│ ├── api <- API route definitions and handlers.
│ ├── core <- Core application components and utilities.
│ ├── db <- Database configuration and management.
│ ├── main.py <- Main FastAPI application entry point.
│ ├── models <- Data models and database schemas.
│ ├── tests <- Unit and integration tests for the application.
│ ├── utils <- Utility functions and helper modules.
│ └── worker.py <- Background task workers (if applicable).
├── docker-compose-dev.yml <- Docker Compose configuration for development.
├── docker-compose.yml <- Docker Compose configuration for production.
├── logs <- Log files generated by the application.
├── poetry.lock <- Lock file for Python dependency management.
├── pyproject.toml <- Project configuration and dependencies (Poetry).
├── scripts
├── codespaces-init.sh <- Initialization script for Codespaces (if using VS Code Codespaces).
└── gitpod-init.sh <- Initialization script for Gitpod (if using Gitpod).
hunting
service facilitates creation of dataset profiles i.e. details of dataset variable, correlation, numeric and categorical variables, missing values, etc as a fastAPI service.
It uses ydata-profiling
: https://docs.profiling.ydata.ai/
Hunting servers the following purpose:
- Automate the process of dataset profiling that could be for dataset description for Dataful datasets page. Hunting provides following attributes that are used for Dataset description page :
- Number of Rows
- Number of Columns
- Data Preview
- Metadata (column name, number of distinct values, number of unique values, type, count)
-
Execute the following command docker-compose command to start the entire NEDC Database application and the dependent services
docker-compose up
-
When the application is started using docker-compose, a directory with name
volumes
will be created in the directory where all the docker data for all the services will be persisted.
Once the application is up and running you should be able to access it using the following urls:
Service | URL |
---|---|
Server | API Root: http://0.0.0.0:8000/api/v1 Swagger: http://0.0.0.0:8000/api/docs Redoc: http://0.0.0.0:8000/redoc |
MongoDB | http://localhost:27017 Username: root Password: example |
Redis | http://localhost:6379 Password: password |
Flower | Dashboard: http://localhost:5555 |
-
Execute the following command docker-compose command to stop Dega and all the components
docker-compose stop
Or use the following command to stop the application, and remove all the containers and networks that were created:
docker-compose down
- Create
.env
file in the root directory based on.env.example
.- The values in
.env.example
are pre-configured to running the application using the defaultdocker-compose.yml
- The values in
- If no
.env
file is found in the root directory, the default values provided in/app/core/config.py
will be considered for the environment variables.- The values in
/app/core/config.py
are pre-configured to running the application using the defaultdocker-compose.yml
- The values in
There are 2 types of endpoint that could be used for dataset profiling:
flowchart TD
A[Hunting Service] --> B{background task?}
B --> |Yes| C[prefetch]
B --> |No| D[description & other specific endpoint]
Below table provide the detailed explanation for the above classification:
prefetch | description |
---|---|
Post request that takes a list of s3 file path for datasets | Get Request that takes 1 file path of dataset |
Puts the Datasets into the Celery queue and process them as background job | Process the datasets on the fly |
Output is saved into mongoDB for each file path | Pandas profiling output is returned as JSON output |
Following process happens in sequence for prefetch endpoint:
timeline
1 : Pass the request body with the list of file paths that are to be processed
2 : Response output with background job task
: All the file paths will be put into the celery queue for processing
3 : Celery workers will pick up one path at a time
4 : Output will be saved into mongoDB
NOTE:
- Prefetch group of routes will only be enabled if ENABLE_PREFETCH is true.
Use flower
to check the status for the background tasks. With every prefetch request, a trigger_id
is passed. That is helpful to check the status of the background task.
- Example of request made to
prefetch
:
{
"urls": [
"s3://roapitest/titanic.csv"
],
"minimal": true,
"samples_to_fetch": 10,
"trigger_id": "dd6d2667-366c-4ca5-a403-b227cc2148ff"
}
- Example fo response from
prefetch
:
{
"task_id": "ff8dfbc7-b383-43cd-a0e9-6af7d1226205",
"trigger_id": "dd6d2667-366c-4ca5-a403-b227cc2148ff"
}
Open the Flower task
option and in the search bar pass the trigger_id. It will show only tasks that are related to that trigger_id. Click on the state
to sort on status for the task for each file.