- Table of Contents
- About The Project
- Getting Started
- Requirements
- Local development
- Usage
- Environment Variables
- Maintainers
- License
IRyS (Intelligent Repository System) is a digital repository system that can be used to store documents and perform search on those documents. When a document is stored, it will be processed to extract important information from the document such as metadata or entities. Search on documents can be performed using semantic similarity between documents combined with corresponding metadata or entities. There are other features such as authentication, notification, repository management, access management, and others.
To get a local copy up and running follow these simple steps.
- Pyenv (Recommended) for python version management
- Python ^3.10.x
- To install using pyenv
pyenv install 3.10.x
- To install using pyenv
- Poetry for Python package and environment management.
- Postgres
- Redis
- Elasticsearch cloud service
- For instructions on how to setup Elasticsearch cloud service, please refer to Elasticsearch section.
You can run following commands to download the BERT model:
cd bertserving
wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
unzip cased_L-12_H-768_A-12.zip
List of released pretrained BERT models (click to expand...)
BERT-Base, Uncased | 12-layer, 768-hidden, 12-heads, 110M parameters |
BERT-Large, Uncased | 24-layer, 1024-hidden, 16-heads, 340M parameters |
BERT-Base, Cased | 12-layer, 768-hidden, 12-heads , 110M parameters |
BERT-Large, Cased | 24-layer, 1024-hidden, 16-heads, 340M parameters |
BERT-Base, Multilingual Cased (New) | 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
BERT-Base, Multilingual Cased (Old) | 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
BERT-Base, Chinese | Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters |
To setup elasticsearch cloud service, you can follow the steps below:
- Create an account in Elastic Cloud.
- Create a new deployment.
- The deployment will be created in a few minutes. After that, elasticsearch will give you an password for the default user
elastic
. Fill the value ofELASTICSEARCH_USER
withelastic
andELASTICSEARCH_PASSWORD
with the password given by elastic search. - Go to manage deployment page and copy the cloud deplpoyment id and paste it in the
ELASTICSEARCH_CLOUD_ID
environment variable. - Create a new API key, and copy the API key and paste it in the
ELASTICSEARCH_API_KEY
environment variable. - Set the
ELASTICSEARCH_CLOUD
environment variable toTrue
.
To setup elasticsearch locally, you can follow the steps below:
- Install elasticsearch using docker by following this link.
- Change the
ELASTICSEARCH_CLOUD
environment variable toFalse
. - Change the
ELASTICSEARCH_HOST
environment variable tolocalhost
andELASTICSEARCH_SCHEME
tohttp
. - Change the
ELASTICSEARCH_PORT
to assigned port during installation.
- Install each dependency from the requirements section above.
- Install python dependecies by running
poetry install
- Run
poetry shell
to open Poetry Shell - Train the Machine Learning model for document classification by running this command:
NOTE: If you get error while installing
python3 app/classification/mlutil/classifier_training.py
psycopg2-binary
package, try to run:$ poetry run pip install psycopg2-binary
first then re-run$ poetry install
- Install pre-commit git hook (for auto formatting purpose)
pre-commit install
- Find all files below.
- Duplicate those files and rename the duplicate files from
[prefix_name].example
pattern to[prefix_name]
- Open newly created files and adjust the content according to your environment. To see the explanation of each environment variable, you can check the environment variable section.
If you want to migrate the database, you can run the following command.
alembic upgrade head
If you want to fully rollback the database, you can run the following command.
alembic downgrade base
If you want to rollback to specific version, you can run the following command.
alembic downgrade [version]
To see the list of available version, you can run the following command.
alembic history
If you want to add new migration, you can run the following command to generate new migration file.
alembic revision --autogenerate -m "migration message"
Dont forget to add the model in migrations/env.py file (if not exist).
from app.<folder>.models import *
Run docker compose by running
docker-compose -f docker-compose-local.yml up
Below are services that are running:
- bert-serving: Used for sentence embedding using BERT
- redis: Used for celery result backend and message broker
- celery_worker -> Used for running celery tasks
- celery_beat: Used for running celery beat (cron jobs scheduler)
- flower: Used for monitoring celery tasks, located at http://localhost:5557
Below are some useful commands for docker:
- To rebuild docker containers, run
docker-compose -f docker-compose-local.yml up --build
- To remove unused docker containers, run
docker container prune
- To remove unused docker images, run
docker rmi $(docker images --filter "dangling=true" -q --no-trunc)
- To exec into a docker container, run
docker exec -it <container_name> bash
- Run
poetry shell
to open Poetry Shell - Lastly, run the app using this command:
ENV=local|development|production python3 main.py
- To access the documentation, you can go to localhost:8000/docs on your web browser.
Name | Description | Example Value |
---|---|---|
DEV_DB_HOST |
Database host address | localhost |
DEV_DB_USER |
Database user's username | postgres |
DEV_DB_PASSWORD |
Database user's password | postgres |
DEV_DB_NAME |
Database name used for application | IRyS_v1 |
PROD_DB_HOST |
Database host address | localhost |
PROD_DB_USER |
Database user username | postgres |
PROD_DB_PASSWORD |
Database user password | postgres |
PROD_DB_NAME |
Database name used for application | IRyS_v1 |
ELASTICSEARCH_CLOUD |
Whether using Elasticsearch Cloud or not | True |
ELASTICSEARCH_CLOUD_ID |
Elasticsearch Cloud deployment ID | fcggg111hgg2jjh2:jhhhllk |
ELASTICSEARCH_USER |
Elasticsearch username (either using Elasticsearch Cloud or not) | elastic |
ELASTICSEARCH_PASSWORD |
Elasticsearch password (either using Elasticsearch Cloud or not) | password |
ELASTICSEARCH_API_KEY |
Elasticsearch API key (when using Elasticsearch Cloud) | 1234567890 |
ELASTICSEARCH_SCHEME |
Elasticsearch scheme (when using local Elasticsearch) | http |
ELASTICSEARCH_HOST |
Elasticsearch host address (when using local Elasticsearch) | localhost |
ELASTICSEARCH_PORT |
Elasticsearch port (when using local Elasticsearch) | 9200 |
MAIL_USERNAME |
Email username | username |
MAIL_PASSWORD |
Email password | password |
MAIL_FROM |
Email sender | |
MAIL_PORT |
Email port | 587 |
MAIL_SERVER |
Email server | smtp.gmail.com |
CELERY_BROKER_URL |
Celery broker URL | redis://localhost:6379/0 |
CELERY_RESULT_BACKEND |
Celery result backend URL | redis://localhost:6379/0 |
Note:
- More on elasticsearch see Elasticsearch section.
Variable | Description | Default |
---|---|---|
CELERY_BROKER_URL |
The URL of the broker to use. | redis://redis:6379/0 |
CELERY_RESULT_BACKEND |
The URL of the result backend to use. | redis://redis:6379/0 |
Note:
- The value of
CELERY_BROKER_URL
andCELERY_RESULT_BACKEND
should be the same as the value of redis configuration in thedocker-compose.yml
file.
Variable | Description | Default |
---|---|---|
MODEL_DOWNLOAD_URL |
The URL of the BERT model to download. | https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip |
MODEL_NAME |
The name of the BERT model to download. | uncased_L-12_H-768_A-12 |
Note:
- More on BERT see BERT model section.
- The value of
MODEL_NAME
will be used as the name of the folder that contains the BERT model.
List of Maintainers
- Gde Anantha Priharsena
- Reihan Andhika Putra
- Shifa Salsabiila
- Reyhan Emyr Arrosyid
- Andres Jerriel Sinabutar
Copyright (c) 2023, IRyS-Team.