Skip to content

Commit

Permalink
Merge pull request #12 from ai-cfia/issue8-missing-chunks
Browse files Browse the repository at this point in the history
issue #8: install doc and missing_chunk_queries
  • Loading branch information
rngadam authored Sep 25, 2023
2 parents 92ef1ea + 08b0045 commit 5ea67c5
Show file tree
Hide file tree
Showing 55 changed files with 1,148 additions and 548 deletions.
10 changes: 9 additions & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,15 @@
],

// Configure tool-specific properties.
// "customizations": {},
"customizations": {
"vscode":{
"extensions": [
"timonwong.shellcheck",
"GitHub.vscode-pull-request-github",
"charliermarsh.ruff"
]
}
},

// Uncomment to connect as root instead. More info: https://aka.ms/dev-containers-non-root.
// "remoteUser": "root"
Expand Down
12 changes: 12 additions & 0 deletions .env.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
LOUIS_DSN=
PGBASE=
PGUSER=
USER=
PGHOST=
POSTGRES_PASSWORD=
PGPASSWORD=
OPENAI_API_KEY=
AZURE_OPENAI_SERVICE=
LOUIS_SCHEMA=
DB_SERVER_CONTAINER_NAME=
PGDATA=
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.env
.env**
.pgpassfile
dumps/**
reports/**
Expand Down
53 changes: 53 additions & 0 deletions DEVELOPER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Development guidelines for louis-db

## Making changes to the database schema

### Run latest schema locally

* Setup .env environment variables
* LOUIS_DSN: Data Source Name (DSN) used for configuring a database connection in Louis's system.

* PGBASE: the base directory where PostgreSQL related files or resources are stored or accessed.

* PGUSER: the username or role required to authenticate and access a PostgreSQL database.

* USER: the username required for validation and access

* PGHOST: the hostname or IP address of the server where the PostgreSQL database is hosted.

* PGPASSWORD: the password for the user authentication when connecting to the PostgreSQL database.

* POSTGRES_PASSWORD: the password for the database, for authentication when connecting to the PostgreSQL database.

* PGDATA: path to the directory where PostgreSQL data files are stored.

* OPENAI_API_KEY: the API key required for authentication when making requests to the OpenAI API.

* AZURE_OPENAI_SERVICE: information related to an Azure-based service for OpenAI.

* LOUIS_SCHEMA: the Louis schema within database.

* DB_SERVER_CONTAINER_NAME: name of your database server container.

* Run database locally (see bin/postgres.sh)
* Restore latest schema dump

### before every change

* pgdump the schema using ```bin/backup-db-docker.sh```

### Create change

* make sure to create a Github Issue issue #X first describing the work to be done
* create a branch ```issueX-descriptive-name```
* add a new SQL file YYYY-mm-dd-issueX-descriptive-name
* explain in top header comment the changes to be made
* provide original DDL of files to be modified
* create a test case in tests/test_db.py
* load your new SQL file within a transaction (that will be rolled back)
* ensure you have an assert to test for
* once your test passes, commit change to the database by running your script with bin/psql.sh
* you should now be able to remove the load SQL file and run the test successfully
* re-run test suite and fix exposed database functions affected by your changes (failing)
* dump the new schema as louis_v00X with X+1
* test new schema with your client apps.
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# syntax=docker/dockerfile:1
FROM alpine
RUN apk add && apk add postgresql-client
COPY docker-entrypoint.sh /entrypoint.sh
COPY bin/docker-entrypoint.sh /entrypoint.sh
ENV LOUIS_DSN=
ENV LOUIS_SCHEMA=
ENV LOAD_DATA_ONLY=
Expand Down
18 changes: 16 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,20 @@

## Installing python package

If you need to interface with the database, use this to install:

```
pip install git+https://github.com/ai-cfia/[email protected]
```
pip install git+https://github.com/ai-cfia/[email protected]
```

You'll often want to add, move or modify existing database layer functions found in louis-db from a client repository.

To edit, you can install an editable version of the package dependencies such as:

```
pip install -e git+https://github.com/ai-cfia/louis-db#egg=louis_db
```

this will checkout the latest source in a local git in src/louis-db allowing edits in that directory to be immediately available for use by louis-crawler.

Don't forget to create a PR with your changes once you're done!
35 changes: 0 additions & 35 deletions backup-db-docker.sh

This file was deleted.

15 changes: 0 additions & 15 deletions backup-db.sh

This file was deleted.

64 changes: 64 additions & 0 deletions bin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# creating a new schema

## environment

This assumes:

* you are running WSL
* you are running a dockerized version of Postgresql 15 under WSL
* you are running louis-db in a DevContainer under Visual Studio Code
* your source is on WSL under ~/src

## configuration

database connection parameters is set in .env file

you can create multiple .env.NAME and symlink as needed:

working on local source:

```
ln -sf .env.louis_v004_local .env
```

switching to target

```
ln -sf .env.louis_v005_azure .env
```

## Running the database server locally

* use Dockerfile in postgres directory
* use ```bin/postgres.sh``` script as your startup script (symlink)

## Editing

* Create adhoc modifications as scripts in sql/ with proper YYYY-mm-dd prefix
* Create tests that apply these sql scripts in a transaction and test them
* Once satisfied, commit changes to database



## backing up schema and data

in this example, the modified louis_v004 becomes the louis_v005 schema:

```
./bin/dump-versioned-schema.sh louis_v004 louis_v005
./bin/dump-versioned-data.sh louis_v004 louis_v005
```

## loading schema

change your .env to link to your target database first

```
./bin/load-versioned-schema.sh louis_v005
```

validate manually that schema is as expected here (dbBeaver ERD diagram) before loading the data:

```
./bin/load-versioned-data.sh louis_v005
```
7 changes: 7 additions & 0 deletions bin/backup-db-docker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash
DIRNAME=`dirname $0`
. $DIRNAME/lib.sh

docker cp $DIRNAME/backup-db.sh louis-db-server:backup-db.sh
docker cp $DIRNAME/lib.sh louis-db-server:lib.sh
docker exec -it -e PGDUMP_FILENAME=/dev/stdout --env-file $ENV_FILE louis-db-server ./backup-db.sh > $PGDUMP_FILENAME
19 changes: 19 additions & 0 deletions bin/backup-db.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash
DIRNAME=`dirname $0`
. $DIRNAME/lib.sh

if [ ! -f "$NAME" ]; then
echo "preparing to dump $PGBASE.$LOUIS_SCHEMA to $PGDUMP_FILENAME"
# apparently pg_dump doesn't use the environment variables PG*
pg_dump -d $PGBASE --schema=$LOUIS_SCHEMA --no-owner --no-privileges --file $PGDUMP_FILENAME
else
echo "File $PGDUMP_FILENAME already exists"
fi

if [ -f "$PGDUMP_FILENAME" ]; then
if [ ! -f "$PGDUMP_FILENAME.zip" ]; then
zip $PGDUMP_FILENAME.zip $PGDUMP_FILENAME
else
echo "File $PGDUMP_FILENAME.zip already exists"
fi
fi
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
10 changes: 7 additions & 3 deletions dump-versioned-data.sh → bin/dump-versioned-data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ INPUT_SCHEMA=$1
OUTPUT_SCHEMA=$2

if [ -z "$PGHOST" -o "$PGHOST" == "localhost" ]; then
RELPATH=dumps/$OUTPUT_SCHEMA
RELPATH=$PROJECT_DIR/dumps/$OUTPUT_SCHEMA
OUTPUT_DIR=`realpath $RELPATH`
if [ -d "$OUTPUT_DIR" ]; then
echo "Warning: Directory exist: $OUTPUT_DIR"
Expand All @@ -18,7 +18,11 @@ else
OUTPUT_DIR=/var/lib/postgresql/data
fi

$PSQL_ADMIN < $DIRNAME/sql/schema_to_csv.sql
$PSQL_ADMIN -f $PROJECT_DIR/sql/schema_to_csv.sql
if [ $? -ne 0 ]; then
echo "Failed to install schema_to_csv function"
exit 3
fi

echo "Outputting all tables from schema $INPUT_SCHEMA as csv to $OUTPUT_DIR on the database server"
$PSQL_ADMIN -c "select * from schema_to_csv('$INPUT_SCHEMA', '$OUTPUT_DIR')"
$PSQL_ADMIN -c "select * from public.schema_to_csv('$INPUT_SCHEMA'::text, '$OUTPUT_DIR'::text)"
9 changes: 5 additions & 4 deletions dump-versioned-schema.sh → bin/dump-versioned-schema.sh
Original file line number Diff line number Diff line change
@@ -1,23 +1,24 @@
#!/bin/bash
DIRNAME=`dirname $0`
. $DIRNAME/lib.sh
TODAY=`date +%Y-%m-%d`

if [ -z $2 ]; then
echo "usage: $0 source_schema output_schema"
echo "example: $0 louis_v005 to louis_v006"
exit 1
fi

SOURCE_SCHEMA=$1
TARGET_SCHEMA=$2

SCHEMA_OUTPUT_DIR=$DIRNAME/dumps/$TARGET_SCHEMA
SCHEMA_OUTPUT_DIR=$PROJECT_DIR/dumps/$TARGET_SCHEMA
mkdir -p $SCHEMA_OUTPUT_DIR
SCHEMA_OUTPUT_FILENAME=$SCHEMA_OUTPUT_DIR/schema.sql
if [ -f "$SCHEMA_OUTPUT_FILENAME" ]; then
echo "File $SCHEMA_OUTPUT_FILENAME already exists"
#exit 2
echo "File $SCHEMA_OUTPUT_FILENAME already exists, exiting"
exit 2
fi
echo "dumping schema to $SCHEMA_OUTPUT_FILENAME"
pg_dump -n $SOURCE_SCHEMA -d $PGBASE \
--no-owner --no-privileges --no-security-labels \
--no-table-access-method --no-tablespaces --schema-only \
Expand Down
4 changes: 4 additions & 0 deletions bin/install-postgresl-client-15.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt update
sudo apt install postgresql-client-15
52 changes: 52 additions & 0 deletions bin/lib.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#!/bin/bash
DIRNAME=$(dirname $(realpath $0))
PARENT_DIR=$DIRNAME/..
PROJECT_DIR=$(realpath $PARENT_DIR)
ENV_FILE=$PROJECT_DIR/.env

if [ -f "$ENV_FILE" ]; then
# shellcheck source=lib.sh
. "$ENV_FILE"
else
echo "WARNING: File $ENV_FILE does not exist, relying on environment variables"
fi

check_environment_variables_defined () {
variable_not_set=0
for VARIABLE in "$@"; do
if [ -z "${!VARIABLE}" ]; then
echo "Environment variable $VARIABLE is not set"
variable_not_set=1
fi
done

if [ $variable_not_set -eq 1 ]; then
echo "One or more variables are not defined, the program cannot continue"
exit 1
fi
}

export PGOPTIONS="--search_path=$LOUIS_SCHEMA"
export PGBASE
export PGDATABASE
export PGHOST
export PGUSER
export PGPORT
export PGHOST
export PGPASSFILE
export PGPASSWORD

VERSION15=$(psql --version | grep 15.)

if [ -z "$VERSION15" ]; then
echo "postgresql-client-15 required"
exit 1
fi

TODAY=$(date +%Y-%m-%d)

if [ -z "$PGDUMP_FILENAME" ]; then
PGDUMP_FILENAME=$PROJECT_DIR/dumps/$TODAY.$PGBASE.pg_dump
fi

export PSQL_ADMIN="psql -v ON_ERROR_STOP=1 --single-transaction -d $PGBASE"
File renamed without changes.
Loading

0 comments on commit 5ea67c5

Please sign in to comment.