Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release/1.0.5 #84

Merged
merged 39 commits into from
Mar 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
6ef4906
Added instructions for using airflow with remote dot and data databases
dividor Dec 9, 2022
0bfc56e
Bump numpy from 1.20.3 to 1.22.0 in /dot
dependabot[bot] Dec 9, 2022
f93a1b1
add test for expression is true
lrnzcig Dec 10, 2022
3c4403b
make lint checks fail if any of the files fails
lrnzcig Dec 10, 2022
72c55d8
improve pylint scores
lrnzcig Dec 10, 2022
970f04d
Seems sometimes column can be null, resulting in schema tests going i…
dividor Dec 5, 2022
ed4937b
Updates to the UI
dividor Dec 5, 2022
1289ba1
Removing hard-coded uuid on accepted_values.
dividor Dec 5, 2022
8794889
Handle cases when there are no test results
dividor Dec 5, 2022
4bd1a25
Commit only the file without conflicts
Dec 16, 2022
cc832b5
DKW-716 - Remove unneeded packages from DOT requirements (pyreqs and …
JanPeterDatakind Dec 21, 2022
fb350ea
DKW-716 Remove unneeded packages from DOT requirements: Had to add on…
JanPeterDatakind Dec 21, 2022
d4ba548
Bump oauthlib from 3.2.1 to 3.2.2 in /dot
dependabot[bot] Feb 6, 2023
4f2e437
Bump ipython from 7.31.1 to 8.10.0 in /dot
dependabot[bot] Feb 11, 2023
ed9db76
Merge branch 'develop' into minor-fixes_cc45e08
dividor Mar 14, 2023
c21a5d3
Merge pull request #68 from datakind/minor-fixes_cc45e08
dividor Mar 14, 2023
60c380c
Merge pull request #69 from datakind/minor-fixes_aed9937
JanPeterDatakind Mar 14, 2023
6fc2dc6
Merge pull request #67 from datakind/minor-fixes_bd027fb
JanPeterDatakind Mar 14, 2023
f9fa659
Merge pull request #70 from datakind/minor-fixes_4e883da
JanPeterDatakind Mar 14, 2023
55f98d3
Merge branch 'develop' into minor-fixes_42ad437
JanPeterDatakind Mar 14, 2023
41447bf
Merge pull request #74 from datakind/minor-fixes_42ad437
JanPeterDatakind Mar 14, 2023
9effa00
Merge pull request #63 from datakind/airflow-doc-updates
JanPeterDatakind Mar 14, 2023
4c64341
Merge pull request #76 from datakind/package_clean
JanPeterDatakind Mar 14, 2023
169e752
Merge branch 'develop' into test_type_expression_is_true
JanPeterDatakind Mar 15, 2023
1fadcb4
Merge pull request #65 from datakind/test_type_expression_is_true
JanPeterDatakind Mar 15, 2023
03e8be7
Merge branch 'develop' into dependabot/pip/dot/numpy-1.22.0
JanPeterDatakind Mar 15, 2023
78f730b
Merge pull request #64 from datakind/dependabot/pip/dot/numpy-1.22.0
JanPeterDatakind Mar 15, 2023
90cd361
Merge branch 'develop' into dependabot/pip/dot/ipython-8.10.0
JanPeterDatakind Mar 15, 2023
132a2d5
Merge pull request #80 from datakind/dependabot/pip/dot/ipython-8.10.0
JanPeterDatakind Mar 15, 2023
a5205bf
Merge branch 'develop' into dependabot/pip/dot/oauthlib-3.2.2
JanPeterDatakind Mar 15, 2023
8630686
Merge pull request #78 from datakind/dependabot/pip/dot/oauthlib-3.2.2
JanPeterDatakind Mar 15, 2023
278033f
Making sure that the changes of commit 162f9ef, that were merged into…
JanPeterDatakind Mar 15, 2023
63f3239
Minor changes so that self tests don't fail and don't produce depreca…
JanPeterDatakind Mar 16, 2023
81bd45d
Changed test_results_summary.csv to match the actual output of a DOT …
JanPeterDatakind Mar 16, 2023
15f3996
Changed test_results_summary.csv and test_results.csv to match the ac…
JanPeterDatakind Mar 17, 2023
bd3739c
Adjusted Readme to reflect changes to entity_id in example queries an…
JanPeterDatakind Mar 21, 2023
7380bee
Reverted changes that had overwritten the previous linting of this file.
JanPeterDatakind Mar 22, 2023
0f96ebd
Reverted left over changes that had overwritten the previous linting …
JanPeterDatakind Mar 22, 2023
bf38b7e
Fixes to Appsmith UI (JSFunction - getEntitycolumns)
JanPeterDatakind Mar 27, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,73 @@ you can open a new issue using a relevant [issue form](https://github.com/dataki

As a general rule, we don’t assign issues to anyone. If you find an issue to work on, you are welcome to open a PR with a fix.

### More complex configuration options

All the configuration files must be located under the [config](dot/config) folder of the DOT.

### Main config file

The main config file must be called `dot_config.yml` and located at the top [config](dot/config) folder. Note that
this file will be ignored for version control. You may use the [example dot_config yaml](dot/config/example/dot_config.yml)
as a template.

Besides the DOT DB connection in the paragraph above, see below for additional config options.

#### Connection parameters for each of the projects to run

For each of the projects you would like to run, add a key to the DOT config yaml with the following structure:
```
<project_name>_db:
type: connection type e.g. postgres
host: host
user: username
pass: password
port: port number e.g 5432
dbname: database name
schema: schema name, e.g. public
threads: nubmer of threads for DBT, e.g. 4
```

#### Output schema suffix

The DOT generates 2 kind of database objects:
- Entities of the models that are being tested, e.g. assessments, follow ups, patients
- Results of the failing tests

If nothing is done, these objects would be created in the same schema as the original data for the project
(thus polluting the DB). If the key `output_schema_suffix` is added, its value will be added as a suffix; i.e. if the
project data is stored in a certain schema, the output objects will go to `<project_schema>_<schema_suffix>`
(e.g. to `public_tests` if the project schema is `public` and the suffix is set to `tests` in the lines above).

Note that this mechanism uses a DBT feature, and that the same applies to the GE tests.

#### Save passed tests

The key `save_passed_tests` accepts boolean values. If set to true, tha results of the passing tests will be also stored
to the DOT DB. If not, only the results of failing tests will be stored.

### Other config file locations
Optional configuration for DBT and Great Expectations can be added, per project, in a structure as follows.

```bash
|____config
| |____<project_name>
| | |____dbt
| | | |____profiles.yml
| | | |____dbt_project.yml
| | |____ge
| | | |____great_expectations.yml
| | | |____config_variables.yml
| | | |____batch_config.json
```
In general these customizations will not be needed, but only in some scenarios with particular requirements; these
require a deeper knowledge of the DOT and of either DBT and/or Great Expectations.

There are examples for all the files above under [this folder](dot/config/example/project_name). For each of the
files you want to customize, you may copy and adapt the examples provided following the directory structure above.

More details in the [config README](dot/config/README.md).

## Making Code changes

## Setting up a Development Environment
Expand Down
119 changes: 34 additions & 85 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,17 +275,16 @@ also use the DOT user interface for tests, for more details please see section s
The DOT will run tests against user-defined views onto the underlying data. These views are called "entities" and defined in table `dot.configured_entities`:


| Column | Description |
| :----------- | :----------- |
| entity_id | UUID of the entity |
| entity_name | Name of the entity e.g. ancview_danger_sign |
| Column | Description |
| :----------- |:--------------------------------------------------------------------------|
| entity_id | Name of the entity e.g. ancview_danger_sign |
| entity_category | Category of the entity e.g. anc => needs to be in `dot.entity_categories` |
| entity_definition | String for the SQL query that defines the entity |
| entity_definition | String for the SQL query that defines the entity |

For example, this would be an insert command to create `ancview_danger_sign`:

```postgres-sql
INSERT INTO dot.configured_entities VALUES('b05f1f9c-2176-46b0-8e8f-d6690f696b9b',
INSERT INTO dot.configured_entities (project_id,entity_id,entity_category,entity_definition,date_added,date_modified,last_updated_by) VALUES('Project1',
'ancview_danger_sign', 'anc', '{{ config(materialized=''view'') }}
{% set schema = <schema> %}

Expand All @@ -294,8 +293,6 @@ from {{ schema }}.ancview_danger_sign');

```

Note: UUID in the above statement will be overwritten with an automatically generated value.

All entities use Jinja macro statements - the parts between `{ ... }` - which the DOT uses to create the entity
materialized views in the correct database location. Use the above format for any new entities you create.

Expand Down Expand Up @@ -368,58 +365,58 @@ generated one.
<br><br>
```
'INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '0cdc9702-91e0-3499-b6f0-4dec12ad0f08', 'ASSESS-1', 3, '', '',
'', 'dot_model__ancview_pregnancy', 'relationships', 'uuid', '',
'', 'ancview_pregnancy', 'relationships', 'uuid', '',
$${"name": "danger_signs_with_no_pregnancy", "to": "ref('dot_model__ancview_danger_sign')", "field": "pregnancy_uuid"}$$,
'2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
```
2. `unique`
<br><br>
```
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '52d7352e-56ee-3084-9c67-e5ab24afc3a3', 'DUPLICATE-1', 3, '',
'', '', '6ba8075f-6f35-4ff1-be3a-4c75d0884bf4', 'unique', 'uuid', 'alternative index?', '',
'', '', 'ancview_pregnancy', 'unique', 'uuid', 'alternative index?', '',
'2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
```
3. `not_negative_string_column`
<br><br>
```
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '8aca2bee-9e95-3f8a-90e9-153714e05367', 'INCONSISTENT-1', 3,
'', '', '', '95bd0f60-ab59-48fc-a62e-f256f5f3e6de', 'not_negative_string_column', 'patient_age_in_years', '',
'', '', '', 'ancview_pregnancy', 'not_negative_string_column', 'patient_age_in_years', '',
$${"name": "patient_age_in_years"}$$, '2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
```
4. `not_null`
<br><br>
```
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '549c0575-e64c-3605-85a9-70356a23c4d2', 'MISSING-1', 3, '',
'', '', '638ed10b-3a2f-4f18-9ca1-ebf23563fdc0', 'not_null', 'patient_id', '', '', '2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
'', '', 'ancview_pregnancy', 'not_null', 'patient_id', '', '', '2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
```
5. `accepted_values`
<br><br>
```
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '935e6b61-b664-3eab-9d67-97c2c9c2bec0', 'INCONSISTENT-1', 3,
'', '', '', '95bd0f60-ab59-48fc-a62e-f256f5f3e6de', 'accepted_values', 'fp_method_being_used', '',
'', '', '', 'ancview_pregnancy', 'accepted_values', 'fp_method_being_used', '',
$${"values": ['oral mini-pill (progestogen)', 'male condom', 'female sterilization', 'iud', 'oral combination pill', 'implants', 'injectible']}$$,
'2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
```
6. `possible_duplicate_forms`
<br><br>
```
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '7f78de0e-8268-3da6-8845-9a445457cc9a', 'DUPLICATE-1', 3, '',
'', '', '66f5d13a-8f74-4f97-836b-334d97932781', 'possible_duplicate_forms', '', '',
'', '', 'ancview_pregnancy', 'possible_duplicate_forms', '', '',
$${"table_specific_reported_date": "delivery_date", "table_specific_patient_uuid": "patient_id", "table_specific_uuid": "uuid"}$$, '2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
```
7. `associated_columns_not_null`
<br><br>
```
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', 'd74fc600-31c3-307d-9501-5b7f6b09aff5', 'MISSING-1', 3, '',
'', '', 'dot_model__iccmview_assessment', 'associated_columns_not_null', 'diarrhea_dx', 'diarrhea diagnosis',
'', '', 'ancview_pregnancy', 'associated_columns_not_null', 'diarrhea_dx', 'diarrhea diagnosis',
$${"name": "diarrhea_dx_has_duration", "col_value": True, "associated_columns": ['max_symptom_duration']}$$,
'2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
```
8. `expect_similar_means_across_reporters`
<br><br>
```
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '0cdc9702-91e0-3499-b6f0-4dec12ad0f08', 'BIAS-1', 3,
'Test for miscalibrated thermometer', '', '', 'baf349c9-c919-40ff-a611-61ddc59c2d52', 'expect_similar_means_across_reporters',
'Test for miscalibrated thermometer', '', '', 'ancview_pregnancy', 'expect_similar_means_across_reporters',
'child_temperature_pre_chw', '', '{"key": "reported_by","quantity": "child_temperature_pre_chw",
"form_name": "dot_model__iccmview_assessment","id_column": "reported_by"}', '2022-01-19 20:00:00.000 -0500',
'2022-01-19 20:00:00.000 -0500', 'your-name');
Expand All @@ -428,7 +425,7 @@ generated one.
<br><br>
```
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '3081f033-e8f4-4f3b-aea8-36f8c5df05dc', 'INCONSISTENT-1', 3,
'Wrong treatment/dosage arising from wrong age of children (WT-1)', '', '', 'baf349c9-c919-40ff-a611-61ddc59c2d52',
'Wrong treatment/dosage arising from wrong age of children (WT-1)', '', '', 'ancview_pregnancy',
'expression_is_true', '', '',
$${"name": "t_under_24_months_wrong_dosage", "expression": "malaria_act_dosage is not null", "condition": "(patient_age_in_months<24) and (malaria_give_act is not null)"}$$,
'2022-02-14 19:00:00.000 -0500', '2022-02-14 19:00:00.000 -0500', 'your-name');
Expand All @@ -437,7 +434,7 @@ generated one.
<br><br>
Custom SQL queries require special case because they must have `primary_table` and `primary_table_id_field` specified within the SQL query as shown below:
```
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', 'c4a3da8f-32f4-4e9b-b135-354de203ca90', 'TREAT-1', 6, 'Test for new family planning method (NFP-1)', '', '', '95bd0f60-ab59-48fc-a62e-f256f5f3e6de', 'custom_sql', '', '',
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', 'c4a3da8f-32f4-4e9b-b135-354de203ca90', 'TREAT-1', 6, 'Test for new family planning method (NFP-1)', '', '', 'ancview_pregnancy', 'custom_sql', '', '',
format('{%s: %s}',
to_json('query'::text),
to_json($query$
Expand Down Expand Up @@ -527,72 +524,7 @@ custom SQL query. Given this, there is a useful Postgres function which will ret
see 'Seeing the raw data for failed tests' above.


## More complex configuration options

All the configuration files must be located under the [config](dot/config) folder of the DOT.

### Main config file

The main config file must be called `dot_config.yml` and located at the top [config](dot/config) folder. Note that
this file will be ignored for version control. You may use the [example dot_config yaml](dot/config/example/dot_config.yml)
as a template.

Besides the DOT DB connection in the paragraph above, see below for additional config options.

#### Connection parameters for each of the projects to run

For each of the projects you would like to run, add a key to the DOT config yaml with the following structure:
```
<project_name>_db:
type: connection type e.g. postgres
host: host
user: username
pass: password
port: port number e.g 5432
dbname: database name
schema: schema name, e.g. public
threads: nubmer of threads for DBT, e.g. 4
```

#### Output schema suffix

The DOT generates 2 kind of database objects:
- Entities of the models that are being tested, e.g. assessments, follow ups, patients
- Results of the failing tests

If nothing is done, these objects would be created in the same schema as the original data for the project
(thus polluting the DB). If the key `output_schema_suffix` is added, its value will be added as a suffix; i.e. if the
project data is stored in a certain schema, the output objects will go to `<project_schema>_<schema_suffix>`
(e.g. to `public_tests` if the project schema is `public` and the suffix is set to `tests` in the lines above).

Note that this mechanism uses a DBT feature, and that the same applies to the GE tests.

#### Save passed tests

The key `save_passed_tests` accepts boolean values. If set to true, tha results of the passing tests will be also stored
to the DOT DB. If not, only the results of failing tests will be stored.

### Other config file locations
Optional configuration for DBT and Great Expectations can be added, per project, in a structure as follows.

```bash
|____config
| |____<project_name>
| | |____dbt
| | | |____profiles.yml
| | | |____dbt_project.yml
| | |____ge
| | | |____great_expectations.yml
| | | |____config_variables.yml
| | | |____batch_config.json
```
In general these customizations will not be needed, but only in some scenarios with particular requirements; these
require a deeper knowledge of the DOT and of either DBT and/or Great Expectations.

There are examples for all the files above under [this folder](dot/config/example/project_name). For each of the
files you want to customize, you may copy and adapt the examples provided following the directory structure above.

More details in the [config README](dot/config/README.md).
### Please refer to [CONTRIBUTING.md](./CONTRIBUTING.md) for information on more complex configuration options.

## How to visualize the results using Superset

Expand Down Expand Up @@ -706,7 +638,7 @@ NOTE: You might need to use docker-compose on some hosts.

`docker compose -f docker-compose-with-airflow.yml down -v`

### Running the DOT in Airflow
### Running the DOT in Airflow (Demo)

A DAG has been included which copies data from the uploaded DB dump into the DOT DB 'data_ScanProject1' schema, and then runs
the toolkit against this data. To do this ...
Expand All @@ -733,6 +665,23 @@ Or to run just DOT stage ...

`airflow tasks test run_dot_project run_dot 2022-03-01`


### Running the DOT in Airflow (Connecting to external databases)

The following instructions illustrate how to use a local airflow environment, connecting with external databases for the data and DOT.

**NOTE:** These are for illustrative purposes only. If using Airflow in production it's important that it is set up correctly
and does not expose a http connection to the internet, and also has adequate network security (firewal, strong password, etc)

1. Edit [./dot/dot_config.yml] and set the correct parameters for your external dot_db
2. Create a section for your data databases and set connection parameters
3. If you have a DAG json file `dot_projects.json` already, deploy it into `./airflow/dags`
4. Run steps 1-11 in [Configuring/Building Airflow Docker environment](#Configuring/Building Airflow Docker environment)
5. Run steps 12 and 13, but use the values for your external databases you configured in `dot_config.yml`

You will need to configure DOT tests and the DAG json file appropriately for your installation.


#### Adding more projects

If configuring Airflow in production, you will need to adjust `./docker/dot/dot_config.yml` accordingly. You can also
Expand Down
7 changes: 6 additions & 1 deletion db/dot/4-upload_sample_dot_data.sql
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ $${"table_specific_reported_date": "departure_time", "table_specific_patient_uui
"uuid", "table_specific_period": "day"}$$, '2021-12-23 19:00:00.000 -0500', '2022-03-21 19:00:00.000 -0500', 'Matt');

INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', 'c4a3da8f-32f4-4e9b-b135-354de203ca90', 'TREAT-1',
5, 'Number of stops has a reasonible value', '', '', 'all_flight_data', 'custom_sql', '', '',
5, 'Number of stops has a reasonable value', '', '', 'all_flight_data', 'custom_sql', '', '',
format('{%s: %s}',
to_json('query'::text),
to_json($query$
Expand All @@ -79,6 +79,11 @@ format('{%s: %s}',
)::json,
'2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'Lorenzo');

INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '3081f033-e8f4-4f3b-aea8-36f8c5df05dc', 'INCONSISTENT-1',
8, 'Price is a positive number for direct flights', '', '', 'all_flight_data', 'expression_is_true',
'', '', $${"name": "t_direct_flights_positive_price", "expression": "price is not null and price > 0",
"condition": "stops = 'non-stop'"}$$, '2022-12-10 19:00:00.000 -0500', '2022-12-10 19:00:00.000 -0500', 'Lorenzo');

COMMIT;


2 changes: 1 addition & 1 deletion docker/appsmith/DOT App V2.json

Large diffs are not rendered by default.

14 changes: 8 additions & 6 deletions docker/run_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

url_demo_data = "https://drive.google.com/uc?id=157Iad8mHnwbZ_dAeLQy5XfLihhcpD6yc"
filename_demo_data = "dot_demo_data.tar.gz"
url_dot_ui = "http://localhost:82/app/data-observation-toolkit/run-log-634491ea0da61b0e9f38760d?embed=True"
url_dot_ui = "http://localhost:82/app/data-observation-toolkit/run-log-634491ea0da61b0e9f38760d?embed=True" # pylint: disable=line-too-long

# Check if db, appsmith and tar file are there and if so, delete them.
os.chdir("demo/")
Expand All @@ -30,12 +30,12 @@

# Open/Extract tarfile
with tarfile.open(filename_demo_data) as my_tar:
my_tar.extractall('')
my_tar.extractall("")
my_tar.close()

with open("./db/.env") as f:
demo_pwd=f.read().split("=")[1]
os.environ['POSTGRES_PASSWORD'] = demo_pwd
demo_pwd = f.read().split("=")[1]
os.environ["POSTGRES_PASSWORD"] = demo_pwd

# Composing and running container(s)
print("Starting DOT...\n")
Expand All @@ -49,8 +49,10 @@

webbrowser.open(url_dot_ui)

print("In case DOT was not opened in your browser, please go to this URL: "
"http://localhost:82/app/data-observation-toolkit/run-log-634491ea0da61b0e9f38760d?embed=True\n")
print(
"In case DOT was not opened in your browser, please go to this URL: "
"http://localhost:82/app/data-observation-toolkit/run-log-634491ea0da61b0e9f38760d?embed=True\n"
)
input("Press return to stop DOT container\n")
print("Container is being stopped - we hope you enjoyed this demo :)")
docker.compose.stop()
2 changes: 1 addition & 1 deletion dot/dbt/macros/test_expression_is_true.sql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
-- wrapper around dbt_utils.expression_is_true including the name
{% test expression_is_true(model, expression, column_name=None, condition='1=1', name='do_set_name') %}
{{ return(adapter.dispatch('test_expression_is_true', 'dbt_utils')(model, expression, column_name, condition)) }}
{{ return(adapter.dispatch('test_expression_is_true', 'dbt_utils')(model, expression, '', condition)) }}
{% endtest %}
Loading