Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cardea class, functional api, and compose #92

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ docs/cardea.rst
docs/cardea.*.rst
docs/modules.rst
docs/api
docs/api_reference/api

# PyBuilder
target/
Expand Down Expand Up @@ -113,3 +114,7 @@ ENV/

# IntelliJ Idea
.idea/

# output
data/
*.csv
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ test: ## run tests quickly with the default Python

.PHONY: test-all
test-all: ## run tests on every Python version with tox
tox
tox -r

.PHONY: test-readme
test-readme: ## run the readme snippets
Expand Down
69 changes: 37 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,30 +64,34 @@ This will pull and install the latest stable release from [PyPi](https://pypi.or

In this short tutorial we will guide you through a series of steps that will help you get Cardea started.

First, load the core class to work with:
First, we download the dataset we will be working with. Here in this example, we are loading a pre-processed version of the [Kaggle dataset: Medical Appointment No Shows](https://www.kaggle.com/joniarroba/noshowappointments).

We can use a helper function to download the data.

```python3
from cardea import Cardea
from cardea.data import download

cardea = Cardea()
data_path = download('kaggle')
```

We then seamlessly plug in our data. Here in this example, we are loading a pre-processed version of the [Kaggle dataset: Medical Appointment No Shows](https://www.kaggle.com/joniarroba/noshowappointments).
To use this dataset download the data from here then unzip it in the root directory, or run the command:
Alternatively, you can download the dataset directly from the s3 bucket.

```bash
curl -O https://dai-cardea.s3.amazonaws.com/kaggle.zip && unzip -d kaggle kaggle.zip
```
To load the data, supply the ``data`` to the loader using the following command:

Then, we instantiate a cardea instance by supplying the ``data_path`` to the initializer and choosing the format of the data.

```python3
cardea.load_entityset(data='kaggle')
from cardea import Cardea

cardea = Cardea(data_path=data_path,
fhir=True)
```
> :bulb: To load local data, pass the folder path to ``data``.

To verify that the data has been loaded, you can find the loaded entityset by viewing ``cardea.es`` which should output the following:
To verify that the data has been loaded, you can find the loaded entityset by viewing ``cardea.entityset`` which should output the following:

```bash
```
Entityset: kaggle
Entities:
Address [Rows: 81, Columns: 2]
Expand All @@ -108,23 +112,25 @@ Entityset: kaggle
Patient.address -> Address.object_id
```

The output shown represents the entityset data structure where ``cardea.es`` is composed of entities and relationships. You can read more about entitysets [here](https://mlbazaar.github.io/Cardea/basic_concepts/data_loading.html).
The output shown represents the entityset data structure where ``cardea.entityset`` is composed of entities and relationships. You can read more about entitysets [here](https://mlbazaar.github.io/Cardea/basic_concepts/data_loading.html).

From there, you can select the prediction problem you aim to solve by specifying the name of the class, which in return gives us the ``label_times`` of the problem.
From there, you can select the prediction problem you aim to solve by specifying the name of the function, which in return gives us the ``label_times`` of the problem.

```python3
label_times = cardea.select_problem('MissedAppointment')
from cardea.data_labeling import appointment_no_show

label_times = cardea.label(appointment_no_show, subset=1000) # labeling only a subset of the data
```

``label_times`` summarizes for each instance in the dataset (1) what is its corresponding label of the instance and (2) what is the time index that indicates the timespan allowed for calculating features that pertain to each instance in the dataset.

```bash
cutoff_time instance_id label
0 2015-11-10 07:13:56 5030230 noshow
1 2015-12-03 08:17:28 5122866 fulfilled
2 2015-12-07 10:40:59 5134197 fulfilled
3 2015-12-07 10:42:42 5134220 noshow
4 2015-12-07 10:43:01 5134223 noshow
```
identifier time label
0 5030230 2015-11-10 07:13:56 True
1 5122866 2015-12-03 08:17:28 False
2 5134197 2015-12-07 10:40:59 False
3 5134220 2015-12-07 10:42:42 True
4 5134223 2015-12-07 10:43:01 True
```

You can read more about ``label_times`` [here](https://mlbazaar.github.io/Cardea/basic_concepts/machine_learning_tasks.html).
Expand All @@ -134,15 +140,14 @@ Then, you can perform the AutoML steps and take advantage of Cardea.
Cardea extracts features through automated feature engineering by supplying the ``label_times`` pertaining to the problem you aim to solve

```python3
feature_matrix = cardea.generate_features(label_times[:1000])
feature_matrix = cardea.featurize(label_times)
```
> :warning: Featurizing the data might take a while depending on the size of the data. For demonstration, we only featurize the first 1000 records.
> :warning: Featurizing the data might take a while depending on the size of the data.

Once we have the features, we can now split the data into training and testing

```python3
y = list(feature_matrix.pop('label'))

y = feature_matrix.pop('label').values
X = feature_matrix.values

X_train, X_test, y_train, y_test = cardea.train_test_split(
Expand All @@ -152,29 +157,29 @@ X_train, X_test, y_train, y_test = cardea.train_test_split(
Now that we have our feature matrix properly divided, we can use to train our machine learning pipeline, Modeling, optimizing hyperparameters and finding the most optimal model

```python3
cardea.select_pipeline('Random Forest')
cardea.set_pipeline('Random Forest')
cardea.fit(X_train, y_train)
y_pred = cardea.predict(X_test)
```

Finally, you can evaluate the performance of the model
```python3
cardea.evaluate(X, y, test_size=0.2, shuffle=True)
cardea.evaluate(X_test, y_test, shuffle=True)
```
which returns the scoring metric depending on the type of problem
```bash
{'Accuracy': 0.75,
'F1 Macro': 0.5098039215686274,
'Precision': 0.5183001719479243,
'Recall': 0.5123528436411872}
```
Accuracy 0.75
F1 Macro 0.5098
Precision 0.5183
Recall 0.5123
```

# Citation
If you use Cardea for your research, please consider citing the following paper:

Sarah Alnegheimish; Najat Alrashed; Faisal Aleissa; Shahad Althobaiti; Dongyu Liu; Mansour Alsaleh; Kalyan Veeramachaneni. [Cardea: An Open Automated Machine Learning Framework for Electronic Health Records](https://arxiv.org/abs/2010.00509). [IEEE DSAA 2020](https://ieeexplore.ieee.org/document/9260104).

```bash
```
@inproceedings{alnegheimish2020cardea,
title={Cardea: An Open Automated Machine Learning Framework for Electronic Health Records},
author={Alnegheimish, Sarah and Alrashed, Najat and Aleissa, Faisal and Althobaiti, Shahad and Liu, Dongyu and Alsaleh, Mansour and Veeramachaneni, Kalyan},
Expand Down
Loading