MLBazaar · sarahmish · Dec 2, 2020 · Dec 7, 2020 · Mar 23, 2021 · Mar 26, 2021
diff --git a/.gitignore b/.gitignore
@@ -68,6 +68,7 @@ docs/cardea.rst
 docs/cardea.*.rst
 docs/modules.rst
 docs/api
+docs/api_reference/api
 
 # PyBuilder
 target/
@@ -113,3 +114,7 @@ ENV/
 
 # IntelliJ Idea
 .idea/
+
+# output
+data/
+*.csv
diff --git a/Makefile b/Makefile
@@ -113,7 +113,7 @@ test: ## run tests quickly with the default Python
 
 .PHONY: test-all
 test-all: ## run tests on every Python version with tox
-	tox
+	tox -r
 
 .PHONY: test-readme
 test-readme: ## run the readme snippets

diff --git a/README.md b/README.md
@@ -64,30 +64,34 @@ This will pull and install the latest stable release from [PyPi](https://pypi.or
 
 In this short tutorial we will guide you through a series of steps that will help you get Cardea started.
 
-First, load the core class to work with:
+First, we download the dataset we will be working with. Here in this example, we are loading a pre-processed version of the [Kaggle dataset: Medical Appointment No Shows](https://www.kaggle.com/joniarroba/noshowappointments). 
+
+We can use a helper function to download the data.
 
 ```python3
-from cardea import Cardea
+from cardea.data import download
 
-cardea = Cardea()
+data_path = download('kaggle')
 ```
 
-We then seamlessly plug in our data. Here in this example, we are loading a pre-processed version of the [Kaggle dataset: Medical Appointment No Shows](https://www.kaggle.com/joniarroba/noshowappointments). 
-To use this dataset download the data from here then unzip it in the root directory, or run the command:
+Alternatively, you can download the dataset directly from the s3 bucket.
 
 ```bash
 curl -O https://dai-cardea.s3.amazonaws.com/kaggle.zip && unzip -d kaggle kaggle.zip
 ```
-To load the data, supply the ``data`` to the loader using the following command:
+
+Then, we instantiate a cardea instance by supplying the ``data_path`` to the initializer and choosing the format of the data.
 
 ```python3
-cardea.load_entityset(data='kaggle')
+from cardea import Cardea
+
+cardea = Cardea(data_path=data_path,
+                fhir=True)
 ```
-> :bulb: To load local data, pass the folder path to ``data``.
 
-To verify that the data has been loaded, you can find the loaded entityset by viewing ``cardea.es`` which should output the following:
+To verify that the data has been loaded, you can find the loaded entityset by viewing ``cardea.entityset`` which should output the following:
 
-```bash
+```
 Entityset: kaggle
   Entities:
     Address [Rows: 81, Columns: 2]
@@ -108,23 +112,25 @@ Entityset: kaggle
     Patient.address -> Address.object_id
 ```
 
-The output shown represents the entityset data structure where ``cardea.es`` is composed of entities and relationships. You can read more about entitysets [here](https://mlbazaar.github.io/Cardea/basic_concepts/data_loading.html).
+The output shown represents the entityset data structure where ``cardea.entityset`` is composed of entities and relationships. You can read more about entitysets [here](https://mlbazaar.github.io/Cardea/basic_concepts/data_loading.html).
 
-From there, you can select the prediction problem you aim to solve by specifying the name of the class, which in return gives us the ``label_times`` of the problem.
+From there, you can select the prediction problem you aim to solve by specifying the name of the function, which in return gives us the ``label_times`` of the problem.
 
 ```python3
-label_times = cardea.select_problem('MissedAppointment')
+from cardea.data_labeling import appointment_no_show
+
+label_times = cardea.label(appointment_no_show, subset=1000) # labeling only a subset of the data
 ```
 
 ``label_times`` summarizes for each instance in the dataset (1) what is its corresponding label of the instance and (2) what is the time index that indicates the timespan allowed for calculating features that pertain to each instance in the dataset.
 
-```bash
-          cutoff_time     instance_id        label
-0 2015-11-10 07:13:56	      5030230       noshow
-1 2015-12-03 08:17:28	      5122866    fulfilled
-2 2015-12-07 10:40:59	      5134197    fulfilled
-3 2015-12-07 10:42:42	      5134220       noshow
-4 2015-12-07 10:43:01	      5134223       noshow
+```
+  identifier               time  label
+0 5030230   2015-11-10 07:13:56   True
+1 5122866   2015-12-03 08:17:28   False
+2 5134197   2015-12-07 10:40:59   False
+3 5134220   2015-12-07 10:42:42   True
+4 5134223   2015-12-07 10:43:01   True
 ```
 
 You can read more about ``label_times`` [here](https://mlbazaar.github.io/Cardea/basic_concepts/machine_learning_tasks.html).
@@ -134,15 +140,14 @@ Then, you can perform the AutoML steps and take advantage of Cardea.
 Cardea extracts features through automated feature engineering by supplying the ``label_times`` pertaining to the problem you aim to solve
 
 ```python3
-feature_matrix = cardea.generate_features(label_times[:1000])
+feature_matrix = cardea.featurize(label_times)
 ```
-> :warning: Featurizing the data might take a while depending on the size of the data. For demonstration, we only featurize the first 1000 records.
+> :warning: Featurizing the data might take a while depending on the size of the data. 
 
 Once we have the features, we can now split the data into training and testing
 
 ```python3
-y = list(feature_matrix.pop('label'))
-
+y = feature_matrix.pop('label').values
 X = feature_matrix.values
 
 X_train, X_test, y_train, y_test = cardea.train_test_split(
@@ -152,29 +157,29 @@ X_train, X_test, y_train, y_test = cardea.train_test_split(
 Now that we have our feature matrix properly divided, we can use to train our machine learning pipeline, Modeling, optimizing hyperparameters and finding the most optimal model
 
 ```python3
-cardea.select_pipeline('Random Forest')
+cardea.set_pipeline('Random Forest')
 cardea.fit(X_train, y_train)
 y_pred = cardea.predict(X_test)
 ```
 
 Finally, you can evaluate the performance of the model
 ```python3
-cardea.evaluate(X, y, test_size=0.2, shuffle=True)
+cardea.evaluate(X_test, y_test, shuffle=True)
 ```
 which returns the scoring metric depending on the type of problem
-```bash
-{'Accuracy': 0.75, 
- 'F1 Macro': 0.5098039215686274, 
- 'Precision': 0.5183001719479243, 
- 'Recall': 0.5123528436411872}
+```
+Accuracy     0.75
+F1 Macro     0.5098
+Precision    0.5183
+Recall       0.5123
 ```
 
 # Citation
 If you use Cardea for your research, please consider citing the following paper:
 
 Sarah Alnegheimish; Najat Alrashed; Faisal Aleissa; Shahad Althobaiti; Dongyu Liu; Mansour Alsaleh; Kalyan Veeramachaneni. [Cardea: An Open Automated Machine Learning Framework for Electronic Health Records](https://arxiv.org/abs/2010.00509). [IEEE DSAA 2020](https://ieeexplore.ieee.org/document/9260104).
 
-```bash
+```
 @inproceedings{alnegheimish2020cardea,
   title={Cardea: An Open Automated Machine Learning Framework for Electronic Health Records},
   author={Alnegheimish, Sarah and Alrashed, Najat and Aleissa, Faisal and Althobaiti, Shahad and Liu, Dongyu and Alsaleh, Mansour and Veeramachaneni, Kalyan},