#5 📝 Tranche 5 of documentation migration

boozallen · May 1, 2024 · 378ffde · 378ffde
1 parent 9be4043
commit 378ffde
Show file tree

Hide file tree

Showing 7 changed files with 532 additions and 0 deletions.
diff --git a/docs/modules/ROOT/pages/data-encryption.adoc b/docs/modules/ROOT/pages/data-encryption.adoc
@@ -0,0 +1,165 @@
+[#_data_encryption]
+= Data Encryption
+
+An important security service provided by the fabrication process is data encryption.  Data encryption is a method
+where information is encoded and obscured making it unreadable by unauthorized entities.
+
+== Encryption Method
+
+=== AES encryption
+AES encryption is a symmetric-key algorithm which means you can use the same key to encrypt and decrypt data.
+To encrypt data using this algorithm you first need to generate a 128 bit (16 character) key and save it in the
+`encrypt.properties` file, then follow the example below to encrypt data using the AES algorithm.
+
+.encrypt.properties
+[source,java]
+----
+encrypt.aes.secret.key.spec=<SOME_AES_ENCRYPTED_VALUE_HERE>
+----
+
+.Java
+[source,java]
+----
+import com.boozallen.aissemble.data.encryption.SimpleAesEncrypt;
+
+AiopsEncrypt simpleAiopsEncrypt = new SimpleAesEncrypt();
+
+//Encrypt
+String encryptedData = simpleAiopsEncrypt.encryptValue("Plain text string data");
+
+//Decrypt
+String decryptedVaultData = simpleAiopsEncrypt.decryptValue(encryptedData);
+----
+
+.Python
+[source,python]
+----
+from aiops_encrypt.aes_encrypt import AESEncrypt
+
+aes_encrypt = AESEncrypt()
+
+# Encrypt
+encrypted_value = aes_encrypt.encrypt('Plain text string data')
+
+# Decrypt
+decrypted_value = aes_encrypt.decrypt(encrypted_value)
+----
+
+=== Vault encryption
+Another encryption service is provide through Hashicorp Vault.  This is a well known provider of security services and
+the fabrication process generates the client methods for calling and encrypting data.  In order to encrypt data using V
+ault the following properties need to be set:
+
+.encrypt.properties
+[source,java]
+----
+secrets.host.url=[Some Url]
+Example: http://vault:8217
+secrets.unseal.keys=[Comma delimited set of keys]
+Example (no quotes or whitespace): key1,key2,key3,key4,key5
+secrets.root.key=[The root key]
+Example: s.EXAMPLE
+----
+
+The keys in the example property file above can be retrieved from the vault console log. The keys will be printed at
+the top of the log when the server is started.  For example:
+
+.Vault console log
+[source,java]
+----
+vault    | ROOT KEY
+vault    | s.EXAMPLE
+vault    | UNSEAL KEYS
+vault    | ["key1", "key2", "key3", "key4", "key5"]
+vault    | TRANSIT TOKEN
+vault    | {"request_id": "29d26f42-7be2-9b06-c4ce-1ecc94114393", "lease_id": "", "renewable": false, "lease_duration": 0, "data": null, "wrap_info": null, "warnings": null, "auth": {"client_token": "s.TOKEN", "accessor": "zFcMdiOHhtXUyRTUigkePpzS", "policies": ["app-aiops", "default"], "token_policies": ["app-aiops", "default"], "metadata": null, "lease_duration": 2764800, "renewable": true, "entity_id": "", "token_type": "service", "orphan": false}}
+----
+
+To add vault encryption to your code follow the example below. Please note that the vault Docker container needs to
+be running in order for this example to work.
+
+.Java
+[source,java]
+----
+import com.boozallen.aissemble.data.encryption.VaultEncrypt;
+
+AiopsEncrypt vaultAiopsEncrypt = new VaultEncrypt();
+
+//Encrypt
+String encryptedData = vaultAiopsEncrypt.encryptValue("Plain text string data");
+
+//Decrypt
+String decryptedVaultData = vaultAiopsEncrypt.decryptValue(encryptedData);
+----
+
+.Python
+[source,python]
+----
+from aiops_encrypt.vault_remote_encryption_strategy import VaultRemoteEncryptionStrategy
+from aiops_encrypt.vault_local_encryption_strategy import VaultLocalEncryptionStrategy
+
+vault_client = VaultRemoteEncryptionStrategy()
+'''
+ Optionally you can use local Vault encryption (vault_client = VaultLocalEncryptionStrategy())
+ which will download the encryption key once and  perform encryption locally without a round trip
+ to the Vault server.  This is useful for encrypting large data objects and for high volume encryption tasks.
+ NOTE: If you are encrypting your data through a User Defined Function (udf) in PySpark you need to use
+ the VaultLocalEncryptionStrategy.  Currently the remote version causes threading issues.  This issue will
+ likely be resolved in a future update to the Hashicorp Vault client
+'''
+
+# Encrypt
+encrypted_value = vault_client.encrypt('Plain text string data')
+
+# Decrypt
+decrypted_value = vault_client.decrypt(encrypted_value)
+----
+
+
+== Encryption by policy
+
+The fabrication process generates built in encryption code that can be activated through a policy file.
+When an encryption policy is configured the pipeline will apply encryption to the fields specified in the policy.
+The following example illustrates how to encrypt a field named "ssn" during the ingest step.
+
+.example-encrypt-policy.json (can be named anything as long as it's in the correct policy directory)
+[source,json]
+----
+[
+  {
+    "identifier": "encryptSSN",
+    "rules": [
+      {
+        "className": "EncryptRule",
+        "configurations": {
+          "description": "Apply encryption policy"
+        }
+      }
+    ],
+    "encryptPhase": "ingest",
+    "encryptFields": [
+      "ssn"
+    ],
+    "encryptAlgorithm": "AES_ENCRYPT"
+  }
+]
+----
+
+This file should be placed in a directory which can be specified by the user in the policy-configuration.properties
+file (see example below).
+
+`encryptPhase` - The step in the pipeline where encryption takes place.  Typically, this will happen in the first step.
+
+`encryptFields` - An array of field names that will be encrypted.
+
+`encryptAlgorithm` - The algorithm that will be used to encrypt the data.  Currently, the options are `AES_ENCRYPT`
+and `VAULT_ENCRYPT`.  More can be added through customization.
+
+.policy-configuration.properties
+[source,json]
+----
+policies-location=policies
+----
+
+This configuration defines which folder the encryption policy resides in.  In the example above the policies are in
+the `policies` directory (relative to the working directory).  An absolute path can also be used.
diff --git a/docs/modules/ROOT/pages/data-profiling-details.adoc b/docs/modules/ROOT/pages/data-profiling-details.adoc
@@ -0,0 +1,7 @@
+= Data Profiling
+
+WARNING: Data profiling is currently in an incubating state and is undergoing significant changes.  Stay
+tuned for information on our upcoming improved offerings!
+//Machine learning pipelines should not have data profiling defined in the metamodel
+
+
diff --git a/docs/modules/ROOT/pages/data-validation.adoc b/docs/modules/ROOT/pages/data-validation.adoc
@@ -0,0 +1,125 @@
+= Data Validation
+
+The Data Validation component ensures the quality, feasibility, and accuracy of data before it is used for machine
+learning. Validation boosts machine learning confidence by ensuring data is cleansed, standardized, and ready for use.
+By leveraging data validation from the  xref:semantic-data.adoc#_semantic_data[semantic data model], consistent data
+validation rules are applied throughout the entire project. This page will walk you through how to leverage data
+validation.
+
+== What Gets Generated
+For each record metamodel, a handful of methods are generated (outlined below) that can be leveraged in the
+implementation logic of your pipeline steps to apply data validation.  These methods have default logic implemented in
+generated base classes but can be customized by overriding them in the corresponding implementation class.
+
+For the following method documentation, assume the record name is `TaxPayer` and the dictionary type is `Ssn`.
+
+=== Java Implementation
+****
+.TaxPayerSchema.java
+[source,java]
+----
+/**
+ * Spark-optimized application of the record validation. This should be the preferred method for validating your
+ * dataset in a Spark pipeline.  Override this method to customize how data should be validated for the dataset.
+ */
+public Dataset<Row> validateDataFrame(Dataset<Row>  data)
+----
+_Parameters:_
+
+* `data` – dataset to be filtered
+
+_Return:_ `validData` – dataset with invalid records removed
+****
+
+****
+.Ssn.java
+[source,java]
+----
+/**
+ * Applies validation in the dictionary metamodels to a specific field.
+ * Override this method to customize how the dictionary type should be validated.
+ */
+public void validate()
+----
+
+_Parameters:_ None
+
+_Return:_ None
+
+_Throws:_
+
+* `ValidationException` - if the field fails to meet the validation rules
+****
+
+****
+.TaxPayer.java
+[source,java]
+----
+/**
+ * Applies validation described in the record metamodel and applies validation in the dictionary to any relevant fields.
+ * Override this method to customize how record should be validated.
+ */
+public void validate()
+----
+
+_Parameters:_ None
+
+_Return:_ None
+
+_Throws:_
+
+* `ValidationException` - if the field fails to meet the validation rules
+****
+
+=== Python Implementation
+****
+.tax_payer_schema.py
+[source,python]
+----
+# Spark-optimized application of the record validation. This should be the preferred method for validating your dataset in a PySpark pipeline.
+# Override this method to customize how data should be validated for the dataset.
+def validate_dataset(ingest_dataset: DataFrame)
+----
+
+_Parameters:_
+
+* `DataFrame` – dataset filtered
+
+_Return:_ `valid_data` – dataset with invalid records removed
+****
+
+****
+.ssn.py
+[source,python]
+----
+# Applies validation in the dictionary metamodels to a specific field.
+# Override this method to customize how the dictionary type should be validated.
+def validate()
+----
+
+_Parameters:_ None
+
+_Return:_ None
+
+_Throws:_
+
+* `ValueError` - throws error when the value does not match any valid formats
+****
+
+****
+.tax_payer.py
+[source,python]
+----
+# Applies validation described in the record metamodel and applies validation in the dictionary to any relevant fields.
+# Override this method to customize how record should be validated.
+def validate()
+----
+
+_Parameters:_ None
+
+_Return:_ None
+
+_Throws:_
+
+* `ValueError` - throws error when the value does not match any valid formats.
+****
diff --git a/docs/modules/ROOT/pages/drift-detection.adoc b/docs/modules/ROOT/pages/drift-detection.adoc
@@ -0,0 +1,9 @@
+[#_drift_detection]
+= Drift Detection
+
+The conceptual idea of drift is that as deployed artificial intelligence (AI) systems adapt to evolving data streams,
+the predictive power may potentially degrade. Their inferences may “drift” away from the intended targets. When using
+semantic data models, it is possible to ensure that data are consistently monitored for drift, thus maintaining the
+performance of AI systems.
+
+Please contact the team for more information.