-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
#5 📝 Tranche 5 of documentation migration
- Loading branch information
1 parent
9be4043
commit 378ffde
Showing
7 changed files
with
532 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
[#_data_encryption] | ||
= Data Encryption | ||
An important security service provided by the fabrication process is data encryption. Data encryption is a method | ||
where information is encoded and obscured making it unreadable by unauthorized entities. | ||
== Encryption Method | ||
=== AES encryption | ||
AES encryption is a symmetric-key algorithm which means you can use the same key to encrypt and decrypt data. | ||
To encrypt data using this algorithm you first need to generate a 128 bit (16 character) key and save it in the | ||
`encrypt.properties` file, then follow the example below to encrypt data using the AES algorithm. | ||
.encrypt.properties | ||
[source,java] | ||
---- | ||
encrypt.aes.secret.key.spec=<SOME_AES_ENCRYPTED_VALUE_HERE> | ||
---- | ||
.Java | ||
[source,java] | ||
---- | ||
import com.boozallen.aissemble.data.encryption.SimpleAesEncrypt; | ||
AiopsEncrypt simpleAiopsEncrypt = new SimpleAesEncrypt(); | ||
//Encrypt | ||
String encryptedData = simpleAiopsEncrypt.encryptValue("Plain text string data"); | ||
//Decrypt | ||
String decryptedVaultData = simpleAiopsEncrypt.decryptValue(encryptedData); | ||
---- | ||
.Python | ||
[source,python] | ||
---- | ||
from aiops_encrypt.aes_encrypt import AESEncrypt | ||
aes_encrypt = AESEncrypt() | ||
# Encrypt | ||
encrypted_value = aes_encrypt.encrypt('Plain text string data') | ||
# Decrypt | ||
decrypted_value = aes_encrypt.decrypt(encrypted_value) | ||
---- | ||
=== Vault encryption | ||
Another encryption service is provide through Hashicorp Vault. This is a well known provider of security services and | ||
the fabrication process generates the client methods for calling and encrypting data. In order to encrypt data using V | ||
ault the following properties need to be set: | ||
.encrypt.properties | ||
[source,java] | ||
---- | ||
secrets.host.url=[Some Url] | ||
Example: http://vault:8217 | ||
secrets.unseal.keys=[Comma delimited set of keys] | ||
Example (no quotes or whitespace): key1,key2,key3,key4,key5 | ||
secrets.root.key=[The root key] | ||
Example: s.EXAMPLE | ||
---- | ||
The keys in the example property file above can be retrieved from the vault console log. The keys will be printed at | ||
the top of the log when the server is started. For example: | ||
.Vault console log | ||
[source,java] | ||
---- | ||
vault | ROOT KEY | ||
vault | s.EXAMPLE | ||
vault | UNSEAL KEYS | ||
vault | ["key1", "key2", "key3", "key4", "key5"] | ||
vault | TRANSIT TOKEN | ||
vault | {"request_id": "29d26f42-7be2-9b06-c4ce-1ecc94114393", "lease_id": "", "renewable": false, "lease_duration": 0, "data": null, "wrap_info": null, "warnings": null, "auth": {"client_token": "s.TOKEN", "accessor": "zFcMdiOHhtXUyRTUigkePpzS", "policies": ["app-aiops", "default"], "token_policies": ["app-aiops", "default"], "metadata": null, "lease_duration": 2764800, "renewable": true, "entity_id": "", "token_type": "service", "orphan": false}} | ||
---- | ||
To add vault encryption to your code follow the example below. Please note that the vault Docker container needs to | ||
be running in order for this example to work. | ||
.Java | ||
[source,java] | ||
---- | ||
import com.boozallen.aissemble.data.encryption.VaultEncrypt; | ||
AiopsEncrypt vaultAiopsEncrypt = new VaultEncrypt(); | ||
//Encrypt | ||
String encryptedData = vaultAiopsEncrypt.encryptValue("Plain text string data"); | ||
//Decrypt | ||
String decryptedVaultData = vaultAiopsEncrypt.decryptValue(encryptedData); | ||
---- | ||
.Python | ||
[source,python] | ||
---- | ||
from aiops_encrypt.vault_remote_encryption_strategy import VaultRemoteEncryptionStrategy | ||
from aiops_encrypt.vault_local_encryption_strategy import VaultLocalEncryptionStrategy | ||
vault_client = VaultRemoteEncryptionStrategy() | ||
''' | ||
Optionally you can use local Vault encryption (vault_client = VaultLocalEncryptionStrategy()) | ||
which will download the encryption key once and perform encryption locally without a round trip | ||
to the Vault server. This is useful for encrypting large data objects and for high volume encryption tasks. | ||
NOTE: If you are encrypting your data through a User Defined Function (udf) in PySpark you need to use | ||
the VaultLocalEncryptionStrategy. Currently the remote version causes threading issues. This issue will | ||
likely be resolved in a future update to the Hashicorp Vault client | ||
''' | ||
# Encrypt | ||
encrypted_value = vault_client.encrypt('Plain text string data') | ||
# Decrypt | ||
decrypted_value = vault_client.decrypt(encrypted_value) | ||
---- | ||
== Encryption by policy | ||
The fabrication process generates built in encryption code that can be activated through a policy file. | ||
When an encryption policy is configured the pipeline will apply encryption to the fields specified in the policy. | ||
The following example illustrates how to encrypt a field named "ssn" during the ingest step. | ||
.example-encrypt-policy.json (can be named anything as long as it's in the correct policy directory) | ||
[source,json] | ||
---- | ||
[ | ||
{ | ||
"identifier": "encryptSSN", | ||
"rules": [ | ||
{ | ||
"className": "EncryptRule", | ||
"configurations": { | ||
"description": "Apply encryption policy" | ||
} | ||
} | ||
], | ||
"encryptPhase": "ingest", | ||
"encryptFields": [ | ||
"ssn" | ||
], | ||
"encryptAlgorithm": "AES_ENCRYPT" | ||
} | ||
] | ||
---- | ||
This file should be placed in a directory which can be specified by the user in the policy-configuration.properties | ||
file (see example below). | ||
`encryptPhase` - The step in the pipeline where encryption takes place. Typically, this will happen in the first step. | ||
`encryptFields` - An array of field names that will be encrypted. | ||
`encryptAlgorithm` - The algorithm that will be used to encrypt the data. Currently, the options are `AES_ENCRYPT` | ||
and `VAULT_ENCRYPT`. More can be added through customization. | ||
.policy-configuration.properties | ||
[source,json] | ||
---- | ||
policies-location=policies | ||
---- | ||
This configuration defines which folder the encryption policy resides in. In the example above the policies are in | ||
the `policies` directory (relative to the working directory). An absolute path can also be used. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
= Data Profiling | ||
|
||
WARNING: Data profiling is currently in an incubating state and is undergoing significant changes. Stay | ||
tuned for information on our upcoming improved offerings! | ||
//Machine learning pipelines should not have data profiling defined in the metamodel | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
= Data Validation | ||
|
||
The Data Validation component ensures the quality, feasibility, and accuracy of data before it is used for machine | ||
learning. Validation boosts machine learning confidence by ensuring data is cleansed, standardized, and ready for use. | ||
By leveraging data validation from the xref:semantic-data.adoc#_semantic_data[semantic data model], consistent data | ||
validation rules are applied throughout the entire project. This page will walk you through how to leverage data | ||
validation. | ||
|
||
== What Gets Generated | ||
For each record metamodel, a handful of methods are generated (outlined below) that can be leveraged in the | ||
implementation logic of your pipeline steps to apply data validation. These methods have default logic implemented in | ||
generated base classes but can be customized by overriding them in the corresponding implementation class. | ||
|
||
For the following method documentation, assume the record name is `TaxPayer` and the dictionary type is `Ssn`. | ||
|
||
=== Java Implementation | ||
**** | ||
.TaxPayerSchema.java | ||
[source,java] | ||
---- | ||
/** | ||
* Spark-optimized application of the record validation. This should be the preferred method for validating your | ||
* dataset in a Spark pipeline. Override this method to customize how data should be validated for the dataset. | ||
*/ | ||
public Dataset<Row> validateDataFrame(Dataset<Row> data) | ||
---- | ||
_Parameters:_ | ||
* `data` – dataset to be filtered | ||
_Return:_ `validData` – dataset with invalid records removed | ||
**** | ||
|
||
**** | ||
.Ssn.java | ||
[source,java] | ||
---- | ||
/** | ||
* Applies validation in the dictionary metamodels to a specific field. | ||
* Override this method to customize how the dictionary type should be validated. | ||
*/ | ||
public void validate() | ||
---- | ||
_Parameters:_ None | ||
_Return:_ None | ||
_Throws:_ | ||
* `ValidationException` - if the field fails to meet the validation rules | ||
**** | ||
|
||
**** | ||
.TaxPayer.java | ||
[source,java] | ||
---- | ||
/** | ||
* Applies validation described in the record metamodel and applies validation in the dictionary to any relevant fields. | ||
* Override this method to customize how record should be validated. | ||
*/ | ||
public void validate() | ||
---- | ||
_Parameters:_ None | ||
_Return:_ None | ||
_Throws:_ | ||
* `ValidationException` - if the field fails to meet the validation rules | ||
**** | ||
|
||
=== Python Implementation | ||
**** | ||
.tax_payer_schema.py | ||
[source,python] | ||
---- | ||
# Spark-optimized application of the record validation. This should be the preferred method for validating your dataset in a PySpark pipeline. | ||
# Override this method to customize how data should be validated for the dataset. | ||
def validate_dataset(ingest_dataset: DataFrame) | ||
---- | ||
_Parameters:_ | ||
* `DataFrame` – dataset filtered | ||
_Return:_ `valid_data` – dataset with invalid records removed | ||
**** | ||
|
||
**** | ||
.ssn.py | ||
[source,python] | ||
---- | ||
# Applies validation in the dictionary metamodels to a specific field. | ||
# Override this method to customize how the dictionary type should be validated. | ||
def validate() | ||
---- | ||
_Parameters:_ None | ||
_Return:_ None | ||
_Throws:_ | ||
* `ValueError` - throws error when the value does not match any valid formats | ||
**** | ||
|
||
**** | ||
.tax_payer.py | ||
[source,python] | ||
---- | ||
# Applies validation described in the record metamodel and applies validation in the dictionary to any relevant fields. | ||
# Override this method to customize how record should be validated. | ||
def validate() | ||
---- | ||
_Parameters:_ None | ||
_Return:_ None | ||
_Throws:_ | ||
* `ValueError` - throws error when the value does not match any valid formats. | ||
**** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
[#_drift_detection] | ||
= Drift Detection | ||
The conceptual idea of drift is that as deployed artificial intelligence (AI) systems adapt to evolving data streams, | ||
the predictive power may potentially degrade. Their inferences may “drift” away from the intended targets. When using | ||
semantic data models, it is possible to ensure that data are consistently monitored for drift, thus maintaining the | ||
performance of AI systems. | ||
Please contact the team for more information. |
Oops, something went wrong.