diff --git a/lambda_layer/index.html b/lambda_layer/index.html
index baa9a944a..77fca0834 100644
--- a/lambda_layer/index.html
+++ b/lambda_layer/index.html
@@ -1915,11 +1915,13 @@ <h2 id="lambda-job-setup">Lambda Job Setup</h2>
 <p><strong>us-east-1</strong></p>
 <ul>
 <li>arn:aws:lambda:us-east-1:507740646243:layer:sageworks_lambda_layer-us-east-1-python310:1</li>
+<li>arn:aws:lambda:us-east-1:507740646243:layer:sageworks_lambda_layer-us-east-1-python311:2</li>
 <li>arn:aws:lambda:us-east-1:507740646243:layer:sageworks_lambda_layer-us-east-1-python312:1</li>
 </ul>
 <p><strong>us-west-2</strong></p>
 <ul>
 <li>arn:aws:lambda:us-west-2:507740646243:layer:sageworks_lambda_layer-us-west-2-python310:1</li>
+<li>arn:aws:lambda:us-west-2:507740646243:layer:sageworks_lambda_layer-us-west-2-python311:2</li>
 <li>arn:aws:lambda:us-west-2:507740646243:layer:sageworks_lambda_layer-us-west-2-python312:1</li>
 </ul>
 <p><strong>Note:</strong> If you're using lambdas on a different region or with a different Python version, just let us know and we'll publish some additional layers.</p>
diff --git a/search/search_index.json b/search/search_index.json
index 3f22db811..07ca75a34 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to SageWorks","text":"<p>The SageWorks framework makes AWS\u00ae both easier to use and more powerful. SageWorks handles all the details around updating and managing a complex set of AWS Services. With a simple-to-use Python API and a beautiful set of web interfaces, SageWorks makes creating AWS ML pipelines a snap. It also dramatically improves both the usability and visibility across the entire spectrum of services: Glue Jobs, Athena, Feature Store, Models, and Endpoints. SageWorks makes it easy to build production ready, AWS powered, machine learning pipelines.</p> SageWorks Dashboard: AWS Pipelines in a Whole New Light!"},{"location":"#full-aws-overview","title":"Full AWS OverView","text":"<ul> <li>Health Monitoring \ud83d\udfe2</li> <li>Dynamic Updates</li> <li>High Level Summary</li> </ul>"},{"location":"#drill-down-views","title":"Drill-Down Views","text":"<ul> <li>Glue Jobs</li> <li>DataSources</li> <li>FeatureSets</li> <li>Models</li> <li>Endpoints</li> </ul>"},{"location":"#private-saas-architecture","title":"Private SaaS Architecture","text":"<p>Secure your Data, Empower your ML Pipelines</p> <p>SageWorks is architected as a Private SaaS. This hybrid architecture is the ultimate solution for businesses that prioritize data control and security. SageWorks deploys as an AWS Stack within your own cloud environment, ensuring compliance with stringent corporate and regulatory standards. It offers the flexibility to tailor solutions to your specific business needs through our comprehensive plugin support, both components and full web interfaces. By using SageWorks, you maintain absolute control over your data while benefiting from the power, security, and scalability of AWS cloud services. SageWorks Private SaaS Architecture</p>"},{"location":"#dashboard-and-api","title":"Dashboard and API","text":"<p>The SageWorks package has two main components, a Web Interface that provides visibility into AWS ML PIpelines and a Python API that makes creation and usage of the AWS ML Services easier than using/learning the services directly.</p>"},{"location":"#web-interfaces","title":"Web Interfaces","text":"<p>The SageWorks Dashboard has a set of web interfaces that give visibility into the AWS Glue and SageMaker Services. There are currently 5 web interfaces available:</p> <ul> <li>Top Level Dashboard: Shows all AWS ML Artifacts (Glue and SageMaker)</li> <li>DataSources: DataSource Column Details, Distributions and Correlations</li> <li>FeatureSets: FeatureSet Details, Distributions and Correlations</li> <li>Model: Model details, performance metric, and inference plots</li> <li>Endpoints: Endpoint details, realtime charts of endpoint performance and latency</li> </ul>"},{"location":"#python-api","title":"Python API","text":"<p>SageWorks API Documentation: SageWorks API Classes </p> <p>The main functionality of the Python API is to encapsulate and manage a set of AWS services underneath a Python Object interface. The Python Classes are used to create and interact with Machine Learning Pipeline Artifacts.</p>"},{"location":"#getting-started","title":"Getting Started","text":"<p>SageWorks will need some initial setup when you first start using it. See our Getting Started guide on how to connect SageWorks to your AWS Account.</p>"},{"location":"#additional-resources","title":"Additional Resources","text":"<ul> <li>Getting Started: Getting Started </li> <li>SageWorks API Classes: API Classes</li> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"admin/base_docker_push/","title":"SageWorks Base Docker Build and Push","text":"<p>Notes and information on how to do the Docker Builds and Push to AWS ECR.</p>"},{"location":"admin/base_docker_push/#update-sageworks-version","title":"Update SageWorks Version","text":"<pre><code>vi Dockerfile\n\n# Install latest Sageworks\nRUN pip install --no-cache-dir 'sageworks[ml-tool,chem]'==0.7.0\n</code></pre>"},{"location":"admin/base_docker_push/#build-the-docker-image","title":"Build the Docker Image","text":"<p>Note: For a client specific config file you'll need to copy it locally so that it's within Dockers 'build context'. If you're building the 'vanilla' open source Docker image, then you can use the <code>open_source_config.json</code> that's in the directory already.</p> <pre><code>docker build --build-arg SAGEWORKS_CONFIG=open_source_config.json -t \\\nsageworks_base:v0_7_0_amd64 --platform linux/amd64 .\n</code></pre>"},{"location":"admin/base_docker_push/#test-the-image-locally","title":"Test the Image Locally","text":"<p>You have a <code>docker_local_base</code> alias in your <code>~/.zshrc</code> :)</p>"},{"location":"admin/base_docker_push/#login-to-ecr","title":"Login to ECR","text":"<pre><code>aws ecr-public get-login-password --region us-east-1 --profile \\\nscp_sandbox_admin | docker login --username AWS \\\n--password-stdin public.ecr.aws\n</code></pre>"},{"location":"admin/base_docker_push/#tagpush-the-image-to-aws-ecr","title":"Tag/Push the Image to AWS ECR","text":"<p><pre><code>docker tag sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64\n</code></pre></p>"},{"location":"admin/base_docker_push/#update-the-latest-tag","title":"Update the 'latest' tag","text":"<p><pre><code>docker tag public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:latest\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_base:latest\n</code></pre></p>"},{"location":"admin/base_docker_push/#update-the-stable-tag","title":"Update the 'stable' tag","text":"<p>This is obviously only when you want to mark a version as stable. Meaning that it seems to 'be good and stable (ish)' :)</p> <p><pre><code>docker tag public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:stable\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_base:stable\n</code></pre></p>"},{"location":"admin/base_docker_push/#test-the-ecr-image","title":"Test the ECR Image","text":"<p>You have a <code>docker_ecr_base</code> alias in your <code>~/.zshrc</code> :)</p>"},{"location":"admin/dashboard_docker_push/","title":"Dashboard Docker Build and Push","text":"<p>Notes and information on how to do the Dashboard Docker Builds and Push to AWS ECR.</p>"},{"location":"admin/dashboard_docker_push/#update-sageworks-version","title":"Update SageWorks Version","text":"<pre><code>cd applications/aws_dashboard\nvi Dockerfile\n\n# Install Sageworks (changes often)\nRUN pip install --no-cache-dir sageworks==0.4.13 &lt;-- change this\n</code></pre>"},{"location":"admin/dashboard_docker_push/#build-the-docker-image","title":"Build the Docker Image","text":"<p>Note: For a client specific config file you'll need to copy it locally so that it's within Dockers 'build context'. If you're building the 'vanilla' open source Docker image, then you can use the <code>open_source_config.json</code> that's in the directory already.</p> <pre><code>docker build --build-arg SAGEWORKS_CONFIG=open_source_config.json -t \\\nsageworks_dashboard:v0_4_13_amd64 --platform linux/amd64 .\n</code></pre> <p>Docker with Custom Plugins: If you're using custom plugins you may want to change the SAGEWORKS_PLUGINS directory to something like <code>/app/sageworks_plugins</code> and then have Dockerfile copy your plugins into that directory on the Docker image.</p>"},{"location":"admin/dashboard_docker_push/#test-the-image-locally","title":"Test the Image Locally","text":"<p>You have a <code>docker_local_dashboard</code> alias in your <code>~/.zshrc</code> :)</p>"},{"location":"admin/dashboard_docker_push/#login-to-ecr","title":"Login to ECR","text":"<pre><code>aws ecr-public get-login-password --region us-east-1 --profile \\\nscp_sandbox_admin | docker login --username AWS \\\n--password-stdin public.ecr.aws\n</code></pre>"},{"location":"admin/dashboard_docker_push/#tagpush-the-image-to-aws-ecr","title":"Tag/Push the Image to AWS ECR","text":"<p><pre><code>docker tag sageworks_dashboard:v0_4_13_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64\n</code></pre></p>"},{"location":"admin/dashboard_docker_push/#update-the-latest-tag","title":"Update the 'latest' tag","text":"<p><pre><code>docker tag public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:latest\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:latest\n</code></pre></p>"},{"location":"admin/dashboard_docker_push/#update-the-stable-tag","title":"Update the 'stable' tag","text":"<p>This is obviously only when you want to mark a version as stable. Meaning that it seems to 'be good and stable (ish)' :)</p> <p><pre><code>docker tag public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_5_4_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:stable\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:stable\n</code></pre></p>"},{"location":"admin/dashboard_docker_push/#test-the-ecr-image","title":"Test the ECR Image","text":"<p>You have a <code>docker_ecr_dashboard</code> alias in your <code>~/.zshrc</code> :)</p>"},{"location":"admin/pypi_release/","title":"PyPI Release Notes","text":"<p>Notes and information on how to do the PyPI release for the SageMaker project. For full details on packaging you can reference this page Packaging</p> <p>The following instructions should work, but things change :)</p>"},{"location":"admin/pypi_release/#package-requirements","title":"Package Requirements","text":"<ul> <li>pip install tox</li> <li>pip install --upgrade wheel build twine</li> </ul>"},{"location":"admin/pypi_release/#setup-pypirc","title":"Setup pypirc","text":"<p>The easiest thing to do is setup a \\~/.pypirc file with the following contents</p> <pre><code>[distutils]\nindex-servers =\n  pypi\n  testpypi\n\n[pypi]\nusername = __token__\npassword = pypi-AgEIcH...\n\n[testpypi]\nusername = __token__\npassword = pypi-AgENdG...\n</code></pre>"},{"location":"admin/pypi_release/#tox-background","title":"Tox Background","text":"<p>Tox will install the SageMaker Sandbox package into a blank virtualenv and then execute all the tests against the newly installed package. So if everything goes okay, you know the pypi package installed fine and the tests (which puls from the installed <code>sageworks</code> package) also ran okay.</p>"},{"location":"admin/pypi_release/#make-sure-all-tests-pass","title":"Make sure ALL tests pass","text":"<pre><code>$ cd sageworks\n$ tox \n</code></pre> <p>If ALL the test above pass...</p>"},{"location":"admin/pypi_release/#clean-any-previous-distribution-files","title":"Clean any previous distribution files","text":"<pre><code>make clean\n</code></pre>"},{"location":"admin/pypi_release/#tag-the-new-version","title":"Tag the New Version","text":"<pre><code>git tag v0.1.8 (or whatever)\ngit push --tags\n</code></pre>"},{"location":"admin/pypi_release/#create-the-test-pypi-release","title":"Create the TEST PyPI Release","text":"<pre><code>python -m build\ntwine upload dist/* -r testpypi\n</code></pre>"},{"location":"admin/pypi_release/#install-the-test-pypi-release","title":"Install the TEST PyPI Release","text":"<pre><code>pip install --index-url https://test.pypi.org/simple sageworks\n</code></pre>"},{"location":"admin/pypi_release/#create-the-real-pypi-release","title":"Create the REAL PyPI Release","text":"<pre><code>twine upload dist/* -r pypi\n</code></pre>"},{"location":"admin/pypi_release/#push-any-possible-changes-to-github","title":"Push any possible changes to Github","text":"<pre><code>git push\n</code></pre>"},{"location":"admin/sageworks_docker_for_lambdas/","title":"SageWorks Docker Image for Lambdas","text":"<p>Using the SageWorks Docker Image for AWS Lambda Jobs allows your Lambda Jobs to use and create AWS ML Pipeline Artifacts with SageWorks.</p> <p>AWS, for some reason, does not allow Public ECRs to be used for Lambda Docker images. So you'll have to copy the Docker image into your private ECR. </p>"},{"location":"admin/sageworks_docker_for_lambdas/#creating-a-private-ecr","title":"Creating a Private ECR","text":"<p>You only need to do this if you don't already have a private ECR.</p>"},{"location":"admin/sageworks_docker_for_lambdas/#aws-console-to-create-private-ecr","title":"AWS Console to create Private ECR","text":"<ol> <li>Open the Amazon ECR console.</li> <li>Choose \"Create repository\".</li> <li>For \"Repository name\", enter <code>sageworks_base</code>.</li> <li>Ensure \"Private\" is selected.</li> <li>Choose \"Create repository\".</li> </ol>"},{"location":"admin/sageworks_docker_for_lambdas/#command-line-to-create-private-ecr","title":"Command Line to create Private ECR","text":"<p>Create the ECR repository using the AWS CLI:</p> <pre><code>aws ecr create-repository --repository-name sageworks_base --region &lt;region&gt;\n</code></pre>"},{"location":"admin/sageworks_docker_for_lambdas/#pulling-docker-image-into-private-ecr","title":"Pulling Docker Image into Private ECR","text":"<p>Note: You'll only need to do this when you want to update the SageWorks Docker image</p> <p>Pull the SageWorks Public ECR Image</p> <pre><code>docker pull public.ecr.aws/m6i5k1r2/sageworks_base:latest\n</code></pre> <p>Tag the image for your private ECR</p> <pre><code>docker tag public.ecr.aws/m6i5k1r2/sageworks_base:latest \\\n&lt;your-account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:latest\n</code></pre> <p>Push the image to your private ECR</p> <pre><code>aws ecr get-login-password --region &lt;region&gt; --profile &lt;profile&gt; | \\\ndocker login --username AWS --password-stdin &lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com\n\ndocker push &lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:&lt;tag&gt;\n</code></pre>"},{"location":"admin/sageworks_docker_for_lambdas/#using-the-docker-image-for-your-lambdas","title":"Using the Docker Image for your Lambdas","text":"<p>Okay, now that you have the SageWorks Docker image in your private ECR, here's how you use that image for your Lambda jobs.</p>"},{"location":"admin/sageworks_docker_for_lambdas/#aws-console","title":"AWS Console","text":"<ol> <li>Open the AWS Lambda console.</li> <li>Create a new function.</li> <li>Select \"Container image\".</li> <li>Use the ECR image URI: <code>&lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:&lt;tag&gt;</code>.</li> </ol>"},{"location":"admin/sageworks_docker_for_lambdas/#command-line","title":"Command Line","text":"<p>Create the Lambda function using the AWS CLI:</p> <pre><code>aws lambda create-function \\\n --region &lt;region&gt; \\\n --function-name myLambdaFunction \\\n --package-type Image \\\n --code ImageUri=&lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:&lt;tag&gt; \\\n --role arn:aws:iam::&lt;account-id&gt;:role/&lt;execution-role&gt;\n</code></pre>"},{"location":"admin/sageworks_docker_for_lambdas/#python-cdk","title":"Python CDK","text":"<p>Define the Lambda function in your CDK app:</p> <pre><code>from aws_cdk import (\n   aws_lambda as _lambda,\n   core\n)\n\nclass MyLambdaStack(core.Stack):\n   def __init__(self, scope: core.Construct, id: str, **kwargs) -&gt; None:\n       super().__init__(scope, id, **kwargs)\n\n       _lambda.Function(self, \"MyLambdaFunction\",\n                        code=_lambda.Code.from_ecr_image(\"&lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:&lt;tag&gt;\"),\n                        handler=_lambda.Handler.FROM_IMAGE,\n                        runtime=_lambda.Runtime.FROM_IMAGE,\n                        role=iam.Role.from_role_arn(self, \"LambdaRole\", \"arn:aws:iam::&lt;account-id&gt;:role/&lt;execution-role&gt;\"))\n\napp = core.App()\nMyLambdaStack(app, \"MyLambdaStack\")\napp.synth()\n</code></pre>"},{"location":"admin/sageworks_docker_for_lambdas/#cloudformation","title":"Cloudformation","text":"<p>Define the Lambda function in your CloudFormation template.</p> <pre><code>Resources:\n MyLambdaFunction:\n   Type: AWS::Lambda::Function\n   Properties:\n     Code:\n       ImageUri: &lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:&lt;tag&gt;\n     Role: arn:aws:iam::&lt;account-id&gt;:role/&lt;execution-role&gt;\n     PackageType: Image\n</code></pre>"},{"location":"api_classes/data_source/","title":"DataSource","text":"<p>DataSource Examples</p> <p>Examples of using the DataSource class are in the Examples section at the bottom of this page. S3 data, local files, and Pandas dataframes, DataSource can read data from many different sources.</p> <p>DataSource: Manages AWS Data Catalog creation and management. DataSources are set up so that can easily be queried with AWS Athena. All DataSources are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) DataSources can be viewed and explored within the SageWorks Dashboard UI.</p>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource","title":"<code>DataSource</code>","text":"<p>               Bases: <code>AthenaSource</code></p> <p>DataSource: SageWorks DataSource API Class</p> Common Usage <pre><code>my_data = DataSource(name_of_source)\nmy_data.details()\nmy_features = my_data.to_features()\n</code></pre> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>class DataSource(AthenaSource):\n    \"\"\"DataSource: SageWorks DataSource API Class\n\n    Common Usage:\n        ```\n        my_data = DataSource(name_of_source)\n        my_data.details()\n        my_features = my_data.to_features()\n        ```\n    \"\"\"\n\n    def __init__(self, source, name: str = None, tags: list = None):\n        \"\"\"\n        Initializes a new DataSource object.\n\n        Args:\n            source (str): The source of the data. This can be an S3 bucket, file path,\n                          DataFrame object, or an existing DataSource object.\n            name (str): The name of the data source (must be lowercase). If not specified, a name will be generated\n            tags (list[str]): A list of tags associated with the data source. If not specified tags will be generated.\n        \"\"\"\n\n        # Make sure we have a name for when we use a DataFrame source\n        if name == \"dataframe\":\n            msg = \"Set the 'name' argument in the constructor: DataSource(df, name='my_data')\"\n            self.log.critical(msg)\n            raise ValueError(msg)\n\n        # Ensure the ds_name is valid\n        if name:\n            Artifact.ensure_valid_name(name)\n\n        # If the model_name wasn't given generate it\n        else:\n            name = extract_data_source_basename(source)\n            name = Artifact.generate_valid_name(name)\n\n        # Set the tags and load the source\n        tags = [name] if tags is None else tags\n        self._load_source(source, name, tags)\n\n        # Call superclass init\n        super().__init__(name)\n\n    def details(self, **kwargs) -&gt; dict:\n        \"\"\"DataSource Details\n\n        Returns:\n            dict: A dictionary of details about the DataSource\n        \"\"\"\n        return super().details(**kwargs)\n\n    def query(self, query: str) -&gt; pd.DataFrame:\n        \"\"\"Query the AthenaSource\n\n        Args:\n            query (str): The query to run against the DataSource\n\n        Returns:\n            pd.DataFrame: The results of the query\n        \"\"\"\n        return super().query(query)\n\n    def pull_dataframe(self, include_aws_columns=False) -&gt; pd.DataFrame:\n        \"\"\"Return a DataFrame of ALL the data from this DataSource\n\n        Args:\n            include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n        Returns:\n            pd.DataFrame: A DataFrame of ALL the data from this DataSource\n\n        Note:\n            Obviously this is not recommended for large datasets :)\n        \"\"\"\n\n        # Get the table associated with the data\n        self.log.info(f\"Pulling all data from {self.uuid}...\")\n        table = super().get_table_name()\n        query = f\"SELECT * FROM {table}\"\n        df = self.query(query)\n\n        # Drop any columns generated from AWS\n        if not include_aws_columns:\n            aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n            df = df.drop(columns=aws_cols, errors=\"ignore\")\n        return df\n\n    def to_features(\n        self,\n        name: str = None,\n        tags: list = None,\n        target_column: str = None,\n        id_column: str = None,\n        event_time_column: str = None,\n        auto_one_hot: bool = False,\n    ) -&gt; FeatureSet:\n        \"\"\"\n        Convert the DataSource to a FeatureSet\n\n        Args:\n            name (str): Set the name for feature set (must be lowercase). If not specified, a name will be generated\n            tags (list): Set the tags for the feature set. If not specified tags will be generated.\n            target_column (str): Set the target column for the feature set. (Optional)\n            id_column (str): Set the id column for the feature set. If not specified will be generated.\n            event_time_column (str): Set the event time for the feature set. If not specified will be generated.\n            auto_one_hot (bool): Automatically one-hot encode categorical fields (default: False)\n\n        Returns:\n            FeatureSet: The FeatureSet created from the DataSource\n        \"\"\"\n\n        # Ensure the feature_set_name is valid\n        if name:\n            Artifact.ensure_valid_name(name)\n\n        # If the feature_set_name wasn't given generate it\n        else:\n            name = self.uuid.replace(\"_data\", \"\") + \"_features\"\n            name = Artifact.generate_valid_name(name)\n\n        # Set the Tags\n        tags = [name] if tags is None else tags\n\n        # Transform the DataSource to a FeatureSet\n        data_to_features = DataToFeaturesLight(self.uuid, name)\n        data_to_features.set_output_tags(tags)\n        data_to_features.transform(\n            target_column=target_column,\n            id_column=id_column,\n            event_time_column=event_time_column,\n            auto_one_hot=auto_one_hot,\n        )\n\n        # Return the FeatureSet (which will now be up-to-date)\n        return FeatureSet(name)\n\n    def _load_source(self, source: str, name: str, tags: list):\n        \"\"\"Load the source of the data\"\"\"\n        self.log.info(f\"Loading source: {source}...\")\n\n        # Pandas DataFrame Source\n        if isinstance(source, pd.DataFrame):\n            my_loader = PandasToData(name)\n            my_loader.set_input(source)\n            my_loader.set_output_tags(tags)\n            my_loader.transform()\n\n        # S3 Source\n        source = source if isinstance(source, str) else str(source)\n        if source.startswith(\"s3://\"):\n            my_loader = S3ToDataSourceLight(source, name)\n            my_loader.set_output_tags(tags)\n            my_loader.transform()\n\n        # File Source\n        elif os.path.isfile(source):\n            my_loader = CSVToDataSource(source, name)\n            my_loader.set_output_tags(tags)\n            my_loader.transform()\n</code></pre>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.__init__","title":"<code>__init__(source, name=None, tags=None)</code>","text":"<p>Initializes a new DataSource object.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>The source of the data. This can be an S3 bucket, file path,           DataFrame object, or an existing DataSource object.</p> required <code>name</code> <code>str</code> <p>The name of the data source (must be lowercase). If not specified, a name will be generated</p> <code>None</code> <code>tags</code> <code>list[str]</code> <p>A list of tags associated with the data source. If not specified tags will be generated.</p> <code>None</code> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>def __init__(self, source, name: str = None, tags: list = None):\n    \"\"\"\n    Initializes a new DataSource object.\n\n    Args:\n        source (str): The source of the data. This can be an S3 bucket, file path,\n                      DataFrame object, or an existing DataSource object.\n        name (str): The name of the data source (must be lowercase). If not specified, a name will be generated\n        tags (list[str]): A list of tags associated with the data source. If not specified tags will be generated.\n    \"\"\"\n\n    # Make sure we have a name for when we use a DataFrame source\n    if name == \"dataframe\":\n        msg = \"Set the 'name' argument in the constructor: DataSource(df, name='my_data')\"\n        self.log.critical(msg)\n        raise ValueError(msg)\n\n    # Ensure the ds_name is valid\n    if name:\n        Artifact.ensure_valid_name(name)\n\n    # If the model_name wasn't given generate it\n    else:\n        name = extract_data_source_basename(source)\n        name = Artifact.generate_valid_name(name)\n\n    # Set the tags and load the source\n    tags = [name] if tags is None else tags\n    self._load_source(source, name, tags)\n\n    # Call superclass init\n    super().__init__(name)\n</code></pre>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.details","title":"<code>details(**kwargs)</code>","text":"<p>DataSource Details</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about the DataSource</p> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>def details(self, **kwargs) -&gt; dict:\n    \"\"\"DataSource Details\n\n    Returns:\n        dict: A dictionary of details about the DataSource\n    \"\"\"\n    return super().details(**kwargs)\n</code></pre>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.pull_dataframe","title":"<code>pull_dataframe(include_aws_columns=False)</code>","text":"<p>Return a DataFrame of ALL the data from this DataSource</p> <p>Parameters:</p> Name Type Description Default <code>include_aws_columns</code> <code>bool</code> <p>Include the AWS columns in the DataFrame (default: False)</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A DataFrame of ALL the data from this DataSource</p> Note <p>Obviously this is not recommended for large datasets :)</p> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>def pull_dataframe(self, include_aws_columns=False) -&gt; pd.DataFrame:\n    \"\"\"Return a DataFrame of ALL the data from this DataSource\n\n    Args:\n        include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n    Returns:\n        pd.DataFrame: A DataFrame of ALL the data from this DataSource\n\n    Note:\n        Obviously this is not recommended for large datasets :)\n    \"\"\"\n\n    # Get the table associated with the data\n    self.log.info(f\"Pulling all data from {self.uuid}...\")\n    table = super().get_table_name()\n    query = f\"SELECT * FROM {table}\"\n    df = self.query(query)\n\n    # Drop any columns generated from AWS\n    if not include_aws_columns:\n        aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n        df = df.drop(columns=aws_cols, errors=\"ignore\")\n    return df\n</code></pre>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.query","title":"<code>query(query)</code>","text":"<p>Query the AthenaSource</p> <p>Parameters:</p> Name Type Description Default <code>query</code> <code>str</code> <p>The query to run against the DataSource</p> required <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The results of the query</p> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>def query(self, query: str) -&gt; pd.DataFrame:\n    \"\"\"Query the AthenaSource\n\n    Args:\n        query (str): The query to run against the DataSource\n\n    Returns:\n        pd.DataFrame: The results of the query\n    \"\"\"\n    return super().query(query)\n</code></pre>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.to_features","title":"<code>to_features(name=None, tags=None, target_column=None, id_column=None, event_time_column=None, auto_one_hot=False)</code>","text":"<p>Convert the DataSource to a FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>Set the name for feature set (must be lowercase). If not specified, a name will be generated</p> <code>None</code> <code>tags</code> <code>list</code> <p>Set the tags for the feature set. If not specified tags will be generated.</p> <code>None</code> <code>target_column</code> <code>str</code> <p>Set the target column for the feature set. (Optional)</p> <code>None</code> <code>id_column</code> <code>str</code> <p>Set the id column for the feature set. If not specified will be generated.</p> <code>None</code> <code>event_time_column</code> <code>str</code> <p>Set the event time for the feature set. If not specified will be generated.</p> <code>None</code> <code>auto_one_hot</code> <code>bool</code> <p>Automatically one-hot encode categorical fields (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>FeatureSet</code> <code>FeatureSet</code> <p>The FeatureSet created from the DataSource</p> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>def to_features(\n    self,\n    name: str = None,\n    tags: list = None,\n    target_column: str = None,\n    id_column: str = None,\n    event_time_column: str = None,\n    auto_one_hot: bool = False,\n) -&gt; FeatureSet:\n    \"\"\"\n    Convert the DataSource to a FeatureSet\n\n    Args:\n        name (str): Set the name for feature set (must be lowercase). If not specified, a name will be generated\n        tags (list): Set the tags for the feature set. If not specified tags will be generated.\n        target_column (str): Set the target column for the feature set. (Optional)\n        id_column (str): Set the id column for the feature set. If not specified will be generated.\n        event_time_column (str): Set the event time for the feature set. If not specified will be generated.\n        auto_one_hot (bool): Automatically one-hot encode categorical fields (default: False)\n\n    Returns:\n        FeatureSet: The FeatureSet created from the DataSource\n    \"\"\"\n\n    # Ensure the feature_set_name is valid\n    if name:\n        Artifact.ensure_valid_name(name)\n\n    # If the feature_set_name wasn't given generate it\n    else:\n        name = self.uuid.replace(\"_data\", \"\") + \"_features\"\n        name = Artifact.generate_valid_name(name)\n\n    # Set the Tags\n    tags = [name] if tags is None else tags\n\n    # Transform the DataSource to a FeatureSet\n    data_to_features = DataToFeaturesLight(self.uuid, name)\n    data_to_features.set_output_tags(tags)\n    data_to_features.transform(\n        target_column=target_column,\n        id_column=id_column,\n        event_time_column=event_time_column,\n        auto_one_hot=auto_one_hot,\n    )\n\n    # Return the FeatureSet (which will now be up-to-date)\n    return FeatureSet(name)\n</code></pre>"},{"location":"api_classes/data_source/#examples","title":"Examples","text":"<p>All of the SageWorks Examples are in the Sageworks Repository under the <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p> <p>Create a DataSource from an S3 Path or File Path</p> datasource_from_s3.py<pre><code>from sageworks.api.data_source import DataSource\n\n# Create a DataSource from an S3 Path (or a local file)\nsource_path = \"s3://sageworks-public-data/common/abalone.csv\"\n# source_path = \"/full/path/to/local/file.csv\"\n\nmy_data = DataSource(source_path)\nprint(my_data.details())\n</code></pre> <p>Create a DataSource from a Pandas Dataframe</p> datasource_from_df.py<pre><code>from sageworks.utils.test_data_generator import TestDataGenerator\nfrom sageworks.api.data_source import DataSource\n\n# Create a DataSource from a Pandas DataFrame\ngen_data = TestDataGenerator()\ndf = gen_data.person_data()\n\ntest_data = DataSource(df, name=\"test_data\")\nprint(test_data.details())\n</code></pre> <p>Query a DataSource</p> <p>All SageWorks DataSources use AWS Athena, so any query that you can make with Athena is accessible through the DataSource API.</p> datasource_query.py<pre><code>from sageworks.api.data_source import DataSource\n\n# Grab a DataSource\nmy_data = DataSource(\"abalone_data\")\n\n# Make some queries using the Athena backend\ndf = my_data.query(\"select * from abalone_data where height &gt; .3\")\nprint(df.head())\n\ndf = my_data.query(\"select * from abalone_data where class_number_of_rings &lt; 3\")\nprint(df.head())\n</code></pre> <p>Output</p> <pre><code>  sex  length  diameter  height  whole_weight  shucked_weight  viscera_weight  shell_weight  class_number_of_rings\n0   M   0.705     0.565   0.515         2.210          1.1075          0.4865        0.5120                     10\n1   F   0.455     0.355   1.130         0.594          0.3320          0.1160        0.1335                      8\n\n  sex  length  diameter  height  whole_weight  shucked_weight  viscera_weight  shell_weight  class_number_of_rings\n0   I   0.075     0.055   0.010         0.002          0.0010          0.0005        0.0015                      1\n1   I   0.150     0.100   0.025         0.015          0.0045          0.0040        0.0050                      2\n</code></pre> <p>Create a FeatureSet from a DataSource</p> datasource_to_featureset.py<pre><code>from sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\ntest_data = DataSource('test_data')\nmy_features = test_data.to_features()\nprint(my_features.details())\n</code></pre>"},{"location":"api_classes/data_source/#sageworks-ui","title":"SageWorks UI","text":"<p>Whenever a DataSource is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.</p> SageWorks Dashboard: DataSources <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"api_classes/endpoint/","title":"Endpoint","text":"<p>Endpoint Examples</p> <p>Examples of using the Endpoint class are listed at the bottom of this page Examples.</p> <p>Endpoint: Manages AWS Endpoint creation and deployment. Endpoints are automatically set up and provisioned for deployment into AWS. Endpoints can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional model details and performance metrics</p>"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint","title":"<code>Endpoint</code>","text":"<p>               Bases: <code>EndpointCore</code></p> <p>Endpoint: SageWorks Endpoint API Class</p> Common Usage <pre><code>my_endpoint = Endpoint(name)\nmy_endpoint.details()\nmy_endpoint.inference(eval_df)\n</code></pre> Source code in <code>src/sageworks/api/endpoint.py</code> <pre><code>class Endpoint(EndpointCore):\n    \"\"\"Endpoint: SageWorks Endpoint API Class\n\n    Common Usage:\n        ```\n        my_endpoint = Endpoint(name)\n        my_endpoint.details()\n        my_endpoint.inference(eval_df)\n        ```\n    \"\"\"\n\n    def details(self, **kwargs) -&gt; dict:\n        \"\"\"Endpoint Details\n\n        Returns:\n            dict: A dictionary of details about the Endpoint\n        \"\"\"\n        return super().details(**kwargs)\n\n    def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -&gt; pd.DataFrame:\n        \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n        Args:\n            eval_df (pd.DataFrame): The DataFrame to run predictions on\n            capture_uuid (str, optional): The UUID of the capture to use (default: None)\n            id_column (str, optional): The name of the column to use as the ID (default: None)\n\n        Returns:\n            pd.DataFrame: The DataFrame with predictions\n        \"\"\"\n        return super().inference(eval_df, capture_uuid, id_column)\n\n    def auto_inference(self, capture: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Run inference on the Endpoint using the FeatureSet evaluation data\n\n        Args:\n            capture (bool): Capture the inference results\n\n        Returns:\n            pd.DataFrame: The DataFrame with predictions\n        \"\"\"\n        return super().auto_inference(capture)\n</code></pre>"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.auto_inference","title":"<code>auto_inference(capture=False)</code>","text":"<p>Run inference on the Endpoint using the FeatureSet evaluation data</p> <p>Parameters:</p> Name Type Description Default <code>capture</code> <code>bool</code> <p>Capture the inference results</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The DataFrame with predictions</p> Source code in <code>src/sageworks/api/endpoint.py</code> <pre><code>def auto_inference(self, capture: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Run inference on the Endpoint using the FeatureSet evaluation data\n\n    Args:\n        capture (bool): Capture the inference results\n\n    Returns:\n        pd.DataFrame: The DataFrame with predictions\n    \"\"\"\n    return super().auto_inference(capture)\n</code></pre>"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.details","title":"<code>details(**kwargs)</code>","text":"<p>Endpoint Details</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about the Endpoint</p> Source code in <code>src/sageworks/api/endpoint.py</code> <pre><code>def details(self, **kwargs) -&gt; dict:\n    \"\"\"Endpoint Details\n\n    Returns:\n        dict: A dictionary of details about the Endpoint\n    \"\"\"\n    return super().details(**kwargs)\n</code></pre>"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.inference","title":"<code>inference(eval_df, capture_uuid=None, id_column=None)</code>","text":"<p>Run inference on the Endpoint using the provided DataFrame</p> <p>Parameters:</p> Name Type Description Default <code>eval_df</code> <code>DataFrame</code> <p>The DataFrame to run predictions on</p> required <code>capture_uuid</code> <code>str</code> <p>The UUID of the capture to use (default: None)</p> <code>None</code> <code>id_column</code> <code>str</code> <p>The name of the column to use as the ID (default: None)</p> <code>None</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The DataFrame with predictions</p> Source code in <code>src/sageworks/api/endpoint.py</code> <pre><code>def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -&gt; pd.DataFrame:\n    \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n    Args:\n        eval_df (pd.DataFrame): The DataFrame to run predictions on\n        capture_uuid (str, optional): The UUID of the capture to use (default: None)\n        id_column (str, optional): The name of the column to use as the ID (default: None)\n\n    Returns:\n        pd.DataFrame: The DataFrame with predictions\n    \"\"\"\n    return super().inference(eval_df, capture_uuid, id_column)\n</code></pre>"},{"location":"api_classes/endpoint/#examples","title":"Examples","text":"<p>Run Inference on an Endpoint</p> endpoint_inference.py<pre><code>from sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import Model\nfrom sageworks.api.endpoint import Endpoint\n\n# Grab an existing Endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# SageWorks has full ML Pipeline provenance, so we can backtrack the inputs,\n# get a DataFrame of data (not used for training) and run inference\nmodel = Model(endpoint.get_input())\nfs = FeatureSet(model.get_input())\nathena_table = fs.get_training_view_table()\ndf = fs.query(f\"SELECT * FROM {athena_table} where training = 0\")\n\n# Run inference/predictions on the Endpoint\nresults_df = endpoint.inference(df)\n\n# Run inference/predictions and capture the results\nresults_df = endpoint.inference(df, capture=True)\n\n# Run inference/predictions using the FeatureSet evaluation data\nresults_df = endpoint.auto_inference(capture=True)\n</code></pre> <p>Output</p> <p><pre><code>Processing...\n     class_number_of_rings  prediction\n0                       13   11.477922\n1                       12   12.316887\n2                        8    7.612847\n3                        8    9.663341\n4                        9    9.075263\n..                     ...         ...\n839                      8    8.069856\n840                     15   14.915502\n841                     11   10.977605\n842                     10   10.173433\n843                      7    7.297976\n</code></pre> Endpoint Details</p> <p>The details() method</p> <p>The <code>detail()</code> method on the Endpoint class provides a lot of useful information. All of the SageWorks classes have a <code>details()</code> method try it out!</p> endpoint_details.py<pre><code>from sageworks.api.endpoint import Endpoint\nfrom pprint import pprint\n\n# Get Endpoint and print out it's details\nendpoint = Endpoint(\"abalone-regression-end\")\npprint(endpoint.details())\n</code></pre> <p>Output</p> <pre><code>{\n 'input': 'abalone-regression',\n 'instance': 'Serverless (2GB/5)',\n 'model_metrics':   metric_name  value\n            0        RMSE  2.190\n            1         MAE  1.544\n            2          R2  0.504,\n 'model_name': 'abalone-regression',\n 'model_type': 'regressor',\n 'modified': datetime.datetime(2023, 12, 29, 17, 48, 35, 115000, tzinfo=datetime.timezone.utc),\n     class_number_of_rings  prediction\n0                        9    8.648378\n1                       11    9.717787\n2                       11   10.933070\n3                       10    9.899738\n4                        9   10.014504\n..                     ...         ...\n495                     10   10.261657\n496                      9   10.788254\n497                     13    7.779886\n498                     12   14.718514\n499                     13   10.637320\n 'sageworks_tags': ['abalone', 'regression'],\n 'status': 'InService',\n 'uuid': 'abalone-regression-end',\n 'variant': 'AllTraffic'}\n</code></pre> <p>Endpoint Metrics</p> endpoint_metrics.py<pre><code>from sageworks.api.endpoint import Endpoint\n\n# Grab an existing Endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# SageWorks tracks both Model performance and Endpoint Metrics\nmodel_metrics = endpoint.details()[\"model_metrics\"]\nendpoint_metrics = endpoint.endpoint_metrics()\nprint(model_metrics)\nprint(endpoint_metrics)\n</code></pre> <p>Output</p> <pre><code>  metric_name  value\n0        RMSE  2.190\n1         MAE  1.544\n2          R2  0.504\n\n    Invocations  ModelLatency  OverheadLatency  ModelSetupTime  Invocation5XXErrors\n29          0.0          0.00             0.00            0.00                  0.0\n30          1.0          1.11            23.73           23.34                  0.0\n31          0.0          0.00             0.00            0.00                  0.0\n48          0.0          0.00             0.00            0.00                  0.0\n49          5.0          0.45             9.64           23.57                  0.0\n50          2.0          0.57             0.08            0.00                  0.0\n51          0.0          0.00             0.00            0.00                  0.0\n60          4.0          0.33             5.80           22.65                  0.0\n61          1.0          1.11            23.35           23.10                  0.0\n62          0.0          0.00             0.00            0.00                  0.0\n...\n</code></pre>"},{"location":"api_classes/endpoint/#sageworks-ui","title":"SageWorks UI","text":"<p>Running these few lines of code creates and deploys an AWS Endpoint. The Endpoint artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI. SageWorks will monitor the endpoint, plot invocations, latencies, and tracks error metrics.</p> SageWorks Dashboard: Endpoints <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"api_classes/feature_set/","title":"FeatureSet","text":"<p>FeatureSet Examples</p> <p>Examples of using the FeatureSet Class are in the Examples section at the bottom of this page. AWS Feature Store and Feature Groups are quite complicated to set up manually but the SageWorks FeatureSet makes it a breeze!</p> <p>FeatureSet: Manages AWS Feature Store/Group creation and management. FeatureSets are set up so they can easily be queried with AWS Athena. All FeatureSets are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) FeatureSets can be viewed and explored within the SageWorks Dashboard UI.</p>"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet","title":"<code>FeatureSet</code>","text":"<p>               Bases: <code>FeatureSetCore</code></p> <p>FeatureSet: SageWorks FeatureSet API Class</p> Common Usage <pre><code>my_features = FeatureSet(name)\nmy_features.details()\nmy_features.to_model(\n    ModelType.REGRESSOR,\n    name=\"abalone-regression\",\n    target_column=\"class_number_of_rings\"\n)\n</code></pre> Source code in <code>src/sageworks/api/feature_set.py</code> <pre><code>class FeatureSet(FeatureSetCore):\n    \"\"\"FeatureSet: SageWorks FeatureSet API Class\n\n    Common Usage:\n        ```\n        my_features = FeatureSet(name)\n        my_features.details()\n        my_features.to_model(\n            ModelType.REGRESSOR,\n            name=\"abalone-regression\",\n            target_column=\"class_number_of_rings\"\n        )\n        ```\n    \"\"\"\n\n    def details(self, **kwargs) -&gt; dict:\n        \"\"\"FeatureSet Details\n\n        Returns:\n            dict: A dictionary of details about the FeatureSet\n        \"\"\"\n        return super().details(**kwargs)\n\n    def query(self, query: str, **kwargs) -&gt; pd.DataFrame:\n        \"\"\"Query the AthenaSource\n\n        Args:\n            query (str): The query to run against the FeatureSet\n\n        Returns:\n            pd.DataFrame: The results of the query\n        \"\"\"\n        return super().query(query, **kwargs)\n\n    def pull_dataframe(self, include_aws_columns=False) -&gt; pd.DataFrame:\n        \"\"\"Return a DataFrame of ALL the data from this FeatureSet\n\n        Args:\n            include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n        Returns:\n            pd.DataFrame: A DataFrame of ALL the data from this FeatureSet\n\n        Note:\n            Obviously this is not recommended for large datasets :)\n        \"\"\"\n\n        # Get the table associated with the data\n        self.log.info(f\"Pulling all data from {self.uuid}...\")\n        query = f\"SELECT * FROM {self.athena_table}\"\n        df = self.query(query)\n\n        # Drop any columns generated from AWS\n        if not include_aws_columns:\n            aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n            df = df.drop(columns=aws_cols, errors=\"ignore\")\n        return df\n\n    def to_model(\n        self,\n        model_type: ModelType = ModelType.UNKNOWN,\n        model_class: str = None,\n        name: str = None,\n        tags: list = None,\n        description: str = None,\n        feature_list: list = None,\n        target_column: str = None,\n        **kwargs,\n    ) -&gt; Model:\n        \"\"\"Create a Model from the FeatureSet\n\n        Args:\n\n            model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\n            model_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\n            name (str): Set the name for the model. If not specified, a name will be generated\n            tags (list): Set the tags for the model.  If not specified tags will be generated.\n            description (str): Set the description for the model. If not specified a description is generated.\n            feature_list (list): Set the feature list for the model. If not specified a feature list is generated.\n            target_column (str): The target column for the model (use None for unsupervised model)\n\n        Returns:\n            Model: The Model created from the FeatureSet\n        \"\"\"\n\n        # Ensure the model_name is valid\n        if name:\n            Artifact.ensure_valid_name(name, delimiter=\"-\")\n\n        # If the model_name wasn't given generate it\n        else:\n            name = self.uuid.replace(\"_features\", \"\") + \"-model\"\n            name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n        # Create the Model Tags\n        tags = [name] if tags is None else tags\n\n        # Transform the FeatureSet into a Model\n        features_to_model = FeaturesToModel(self.uuid, name, model_type=model_type, model_class=model_class)\n        features_to_model.set_output_tags(tags)\n        features_to_model.transform(\n            target_column=target_column, description=description, feature_list=feature_list, **kwargs\n        )\n\n        # Return the Model\n        return Model(name)\n</code></pre>"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.details","title":"<code>details(**kwargs)</code>","text":"<p>FeatureSet Details</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about the FeatureSet</p> Source code in <code>src/sageworks/api/feature_set.py</code> <pre><code>def details(self, **kwargs) -&gt; dict:\n    \"\"\"FeatureSet Details\n\n    Returns:\n        dict: A dictionary of details about the FeatureSet\n    \"\"\"\n    return super().details(**kwargs)\n</code></pre>"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.pull_dataframe","title":"<code>pull_dataframe(include_aws_columns=False)</code>","text":"<p>Return a DataFrame of ALL the data from this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>include_aws_columns</code> <code>bool</code> <p>Include the AWS columns in the DataFrame (default: False)</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A DataFrame of ALL the data from this FeatureSet</p> Note <p>Obviously this is not recommended for large datasets :)</p> Source code in <code>src/sageworks/api/feature_set.py</code> <pre><code>def pull_dataframe(self, include_aws_columns=False) -&gt; pd.DataFrame:\n    \"\"\"Return a DataFrame of ALL the data from this FeatureSet\n\n    Args:\n        include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n    Returns:\n        pd.DataFrame: A DataFrame of ALL the data from this FeatureSet\n\n    Note:\n        Obviously this is not recommended for large datasets :)\n    \"\"\"\n\n    # Get the table associated with the data\n    self.log.info(f\"Pulling all data from {self.uuid}...\")\n    query = f\"SELECT * FROM {self.athena_table}\"\n    df = self.query(query)\n\n    # Drop any columns generated from AWS\n    if not include_aws_columns:\n        aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n        df = df.drop(columns=aws_cols, errors=\"ignore\")\n    return df\n</code></pre>"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.query","title":"<code>query(query, **kwargs)</code>","text":"<p>Query the AthenaSource</p> <p>Parameters:</p> Name Type Description Default <code>query</code> <code>str</code> <p>The query to run against the FeatureSet</p> required <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The results of the query</p> Source code in <code>src/sageworks/api/feature_set.py</code> <pre><code>def query(self, query: str, **kwargs) -&gt; pd.DataFrame:\n    \"\"\"Query the AthenaSource\n\n    Args:\n        query (str): The query to run against the FeatureSet\n\n    Returns:\n        pd.DataFrame: The results of the query\n    \"\"\"\n    return super().query(query, **kwargs)\n</code></pre>"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.to_model","title":"<code>to_model(model_type=ModelType.UNKNOWN, model_class=None, name=None, tags=None, description=None, feature_list=None, target_column=None, **kwargs)</code>","text":"<p>Create a Model from the FeatureSet</p> <p>Args:</p> <pre><code>model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\nmodel_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\nname (str): Set the name for the model. If not specified, a name will be generated\ntags (list): Set the tags for the model.  If not specified tags will be generated.\ndescription (str): Set the description for the model. If not specified a description is generated.\nfeature_list (list): Set the feature list for the model. If not specified a feature list is generated.\ntarget_column (str): The target column for the model (use None for unsupervised model)\n</code></pre> <p>Returns:</p> Name Type Description <code>Model</code> <code>Model</code> <p>The Model created from the FeatureSet</p> Source code in <code>src/sageworks/api/feature_set.py</code> <pre><code>def to_model(\n    self,\n    model_type: ModelType = ModelType.UNKNOWN,\n    model_class: str = None,\n    name: str = None,\n    tags: list = None,\n    description: str = None,\n    feature_list: list = None,\n    target_column: str = None,\n    **kwargs,\n) -&gt; Model:\n    \"\"\"Create a Model from the FeatureSet\n\n    Args:\n\n        model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\n        model_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\n        name (str): Set the name for the model. If not specified, a name will be generated\n        tags (list): Set the tags for the model.  If not specified tags will be generated.\n        description (str): Set the description for the model. If not specified a description is generated.\n        feature_list (list): Set the feature list for the model. If not specified a feature list is generated.\n        target_column (str): The target column for the model (use None for unsupervised model)\n\n    Returns:\n        Model: The Model created from the FeatureSet\n    \"\"\"\n\n    # Ensure the model_name is valid\n    if name:\n        Artifact.ensure_valid_name(name, delimiter=\"-\")\n\n    # If the model_name wasn't given generate it\n    else:\n        name = self.uuid.replace(\"_features\", \"\") + \"-model\"\n        name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n    # Create the Model Tags\n    tags = [name] if tags is None else tags\n\n    # Transform the FeatureSet into a Model\n    features_to_model = FeaturesToModel(self.uuid, name, model_type=model_type, model_class=model_class)\n    features_to_model.set_output_tags(tags)\n    features_to_model.transform(\n        target_column=target_column, description=description, feature_list=feature_list, **kwargs\n    )\n\n    # Return the Model\n    return Model(name)\n</code></pre>"},{"location":"api_classes/feature_set/#examples","title":"Examples","text":"<p>All of the SageWorks Examples are in the Sageworks Repository under the <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p> <p>Create a FeatureSet from a Datasource</p> datasource_to_featureset.py<pre><code>from sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\ntest_data = DataSource('test_data')\nmy_features = test_data.to_features()\nprint(my_features.details())\n</code></pre> <p>FeatureSet EDA Statistics</p> <p>featureset_eda.py<pre><code>from sageworks.api.feature_set import FeatureSet\nimport pandas as pd\n\n# Grab a FeatureSet and pull some of the EDA Stats\nmy_features = FeatureSet('test_features')\n\n# Grab some of the EDA Stats\ncorr_data = my_features.correlations()\ncorr_df = pd.DataFrame(corr_data)\nprint(corr_df)\n\n# Get some outliers\noutliers = my_features.outliers()\npprint(outliers.head())\n\n# Full set of EDA Stats\neda_stats = my_features.column_stats()\npprint(eda_stats)\n</code></pre> Output</p> <pre><code>                 age  food_pizza  food_steak  food_sushi  food_tacos    height        id  iq_score\nage              NaN   -0.188645   -0.256356    0.263048    0.054211  0.439678 -0.054948 -0.295513\nfood_pizza -0.188645         NaN   -0.288175   -0.229591   -0.196818 -0.494380  0.137282  0.395378\nfood_steak -0.256356   -0.288175         NaN   -0.374920   -0.321403 -0.002542 -0.005199  0.076477\nfood_sushi  0.263048   -0.229591   -0.374920         NaN   -0.256064  0.536396  0.038279 -0.435033\nfood_tacos  0.054211   -0.196818   -0.321403   -0.256064         NaN -0.091493 -0.051398  0.033364\nheight      0.439678   -0.494380   -0.002542    0.536396   -0.091493       NaN -0.117372 -0.655210\nid         -0.054948    0.137282   -0.005199    0.038279   -0.051398 -0.117372       NaN  0.106020\niq_score   -0.295513    0.395378    0.076477   -0.435033    0.033364 -0.655210  0.106020       NaN\n\n        name     height      weight         salary  age    iq_score  likes_dogs  food_pizza  food_steak  food_sushi  food_tacos outlier_group\n0  Person 96  57.582840  148.461349   80000.000000   43  150.000000           1           0           0           0           0    height_low\n1  Person 68  73.918663  189.527313  219994.000000   80  100.000000           0           0           0           1           0  iq_score_low\n2  Person 49  70.381790  261.237000  175633.703125   49  107.933998           0           0           0           1           0  iq_score_low\n3  Person 90  73.488739  193.840698  227760.000000   72  110.821541           1           0           0           0           0   salary_high\n\n&lt;lots of EDA data and statistics&gt;\n</code></pre> <p>Query a FeatureSet</p> <p>All SageWorks FeatureSet have an 'offline' store that uses AWS Athena, so any query that you can make with Athena is accessible through the FeatureSet API.</p> featureset_query.py<pre><code>from sageworks.api.feature_set import FeatureSet\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"abalone_features\")\n\n# Make some queries using the Athena backend\ndf = my_features.query(\"select * from abalone_features where height &gt; .3\")\nprint(df.head())\n\ndf = my_features.query(\"select * from abalone_features where class_number_of_rings &lt; 3\")\nprint(df.head())\n</code></pre> <p>Output</p> <pre><code>  sex  length  diameter  height  whole_weight  shucked_weight  viscera_weight  shell_weight  class_number_of_rings\n0   M   0.705     0.565   0.515         2.210          1.1075          0.4865        0.5120                     10\n1   F   0.455     0.355   1.130         0.594          0.3320          0.1160        0.1335                      8\n\n  sex  length  diameter  height  whole_weight  shucked_weight  viscera_weight  shell_weight  class_number_of_rings\n0   I   0.075     0.055   0.010         0.002          0.0010          0.0005        0.0015                      1\n1   I   0.150     0.100   0.025         0.015          0.0045          0.0040         0.0050                      2\n</code></pre> <p>Create a Model from a FeatureSet</p> featureset_to_model.py<pre><code>from sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import ModelType\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet('test_features')\n\n# Create a Model from the FeatureSet\n# Note: ModelTypes can be CLASSIFIER, REGRESSOR, \n#       UNSUPERVISED, or TRANSFORMER\nmy_model = my_features.to_model(model_type=ModelType.REGRESSOR, \n                                target_column=\"iq_score\")\npprint(my_model.details())\n</code></pre> <p>Output</p> <pre><code>{'approval_status': 'Approved',\n 'content_types': ['text/csv'],\n ...\n 'inference_types': ['ml.t2.medium'],\n 'input': 'test_features',\n 'model_metrics':   metric_name  value\n                0        RMSE  7.924\n                1         MAE  6.554,\n                2          R2  0.604,\n 'regression_predictions':       iq_score  prediction\n                            0   136.519012  139.964460\n                            1   133.616974  130.819950\n                            2   122.495415  124.967834\n                            3   133.279510  121.010284\n                            4   127.881073  113.825005\n    ...\n 'response_types': ['text/csv'],\n 'sageworks_tags': ['test-model'],\n 'shapley_values': None,\n 'size': 0.0,\n 'status': 'Completed',\n 'transform_types': ['ml.m5.large'],\n 'uuid': 'test-model',\n 'version': 1}\n</code></pre>"},{"location":"api_classes/feature_set/#sageworks-ui","title":"SageWorks UI","text":"<p>Whenever a FeatureSet is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.</p> SageWorks Dashboard: FeatureSets <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"api_classes/meta/","title":"Meta","text":"<p>Meta Examples</p> <p>Examples of using the Meta class are listed at the bottom of this page Examples.</p> <p>Meta: A class that provides high level information and summaries of SageWorks/AWS Artifacts. The Meta class provides 'meta' information, what account are we in, what is the current configuration, etc. It also provides metadata for AWS Artifacts, such as Data Sources, Feature Sets, Models, and Endpoints.</p> <p>Refresh</p> <p>Setting <code>refresh</code> to <code>True</code> will lead to substantial performance issues, so don't do it :).</p>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta","title":"<code>Meta</code>","text":"<p>Meta: A class that provides Metadata for a broad set of AWS Artifacts</p> <p>Common Usage: <pre><code>meta = Meta()\nmeta.account()\nmeta.config()\nmeta.data_sources()\n</code></pre></p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>class Meta:\n    \"\"\"Meta: A class that provides Metadata for a broad set of AWS Artifacts\n\n    Common Usage:\n    ```\n    meta = Meta()\n    meta.account()\n    meta.config()\n    meta.data_sources()\n    ```\n    \"\"\"\n\n    def __init__(self):\n        \"\"\"Meta Initialization\"\"\"\n        self.log = logging.getLogger(\"sageworks\")\n\n        # Account and Service Brokers\n        self.aws_account_clamp = AWSAccountClamp()\n        self.aws_broker = AWSServiceBroker()\n        self.cm = ConfigManager()\n\n        # Pipeline Manager\n        self.pipeline_manager = PipelineManager()\n\n    def account(self) -&gt; dict:\n        \"\"\"Print out the AWS Account Info\n\n        Returns:\n            dict: The AWS Account Info\n        \"\"\"\n        return self.aws_account_clamp.get_aws_account_info()\n\n    def config(self) -&gt; dict:\n        \"\"\"Return the current SageWorks Configuration\n\n        Returns:\n            dict: The current SageWorks Configuration\n        \"\"\"\n        return self.cm.get_all_config()\n\n    def incoming_data(self) -&gt; pd.DataFrame:\n        \"\"\"Get summary data about data in the incoming-data S3 Bucket\n\n        Returns:\n            pd.DataFrame: A summary of the data in the incoming-data S3 Bucket\n        \"\"\"\n        data = self.incoming_data_deep()\n        data_summary = []\n        for name, info in data.items():\n            # Get the name and the size of the S3 Storage Object(s)\n            name = \"/\".join(name.split(\"/\")[-2:]).replace(\"incoming-data/\", \"\")\n            info[\"Name\"] = name\n            size = info.get(\"ContentLength\") / 1_000_000\n            summary = {\n                \"Name\": name,\n                \"Size(MB)\": f\"{size:.2f}\",\n                \"Modified\": datetime_string(info.get(\"LastModified\", \"-\")),\n                \"ContentType\": str(info.get(\"ContentType\", \"-\")),\n                \"ServerSideEncryption\": info.get(\"ServerSideEncryption\", \"-\"),\n                \"Tags\": str(info.get(\"tags\", \"-\")),\n                \"_aws_url\": aws_url(info, \"S3\", self.aws_account_clamp),  # Hidden Column\n            }\n            data_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(data_summary)\n\n    def incoming_data_deep(self, refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for the Incoming Data in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Incoming Data in AWS\n        \"\"\"\n        return self.aws_broker.get_metadata(ServiceCategory.INCOMING_DATA_S3, force_refresh=refresh)\n\n    def glue_jobs(self) -&gt; pd.DataFrame:\n        \"\"\"Get summary data about AWS Glue Jobs\"\"\"\n        glue_meta = self.glue_jobs_deep()\n        glue_summary = []\n\n        # Get the information about each Glue Job\n        for name, info in glue_meta.items():\n            summary = {\n                \"Name\": info[\"Name\"],\n                \"GlueVersion\": info[\"GlueVersion\"],\n                \"Workers\": info.get(\"NumberOfWorkers\", \"-\"),\n                \"WorkerType\": info.get(\"WorkerType\", \"-\"),\n                \"Modified\": datetime_string(info.get(\"LastModifiedOn\")),\n                \"LastRun\": datetime_string(info[\"sageworks_meta\"][\"last_run\"]),\n                \"Status\": info[\"sageworks_meta\"][\"status\"],\n                \"_aws_url\": aws_url(info, \"GlueJob\", self.aws_account_clamp),  # Hidden Column\n            }\n            glue_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(glue_summary)\n\n    def glue_jobs_deep(self, refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for the Glue Jobs in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Glue Jobs in AWS\n        \"\"\"\n        return self.aws_broker.get_metadata(ServiceCategory.GLUE_JOBS, force_refresh=refresh)\n\n    def data_sources(self) -&gt; pd.DataFrame:\n        \"\"\"Get a summary of the Data Sources in AWS\n\n        Returns:\n            pd.DataFrame: A summary of the Data Sources in AWS\n        \"\"\"\n        data = self.data_sources_deep()\n        data_summary = []\n\n        # Pull in various bits of metadata for each data source\n        for name, info in data.items():\n            summary = {\n                \"Name\": name,\n                \"Modified\": datetime_string(info.get(\"UpdateTime\")),\n                \"Num Columns\": num_columns_ds(info),\n                \"Tags\": info.get(\"Parameters\", {}).get(\"sageworks_tags\", \"-\"),\n                \"Input\": str(\n                    info.get(\"Parameters\", {}).get(\"sageworks_input\", \"-\"),\n                ),\n                \"_aws_url\": aws_url(info, \"DataSource\", self.aws_account_clamp),  # Hidden Column\n            }\n            data_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(data_summary)\n\n    def data_source_details(\n        self, data_source_name: str, database: str = \"sageworks\", refresh: bool = False\n    ) -&gt; Union[dict, None]:\n        \"\"\"Get detailed information about a specific data source in AWS\n\n        Args:\n            data_source_name (str): The name of the data source\n            database (str, optional): Glue database. Defaults to 'sageworks'.\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: Detailed information about the data source (or None if not found)\n        \"\"\"\n        data = self.data_sources_deep(database=database, refresh=refresh)\n        return data.get(data_source_name)\n\n    def data_sources_deep(self, database: str = \"sageworks\", refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for the Data Sources in AWS\n\n        Args:\n            database (str, optional): Glue database. Defaults to 'sageworks'.\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Data Sources in AWS\n        \"\"\"\n        data = self.aws_broker.get_metadata(ServiceCategory.DATA_CATALOG, force_refresh=refresh)\n\n        # Data Sources are in two databases, 'sageworks' and 'sagemaker_featurestore'\n        data = data[database]\n\n        # Return the data\n        return data\n\n    def feature_sets(self, refresh: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Get a summary of the Feature Sets in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            pd.DataFrame: A summary of the Feature Sets in AWS\n        \"\"\"\n        data = self.feature_sets_deep(refresh)\n        data_summary = []\n\n        # Pull in various bits of metadata for each feature set\n        for name, group_info in data.items():\n            sageworks_meta = group_info.get(\"sageworks_meta\", {})\n            summary = {\n                \"Feature Group\": group_info[\"FeatureGroupName\"],\n                \"Created\": datetime_string(group_info.get(\"CreationTime\")),\n                \"Num Columns\": num_columns_fs(group_info),\n                \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n                \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n                \"Online\": str(group_info.get(\"OnlineStoreConfig\", {}).get(\"EnableOnlineStore\", \"False\")),\n                \"_aws_url\": aws_url(group_info, \"FeatureSet\", self.aws_account_clamp),  # Hidden Column\n            }\n            data_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(data_summary)\n\n    def feature_set_details(self, feature_set_name: str) -&gt; dict:\n        \"\"\"Get detailed information about a specific feature set in AWS\n\n        Args:\n            feature_set_name (str): The name of the feature set\n\n        Returns:\n            dict: Detailed information about the feature set\n        \"\"\"\n        data = self.feature_sets_deep()\n        return data.get(feature_set_name, {})\n\n    def feature_sets_deep(self, refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for the Feature Sets in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Feature Sets in AWS\n        \"\"\"\n        return self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=refresh)\n\n    def models(self, refresh: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Get a summary of the Models in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            pd.DataFrame: A summary of the Models in AWS\n        \"\"\"\n        model_data = self.models_deep(refresh)\n        model_summary = []\n        for model_group_name, model_list in model_data.items():\n\n            # Get Summary information for the 'latest' model in the model_list\n            latest_model = model_list[0]\n            sageworks_meta = latest_model.get(\"sageworks_meta\", {})\n\n            # If the sageworks_health_tags have nothing in them, then the model is healthy\n            health_tags = sageworks_meta.get(\"sageworks_health_tags\", \"-\")\n            health_tags = health_tags if health_tags else \"healthy\"\n            summary = {\n                \"Model Group\": latest_model[\"ModelPackageGroupName\"],\n                \"Health\": health_tags,\n                \"Owner\": sageworks_meta.get(\"sageworks_owner\", \"-\"),\n                \"Model Type\": sageworks_meta.get(\"sageworks_model_type\"),\n                \"Created\": datetime_string(latest_model.get(\"CreationTime\")),\n                \"Ver\": latest_model[\"ModelPackageVersion\"],\n                \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n                \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n                \"Status\": latest_model[\"ModelPackageStatus\"],\n                \"Description\": latest_model.get(\"ModelPackageDescription\", \"-\"),\n            }\n            model_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(model_summary)\n\n    def model_details(self, model_group_name: str) -&gt; dict:\n        \"\"\"Get detailed information about a specific model group in AWS\n\n        Args:\n            model_group_name (str): The name of the model group\n\n        Returns:\n            dict: Detailed information about the model group\n        \"\"\"\n        data = self.models_deep()\n        return data.get(model_group_name, {})\n\n    def models_deep(self, refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for Models in AWS\n\n         Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Models in AWS\n        \"\"\"\n        return self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=refresh)\n\n    def endpoints(self, refresh: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Get a summary of the Endpoints in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            pd.DataFrame: A summary of the Endpoints in AWS\n        \"\"\"\n        data = self.endpoints_deep(refresh)\n        data_summary = []\n\n        # Get Summary information for each endpoint\n        for endpoint, endpoint_info in data.items():\n            # Get the SageWorks metadata for this Endpoint\n            sageworks_meta = endpoint_info.get(\"sageworks_meta\", {})\n\n            # If the sageworks_health_tags have nothing in them, then the endpoint is healthy\n            health_tags = sageworks_meta.get(\"sageworks_health_tags\", \"-\")\n            health_tags = health_tags if health_tags else \"healthy\"\n            summary = {\n                \"Name\": endpoint_info[\"EndpointName\"],\n                \"Health\": health_tags,\n                \"Instance\": endpoint_info.get(\"InstanceType\", \"-\"),\n                \"Created\": datetime_string(endpoint_info.get(\"CreationTime\")),\n                \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n                \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n                \"Status\": endpoint_info[\"EndpointStatus\"],\n                \"Variant\": endpoint_info.get(\"ProductionVariants\", [{}])[0].get(\"VariantName\", \"-\"),\n                \"Capture\": str(endpoint_info.get(\"DataCaptureConfig\", {}).get(\"EnableCapture\", \"False\")),\n                \"Samp(%)\": str(endpoint_info.get(\"DataCaptureConfig\", {}).get(\"CurrentSamplingPercentage\", \"-\")),\n            }\n            data_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(data_summary)\n\n    def endpoints_deep(self, refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for Endpoints in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Endpoints in AWS\n        \"\"\"\n        return self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=refresh)\n\n    def pipelines(self, refresh: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Get a summary of the SageWorks Pipelines\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            pd.DataFrame: A summary of the SageWorks Pipelines\n        \"\"\"\n        data = self.pipeline_manager.list_pipelines()\n\n        # Return the pipelines summary as a DataFrame\n        return pd.DataFrame(data)\n\n    def _remove_sageworks_meta(self, data: dict) -&gt; dict:\n        \"\"\"Internal: Recursively remove any keys with 'sageworks_' in them\"\"\"\n\n        # Recursively exclude any keys with 'sageworks_' in them\n        summary_data = {}\n        for key, value in data.items():\n            if isinstance(value, dict):\n                summary_data[key] = self._remove_sageworks_meta(value)\n            elif not key.startswith(\"sageworks_\"):\n                summary_data[key] = value\n        return summary_data\n\n    def refresh_all_aws_meta(self) -&gt; None:\n        \"\"\"Force a refresh of all the metadata\"\"\"\n        self.aws_broker.get_all_metadata(force_refresh=True)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.__init__","title":"<code>__init__()</code>","text":"<p>Meta Initialization</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def __init__(self):\n    \"\"\"Meta Initialization\"\"\"\n    self.log = logging.getLogger(\"sageworks\")\n\n    # Account and Service Brokers\n    self.aws_account_clamp = AWSAccountClamp()\n    self.aws_broker = AWSServiceBroker()\n    self.cm = ConfigManager()\n\n    # Pipeline Manager\n    self.pipeline_manager = PipelineManager()\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.account","title":"<code>account()</code>","text":"<p>Print out the AWS Account Info</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>The AWS Account Info</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def account(self) -&gt; dict:\n    \"\"\"Print out the AWS Account Info\n\n    Returns:\n        dict: The AWS Account Info\n    \"\"\"\n    return self.aws_account_clamp.get_aws_account_info()\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.config","title":"<code>config()</code>","text":"<p>Return the current SageWorks Configuration</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>The current SageWorks Configuration</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def config(self) -&gt; dict:\n    \"\"\"Return the current SageWorks Configuration\n\n    Returns:\n        dict: The current SageWorks Configuration\n    \"\"\"\n    return self.cm.get_all_config()\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.data_source_details","title":"<code>data_source_details(data_source_name, database='sageworks', refresh=False)</code>","text":"<p>Get detailed information about a specific data source in AWS</p> <p>Parameters:</p> Name Type Description Default <code>data_source_name</code> <code>str</code> <p>The name of the data source</p> required <code>database</code> <code>str</code> <p>Glue database. Defaults to 'sageworks'.</p> <code>'sageworks'</code> <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>Union[dict, None]</code> <p>Detailed information about the data source (or None if not found)</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def data_source_details(\n    self, data_source_name: str, database: str = \"sageworks\", refresh: bool = False\n) -&gt; Union[dict, None]:\n    \"\"\"Get detailed information about a specific data source in AWS\n\n    Args:\n        data_source_name (str): The name of the data source\n        database (str, optional): Glue database. Defaults to 'sageworks'.\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: Detailed information about the data source (or None if not found)\n    \"\"\"\n    data = self.data_sources_deep(database=database, refresh=refresh)\n    return data.get(data_source_name)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.data_sources","title":"<code>data_sources()</code>","text":"<p>Get a summary of the Data Sources in AWS</p> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the Data Sources in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def data_sources(self) -&gt; pd.DataFrame:\n    \"\"\"Get a summary of the Data Sources in AWS\n\n    Returns:\n        pd.DataFrame: A summary of the Data Sources in AWS\n    \"\"\"\n    data = self.data_sources_deep()\n    data_summary = []\n\n    # Pull in various bits of metadata for each data source\n    for name, info in data.items():\n        summary = {\n            \"Name\": name,\n            \"Modified\": datetime_string(info.get(\"UpdateTime\")),\n            \"Num Columns\": num_columns_ds(info),\n            \"Tags\": info.get(\"Parameters\", {}).get(\"sageworks_tags\", \"-\"),\n            \"Input\": str(\n                info.get(\"Parameters\", {}).get(\"sageworks_input\", \"-\"),\n            ),\n            \"_aws_url\": aws_url(info, \"DataSource\", self.aws_account_clamp),  # Hidden Column\n        }\n        data_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(data_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.data_sources_deep","title":"<code>data_sources_deep(database='sageworks', refresh=False)</code>","text":"<p>Get a deeper set of data for the Data Sources in AWS</p> <p>Parameters:</p> Name Type Description Default <code>database</code> <code>str</code> <p>Glue database. Defaults to 'sageworks'.</p> <code>'sageworks'</code> <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Data Sources in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def data_sources_deep(self, database: str = \"sageworks\", refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for the Data Sources in AWS\n\n    Args:\n        database (str, optional): Glue database. Defaults to 'sageworks'.\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Data Sources in AWS\n    \"\"\"\n    data = self.aws_broker.get_metadata(ServiceCategory.DATA_CATALOG, force_refresh=refresh)\n\n    # Data Sources are in two databases, 'sageworks' and 'sagemaker_featurestore'\n    data = data[database]\n\n    # Return the data\n    return data\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.endpoints","title":"<code>endpoints(refresh=False)</code>","text":"<p>Get a summary of the Endpoints in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the Endpoints in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def endpoints(self, refresh: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Get a summary of the Endpoints in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        pd.DataFrame: A summary of the Endpoints in AWS\n    \"\"\"\n    data = self.endpoints_deep(refresh)\n    data_summary = []\n\n    # Get Summary information for each endpoint\n    for endpoint, endpoint_info in data.items():\n        # Get the SageWorks metadata for this Endpoint\n        sageworks_meta = endpoint_info.get(\"sageworks_meta\", {})\n\n        # If the sageworks_health_tags have nothing in them, then the endpoint is healthy\n        health_tags = sageworks_meta.get(\"sageworks_health_tags\", \"-\")\n        health_tags = health_tags if health_tags else \"healthy\"\n        summary = {\n            \"Name\": endpoint_info[\"EndpointName\"],\n            \"Health\": health_tags,\n            \"Instance\": endpoint_info.get(\"InstanceType\", \"-\"),\n            \"Created\": datetime_string(endpoint_info.get(\"CreationTime\")),\n            \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n            \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n            \"Status\": endpoint_info[\"EndpointStatus\"],\n            \"Variant\": endpoint_info.get(\"ProductionVariants\", [{}])[0].get(\"VariantName\", \"-\"),\n            \"Capture\": str(endpoint_info.get(\"DataCaptureConfig\", {}).get(\"EnableCapture\", \"False\")),\n            \"Samp(%)\": str(endpoint_info.get(\"DataCaptureConfig\", {}).get(\"CurrentSamplingPercentage\", \"-\")),\n        }\n        data_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(data_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.endpoints_deep","title":"<code>endpoints_deep(refresh=False)</code>","text":"<p>Get a deeper set of data for Endpoints in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Endpoints in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def endpoints_deep(self, refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for Endpoints in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Endpoints in AWS\n    \"\"\"\n    return self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=refresh)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.feature_set_details","title":"<code>feature_set_details(feature_set_name)</code>","text":"<p>Get detailed information about a specific feature set in AWS</p> <p>Parameters:</p> Name Type Description Default <code>feature_set_name</code> <code>str</code> <p>The name of the feature set</p> required <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>Detailed information about the feature set</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def feature_set_details(self, feature_set_name: str) -&gt; dict:\n    \"\"\"Get detailed information about a specific feature set in AWS\n\n    Args:\n        feature_set_name (str): The name of the feature set\n\n    Returns:\n        dict: Detailed information about the feature set\n    \"\"\"\n    data = self.feature_sets_deep()\n    return data.get(feature_set_name, {})\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.feature_sets","title":"<code>feature_sets(refresh=False)</code>","text":"<p>Get a summary of the Feature Sets in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the Feature Sets in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def feature_sets(self, refresh: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Get a summary of the Feature Sets in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        pd.DataFrame: A summary of the Feature Sets in AWS\n    \"\"\"\n    data = self.feature_sets_deep(refresh)\n    data_summary = []\n\n    # Pull in various bits of metadata for each feature set\n    for name, group_info in data.items():\n        sageworks_meta = group_info.get(\"sageworks_meta\", {})\n        summary = {\n            \"Feature Group\": group_info[\"FeatureGroupName\"],\n            \"Created\": datetime_string(group_info.get(\"CreationTime\")),\n            \"Num Columns\": num_columns_fs(group_info),\n            \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n            \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n            \"Online\": str(group_info.get(\"OnlineStoreConfig\", {}).get(\"EnableOnlineStore\", \"False\")),\n            \"_aws_url\": aws_url(group_info, \"FeatureSet\", self.aws_account_clamp),  # Hidden Column\n        }\n        data_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(data_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.feature_sets_deep","title":"<code>feature_sets_deep(refresh=False)</code>","text":"<p>Get a deeper set of data for the Feature Sets in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Feature Sets in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def feature_sets_deep(self, refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for the Feature Sets in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Feature Sets in AWS\n    \"\"\"\n    return self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=refresh)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.glue_jobs","title":"<code>glue_jobs()</code>","text":"<p>Get summary data about AWS Glue Jobs</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def glue_jobs(self) -&gt; pd.DataFrame:\n    \"\"\"Get summary data about AWS Glue Jobs\"\"\"\n    glue_meta = self.glue_jobs_deep()\n    glue_summary = []\n\n    # Get the information about each Glue Job\n    for name, info in glue_meta.items():\n        summary = {\n            \"Name\": info[\"Name\"],\n            \"GlueVersion\": info[\"GlueVersion\"],\n            \"Workers\": info.get(\"NumberOfWorkers\", \"-\"),\n            \"WorkerType\": info.get(\"WorkerType\", \"-\"),\n            \"Modified\": datetime_string(info.get(\"LastModifiedOn\")),\n            \"LastRun\": datetime_string(info[\"sageworks_meta\"][\"last_run\"]),\n            \"Status\": info[\"sageworks_meta\"][\"status\"],\n            \"_aws_url\": aws_url(info, \"GlueJob\", self.aws_account_clamp),  # Hidden Column\n        }\n        glue_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(glue_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.glue_jobs_deep","title":"<code>glue_jobs_deep(refresh=False)</code>","text":"<p>Get a deeper set of data for the Glue Jobs in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Glue Jobs in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def glue_jobs_deep(self, refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for the Glue Jobs in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Glue Jobs in AWS\n    \"\"\"\n    return self.aws_broker.get_metadata(ServiceCategory.GLUE_JOBS, force_refresh=refresh)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.incoming_data","title":"<code>incoming_data()</code>","text":"<p>Get summary data about data in the incoming-data S3 Bucket</p> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the data in the incoming-data S3 Bucket</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def incoming_data(self) -&gt; pd.DataFrame:\n    \"\"\"Get summary data about data in the incoming-data S3 Bucket\n\n    Returns:\n        pd.DataFrame: A summary of the data in the incoming-data S3 Bucket\n    \"\"\"\n    data = self.incoming_data_deep()\n    data_summary = []\n    for name, info in data.items():\n        # Get the name and the size of the S3 Storage Object(s)\n        name = \"/\".join(name.split(\"/\")[-2:]).replace(\"incoming-data/\", \"\")\n        info[\"Name\"] = name\n        size = info.get(\"ContentLength\") / 1_000_000\n        summary = {\n            \"Name\": name,\n            \"Size(MB)\": f\"{size:.2f}\",\n            \"Modified\": datetime_string(info.get(\"LastModified\", \"-\")),\n            \"ContentType\": str(info.get(\"ContentType\", \"-\")),\n            \"ServerSideEncryption\": info.get(\"ServerSideEncryption\", \"-\"),\n            \"Tags\": str(info.get(\"tags\", \"-\")),\n            \"_aws_url\": aws_url(info, \"S3\", self.aws_account_clamp),  # Hidden Column\n        }\n        data_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(data_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.incoming_data_deep","title":"<code>incoming_data_deep(refresh=False)</code>","text":"<p>Get a deeper set of data for the Incoming Data in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Incoming Data in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def incoming_data_deep(self, refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for the Incoming Data in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Incoming Data in AWS\n    \"\"\"\n    return self.aws_broker.get_metadata(ServiceCategory.INCOMING_DATA_S3, force_refresh=refresh)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.model_details","title":"<code>model_details(model_group_name)</code>","text":"<p>Get detailed information about a specific model group in AWS</p> <p>Parameters:</p> Name Type Description Default <code>model_group_name</code> <code>str</code> <p>The name of the model group</p> required <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>Detailed information about the model group</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def model_details(self, model_group_name: str) -&gt; dict:\n    \"\"\"Get detailed information about a specific model group in AWS\n\n    Args:\n        model_group_name (str): The name of the model group\n\n    Returns:\n        dict: Detailed information about the model group\n    \"\"\"\n    data = self.models_deep()\n    return data.get(model_group_name, {})\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.models","title":"<code>models(refresh=False)</code>","text":"<p>Get a summary of the Models in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the Models in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def models(self, refresh: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Get a summary of the Models in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        pd.DataFrame: A summary of the Models in AWS\n    \"\"\"\n    model_data = self.models_deep(refresh)\n    model_summary = []\n    for model_group_name, model_list in model_data.items():\n\n        # Get Summary information for the 'latest' model in the model_list\n        latest_model = model_list[0]\n        sageworks_meta = latest_model.get(\"sageworks_meta\", {})\n\n        # If the sageworks_health_tags have nothing in them, then the model is healthy\n        health_tags = sageworks_meta.get(\"sageworks_health_tags\", \"-\")\n        health_tags = health_tags if health_tags else \"healthy\"\n        summary = {\n            \"Model Group\": latest_model[\"ModelPackageGroupName\"],\n            \"Health\": health_tags,\n            \"Owner\": sageworks_meta.get(\"sageworks_owner\", \"-\"),\n            \"Model Type\": sageworks_meta.get(\"sageworks_model_type\"),\n            \"Created\": datetime_string(latest_model.get(\"CreationTime\")),\n            \"Ver\": latest_model[\"ModelPackageVersion\"],\n            \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n            \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n            \"Status\": latest_model[\"ModelPackageStatus\"],\n            \"Description\": latest_model.get(\"ModelPackageDescription\", \"-\"),\n        }\n        model_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(model_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.models_deep","title":"<code>models_deep(refresh=False)</code>","text":"<p>Get a deeper set of data for Models in AWS</p> <p>Args:     refresh (bool, optional): Force a refresh of the metadata. Defaults to False.</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Models in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def models_deep(self, refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for Models in AWS\n\n     Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Models in AWS\n    \"\"\"\n    return self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=refresh)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.pipelines","title":"<code>pipelines(refresh=False)</code>","text":"<p>Get a summary of the SageWorks Pipelines</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the SageWorks Pipelines</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def pipelines(self, refresh: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Get a summary of the SageWorks Pipelines\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        pd.DataFrame: A summary of the SageWorks Pipelines\n    \"\"\"\n    data = self.pipeline_manager.list_pipelines()\n\n    # Return the pipelines summary as a DataFrame\n    return pd.DataFrame(data)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.refresh_all_aws_meta","title":"<code>refresh_all_aws_meta()</code>","text":"<p>Force a refresh of all the metadata</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def refresh_all_aws_meta(self) -&gt; None:\n    \"\"\"Force a refresh of all the metadata\"\"\"\n    self.aws_broker.get_all_metadata(force_refresh=True)\n</code></pre>"},{"location":"api_classes/meta/#examples","title":"Examples","text":"<p>These example show how to use the <code>Meta()</code> class to pull lists of artifacts from AWS. DataSources, FeatureSets, Models, Endpoints and more. If you're building a web interface plugin, the Meta class is a great place to start.</p> <p>SageWorks REPL</p> <p>If you'd like to see exactly what data/details you get back from the <code>Meta()</code> class, you can spin up the SageWorks REPL, use the class and test out all the methods. Try it out! SageWorks REPL</p> Using SageWorks REPL<pre><code>[\u25cf\u25cf\u25cf]SageWorks:scp_sandbox&gt; meta = Meta()\n[\u25cf\u25cf\u25cf]SageWorks:scp_sandbox&gt; model_info = meta.models()\n[\u25cf\u25cf\u25cf]SageWorks:scp_sandbox&gt; model_info\n               Model Group   Health Owner  ...             Input     Status                Description\n0      wine-classification  healthy     -  ...     wine_features  Completed  Wine Classification Model\n1  abalone-regression-full  healthy     -  ...  abalone_features  Completed   Abalone Regression Model\n2       abalone-regression  healthy     -  ...  abalone_features  Completed   Abalone Regression Model\n\n[3 rows x 10 columns]\n</code></pre> <p>List the Models in AWS</p> meta_list_models.py<pre><code>from sageworks.api.meta import Meta\n\n# Create our Meta Class and get a list of our Models\nmeta = Meta()\nmodels = meta.models()\n\nprint(f\"Number of Models: {len(models)}\")\nprint(models)\n\n# Get more details data on the Endpoints\nmodels_groups = meta.models_deep()\nfor name, model_versions in models_groups.items():\n    print(name)\n</code></pre> <p>Output</p> <pre><code>Number of Models: 3\n               Model Group   Health Owner  ...             Input     Status                Description\n0      wine-classification  healthy     -  ...     wine_features  Completed  Wine Classification Model\n1  abalone-regression-full  healthy     -  ...  abalone_features  Completed   Abalone Regression Model\n2       abalone-regression  healthy     -  ...  abalone_features  Completed   Abalone Regression Model\n\n[3 rows x 10 columns]\nwine-classification\nabalone-regression-full\nabalone-regression\n</code></pre> <p>Getting Model Performance Metrics</p> meta_models.py<pre><code>from sageworks.api.meta import Meta\n\n# Create our Meta Class to get metadata about our Models\nmeta = Meta()\nmodel_info = meta.models_deep()\n\n# Print out the summary of our Models\nfor name, info in model_info.items():\n    print(f\"{name}\")\n    latest = info[0]  # We get a list of models, so we only want the latest\n    print(f\"\\tARN: {latest['ModelPackageGroupArn']}\")\n    print(f\"\\tDescription: {latest['ModelPackageDescription']}\")\n    print(f\"\\tTags: {latest['sageworks_meta']['sageworks_tags']}\")\n    performance_metrics = latest[\"sageworks_meta\"][\"sageworks_inference_metrics\"]\n    print(f\"\\tPerformance Metrics:\")\n    print(f\"\\t\\t{performance_metrics}\")\n</code></pre> <p>Output</p> <pre><code>wine-classification\n    ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/wine-classification\n    Description: Wine Classification Model\n    Tags: wine::classification\n    Performance Metrics:\n        [{'wine_class': 'TypeA', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 12}, {'wine_class': 'TypeB', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 14}, {'wine_class': 'TypeC', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 9}]\n\nabalone-regression\n    ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/abalone-regression\n    Description: Abalone Regression Model\n    Tags: abalone::regression\n    Performance Metrics:\n        [{'MAE': 1.64, 'RMSE': 2.246, 'R2': 0.502, 'MAPE': 16.393, 'MedAE': 1.209, 'NumRows': 834}]\n</code></pre> <p>List the Endpoints in AWS</p> meta_list_endpoints.py<pre><code>from sageworks.api.meta import Meta\n\n# Create our Meta Class and get a list of our Endpoints\nmeta = Meta()\nendpoints = meta.endpoints()\nprint(f\"Number of Endpoints: {len(endpoints)}\")\nprint(endpoints)\n\n# Get more details data on the Endpoints\nendpoints_deep = meta.endpoints_deep()\nfor name, info in endpoints_deep.items():\n    print(name)\n    print(info.keys())\n</code></pre> <p>Output</p> <pre><code>Number of Endpoints: 2\n                      Name   Health            Instance           Created  ...     Status     Variant Capture Samp(%)\n0  wine-classification-end  healthy  Serverless (2GB/5)  2024-03-23 23:09  ...  InService  AllTraffic   False       -\n1   abalone-regression-end  healthy  Serverless (2GB/5)  2024-03-23 21:11  ...  InService  AllTraffic   False       -\n\n[2 rows x 10 columns]\nwine-classification-end\ndict_keys(['EndpointName', 'EndpointArn', 'EndpointConfigName', 'ProductionVariants', 'EndpointStatus', 'CreationTime', 'LastModifiedTime', 'ResponseMetadata', 'InstanceType', 'sageworks_meta'])\nabalone-regression-end\ndict_keys(['EndpointName', 'EndpointArn', 'EndpointConfigName', 'ProductionVariants', 'EndpointStatus', 'CreationTime', 'LastModifiedTime', 'ResponseMetadata', 'InstanceType', 'sageworks_meta'])\n</code></pre> <p>Not Finding some particular AWS Data?</p> <p>The SageWorks Meta API Class also has <code>_details()</code> methods, so make sure to check those out.</p>"},{"location":"api_classes/model/","title":"Model","text":"<p>Model Examples</p> <p>Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!</p> <p>Model: Manages AWS Model Package/Group creation and management.</p> <p>Models are automatically set up and provisioned for deployment into AWS. Models can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional model details and performance metrics</p>"},{"location":"api_classes/model/#sageworks.api.model.Model","title":"<code>Model</code>","text":"<p>               Bases: <code>ModelCore</code></p> <p>Model: SageWorks Model API Class.</p> Common Usage <pre><code>my_features = Model(name)\nmy_features.details()\nmy_features.to_endpoint()\n</code></pre> Source code in <code>src/sageworks/api/model.py</code> <pre><code>class Model(ModelCore):\n    \"\"\"Model: SageWorks Model API Class.\n\n    Common Usage:\n        ```\n        my_features = Model(name)\n        my_features.details()\n        my_features.to_endpoint()\n        ```\n    \"\"\"\n\n    def details(self, **kwargs) -&gt; dict:\n        \"\"\"Retrieve the Model Details.\n\n        Returns:\n            dict: A dictionary of details about the Model\n        \"\"\"\n        return super().details(**kwargs)\n\n    def to_endpoint(self, name: str = None, tags: list = None, serverless: bool = True) -&gt; Endpoint:\n        \"\"\"Create an Endpoint from the Model.\n\n        Args:\n            name (str): Set the name for the endpoint. If not specified, an automatic name will be generated\n            tags (list): Set the tags for the endpoint. If not specified automatic tags will be generated.\n            serverless (bool): Set the endpoint to be serverless (default: True)\n\n        Returns:\n            Endpoint: The Endpoint created from the Model\n        \"\"\"\n\n        # Ensure the endpoint_name is valid\n        if name:\n            Artifact.ensure_valid_name(name, delimiter=\"-\")\n\n        # If the endpoint_name wasn't given generate it\n        else:\n            name = self.uuid.replace(\"_features\", \"\") + \"-end\"\n            name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n        # Create the Endpoint Tags\n        tags = [name] if tags is None else tags\n\n        # Create an Endpoint from the Model\n        model_to_endpoint = ModelToEndpoint(self.uuid, name, serverless=serverless)\n        model_to_endpoint.set_output_tags(tags)\n        model_to_endpoint.transform()\n\n        # Return the Endpoint\n        return Endpoint(name)\n</code></pre>"},{"location":"api_classes/model/#sageworks.api.model.Model.details","title":"<code>details(**kwargs)</code>","text":"<p>Retrieve the Model Details.</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about the Model</p> Source code in <code>src/sageworks/api/model.py</code> <pre><code>def details(self, **kwargs) -&gt; dict:\n    \"\"\"Retrieve the Model Details.\n\n    Returns:\n        dict: A dictionary of details about the Model\n    \"\"\"\n    return super().details(**kwargs)\n</code></pre>"},{"location":"api_classes/model/#sageworks.api.model.Model.to_endpoint","title":"<code>to_endpoint(name=None, tags=None, serverless=True)</code>","text":"<p>Create an Endpoint from the Model.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>Set the name for the endpoint. If not specified, an automatic name will be generated</p> <code>None</code> <code>tags</code> <code>list</code> <p>Set the tags for the endpoint. If not specified automatic tags will be generated.</p> <code>None</code> <code>serverless</code> <code>bool</code> <p>Set the endpoint to be serverless (default: True)</p> <code>True</code> <p>Returns:</p> Name Type Description <code>Endpoint</code> <code>Endpoint</code> <p>The Endpoint created from the Model</p> Source code in <code>src/sageworks/api/model.py</code> <pre><code>def to_endpoint(self, name: str = None, tags: list = None, serverless: bool = True) -&gt; Endpoint:\n    \"\"\"Create an Endpoint from the Model.\n\n    Args:\n        name (str): Set the name for the endpoint. If not specified, an automatic name will be generated\n        tags (list): Set the tags for the endpoint. If not specified automatic tags will be generated.\n        serverless (bool): Set the endpoint to be serverless (default: True)\n\n    Returns:\n        Endpoint: The Endpoint created from the Model\n    \"\"\"\n\n    # Ensure the endpoint_name is valid\n    if name:\n        Artifact.ensure_valid_name(name, delimiter=\"-\")\n\n    # If the endpoint_name wasn't given generate it\n    else:\n        name = self.uuid.replace(\"_features\", \"\") + \"-end\"\n        name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n    # Create the Endpoint Tags\n    tags = [name] if tags is None else tags\n\n    # Create an Endpoint from the Model\n    model_to_endpoint = ModelToEndpoint(self.uuid, name, serverless=serverless)\n    model_to_endpoint.set_output_tags(tags)\n    model_to_endpoint.transform()\n\n    # Return the Endpoint\n    return Endpoint(name)\n</code></pre>"},{"location":"api_classes/model/#examples","title":"Examples","text":"<p>All of the SageWorks Examples are in the Sageworks Repository under the <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p> <p>Create a Model from a FeatureSet</p> featureset_to_model.py<pre><code>from sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import ModelType\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"test_features\")\n\n# Create a Model from the FeatureSet\n# Note: ModelTypes can be CLASSIFIER, REGRESSOR (XGBoost is default)\nmy_model = my_features.to_model(model_type=ModelType.REGRESSOR, \n                                target_column=\"iq_score\")\npprint(my_model.details())\n</code></pre> <p>Output</p> <pre><code>{'approval_status': 'Approved',\n 'content_types': ['text/csv'],\n ...\n 'inference_types': ['ml.t2.medium'],\n 'input': 'test_features',\n 'model_metrics':   metric_name  value\n                0        RMSE  7.924\n                1         MAE  6.554,\n                2          R2  0.604,\n 'regression_predictions':       iq_score  prediction\n                            0   136.519012  139.964460\n                            1   133.616974  130.819950\n                            2   122.495415  124.967834\n                            3   133.279510  121.010284\n                            4   127.881073  113.825005\n    ...\n 'response_types': ['text/csv'],\n 'sageworks_tags': ['test-model'],\n 'shapley_values': None,\n 'size': 0.0,\n 'status': 'Completed',\n 'transform_types': ['ml.m5.large'],\n 'uuid': 'test-model',\n 'version': 1}\n</code></pre> <p>Use a specific Scikit-Learn Model</p> <p>featureset_to_knn.py<pre><code>from sageworks.api.feature_set import FeatureSet\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"abalone_features\")\n\n# Transform FeatureSet into KNN Regression Model\n# Note: model_class can be any sckit-learn model \n#  \"KNeighborsRegressor\", \"BayesianRidge\",\n#  \"GaussianNB\", \"AdaBoostClassifier\", etc\nmy_model = my_features.to_model(\n    model_class=\"KNeighborsRegressor\",\n    target_column=\"class_number_of_rings\",\n    name=\"abalone-knn-reg\",\n    description=\"Abalone KNN Regression\",\n    tags=[\"abalone\", \"knn\"],\n    train_all_data=True,\n)\npprint(my_model.details())\n</code></pre> Another Scikit-Learn Example</p> featureset_to_rfc.py<pre><code>from sageworks.api.feature_set import FeatureSet\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"wine_features\")\n\n# Using a Scikit-Learn Model\n# Note: model_class can be any sckit-learn model (\"KNeighborsRegressor\", \"BayesianRidge\",\n#       \"GaussianNB\", \"AdaBoostClassifier\", \"Ridge, \"Lasso\", \"SVC\", \"SVR\", etc...)\nmy_model = my_features.to_model(\n    model_class=\"RandomForestClassifier\",\n    target_column=\"wine_class\",\n    name=\"wine-rfc-class\",\n    description=\"Wine RandomForest Classification\",\n    tags=[\"wine\", \"rfc\"]\n)\npprint(my_model.details())\n</code></pre> <p>Create an Endpoint from a Model</p> <p>Endpoint Costs</p> <p>Serverless endpoints are a great option, they have no AWS charges when not running. A realtime endpoint has less latency (no cold start) but AWS charges an hourly fee which can add up quickly!</p> model_to_endpoint.py<pre><code>from sageworks.api.model import Model\n\n# Grab the abalone regression Model\nmodel = Model(\"abalone-regression\")\n\n# By default, an Endpoint is serverless, you can\n# make a realtime endpoint with serverless=False\nmodel.to_endpoint(name=\"abalone-regression-end\",\n                  tags=[\"abalone\", \"regression\"],\n                  serverless=True)\n</code></pre> <p>Model Health Check and Metrics</p> model_metrics.py<pre><code>from sageworks.api.model import Model\n\n# Grab the abalone-regression Model\nmodel = Model(\"abalone-regression\")\n\n# Perform a health check on the model\n# Note: The health_check() method returns 'issues' if there are any\n#       problems, so if there are no issues, the model is healthy\nhealth_issues = model.health_check()\nif not health_issues:\n    print(\"Model is Healthy\")\nelse:\n    print(\"Model has issues\")\n    print(health_issues)\n\n# Get the model metrics and regression predictions\nprint(model.model_metrics())\nprint(model.regression_predictions())\n</code></pre> <p>Output</p> <pre><code>Model is Healthy\n  metric_name  value\n0        RMSE  2.190\n1         MAE  1.544\n2          R2  0.504\n\n     class_number_of_rings  prediction\n0                        9    8.648378\n1                       11    9.717787\n2                       11   10.933070\n3                       10    9.899738\n4                        9   10.014504\n..                     ...         ...\n495                     10   10.261657\n496                      9   10.788254\n497                     13    7.779886\n498                     12   14.718514\n499                     13   10.637320\n</code></pre>"},{"location":"api_classes/model/#sageworks-ui","title":"SageWorks UI","text":"<p>Running these few lines of code creates an AWS Model Package Group and an AWS Model Package. These model artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI.</p> SageWorks Dashboard: Models <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"api_classes/monitor/","title":"Monitor","text":"<p>Monitor Examples</p> <p>Examples of using the Monitor class are listed at the bottom of this page Examples.</p> <p>Monitor: Manages AWS Endpoint Monitor creation and deployment. Endpoints Monitors are set up and provisioned for deployment into AWS. Monitors can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional monitor details and performance metrics</p>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor","title":"<code>Monitor</code>","text":"<p>               Bases: <code>MonitorCore</code></p> <p>Monitor: SageWorks Monitor API Class</p> Common Usage <pre><code>mon = Endpoint(name).get_monitor()  # Pull from endpoint OR\nmon = Monitor(name)                 # Create using Endpoint Name\nmon.summary()\nmon.details()\n\n# One time setup methods\nmon.add_data_capture()\nmon.create_baseline()\nmon.create_monitoring_schedule()\n\n# Pull information from the monitor\nbaseline_df = mon.get_baseline()\nconstraints_df = mon.get_constraints()\nstats_df = mon.get_statistics()\ninput_df, output_df = mon.get_latest_data_capture()\n</code></pre> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>class Monitor(MonitorCore):\n    \"\"\"Monitor: SageWorks Monitor API Class\n\n    Common Usage:\n       ```\n       mon = Endpoint(name).get_monitor()  # Pull from endpoint OR\n       mon = Monitor(name)                 # Create using Endpoint Name\n       mon.summary()\n       mon.details()\n\n       # One time setup methods\n       mon.add_data_capture()\n       mon.create_baseline()\n       mon.create_monitoring_schedule()\n\n       # Pull information from the monitor\n       baseline_df = mon.get_baseline()\n       constraints_df = mon.get_constraints()\n       stats_df = mon.get_statistics()\n       input_df, output_df = mon.get_latest_data_capture()\n       ```\n    \"\"\"\n\n    def summary(self) -&gt; dict:\n        \"\"\"Monitor Summary\n\n        Returns:\n            dict: A dictionary of summary information about the Monitor\n        \"\"\"\n        return super().summary()\n\n    def details(self) -&gt; dict:\n        \"\"\"Monitor Details\n\n        Returns:\n            dict: A dictionary of details about the Monitor\n        \"\"\"\n        return super().details()\n\n    def add_data_capture(self, capture_percentage=100):\n        \"\"\"\n        Add data capture configuration for this Monitor/endpoint.\n\n        Args:\n            capture_percentage (int): Percentage of data to capture. Defaults to 100.\n        \"\"\"\n        super().add_data_capture(capture_percentage)\n\n    def create_baseline(self, recreate: bool = False):\n        \"\"\"Code to create a baseline for monitoring\n\n        Args:\n            recreate (bool): If True, recreate the baseline even if it already exists\n\n        Notes:\n            This will create/write three files to the baseline_dir:\n            - baseline.csv\n            - constraints.json\n            - statistics.json\n        \"\"\"\n        super().create_baseline(recreate)\n\n    def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n        \"\"\"\n        Sets up the monitoring schedule for the model endpoint.\n\n        Args:\n            schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n            recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n        \"\"\"\n        super().create_monitoring_schedule(schedule, recreate)\n\n    def get_latest_data_capture(self) -&gt; (pd.DataFrame, pd.DataFrame):\n        \"\"\"\n        Get the latest data capture input and output from S3.\n\n        Returns:\n            DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n        \"\"\"\n        return super().get_latest_data_capture()\n\n    def get_baseline(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n        Returns:\n            pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n        \"\"\"\n        return super().get_baseline()\n\n    def get_constraints(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the constraints from the baseline\n\n        Returns:\n           pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n        \"\"\"\n        return super().get_constraints()\n\n    def get_statistics(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the statistics from the baseline\n\n        Returns:\n            pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n        \"\"\"\n        return super().get_statistics()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.add_data_capture","title":"<code>add_data_capture(capture_percentage=100)</code>","text":"<p>Add data capture configuration for this Monitor/endpoint.</p> <p>Parameters:</p> Name Type Description Default <code>capture_percentage</code> <code>int</code> <p>Percentage of data to capture. Defaults to 100.</p> <code>100</code> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def add_data_capture(self, capture_percentage=100):\n    \"\"\"\n    Add data capture configuration for this Monitor/endpoint.\n\n    Args:\n        capture_percentage (int): Percentage of data to capture. Defaults to 100.\n    \"\"\"\n    super().add_data_capture(capture_percentage)\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.create_baseline","title":"<code>create_baseline(recreate=False)</code>","text":"<p>Code to create a baseline for monitoring</p> <p>Parameters:</p> Name Type Description Default <code>recreate</code> <code>bool</code> <p>If True, recreate the baseline even if it already exists</p> <code>False</code> Notes <p>This will create/write three files to the baseline_dir: - baseline.csv - constraints.json - statistics.json</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def create_baseline(self, recreate: bool = False):\n    \"\"\"Code to create a baseline for monitoring\n\n    Args:\n        recreate (bool): If True, recreate the baseline even if it already exists\n\n    Notes:\n        This will create/write three files to the baseline_dir:\n        - baseline.csv\n        - constraints.json\n        - statistics.json\n    \"\"\"\n    super().create_baseline(recreate)\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.create_monitoring_schedule","title":"<code>create_monitoring_schedule(schedule='hourly', recreate=False)</code>","text":"<p>Sets up the monitoring schedule for the model endpoint.</p> <p>Parameters:</p> Name Type Description Default <code>schedule</code> <code>str</code> <p>The schedule for the monitoring job (hourly or daily, defaults to hourly).</p> <code>'hourly'</code> <code>recreate</code> <code>bool</code> <p>If True, recreate the monitoring schedule even if it already exists.</p> <code>False</code> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n    \"\"\"\n    Sets up the monitoring schedule for the model endpoint.\n\n    Args:\n        schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n        recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n    \"\"\"\n    super().create_monitoring_schedule(schedule, recreate)\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.details","title":"<code>details()</code>","text":"<p>Monitor Details</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about the Monitor</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def details(self) -&gt; dict:\n    \"\"\"Monitor Details\n\n    Returns:\n        dict: A dictionary of details about the Monitor\n    \"\"\"\n    return super().details()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_baseline","title":"<code>get_baseline()</code>","text":"<p>Code to get the baseline CSV from the S3 baseline directory</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def get_baseline(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n    Returns:\n        pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n    \"\"\"\n    return super().get_baseline()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_constraints","title":"<code>get_constraints()</code>","text":"<p>Code to get the constraints from the baseline</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def get_constraints(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the constraints from the baseline\n\n    Returns:\n       pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n    \"\"\"\n    return super().get_constraints()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_latest_data_capture","title":"<code>get_latest_data_capture()</code>","text":"<p>Get the latest data capture input and output from S3.</p> <p>Returns:</p> Name Type Description <code>DataFrame</code> <code>input), DataFrame(output</code> <p>Flattened and processed DataFrames for input and output data.</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def get_latest_data_capture(self) -&gt; (pd.DataFrame, pd.DataFrame):\n    \"\"\"\n    Get the latest data capture input and output from S3.\n\n    Returns:\n        DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n    \"\"\"\n    return super().get_latest_data_capture()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_statistics","title":"<code>get_statistics()</code>","text":"<p>Code to get the statistics from the baseline</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def get_statistics(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the statistics from the baseline\n\n    Returns:\n        pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n    \"\"\"\n    return super().get_statistics()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.summary","title":"<code>summary()</code>","text":"<p>Monitor Summary</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of summary information about the Monitor</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def summary(self) -&gt; dict:\n    \"\"\"Monitor Summary\n\n    Returns:\n        dict: A dictionary of summary information about the Monitor\n    \"\"\"\n    return super().summary()\n</code></pre>"},{"location":"api_classes/monitor/#examples","title":"Examples","text":"<p>Initial Setup of the Endpoint Monitor</p> monitor_setup.py<pre><code>from sageworks.api.monitor import Monitor\n\n# Create an Endpoint Monitor Class and perform initial Setup\nendpoint_name = \"abalone-regression-end-rt\"\nmon = Monitor(endpoint_name)\n\n# Add data capture to the endpoint\nmon.add_data_capture(capture_percentage=100)\n\n# Create a baseline for monitoring\nmon.create_baseline()\n\n# Set up the monitoring schedule\nmon.create_monitoring_schedule(schedule=\"hourly\")\n</code></pre> <p>Pulling Information from an Existing Monitor</p> monitor_usage.py<pre><code>from sageworks.api.monitor import Monitor\nfrom sageworks.api.endpoint import Endpoint\n\n# Construct a Monitor Class in one of Two Ways\nmon = Endpoint(\"abalone-regression-end-rt\").get_monitor()\nmon = Monitor(\"abalone-regression-end-rt\")\n\n# Check the summary and details of the monitoring class\nmon.summary()\nmon.details()\n\n# Check the baseline outputs (baseline, constraints, statistics)\nbase_df = mon.get_baseline()\nbase_df.head()\n\nconstraints_df = mon.get_constraints()\nconstraints_df.head()\n\nstatistics_df = mon.get_statistics()\nstatistics_df.head()\n\n# Get the latest data capture (inputs and outputs)\ninput_df, output_df = mon.get_latest_data_capture()\ninput_df.head()\noutput_df.head()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks-ui","title":"SageWorks UI","text":"<p>Running these few lines of code creates and deploys an AWS Endpoint Monitor. The Monitor status and outputs can be viewed in the Sagemaker Console interfaces or in the SageWorks Dashboard UI. SageWorks will use the monitor to track various metrics including Data Quality, Model Bias, etc...</p> SageWorks Dashboard: Endpoints <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"api_classes/overview/","title":"Overview","text":"<p>Just Getting Started?</p> <p>You're in the right place, the SageWorks API Classes are the best way to get started with SageWorks!</p>"},{"location":"api_classes/overview/#welcome-to-the-sageworks-api-classes","title":"Welcome to the SageWorks API Classes","text":"<p>These classes provide high-level APIs for the SageWorks package, they enable your team to build full AWS Machine Learning Pipelines. They handle all the details around updating and managing a complex set of AWS Services. Each class provides an essential component of the overall ML Pipline. Simply combine the classes to build production ready, AWS powered, machine learning pipelines. </p> <ul> <li>DataSource: Manages AWS Data Catalog and Athena</li> <li>FeatureSet: Manages AWS Feature Store and Feature Groups</li> <li>Model: Manages the training and deployment of AWS Model Groups and Packages</li> <li>Endpoint: Manages the deployment and invocations/inference on AWS Endpoints</li> <li>Monitor: Manages the setup and deployment of AWS Endpoint Monitors</li> </ul> <p></p>"},{"location":"api_classes/overview/#example-ml-pipline","title":"Example ML Pipline","text":"full_ml_pipeline.py<pre><code>from sageworks.api.data_source import DataSource\nfrom sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import Model, ModelType\nfrom sageworks.api.endpoint import Endpoint\n\n# Create the abalone_data DataSource\nds = DataSource(\"s3://sageworks-public-data/common/abalone.csv\")\n\n# Now create a FeatureSet\nds.to_features(\"abalone_features\")\n\n# Create the abalone_regression Model\nfs = FeatureSet(\"abalone_features\")\nfs.to_model(\n    ModelType.REGRESSOR,\n    name=\"abalone-regression\",\n    target_column=\"class_number_of_rings\",\n    tags=[\"abalone\", \"regression\"],\n    description=\"Abalone Regression Model\",\n)\n\n# Create the abalone_regression Endpoint\nmodel = Model(\"abalone-regression\")\nmodel.to_endpoint(name=\"abalone-regression-end\", tags=[\"abalone\", \"regression\"])\n\n# Now we'll run inference on the endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# Get a DataFrame of data (not used to train) and run predictions\nathena_table = fs.get_training_view_table()\ndf = fs.query(f\"SELECT * FROM {athena_table} where training = 0\")\nresults = endpoint.predict(df)\nprint(results[[\"class_number_of_rings\", \"prediction\"]])\n</code></pre> <p>Output</p> <pre><code>Processing...\n     class_number_of_rings  prediction\n0                       12   10.477794\n1                       11    11.11835\n2                       14   13.605763\n3                       12   11.744759\n4                       17    15.55189\n..                     ...         ...\n826                      7    7.981503\n827                     11   11.246113\n828                      9    9.592911\n829                      6    6.129388\n830                      8    7.628252\n</code></pre> <p>Full AWS ML Pipeline Achievement Unlocked!</p> <p>Bing! You just built and deployed a full AWS Machine Learning Pipeline. You can now use the SageWorks Dashboard web interface to inspect your AWS artifacts. A comprehensive set of Exploratory Data Analysis techniques and Model Performance Metrics are available for your entire team to review, inspect and interact with.</p> <p></p> <p>Examples</p> <p>All of the SageWorks Examples are in the Sageworks Repository under the <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p>"},{"location":"api_classes/pipelines/","title":"Pipelines","text":"<p>Pipeline Examples</p> <p>Examples of using the Pipeline classes are listed at the bottom of this page Examples.</p> <p>Pipelines store sequences of SageWorks transforms. So if you have a nightly ML workflow you can capture that as a Pipeline. Here's an example pipeline:</p> nightly_sol_pipeline_v1.json<pre><code>{\n    \"data_source\": {\n         \"name\": \"nightly_data\",\n         \"tags\": [\"solubility\", \"foo\"],\n         \"s3_input\": \"s3://blah/blah.csv\"\n    },\n    \"feature_set\": {\n          \"name\": \"nightly_features\",\n          \"tags\": [\"blah\", \"blah\"],\n          \"input\": \"nightly_data\"\n          \"schema\": \"mol_descriptors_v1\"\n    },\n    \"model\": {\n          \"name\": \u201cnightly_model\u201d,\n          \"tags\": [\"blah\", \"blah\"],\n          \"features\": [\"col1\", \"col2\"],\n          \"target\": \u201csol\u201d,\n          \"input\": \u201cnightly_features\u201d\n    \"endpoint\": {\n          ...\n}    \n</code></pre> <p>PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.</p> <p>Pipeline: Manages the details around a SageWorks Pipeline, including Execution</p>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager","title":"<code>PipelineManager</code>","text":"<p>PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.</p> Common Usage <pre><code>my_manager = PipelineManager()\nmy_manager.list_pipelines()\nabalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\nmy_manager.save_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n</code></pre> Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>class PipelineManager:\n    \"\"\"PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.\n\n    Common Usage:\n        ```\n        my_manager = PipelineManager()\n        my_manager.list_pipelines()\n        abalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\n        my_manager.save_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n        ```\n    \"\"\"\n\n    def __init__(self):\n        \"\"\"Pipeline Init Method\"\"\"\n        self.log = logging.getLogger(\"sageworks\")\n\n        # Grab our SageWorks Bucket from Config\n        self.cm = ConfigManager()\n        self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n        if self.sageworks_bucket is None:\n            self.log = logging.getLogger(\"sageworks\")\n            self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n            sys.exit(1)\n\n        # Set the S3 Path for Pipelines\n        self.bucket = self.sageworks_bucket\n        self.prefix = \"pipelines/\"\n        self.pipelines_s3_path = f\"s3://{self.sageworks_bucket}/pipelines/\"\n\n        # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n        self.boto_session = AWSAccountClamp().boto_session()\n\n        # Read all the Pipelines from this S3 path\n        self.s3_client = self.boto_session.client(\"s3\")\n\n    def list_pipelines(self) -&gt; list:\n        \"\"\"List all the Pipelines in the S3 Bucket\n\n        Returns:\n            list: A list of Pipeline names and details\n        \"\"\"\n        # List objects using the S3 client\n        response = self.s3_client.list_objects_v2(Bucket=self.bucket, Prefix=self.prefix)\n\n        # Check if there are objects\n        if \"Contents\" in response:\n            # Process the list of dictionaries (we only need the filename, the LastModified, and the Size)\n            pipelines = [\n                {\n                    \"name\": pipeline[\"Key\"].split(\"/\")[-1].replace(\".json\", \"\"),\n                    \"last_modified\": pipeline[\"LastModified\"],\n                    \"size\": pipeline[\"Size\"],\n                }\n                for pipeline in response[\"Contents\"]\n            ]\n            return pipelines\n        else:\n            self.log.warning(f\"No pipelines found at {self.pipelines_s3_path}...\")\n            return []\n\n    # Create a new Pipeline from an Endpoint\n    def create_from_endpoint(self, endpoint_name: str) -&gt; dict:\n        \"\"\"Create a Pipeline from an Endpoint\n\n        Args:\n            endpoint_name (str): The name of the Endpoint\n\n        Returns:\n            dict: A dictionary of the Pipeline\n        \"\"\"\n        self.log.important(f\"Creating Pipeline from Endpoint: {endpoint_name}...\")\n        pipeline = {}\n        endpoint = Endpoint(endpoint_name)\n        model = Model(endpoint.get_input())\n        feature_set = FeatureSet(model.get_input())\n        data_source = DataSource(feature_set.get_input())\n        s3_source = data_source.get_input()\n        for name in [\"data_source\", \"feature_set\", \"model\", \"endpoint\"]:\n            artifact = locals()[name]\n            pipeline[name] = {\"name\": artifact.uuid, \"tags\": artifact.get_tags(), \"input\": artifact.get_input()}\n            if name == \"model\":\n                pipeline[name][\"model_type\"] = artifact.model_type.value\n                pipeline[name][\"target_column\"] = artifact.target()\n                pipeline[name][\"feature_list\"] = artifact.features()\n\n        # Return the Pipeline\n        return pipeline\n\n    # Publish a Pipeline to SageWorks\n    def publish_pipeline(self, name: str, pipeline: dict):\n        \"\"\"Save a Pipeline to S3\n\n        Args:\n            name (str): The name of the Pipeline\n            pipeline (dict): The Pipeline to save\n        \"\"\"\n        key = f\"{self.prefix}{name}.json\"\n        self.log.important(f\"Saving {name} to S3: {self.bucket}/{key}...\")\n\n        # Save the pipeline as an S3 JSON object\n        self.s3_client.put_object(Body=json.dumps(pipeline, indent=4), Bucket=self.bucket, Key=key)\n\n    def delete_pipeline(self, name: str):\n        \"\"\"Delete a Pipeline from S3\n\n        Args:\n            name (str): The name of the Pipeline to delete\n        \"\"\"\n        key = f\"{self.prefix}{name}.json\"\n        self.log.important(f\"Deleting {name} from S3: {self.bucket}/{key}...\")\n\n        # Delete the pipeline object from S3\n        self.s3_client.delete_object(Bucket=self.bucket, Key=key)\n\n    # Save a Pipeline to a local file\n    def save_pipeline_to_file(self, pipeline: dict, filepath: str):\n        \"\"\"Save a Pipeline to a local file\n\n        Args:\n            pipeline (dict): The Pipeline to save\n            filepath (str): The path to save the Pipeline\n        \"\"\"\n\n        # Sanity check the filepath\n        if not filepath.endswith(\".json\"):\n            filepath += \".json\"\n\n        # Save the pipeline as a local JSON file\n        with open(filepath, \"w\") as fp:\n            json.dump(pipeline, fp, indent=4)\n\n    def load_pipeline_from_file(self, filepath: str) -&gt; dict:\n        \"\"\"Load a Pipeline from a local file\n\n        Args:\n            filepath (str): The path of the Pipeline to load\n\n        Returns:\n            dict: The Pipeline loaded from the file\n        \"\"\"\n\n        # Load a pipeline as a local JSON file\n        with open(filepath, \"r\") as fp:\n            pipeline = json.load(fp)\n            return pipeline\n\n    def publish_pipeline_from_file(self, filepath: str):\n        \"\"\"Publish a Pipeline to SageWorks from a local file\n\n        Args:\n            filepath (str): The path of the Pipeline to publish\n        \"\"\"\n\n        # Load a pipeline as a local JSON file\n        pipeline = self.load_pipeline_from_file(filepath)\n\n        # Get the pipeline name\n        pipeline_name = filepath.split(\"/\")[-1].replace(\".json\", \"\")\n\n        # Publish the Pipeline\n        self.publish_pipeline(pipeline_name, pipeline)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.__init__","title":"<code>__init__()</code>","text":"<p>Pipeline Init Method</p> Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def __init__(self):\n    \"\"\"Pipeline Init Method\"\"\"\n    self.log = logging.getLogger(\"sageworks\")\n\n    # Grab our SageWorks Bucket from Config\n    self.cm = ConfigManager()\n    self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n    if self.sageworks_bucket is None:\n        self.log = logging.getLogger(\"sageworks\")\n        self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n        sys.exit(1)\n\n    # Set the S3 Path for Pipelines\n    self.bucket = self.sageworks_bucket\n    self.prefix = \"pipelines/\"\n    self.pipelines_s3_path = f\"s3://{self.sageworks_bucket}/pipelines/\"\n\n    # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n    self.boto_session = AWSAccountClamp().boto_session()\n\n    # Read all the Pipelines from this S3 path\n    self.s3_client = self.boto_session.client(\"s3\")\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.create_from_endpoint","title":"<code>create_from_endpoint(endpoint_name)</code>","text":"<p>Create a Pipeline from an Endpoint</p> <p>Parameters:</p> Name Type Description Default <code>endpoint_name</code> <code>str</code> <p>The name of the Endpoint</p> required <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of the Pipeline</p> Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def create_from_endpoint(self, endpoint_name: str) -&gt; dict:\n    \"\"\"Create a Pipeline from an Endpoint\n\n    Args:\n        endpoint_name (str): The name of the Endpoint\n\n    Returns:\n        dict: A dictionary of the Pipeline\n    \"\"\"\n    self.log.important(f\"Creating Pipeline from Endpoint: {endpoint_name}...\")\n    pipeline = {}\n    endpoint = Endpoint(endpoint_name)\n    model = Model(endpoint.get_input())\n    feature_set = FeatureSet(model.get_input())\n    data_source = DataSource(feature_set.get_input())\n    s3_source = data_source.get_input()\n    for name in [\"data_source\", \"feature_set\", \"model\", \"endpoint\"]:\n        artifact = locals()[name]\n        pipeline[name] = {\"name\": artifact.uuid, \"tags\": artifact.get_tags(), \"input\": artifact.get_input()}\n        if name == \"model\":\n            pipeline[name][\"model_type\"] = artifact.model_type.value\n            pipeline[name][\"target_column\"] = artifact.target()\n            pipeline[name][\"feature_list\"] = artifact.features()\n\n    # Return the Pipeline\n    return pipeline\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.delete_pipeline","title":"<code>delete_pipeline(name)</code>","text":"<p>Delete a Pipeline from S3</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>The name of the Pipeline to delete</p> required Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def delete_pipeline(self, name: str):\n    \"\"\"Delete a Pipeline from S3\n\n    Args:\n        name (str): The name of the Pipeline to delete\n    \"\"\"\n    key = f\"{self.prefix}{name}.json\"\n    self.log.important(f\"Deleting {name} from S3: {self.bucket}/{key}...\")\n\n    # Delete the pipeline object from S3\n    self.s3_client.delete_object(Bucket=self.bucket, Key=key)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.list_pipelines","title":"<code>list_pipelines()</code>","text":"<p>List all the Pipelines in the S3 Bucket</p> <p>Returns:</p> Name Type Description <code>list</code> <code>list</code> <p>A list of Pipeline names and details</p> Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def list_pipelines(self) -&gt; list:\n    \"\"\"List all the Pipelines in the S3 Bucket\n\n    Returns:\n        list: A list of Pipeline names and details\n    \"\"\"\n    # List objects using the S3 client\n    response = self.s3_client.list_objects_v2(Bucket=self.bucket, Prefix=self.prefix)\n\n    # Check if there are objects\n    if \"Contents\" in response:\n        # Process the list of dictionaries (we only need the filename, the LastModified, and the Size)\n        pipelines = [\n            {\n                \"name\": pipeline[\"Key\"].split(\"/\")[-1].replace(\".json\", \"\"),\n                \"last_modified\": pipeline[\"LastModified\"],\n                \"size\": pipeline[\"Size\"],\n            }\n            for pipeline in response[\"Contents\"]\n        ]\n        return pipelines\n    else:\n        self.log.warning(f\"No pipelines found at {self.pipelines_s3_path}...\")\n        return []\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.load_pipeline_from_file","title":"<code>load_pipeline_from_file(filepath)</code>","text":"<p>Load a Pipeline from a local file</p> <p>Parameters:</p> Name Type Description Default <code>filepath</code> <code>str</code> <p>The path of the Pipeline to load</p> required <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>The Pipeline loaded from the file</p> Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def load_pipeline_from_file(self, filepath: str) -&gt; dict:\n    \"\"\"Load a Pipeline from a local file\n\n    Args:\n        filepath (str): The path of the Pipeline to load\n\n    Returns:\n        dict: The Pipeline loaded from the file\n    \"\"\"\n\n    # Load a pipeline as a local JSON file\n    with open(filepath, \"r\") as fp:\n        pipeline = json.load(fp)\n        return pipeline\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.publish_pipeline","title":"<code>publish_pipeline(name, pipeline)</code>","text":"<p>Save a Pipeline to S3</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>The name of the Pipeline</p> required <code>pipeline</code> <code>dict</code> <p>The Pipeline to save</p> required Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def publish_pipeline(self, name: str, pipeline: dict):\n    \"\"\"Save a Pipeline to S3\n\n    Args:\n        name (str): The name of the Pipeline\n        pipeline (dict): The Pipeline to save\n    \"\"\"\n    key = f\"{self.prefix}{name}.json\"\n    self.log.important(f\"Saving {name} to S3: {self.bucket}/{key}...\")\n\n    # Save the pipeline as an S3 JSON object\n    self.s3_client.put_object(Body=json.dumps(pipeline, indent=4), Bucket=self.bucket, Key=key)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.publish_pipeline_from_file","title":"<code>publish_pipeline_from_file(filepath)</code>","text":"<p>Publish a Pipeline to SageWorks from a local file</p> <p>Parameters:</p> Name Type Description Default <code>filepath</code> <code>str</code> <p>The path of the Pipeline to publish</p> required Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def publish_pipeline_from_file(self, filepath: str):\n    \"\"\"Publish a Pipeline to SageWorks from a local file\n\n    Args:\n        filepath (str): The path of the Pipeline to publish\n    \"\"\"\n\n    # Load a pipeline as a local JSON file\n    pipeline = self.load_pipeline_from_file(filepath)\n\n    # Get the pipeline name\n    pipeline_name = filepath.split(\"/\")[-1].replace(\".json\", \"\")\n\n    # Publish the Pipeline\n    self.publish_pipeline(pipeline_name, pipeline)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.save_pipeline_to_file","title":"<code>save_pipeline_to_file(pipeline, filepath)</code>","text":"<p>Save a Pipeline to a local file</p> <p>Parameters:</p> Name Type Description Default <code>pipeline</code> <code>dict</code> <p>The Pipeline to save</p> required <code>filepath</code> <code>str</code> <p>The path to save the Pipeline</p> required Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def save_pipeline_to_file(self, pipeline: dict, filepath: str):\n    \"\"\"Save a Pipeline to a local file\n\n    Args:\n        pipeline (dict): The Pipeline to save\n        filepath (str): The path to save the Pipeline\n    \"\"\"\n\n    # Sanity check the filepath\n    if not filepath.endswith(\".json\"):\n        filepath += \".json\"\n\n    # Save the pipeline as a local JSON file\n    with open(filepath, \"w\") as fp:\n        json.dump(pipeline, fp, indent=4)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline","title":"<code>Pipeline</code>","text":"<p>Pipeline: SageWorks Pipeline API Class</p> Common Usage <pre><code>my_pipeline = Pipeline(\"name\")\nmy_pipeline.details()\nmy_pipeline.execute()  # Execute entire pipeline\nmy_pipeline.execute_partial([\"data_source\", \"feature_set\"])\nmy_pipeline.execute_partial([\"model\", \"endpoint\"])\n</code></pre> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>class Pipeline:\n    \"\"\"Pipeline: SageWorks Pipeline API Class\n\n    Common Usage:\n        ```\n        my_pipeline = Pipeline(\"name\")\n        my_pipeline.details()\n        my_pipeline.execute()  # Execute entire pipeline\n        my_pipeline.execute_partial([\"data_source\", \"feature_set\"])\n        my_pipeline.execute_partial([\"model\", \"endpoint\"])\n        ```\n    \"\"\"\n\n    def __init__(self, name: str):\n        \"\"\"Pipeline Init Method\"\"\"\n        self.log = logging.getLogger(\"sageworks\")\n        self.name = name\n\n        # Grab our SageWorks Bucket from Config\n        self.cm = ConfigManager()\n        self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n        if self.sageworks_bucket is None:\n            self.log = logging.getLogger(\"sageworks\")\n            self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n            sys.exit(1)\n\n        # Set the S3 Path for this Pipeline\n        self.bucket = self.sageworks_bucket\n        self.key = f\"pipelines/{self.name}.json\"\n        self.s3_path = f\"s3://{self.bucket}/{self.key}\"\n\n        # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n        self.boto_session = AWSAccountClamp().boto_session()\n        self.s3_client = self.boto_session.client(\"s3\")\n\n        # If this S3 Path exists, load the Pipeline\n        if wr.s3.does_object_exist(self.s3_path):\n            self.pipeline = self._get_pipeline()\n        else:\n            self.log.warning(f\"Pipeline {self.name} not found at {self.s3_path}\")\n            self.pipeline = None\n\n        # Data Storage Cache\n        self.data_storage = SageWorksCache(prefix=\"data_storage\")\n\n    def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n        \"\"\"Set the input for the Pipeline\n\n        Args:\n            input (Union[str, pd.DataFrame]): The input for the Pipeline\n            artifact (str): The artifact to set the input for (default: \"data_source\")\n        \"\"\"\n        self.pipeline[artifact][\"input\"] = input\n\n    def set_holdout_ids(self, id_column: str, holdout_ids: list[str]):\n        \"\"\"Set the input for the Pipeline\n\n        Args:\n           id_list (list): The list of hold out ids\n        \"\"\"\n        self.pipeline[\"feature_set\"][\"id_column\"] = id_column\n        self.pipeline[\"feature_set\"][\"holdout_ids\"] = holdout_ids\n\n    def execute(self):\n        \"\"\"Execute the entire Pipeline\n\n        Raises:\n            RunTimeException: If the pipeline execution fails in any way\n        \"\"\"\n        pipeline_executor = PipelineExecutor(self)\n        pipeline_executor.execute()\n\n    def execute_partial(self, subset: list):\n        \"\"\"Execute a partial Pipeline\n\n        Args:\n            subset (list): A subset of the pipeline to execute\n\n        Raises:\n            RunTimeException: If the pipeline execution fails in any way\n        \"\"\"\n        pipeline_executor = PipelineExecutor(self)\n        pipeline_executor.execute_partial(subset)\n\n    def report_settable_fields(self, pipeline: dict = {}, path: str = \"\") -&gt; None:\n        \"\"\"\n        Recursively finds and prints keys with settable fields in a JSON-like dictionary.\n\n        Args:\n        pipeline (dict): pipeline (or sub pipeline) to process.\n        path (str): Current path to the key, used for nested dictionaries.\n        \"\"\"\n        # Grab the entire pipeline if not provided (first call)\n        if not pipeline:\n            self.log.important(f\"Checking Pipeline: {self.name}...\")\n            pipeline = self.pipeline\n        for key, value in pipeline.items():\n            if isinstance(value, dict):\n                # Recurse into sub-dictionary\n                self.report_settable_fields(value, path + key + \" -&gt; \")\n            elif isinstance(value, str) and value.startswith(\"&lt;&lt;\") and value.endswith(\"&gt;&gt;\"):\n                # Check if required or optional\n                required = \"[Required]\" if \"required\" in value else \"[Optional]\"\n                self.log.important(f\"{required} Path: {path + key}\")\n\n    def delete(self):\n        \"\"\"Pipeline Deletion\"\"\"\n        self.log.info(f\"Deleting Pipeline: {self.name}...\")\n        self.data_storage.delete(f\"pipeline:{self.name}:details\")\n        wr.s3.delete_objects(self.s3_path)\n\n    def _get_pipeline(self) -&gt; dict:\n        \"\"\"Internal: Get the pipeline as a JSON object from the specified S3 bucket and key.\"\"\"\n        response = self.s3_client.get_object(Bucket=self.bucket, Key=self.key)\n        json_object = json.loads(response[\"Body\"].read())\n        return json_object\n\n    def __repr__(self) -&gt; str:\n        \"\"\"String representation of this pipeline\n\n        Returns:\n            str: String representation of this pipeline\n        \"\"\"\n        # Class name and details\n        class_name = self.__class__.__name__\n        pipeline_details = json.dumps(self.pipeline, indent=4)\n        return f\"{class_name}({pipeline_details})\"\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.__init__","title":"<code>__init__(name)</code>","text":"<p>Pipeline Init Method</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def __init__(self, name: str):\n    \"\"\"Pipeline Init Method\"\"\"\n    self.log = logging.getLogger(\"sageworks\")\n    self.name = name\n\n    # Grab our SageWorks Bucket from Config\n    self.cm = ConfigManager()\n    self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n    if self.sageworks_bucket is None:\n        self.log = logging.getLogger(\"sageworks\")\n        self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n        sys.exit(1)\n\n    # Set the S3 Path for this Pipeline\n    self.bucket = self.sageworks_bucket\n    self.key = f\"pipelines/{self.name}.json\"\n    self.s3_path = f\"s3://{self.bucket}/{self.key}\"\n\n    # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n    self.boto_session = AWSAccountClamp().boto_session()\n    self.s3_client = self.boto_session.client(\"s3\")\n\n    # If this S3 Path exists, load the Pipeline\n    if wr.s3.does_object_exist(self.s3_path):\n        self.pipeline = self._get_pipeline()\n    else:\n        self.log.warning(f\"Pipeline {self.name} not found at {self.s3_path}\")\n        self.pipeline = None\n\n    # Data Storage Cache\n    self.data_storage = SageWorksCache(prefix=\"data_storage\")\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.__repr__","title":"<code>__repr__()</code>","text":"<p>String representation of this pipeline</p> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>String representation of this pipeline</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def __repr__(self) -&gt; str:\n    \"\"\"String representation of this pipeline\n\n    Returns:\n        str: String representation of this pipeline\n    \"\"\"\n    # Class name and details\n    class_name = self.__class__.__name__\n    pipeline_details = json.dumps(self.pipeline, indent=4)\n    return f\"{class_name}({pipeline_details})\"\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.delete","title":"<code>delete()</code>","text":"<p>Pipeline Deletion</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def delete(self):\n    \"\"\"Pipeline Deletion\"\"\"\n    self.log.info(f\"Deleting Pipeline: {self.name}...\")\n    self.data_storage.delete(f\"pipeline:{self.name}:details\")\n    wr.s3.delete_objects(self.s3_path)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.execute","title":"<code>execute()</code>","text":"<p>Execute the entire Pipeline</p> <p>Raises:</p> Type Description <code>RunTimeException</code> <p>If the pipeline execution fails in any way</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def execute(self):\n    \"\"\"Execute the entire Pipeline\n\n    Raises:\n        RunTimeException: If the pipeline execution fails in any way\n    \"\"\"\n    pipeline_executor = PipelineExecutor(self)\n    pipeline_executor.execute()\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.execute_partial","title":"<code>execute_partial(subset)</code>","text":"<p>Execute a partial Pipeline</p> <p>Parameters:</p> Name Type Description Default <code>subset</code> <code>list</code> <p>A subset of the pipeline to execute</p> required <p>Raises:</p> Type Description <code>RunTimeException</code> <p>If the pipeline execution fails in any way</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def execute_partial(self, subset: list):\n    \"\"\"Execute a partial Pipeline\n\n    Args:\n        subset (list): A subset of the pipeline to execute\n\n    Raises:\n        RunTimeException: If the pipeline execution fails in any way\n    \"\"\"\n    pipeline_executor = PipelineExecutor(self)\n    pipeline_executor.execute_partial(subset)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.report_settable_fields","title":"<code>report_settable_fields(pipeline={}, path='')</code>","text":"<p>Recursively finds and prints keys with settable fields in a JSON-like dictionary.</p> <p>Args: pipeline (dict): pipeline (or sub pipeline) to process. path (str): Current path to the key, used for nested dictionaries.</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def report_settable_fields(self, pipeline: dict = {}, path: str = \"\") -&gt; None:\n    \"\"\"\n    Recursively finds and prints keys with settable fields in a JSON-like dictionary.\n\n    Args:\n    pipeline (dict): pipeline (or sub pipeline) to process.\n    path (str): Current path to the key, used for nested dictionaries.\n    \"\"\"\n    # Grab the entire pipeline if not provided (first call)\n    if not pipeline:\n        self.log.important(f\"Checking Pipeline: {self.name}...\")\n        pipeline = self.pipeline\n    for key, value in pipeline.items():\n        if isinstance(value, dict):\n            # Recurse into sub-dictionary\n            self.report_settable_fields(value, path + key + \" -&gt; \")\n        elif isinstance(value, str) and value.startswith(\"&lt;&lt;\") and value.endswith(\"&gt;&gt;\"):\n            # Check if required or optional\n            required = \"[Required]\" if \"required\" in value else \"[Optional]\"\n            self.log.important(f\"{required} Path: {path + key}\")\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.set_holdout_ids","title":"<code>set_holdout_ids(id_column, holdout_ids)</code>","text":"<p>Set the input for the Pipeline</p> <p>Parameters:</p> Name Type Description Default <code>id_list</code> <code>list</code> <p>The list of hold out ids</p> required Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def set_holdout_ids(self, id_column: str, holdout_ids: list[str]):\n    \"\"\"Set the input for the Pipeline\n\n    Args:\n       id_list (list): The list of hold out ids\n    \"\"\"\n    self.pipeline[\"feature_set\"][\"id_column\"] = id_column\n    self.pipeline[\"feature_set\"][\"holdout_ids\"] = holdout_ids\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.set_input","title":"<code>set_input(input, artifact='data_source')</code>","text":"<p>Set the input for the Pipeline</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Union[str, DataFrame]</code> <p>The input for the Pipeline</p> required <code>artifact</code> <code>str</code> <p>The artifact to set the input for (default: \"data_source\")</p> <code>'data_source'</code> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n    \"\"\"Set the input for the Pipeline\n\n    Args:\n        input (Union[str, pd.DataFrame]): The input for the Pipeline\n        artifact (str): The artifact to set the input for (default: \"data_source\")\n    \"\"\"\n    self.pipeline[artifact][\"input\"] = input\n</code></pre>"},{"location":"api_classes/pipelines/#examples","title":"Examples","text":"<p>Make a Pipeline</p> <p>Pipelines are just JSON files (see <code>sageworks/examples/pipelines/</code>). You can copy one and make changes to fit your objects/use case, or if you have a set of SageWorks artifacts created you can 'backtrack' from the Endpoint and have it create the Pipeline for you.</p> pipeline_manager.py<pre><code>from sageworks.api.pipeline_manager import PipelineManager\n\n # Create a PipelineManager\nmy_manager = PipelineManager()\n\n# List the Pipelines\npprint(my_manager.list_pipelines())\n\n# Create a Pipeline from an Endpoint\nabalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\n\n# Publish the Pipeline\nmy_manager.publish_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n</code></pre> <p>Output</p> <p><pre><code>Listing Pipelines...\n[{'last_modified': datetime.datetime(2024, 4, 16, 21, 10, 6, tzinfo=tzutc()),\n  'name': 'abalone_pipeline_v1',\n  'size': 445}]\n</code></pre> Pipeline Details</p> pipeline_details.py<pre><code>from sageworks.api.pipeline import Pipeline\n\n# Retrieve an existing Pipeline\nmy_pipeline = Pipeline(\"abalone_pipeline_v1\")\npprint(my_pipeline.details())\n</code></pre> <p>Output</p> <pre><code>{\n    \"name\": \"abalone_pipeline_v1\",\n    \"s3_path\": \"s3://sandbox/pipelines/abalone_pipeline_v1.json\",\n    \"pipeline\": {\n        \"data_source\": {\n            \"name\": \"abalone_data\",\n            \"tags\": [\n                \"abalone_data\"\n            ],\n            \"input\": \"/Users/briford/work/sageworks/data/abalone.csv\"\n        },\n        \"feature_set\": {\n            \"name\": \"abalone_features\",\n            \"tags\": [\n                \"abalone_features\"\n            ],\n            \"input\": \"abalone_data\"\n        },\n        \"model\": {\n            \"name\": \"abalone-regression\",\n            \"tags\": [\n                \"abalone\",\n                \"regression\"\n            ],\n            \"input\": \"abalone_features\"\n        },\n        ...\n    }\n}\n</code></pre> <p>Pipeline Execution</p> <p>Pipeline Execution</p> <p>Executing the Pipeline is obviously the most important reason for creating one. If gives you a reproducible way to capture, inspect, and run the same ML pipeline on different data (nightly).</p> pipeline_execution.py<pre><code>from sageworks.api.pipeline import Pipeline\n\n# Retrieve an existing Pipeline\nmy_pipeline = Pipeline(\"abalone_pipeline_v1\")\n\n# Execute the Pipeline\nmy_pipeline.execute()  # Full execution\n\n# Partial executions\nmy_pipeline.execute_partial([\"data_source\", \"feature_set\"])\nmy_pipeline.execute_partial([\"model\", \"endpoint\"])\n</code></pre>"},{"location":"api_classes/pipelines/#pipelines-advanced","title":"Pipelines Advanced","text":"<p>As part of the flexible architecture sometimes DataSources or FeatureSets can be created with a Pandas DataFrame. To support a DataFrame as input to a pipeline we can call the <code>set_input()</code> method to the pipeline object. If you'd like to specify the <code>set_hold_out_ids()</code> you can also provide a list of ids.</p> <pre><code>    def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n        \"\"\"Set the input for the Pipeline\n\n        Args:\n            input (Union[str, pd.DataFrame]): The input for the Pipeline\n            artifact (str): The artifact to set the input for (default: \"data_source\")\n        \"\"\"\n        self.pipeline[artifact][\"input\"] = input\n\n    def set_hold_out_ids(self, id_list: list):\n        \"\"\"Set the input for the Pipeline\n\n        Args:\n           id_list (list): The list of hold out ids\n        \"\"\"\n        self.pipeline[\"feature_set\"][\"hold_out_ids\"] = id_list\n</code></pre> <p>Running a pipeline creates and deploys a set of SageWorks Artifacts, DataSource, FeatureSet, Model and Endpoint. These artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI.</p> <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"aws_setup/aws_access_management/","title":"AWS Acesss Management","text":"<p>Need AWS Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p> <p>This page gives an overview of how SageWorks sets up roles and policies in a granular way that provides 'least priviledge' and also provides a unified framework for AWS access management.</p>"},{"location":"aws_setup/aws_access_management/#conceptual-slide-deck","title":"Conceptual Slide Deck","text":"<p>SageWorks AWS Acesss Management</p>"},{"location":"aws_setup/aws_access_management/#aws-resources","title":"AWS Resources","text":"<ul> <li>AWS Identity Center</li> <li>Users and Groups</li> <li>Permission Sets</li> <li>SSO Command Line/Python Configure</li> </ul>"},{"location":"aws_setup/aws_tips_and_tricks/","title":"AWS Tips and Tricks","text":"<p>Need AWS Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p> <p>This page tries to give helpful guidance when setting up AWS Accounts, Users, and Groups. In general AWS can be a bit tricky to set up the first time. Feel free to use any material in this guide but we're more than happy to help clients get their AWS Setup ready to go for FREE. Below are some guides for setting up a new AWS account for SageWorks and also setting up SSO Users and Groups within AWS.</p>"},{"location":"aws_setup/aws_tips_and_tricks/#new-aws-account-with-aws-organizations-easy","title":"New AWS Account (with AWS Organizations: easy)","text":"<ul> <li>If you already have an AWS Account you can activate the AWS Identity Center/Organization functionality.</li> <li>Now go to AWS Organizations page and hit 'Add an AWS Account' button</li> <li>Add a new User with permissions that allows AWS Stack creation</li> </ul> <p>Email Trick</p> <p>AWS will often not allow the same email to be used for different accounts. If you need a 'new' email just add a plus sign '+' at the end of your existing email (e.g. bob.smith+aws@gmail.com). This email will 'auto forward' to bob.smith@gmail.com.</p>"},{"location":"aws_setup/aws_tips_and_tricks/#new-aws-account-without-aws-organizations-a-bit-harder","title":"New AWS Account (without AWS Organizations: a bit harder)","text":"<ul> <li>Goto: https://aws.amazon.com/free and hit the big button 'Create a Free Account'</li> <li>Enter email and the account name you'd like (anything is fine)</li> <li>You'll get a validation email and go through the rest of the Account setup procedure</li> <li>Add a new User with permissions that allows AWS Stack creation</li> </ul>"},{"location":"aws_setup/aws_tips_and_tricks/#sso-users-and-groups","title":"SSO Users and Groups","text":"<p>AWS SSO (Single Sign-On) is a cloud-based service that allows users to manage access to multiple AWS accounts and business applications using a single set of credentials. It simplifies the authentication process for users and provides centralized management of permissions and access control across various AWS resources. With AWS SSO, users can log in once and access all the applications and accounts they need, streamlining the user experience and increasing productivity. AWS SSO also enables IT administrators to manage access more efficiently by providing a single point of control for managing user access, permissions, and policies, reducing the risk of unauthorized access or security breaches.</p>"},{"location":"aws_setup/aws_tips_and_tricks/#setting-up-sso-users","title":"Setting up SSO Users","text":"<ul> <li>Log in to your AWS account and go to the AWS Identity Center console.</li> <li>Click on the \"Users\" tab and then click on the \"Add user\" button.</li> </ul> <p>The 'Add User' setup is fairly straight forward but here are some screen shots:</p> <p>On the first panel you can fill in the users information.</p> <p></p>"},{"location":"aws_setup/aws_tips_and_tricks/#groups","title":"Groups","text":"<p>On the second panel we suggest that you have at LEAST two groups:</p> <ul> <li>Admin group</li> <li>DataScientists group</li> </ul>"},{"location":"aws_setup/aws_tips_and_tricks/#setting-up-groups","title":"Setting up Groups","text":"<p>This allows you to put most of the users into the DataScientists group that has AWS policies based on their job role. AWS uses 'permission sets' and you assign AWS Policies. This approach makes it easy to give a group of users a set of relevant policies for their tasks. </p> <p>Our standard setup is to have two permission sets with the following policies:</p> <ul> <li>IAM Identity Center --&gt; Permission sets --&gt; DataScientist </li> <li> <p>Add Policy: arn:aws:iam::aws:policy/job-function/DataScientist</p> </li> <li> <p>IAM Identity Center --&gt; Permission sets --&gt; AdministratorAccess </p> </li> <li>Add Policy: arn:aws:iam::aws:policy/job-function/AdministratorAccess</li> </ul> <p>See: Permission Sets for more details and instructions.</p> <p>Another benefit of creating groups is that you can include that group in 'Trust Policy (assume_role)' for the SageWorks-ExecutionRole (this gets deployed as part of the SageWorks AWS Stack). This means that the management of what SageWorks can do/see/read/write is completely done through the SageWorks-ExecutionRole.</p>"},{"location":"aws_setup/aws_tips_and_tricks/#back-to-adding-user","title":"Back to Adding User","text":"<p>Okay now that we have our groups set up we can go back to our original goal of adding a user. So here's the second panel with the groups and now we can hit 'Next'</p> <p></p> <p>On the third panel just review the details and hit the 'Add User' button at the bottom. The user will get an email giving them instructions on how to log on to their AWS account.</p> <p></p>"},{"location":"aws_setup/aws_tips_and_tricks/#aws-console","title":"AWS Console","text":"<p>Now when the user logs onto the AWS Console they should see something like this: </p>"},{"location":"aws_setup/aws_tips_and_tricks/#sso-setup-for-command-linepython-usage","title":"SSO Setup for Command Line/Python Usage","text":"<p>Please see our SSO Setup</p>"},{"location":"aws_setup/aws_tips_and_tricks/#aws-resources","title":"AWS Resources","text":"<ul> <li>AWS Identity Center</li> <li>Users and Groups</li> <li>Permission Sets</li> <li>SSO Command Line/Python Configure</li> </ul>"},{"location":"aws_setup/core_stack/","title":"Initial AWS Setup","text":"<p>Welcome to the SageWorks AWS Setup Guide. SageWorks is deployed as an AWS Stack following the well architected system practices of AWS. </p> <p>AWS Setup can be a bit complex</p> <p>Setting up SageWorks with AWS can be a bit complex, but this only needs to be done ONCE for your entire company. The install uses standard CDK --&gt; AWS Stacks and SageWorks tries to make it straight forward. If you have any troubles at all feel free to contact us a sageworks@supercowpowers.com or on Discord and we're happy to help you with AWS for FREE.</p>"},{"location":"aws_setup/core_stack/#two-main-options-when-using-sageworks","title":"Two main options when using SageWorks","text":"<ol> <li>Spin up a new AWS Account for the SageWorks Stacks (Make a New Account)</li> <li>Deploy SageWorks Stacks into your existing AWS Account</li> </ol> <p>Either of these options are fully supported, but we highly suggest a NEW account as it gives the following benefits:</p> <ul> <li>AWS Data Isolation: Data Scientists will feel empowered to play in the sandbox without impacting production services.</li> <li>AWS Cost Accounting: Monitor and Track all those new ML Pipelines that your team creates with SageWorks :)</li> </ul>"},{"location":"aws_setup/core_stack/#setting-up-users-and-groups","title":"Setting up Users and Groups","text":"<p>If your AWS Account already has users and groups set up you can skip this but here's our recommendations on setting up SSO Users and Groups</p>"},{"location":"aws_setup/core_stack/#onboarding-sageworks-to-your-aws-account","title":"Onboarding SageWorks to your AWS Account","text":"<p>Pulling down the SageWorks Repo   <pre><code>git clone https://github.com/SuperCowPowers/sageworks.git\n</code></pre></p>"},{"location":"aws_setup/core_stack/#sageworks-uses-aws-python-cdk-for-deployments","title":"SageWorks uses AWS Python CDK for Deployments","text":"<p>If you don't have AWS CDK already installed you can do these steps:</p> <p>Mac</p> <p><pre><code>brew install node \nnpm install -g aws-cdk\n</code></pre> Linux</p> <p><pre><code>sudo apt install nodejs\nsudo npm install -g aws-cdk\n</code></pre> For more information on Linux installs see Digital Ocean NodeJS</p>"},{"location":"aws_setup/core_stack/#create-an-s3-bucket-for-sageworks","title":"Create an S3 Bucket for SageWorks","text":"<p>SageWorks pushes and pulls data from AWS, it will use this S3 Bucket for storage and processing. You should create a NEW S3 Bucket, we suggest a name like <code>&lt;company_name&gt;-sageworks</code></p>"},{"location":"aws_setup/core_stack/#deploying-the-sageworks-core-stack","title":"Deploying the SageWorks Core Stack","text":"<p>Do the initial setup/config here: Getting Started. After you've done that come back to this section. For Stack Deployment additional things need to be added to your config file. The config file will be located in your home directory <code>~/.sageworks/sageworks_config.json</code>. Edit this file and add addition stuff for the deployment. Specifically there are two additional fields to be added (optional for both)</p> <p><pre><code>\"SAGEWORKS_SSO_GROUP\": DataScientist (or whatever)\n\"SAGEWORKS_ADDITIONAL_BUCKETS\": \"bucket1, bucket2\n</code></pre> These are optional but are set/used by most SageWorks users.</p> <p>AWS Stuff</p> <p>Activate your AWS Account that's used for SageWorks deployment. For this one time install you should use an Admin Account (or an account that had permissions to create/update AWS Stacks)</p> <pre><code>cd sageworks/aws_setup/sageworks_core\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n</code></pre>"},{"location":"aws_setup/core_stack/#aws-account-setup-check","title":"AWS Account Setup Check","text":"<p>After setting up SageWorks config/AWS Account you can run this test/checking script. If the results ends with <code>INFO AWS Account Clamp: AOK!</code> you're in good shape. If not feel free to contact us on Discord and we'll get it straightened out for you :)</p> <pre><code>pip install sageworks (if not already installed)\ncd sageworks/aws_setup\npython aws_account_check.py\n&lt;lot of print outs for various checks&gt;\n2023-04-12 11:17:09 (aws_account_check.py:48) INFO AWS Account Clamp: AOK!\n</code></pre> <p>Success</p> <p>Congratulations: SageWorks is now deployed to your AWS Account. Deploying the AWS Stack only needs to be done once. Now that this is complete your developers can simply <code>pip install sageworks</code> and start using the API.</p> <p>If you ran into any issues with this procedure please contact us via Discord or email sageworks@supercowpowers.com and the SCP team will provide free setup and support for new SageWorks users.</p>"},{"location":"aws_setup/dashboard_stack/","title":"Deploy the SageWorks Dashboard Stack","text":"<p>Deploying the Dashboard Stack is reasonably straight forward, it's the same approach as the Core Stack that you've already deployed.</p> <p>Please review the Stack Details section to understand all the AWS components that are included and utilized in the SageWorks Dashboard Stack.</p>"},{"location":"aws_setup/dashboard_stack/#deploying-the-dashboard-stack","title":"Deploying the Dashboard Stack","text":"<p>AWS Stuff</p> <p>Activate your AWS Account that's used for SageWorks deployment. For this one time install you should use an Admin Account (or an account that had permissions to create/update AWS Stacks)</p> <pre><code>cd sageworks/aws_setup/sageworks_dashboard_full\nexport SAGEWORKS_CONFIG=/full/path/to/config.json\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n</code></pre>"},{"location":"aws_setup/dashboard_stack/#stack-details","title":"Stack Details","text":"<p>AWS Questions?</p> <p>There's quite a bit to unpack when deploying an AWS powered Web Service. We're happy to help walk you through the details and options. Contact us anytime for a free consultation.</p> <ul> <li>ECS Fargate</li> <li>Load Balancer</li> <li>2 Availability Zones</li> <li>VPCs / Nat Gateways</li> <li>ElasticCache Cluster (shared Redis Caching)</li> </ul>"},{"location":"aws_setup/dashboard_stack/#aws-stack-benefits","title":"AWS Stack Benefits","text":"<ol> <li>Scalability: Includes an Application Load Balancer and uses ECS with Fargate, and ElasticCache for more robust scaling options.</li> <li>Higher Security: Utilizes security groups for both the ECS tasks, load balancer, plus VPC private subnets for Redis and the utilization of NAT Gateways.</li> </ol> <p>AWS Costs</p> <p>Deploying the SageWorks Dashboard does incur some monthly AWS costs. If you're on a tight budget you can deploy the 'lite' version of the Dashboard Stack.</p> <pre><code>cd sageworks/aws_setup/sageworks_dashboard_lite\nexport SAGEWORKS_CONFIG=/full/path/to/config.json\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n</code></pre>"},{"location":"aws_setup/domain_cert_setup/","title":"AWS Domain and Certificate Instructions","text":"<p>Need AWS Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p> <p>This page tries to give helpful guidance when setting up a new domain and SSL Certificate in your AWS Account.</p>"},{"location":"aws_setup/domain_cert_setup/#new-domain","title":"New Domain","text":"<p>You'll want the SageWorks Dashboard to have a domain for your companies internal use. Customers will typically use a domain like <code>&lt;company_name&gt;-ml-dashboard.com</code> but you are free to choose any domain you'd like.</p> <p>Domains are tied to AWS Accounts</p> <p>When you create a new domain in AWS Route 53, that domain is tied to that AWS Account. You can do a cross account setup for domains but it's a bit more tricky. We recommend that each account where SageWorks gets deployed owns the domain for that Dashboard.</p>"},{"location":"aws_setup/domain_cert_setup/#multiple-aws-accounts","title":"Multiple AWS Accounts","text":"<p>Many customers will have a dev/stage/prod set of AWS accounts, if that the case then the best practice is to make a domain specific to each account. So for instance:</p> <ul> <li>The AWS Dev Account gets: <code>&lt;company_name&gt;-ml-dashboard-dev.com</code> </li> <li>The AWS Prod Account gets:  <code>&lt;company_name&gt;-ml-dashboard-prod.com</code>.</li> </ul> <p>This means that when you go to that Dashboard it's super obvious which environment your on.</p>"},{"location":"aws_setup/domain_cert_setup/#register-the-domain","title":"Register the Domain","text":"<ul> <li> <p>Open Route 53 Console Route 53 Console</p> </li> <li> <p>Register your New Domain</p> <ul> <li>Click on Registered domains in the left navigation pane.</li> <li>Click on Register Domain.</li> <li>Enter your desired domain name and check for availability.</li> <li>Follow the prompts to complete the domain registration process.</li> <li>After registration, your domain will be listed under Registered domains.</li> </ul> </li> </ul>"},{"location":"aws_setup/domain_cert_setup/#request-a-ssl-certificate-from-acm","title":"Request a SSL Certificate from ACM","text":"<ol> <li> <p>Open ACM Console: AWS Certificate Manager (ACM) Console</p> </li> <li> <p>Request a Certificate:</p> <ul> <li>Click on Request a certificate.</li> <li>Select Request a public certificate and click Next.</li> </ul> </li> <li> <p>Add Domain Names:</p> <ul> <li>Enter the domain name you registered (e.g., <code>yourdomain.com</code>).</li> <li>Add any additional subdomains if needed (e.g., <code>www.yourdomain.com</code>).</li> </ul> </li> <li> <p>Validation Method:</p> <ul> <li>Choose DNS validation (recommended).</li> <li>ACM will provide CNAME records that you need to add to your Route 53 hosted zone.</li> </ul> </li> <li> <p>Add Tags (Optional):</p> <ul> <li>Add any tags if you want to organize your resources.</li> </ul> </li> <li> <p>Review and Request:</p> <ul> <li>Review your request and click Confirm and request.</li> </ul> </li> </ol>"},{"location":"aws_setup/domain_cert_setup/#adding-cname-records-to-route-53","title":"Adding CNAME Records to Route 53","text":"<p>To complete the domain validation process for your SSL/TLS certificate, you need to add the CNAME records provided by AWS Certificate Manager (ACM) to your Route 53 hosted zone. This step ensures that you own the domain and allows ACM to issue the certificate.</p>"},{"location":"aws_setup/domain_cert_setup/#finding-cname-record-names-and-values","title":"Finding CNAME Record Names and Values","text":"<p>You can find the CNAME record names and values in the AWS Certificate Manager (ACM) console:</p> <ol> <li> <p>Open ACM Console: AWS Certificate Manager (ACM) Console</p> </li> <li> <p>Select Your Certificate:</p> <ul> <li>Click on the certificate that is in the Pending Validation state.</li> </ul> </li> <li> <p>View Domains Section:</p> <ul> <li>Under the Domains section, you will see the CNAME record names and values that you need to add to your Route 53 hosted zone.</li> </ul> </li> </ol>"},{"location":"aws_setup/domain_cert_setup/#adding-cname-records-to-domain","title":"Adding CName Records to Domain","text":"<ol> <li> <p>Open Route 53 Console: Route 53 Console</p> </li> <li> <p>Select Your Hosted Zone:</p> <ul> <li>Find and select the hosted zone for your domain (e.g., <code>yourdomain.com</code>).</li> <li>Click on Create record.</li> </ul> </li> <li> <p>Add the First CNAME Record:</p> <ul> <li>For the Record name, enter the name provided by ACM (e.g., <code>_3e8623442477e9eeec.your-domain.com</code>).</li> <li>For the Record type, select <code>CNAME</code>.</li> <li>For the Value, enter the value provided by ACM (e.g., <code>_0908c89646d92.sdgjtdhdhz.acm-validations.aws.</code>) (include the trailing dot).</li> <li>Leave the default settings for TTL.</li> <li>Click on Create records.</li> </ul> </li> <li> <p>Add the Second CNAME Record:</p> <ul> <li>Repeat the process for the second CNAME record.</li> <li>For the Record name, enter the second name provided by ACM (e.g., <code>_75cd9364c643caa.www.your-domain.com</code>).</li> <li>For the Record type, select <code>CNAME</code>.</li> <li>For the Value, enter the second value provided by ACM (e.g., <code>_f72f8cff4fb20f4.sdgjhdhz.acm-validations.aws.</code>)  (include the trailing dot).</li> <li>Leave the default settings for TTL.</li> <li>Click on Create records.</li> </ul> </li> </ol> <p>DNS Propagation and Cert Validation</p> <p>After adding the CNAME records, these DNS records will propagate through the DNS system and ACM will automatically detect the validation records and validate the domain. This process can take a few minutes or up to an hour.</p>"},{"location":"aws_setup/domain_cert_setup/#certificate-states","title":"Certificate States","text":"<p>After requesting a certificate, it will go through the following states:</p> <ul> <li> <p>Pending Validation: The initial state after you request a certificate and before you complete the validation process. ACM is waiting for you to prove domain ownership by adding the CNAME records.</p> </li> <li> <p>Issued: This state indicates that the certificate has been successfully validated and issued. You can now use this certificate with your AWS resources.</p> </li> <li> <p>Validation Timed Out: If you do not complete the validation process within a specified period (usually 72 hours), the certificate request times out and enters this state.</p> </li> <li> <p>Revoked: This state indicates that the certificate has been revoked and is no longer valid.</p> </li> <li> <p>Failed: If the validation process fails for any reason, the certificate enters this state.</p> </li> <li> <p>Inactive: This state indicates that the certificate is not currently in use.</p> </li> </ul> <p>The certificate status should obviously be in the Issued state, if not please contact SageWorks Support Team.</p>"},{"location":"aws_setup/domain_cert_setup/#retrieving-the-certificate-arn","title":"Retrieving the Certificate ARN","text":"<ol> <li> <p>Open ACM Console:</p> <ul> <li>Go back to the AWS Certificate Manager (ACM) Console.</li> </ul> </li> <li> <p>Check the Status:</p> <ul> <li>Once the CNAME records are added, ACM will automatically validate the domain.</li> <li>Refresh the ACM console to see the updated status.</li> <li>The status will change to \"Issued\" once validation is complete.</li> </ul> </li> <li> <p>Copy the Certificate ARN:</p> <ul> <li>Click on your issued certificate.</li> <li>Copy the Amazon Resource Name (ARN) from the certificate details.</li> </ul> </li> </ol> <p>You now have the ARN for your certificate, which you can use in your AWS resources such as API Gateway, CloudFront, etc.</p>"},{"location":"aws_setup/domain_cert_setup/#aws-resources","title":"AWS Resources","text":"<ul> <li>AWS Adding or Changing DNS Records</li> <li>AWS Certificate Manager (ACM) Documentation</li> <li>Requesting a Public Certificate</li> <li>Validating Domain Ownership</li> <li>AWS Route 53 Documentation</li> <li>AWS API Gateway Documentation</li> </ul>"},{"location":"aws_setup/full_pipeline/","title":"Testing Full ML Pipeline","text":"<p>Now that the core Sageworks AWS Stack has been deployed. Let's test out SageWorks by building a full entire AWS ML Pipeline from start to finish. The script <code>build_ml_pipeline.py</code> uses the SageWorks API to quickly and easily build an AWS Modeling Pipeline.</p> <p>Taste the Awesome</p> <p>The SageWorks \"hello world\" builds a full AWS ML Pipeline. From S3 to deployed model and endpoint. If you have any troubles at all feel free to contact us at sageworks email or on Discord and we're happy to help you for FREE.</p> <ul> <li>DataLoader(abalone.csv) --&gt; DataSource</li> <li>DataToFeatureSet Transform --&gt; FeatureSet</li> <li>FeatureSetToModel Transform --&gt; Model</li> <li>ModelToEndpoint Transform --&gt; Endpoint</li> </ul> <p>This script will take a LONG TiME to run, most of the time is waiting on AWS to finalize FeatureGroups, train Models or deploy Endpoints.</p> <p><pre><code>\u276f python build_ml_pipeline.py\n&lt;lot of building ML pipeline outputs&gt;\n</code></pre> After the script completes you will see that it's built out an AWS ML Pipeline and testing artifacts.</p>"},{"location":"aws_setup/full_pipeline/#run-the-sageworks-dashboard-local","title":"Run the SageWorks Dashboard (Local)","text":"<p>Dashboard AWS Stack</p> <p>Deploying the Dashboard Stack is straight-forward and provides a robust AWS Web Server with Load Balancer, Elastic Container Service, VPC Networks, etc. (see AWS Dashboard Stack)</p> <p>For testing it's nice to run the Dashboard locally, but for longterm use the SageWorks Dashboard should be deployed as an AWS Stack. The deployed Stack allows everyone in the company to use, view, and interact with the AWS Machine Learning Artifacts created with SageWorks.</p> <p><pre><code>cd sageworks/application/aws_dashboard\n./dashboard\n</code></pre> This will open a browser to http://localhost:8000</p> <p> SageWorks Dashboard: AWS Pipelines in a Whole New Light! <p>Success</p> <p>Congratulations: SageWorks is now deployed to your AWS Account. Deploying the AWS Stack only needs to be done once. Now that this is complete your developers can simply <code>pip install sageworks</code> and start using the API.</p> <p>If you ran into any issues with this procedure please contact us via Discord or email sageworks@supercowpowers.com and the SCP team will provide free setup and support for new SageWorks users.</p>"},{"location":"aws_setup/sso_setup/","title":"AWS SSO Setup","text":"<p>Need AWS Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p>"},{"location":"aws_setup/sso_setup/#get-some-information","title":"Get some information","text":"<ul> <li>Goto your AWS Identity Center in the AWS Console</li> <li>On the right side there will be two important pieces of information<ul> <li>Start URL</li> <li>Region </li> </ul> </li> </ul> <p>If you're connecting to the SCP AWS Account you can use these values</p> <pre><code>Start URL: https://supercowpowers.awsapps.com/start \nRegion: us-west-2\n</code></pre>"},{"location":"aws_setup/sso_setup/#install-aws-cli","title":"Install AWS CLI","text":"<p>Mac <code>brew install awscli</code></p> <p>Linux <code>pip install awscli</code></p> <p>Windows</p> <p>Download the MSI installer (top right corner on this page) https://aws.amazon.com/cli/ and follow the installation instructions.</p>"},{"location":"aws_setup/sso_setup/#running-the-sso-configuration","title":"Running the SSO Configuration","text":"<p>Note: You only need to do this once!</p> <pre><code>aws configure sso --profile &lt;whatever you want&gt; (e.g. aws_sso)\nSSO session name (Recommended): sso-session\nSSO start URL []: &lt;the Start URL from info above&gt;\nSSO region []: &lt;the Region from info above&gt;\nSSO registration scopes [sso:account:access]: &lt;just hit return&gt;\n</code></pre> <p>You will get a browser open/redirect at this point and get a list of available accounts.. something like below, just pick the correct account</p> <pre><code>There are 2 AWS accounts available to you.\n&gt; SCP_Sandbox, briford+sandbox@supercowpowers.com (XXXX40646YYY)\n  SCP_Main, briford@supercowpowers.com (XXX576391YYY)\n</code></pre> <p>Now pick the role that you're going to use</p> <pre><code>There are 2 roles available to you.\n&gt; DataScientist\n  AdministratorAccess\n\nCLI default client Region [None]: &lt;same region as above&gt;\nCLI default output format [None]: json\n</code></pre>"},{"location":"aws_setup/sso_setup/#setting-up-some-aliases-for-bashzsh","title":"Setting up some aliases for bash/zsh","text":"<p>Edit your favorite ~/.bashrc ~/.zshrc and add these nice aliases/helper</p> <pre><code># AWS Aliases\nalias bob_sso='export AWS_PROFILE=bob_sso'\n\n# Default AWS Profile\nexport AWS_PROFILE=bob_sso\n</code></pre>"},{"location":"aws_setup/sso_setup/#testing-your-new-aws-profile","title":"Testing your new AWS Profile","text":"<p>Make sure your profile is active/set</p> <p><pre><code>env | grep AWS\nAWS_PROFILE=&lt;bob_sso or whatever&gt;\n</code></pre> Now you can list the S3 buckets in the AWS Account</p> <p><pre><code>aws ls s3\n</code></pre> If you get some message like this...</p> <pre><code>The SSO session associated with this profile has\nexpired or is otherwise invalid. To refresh this SSO\nsession run aws sso login with the corresponding\nprofile.\n</code></pre> <p>This is fine/good, a browser will open up and you can refresh your SSO Token.</p> <p>After that you should get a listing of the S3 buckets without needed to refresh your token.</p> <pre><code>aws s3 ls\n\u276f aws s3 ls\n2023-03-20 20:06:53 aws-athena-query-results-XXXYYY-us-west-2\n2023-03-30 13:22:28 sagemaker-studio-XXXYYY-dbgyvq8ruka\n2023-03-24 22:05:55 sagemaker-us-west-2-XXXYYY\n2023-04-30 13:43:29 scp-sageworks-artifacts\n</code></pre>"},{"location":"aws_setup/sso_setup/#back-to-initial-setup","title":"Back to Initial Setup","text":"<p>If you're doing the initial setup of SageWorks you should now go back and finish that process: Getting Started</p>"},{"location":"aws_setup/sso_setup/#aws-resources","title":"AWS Resources","text":"<ul> <li>AWS Identity Center</li> <li>Users and Groups</li> <li>Permission Sets</li> <li>SSO Command Line/Python Configure</li> </ul>"},{"location":"blogs_research/","title":"SageWorks Blogs","text":"<p>Just Getting Started?</p> <p>The SageWorks Blogs is a great way to see what's possible with SageWorks. Also if you're ready to just in the API Classes will give you details on the SageWorks ML Pipeline Classes.</p>"},{"location":"blogs_research/#blogs","title":"Blogs","text":"<ul> <li>AqSol Residual Analysis: If this Blog we'll look at the popular AqSol compound solubility dataset and walk backward through the ML pipeline by starting with model residuals and backtracking to the features and input data.</li> </ul> <p>Examples</p> <p>All of the SageWorks Examples are in the Sageworks Repository under the <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p>"},{"location":"blogs_research/eda/","title":"Exploratory Data Analysis","text":"<p>SageWorks EDS</p> <p>The SageWorks toolkit a set of plots that show EDA results, it also has a flexible plugin architecture to expand, enhance, or even replace the current set of web components Dashboard.</p> <p>The SageWorks framework has a broad range of Exploratory Data Analysis (EDA) functionality. Each time a DataSource or FeatureSet is created that data is run through a full set of EDA techniques:</p> <ul> <li>TBD</li> <li>TBD2</li> </ul> SageWorks Exploratory Data Analysis"},{"location":"blogs_research/eda/#additional-resources","title":"Additional Resources","text":"<ul> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"blogs_research/htg/","title":"EDA: High Target Gradients","text":"<p>SageWorks EDS</p> <p>The SageWorks toolkit a set of plots that show EDA results, it also has a flexible plugin architecture to expand, enhance, or even replace the current set of web components Dashboard.</p> <p>The SageWorks framework has a broad range of Exploratory Data Analysis (EDA) functionality. Each time a DataSource or FeatureSet is created that data is run through a full set of EDA techniques:</p> <ul> <li>TBD</li> <li>TBD2</li> </ul> <p>One of the latest EDA techniques we've added is the addition of a concept called High Target Gradients </p> <ol> <li>Definition: For a given data point (x_i) with target value (y_i), and its neighbor (x_j) with target value (y_j), the target gradient (G_{ij}) can be defined as:</li> </ol> <p>[G_{ij} = \\frac{|y_i - y_j|}{d(x_i, x_j)}]</p> <p>where (d(x_i, x_j)) is the distance between (x_i) and (x_j) in the feature space. This equation gives you the rate of change of the target value with respect to the change in features, similar to a slope in a two-dimensional space.</p> <ol> <li>Max Gradient for Each Point: For each data point (x_i), you can compute the maximum target gradient with respect to all its neighbors:</li> </ol> <p>[G_{i}^{max} = \\max_{j \\neq i} G_{ij}]</p> <p>This gives you a scalar value for each point in your training data that represents the maximum rate of change of the target value in its local neighborhood.</p> <ol> <li> <p>Usage: You can use (G_{i}^{max}) to identify and filter areas in the feature space that have high target gradients, which may indicate potential issues with data quality or feature representation.</p> </li> <li> <p>Visualization: Plotting the distribution of (G_{i}^{max}) values or visualizing them in the context of the feature space can help you identify regions or specific points that warrant further investigation.</p> </li> </ol>"},{"location":"blogs_research/htg/#additional-resources","title":"Additional Resources","text":"<ul> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"blogs_research/residual_analysis/","title":"Residual Analysis","text":""},{"location":"blogs_research/residual_analysis/#residual-analysis","title":"Residual Analysis","text":"<p>Overview and Definition Residual analysis involves examining the differences between observed and predicted values, known as residuals, to assess the performance of a predictive model. It is a critical step in model evaluation as it helps identify patterns of errors, diagnose potential problems, and improve model performance. By understanding where and why a model's predictions deviate from actual values, we can make informed adjustments to the model or the data to enhance accuracy and robustness.</p> <p>Sparse Data Regions The observation is in a part of feature space with little or no nearby training observations, leading to poor generalization in these regions and resulting in high prediction errors.</p> <p>Noisy/Inconsistent Data and Preprocessing Issues The observation is in a part of feature space where the training data is noisy, incorrect, or has high variance in the target variable. Additionally, missing values or incorrect data transformations can introduce errors, leading to unreliable predictions and high residuals.</p> <p>Feature Resolution The current feature set may not fully resolve the compounds, leading to \u2018collisions\u2019 where different compounds are assigned identical features. Such unresolved features can result in different compounds exhibiting the same features, causing high residuals due to unaccounted structural or chemical nuances.</p> <p>Activity Cliffs Structurally similar compounds exhibit significantly different activities, making accurate prediction challenging due to steep changes in activity with minor structural modifications.</p> <p>Feature Engineering Issues Irrelevant or redundant features and poor feature scaling can negatively impact the model's performance and accuracy, resulting in higher residuals.</p> <p>Model Overfitting or Underfitting Overfitting occurs when the model is too complex and captures noise, while underfitting happens when the model is too simple and misses underlying patterns, both leading to inaccurate predictions.</p>"},{"location":"concepts/model_monitoring/","title":"Model Monitoring","text":"<p>Amazon SageMaker Model Monitor currently provides the following types of monitoring:</p> <ul> <li>Monitor Data Quality: Detect drifts in data quality such as deviations from baseline data types.</li> <li>Monitor Model Quality: Monitor drift in model quality metrics, such as accuracy.</li> <li>Monitor Bias Drift for Models in Production: Monitor bias in your model\u2019s predictions.</li> <li>Monitor Feature Attribution Drift for Models in Production: Monitor drift in feature attribution.</li> </ul>"},{"location":"core_classes/overview/","title":"Core Classes","text":"<p>SageWorks Core Classes</p> <p>These classes interact with many of the AWS service details and are therefore more complex. They provide additional control and refinement over the AWS ML Pipline. For most use cases the API Classes should be used</p> <p>Welcome to the SageWorks Core Classes</p> <p>The Core Classes provide low-level APIs for the SageWorks package, these classes directly interface with the AWS Sagemaker Pipeline interfaces and have a large number of methods with reasonable complexity.</p> <p>The API Classes have method pass-through so just call the method on the API Class and voil\u00e0 it works the same.</p> <p></p>"},{"location":"core_classes/overview/#artifacts","title":"Artifacts","text":"<ul> <li>AthenaSource: Manages AWS Data Catalog and Athena</li> <li>FeatureSetCore: Manages AWS Feature Store and Feature Groups</li> <li>ModelCore: Manages the training and deployment of AWS Model Groups and Packages</li> <li>EndpointCore: Manages the deployment and invocations/inference on AWS Endpoints</li> </ul>"},{"location":"core_classes/overview/#transforms","title":"Transforms","text":"<p>Transforms are a set of classes that transform one type of <code>Artifact</code> to another type. For instance <code>DataToFeatureSet</code> takes a <code>DataSource</code> artifact and creates a <code>FeatureSet</code> artifact.</p> <ul> <li>DataLoaders Light: Loads various light/smaller data into AWS Data Catalog and Athena</li> <li>DataLoaders Heavy: Loads heavy/larger data (via Glue) into AWS Data Catalog and Athena</li> <li>DataToFeatures: Transforms a DataSource into a FeatureSet (AWS Feature Store/Group)</li> <li>FeaturesToModel: Trains and deploys an AWS Model Package/Group from a FeatureSet</li> <li>ModelToEndpoint: Manages the provisioning and deployment of a Model Endpoint</li> <li>PandasTransforms:Pandas DataFrame transforms and helper methods.</li> </ul>"},{"location":"core_classes/artifacts/artifact/","title":"Artifact","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the any class that inherits from the Artifact Class and voil\u00e0 it works the same.</p> <p>The SageWorks Artifact class is a base/abstract class that defines API implemented by all the child classes (DataSource, FeatureSet, Model, Endpoint).</p> <p>Artifact: Abstract Base Class for all Artifact classes in SageWorks. Artifacts simply reflect and aggregate one or more AWS Services</p>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact","title":"<code>Artifact</code>","text":"<p>               Bases: <code>ABC</code></p> <p>Artifact: Abstract Base Class for all Artifact classes in SageWorks</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>class Artifact(ABC):\n    \"\"\"Artifact: Abstract Base Class for all Artifact classes in SageWorks\"\"\"\n\n    log = logging.getLogger(\"sageworks\")\n\n    def __init__(self, uuid: str):\n        \"\"\"Initialize the Artifact Base Class\n\n        Args:\n            uuid (str): The UUID of this artifact\n        \"\"\"\n        self.uuid = uuid\n\n        # Set up our Boto3 and SageMaker Session and SageMaker Client\n        self.aws_account_clamp = AWSAccountClamp()\n        self.boto_session = self.aws_account_clamp.boto_session()\n        self.sm_session = self.aws_account_clamp.sagemaker_session(self.boto_session)\n        self.sm_client = self.aws_account_clamp.sagemaker_client(self.boto_session)\n        self.aws_region = self.aws_account_clamp.region\n\n        # The Meta() class pulls and collects metadata from a bunch of AWS Services\n        self.aws_broker = AWSServiceBroker()\n        from sageworks.api.meta import Meta\n\n        self.meta_broker = Meta()\n\n        # Config Manager Checks\n        self.cm = ConfigManager()\n        if not self.cm.config_okay():\n            self.log.error(\"SageWorks Configuration Incomplete...\")\n            self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n            raise FatalConfigError()\n\n        # Grab our SageWorks Bucket from Config\n        self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n        if self.sageworks_bucket is None:\n            self.log = logging.getLogger(\"sageworks\")\n            self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n            sys.exit(1)\n\n        # Setup Bucket Paths\n        self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n        self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n        self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n        self.endpoints_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n        # Data Cache for Artifacts\n        self.data_storage = SageWorksCache(prefix=\"data_storage\")\n        self.temp_storage = SageWorksCache(prefix=\"temp_storage\", expire=300)  # 5 minutes\n        self.ephemeral_storage = SageWorksCache(prefix=\"ephemeral_storage\", expire=1)  # 1 second\n\n        # Delimiter for storing lists in AWS Tags\n        self.tag_delimiter = \"::\"\n\n    def __post_init__(self):\n        \"\"\"Artifact Post Initialization\"\"\"\n\n        # Do I exist? (very metaphysical)\n        if not self.exists():\n            self.log.debug(f\"Artifact {self.uuid} does not exist\")\n            return\n\n        # Conduct a Health Check on this Artifact\n        health_issues = self.health_check()\n        if health_issues:\n            if \"needs_onboard\" in health_issues:\n                self.log.important(f\"Artifact {self.uuid} needs to be onboarded\")\n            elif health_issues == [\"no_activity\"]:\n                self.log.debug(f\"Artifact {self.uuid} has no activity\")\n            else:\n                self.log.warning(f\"Health Check Failed {self.uuid}: {health_issues}\")\n            for issue in health_issues:\n                self.add_health_tag(issue)\n        else:\n            self.log.info(f\"Health Check Passed {self.uuid}\")\n\n    @classmethod\n    def ensure_valid_name(cls, name: str, delimiter: str = \"_\"):\n        \"\"\"Check if the ID adheres to the naming conventions for this Artifact.\n\n        Args:\n            name (str): The name/id to check.\n            delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n\n        Raises:\n            ValueError: If the name/id is not valid.\n        \"\"\"\n        valid_name = cls.generate_valid_name(name, delimiter=delimiter)\n        if name != valid_name:\n            error_msg = f\"{name} doesn't conform and should be converted to: {valid_name}\"\n            cls.log.error(error_msg)\n            raise ValueError(error_msg)\n\n    @staticmethod\n    def generate_valid_name(name: str, delimiter: str = \"_\") -&gt; str:\n        \"\"\"Only allow letters and the specified delimiter, also lowercase the string\n\n        Args:\n            name (str): The name/id string to check\n            delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n\n        Returns:\n            str: A generated valid name/id\n        \"\"\"\n        valid_name = \"\".join(c for c in name if c.isalnum() or c in [\"_\", \"-\"]).lower()\n        valid_name = valid_name.replace(\"_\", delimiter)\n        valid_name = valid_name.replace(\"-\", delimiter)\n        return valid_name\n\n    @abstractmethod\n    def exists(self) -&gt; bool:\n        \"\"\"Does the Artifact exist? Can we connect to it?\"\"\"\n        pass\n\n    def sageworks_meta(self) -&gt; dict:\n        \"\"\"Get the SageWorks specific metadata for this Artifact\n        Note: This functionality will work for FeatureSets, Models, and Endpoints\n              but not for DataSources/Graphs. DataSource/Graph classes need to override this method.\n        \"\"\"\n        # First, check our cache\n        meta_data_key = f\"{self.uuid}_sageworks_meta\"\n        meta_data = self.ephemeral_storage.get(meta_data_key)\n        if meta_data is not None:\n            return meta_data\n\n        # Otherwise, fetch the metadata from AWS, store it in the cache, and return it\n        meta_data = list_tags_with_throttle(self.arn(), self.sm_session)\n        self.ephemeral_storage.set(meta_data_key, meta_data)\n        return meta_data\n\n    def expected_meta(self) -&gt; list[str]:\n        \"\"\"Metadata we expect to see for this Artifact when it's ready\n        Returns:\n            list[str]: List of expected metadata keys\n        \"\"\"\n\n        # If an artifact has additional expected metadata override this method\n        return [\"sageworks_status\"]\n\n    @abstractmethod\n    def refresh_meta(self):\n        \"\"\"Refresh the Artifact's metadata\"\"\"\n        pass\n\n    def ready(self) -&gt; bool:\n        \"\"\"Is the Artifact ready? Is initial setup complete and expected metadata populated?\"\"\"\n\n        # If anything goes wrong, assume the artifact is not ready\n        try:\n            # Check for the expected metadata\n            expected_meta = self.expected_meta()\n            existing_meta = self.sageworks_meta()\n            ready = set(existing_meta.keys()).issuperset(expected_meta)\n            if ready:\n                return True\n            else:\n                self.log.info(\"Artifact is not ready!\")\n                return False\n        except Exception as e:\n            self.log.error(f\"Artifact malformed: {e}\")\n            return False\n\n    @abstractmethod\n    def onboard(self) -&gt; bool:\n        \"\"\"Onboard this Artifact into SageWorks\n        Returns:\n            bool: True if the Artifact was successfully onboarded, False otherwise\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def details(self) -&gt; dict:\n        \"\"\"Additional Details about this Artifact\"\"\"\n        pass\n\n    @abstractmethod\n    def size(self) -&gt; float:\n        \"\"\"Return the size of this artifact in MegaBytes\"\"\"\n        pass\n\n    @abstractmethod\n    def created(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was created\"\"\"\n        pass\n\n    @abstractmethod\n    def modified(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was last modified\"\"\"\n        pass\n\n    @abstractmethod\n    def arn(self):\n        \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n        pass\n\n    @abstractmethod\n    def aws_url(self):\n        \"\"\"AWS console/web interface for this artifact\"\"\"\n        pass\n\n    @abstractmethod\n    def aws_meta(self) -&gt; dict:\n        \"\"\"Get the full AWS metadata for this artifact\"\"\"\n        pass\n\n    @abstractmethod\n    def delete(self):\n        \"\"\"Delete this artifact including all related AWS objects\"\"\"\n        pass\n\n    def upsert_sageworks_meta(self, new_meta: dict):\n        \"\"\"Add SageWorks specific metadata to this Artifact\n        Args:\n            new_meta (dict): Dictionary of NEW metadata to add\n        Note:\n            This functionality will work for FeatureSets, Models, and Endpoints\n            but not for DataSources. The DataSource class overrides this method.\n        \"\"\"\n        # Sanity check\n        aws_arn = self.arn()\n        if aws_arn is None:\n            self.log.error(f\"ARN is None for {self.uuid}!\")\n            return\n\n        # Add the new metadata to the existing metadata\n        self.log.info(f\"Upserting SageWorks Metadata for Artifact: {aws_arn}...\")\n        aws_tags = dict_to_aws_tags(new_meta)\n        self.sm_client.add_tags(ResourceArn=aws_arn, Tags=aws_tags)\n\n    def remove_sageworks_meta(self, key_to_remove: str):\n        \"\"\"Remove SageWorks specific metadata from this Artifact\n        Args:\n            key_to_remove (str): The metadata key to remove\n        Note:\n            This functionality will work for FeatureSets, Models, and Endpoints\n            but not for DataSources. The DataSource class overrides this method.\n        \"\"\"\n        aws_arn = self.arn()\n        # Sanity check\n        if aws_arn is None:\n            self.log.error(f\"ARN is None for {self.uuid}!\")\n            return\n        self.log.info(f\"Removing SageWorks Metadata {key_to_remove} for Artifact: {aws_arn}...\")\n        sagemaker_delete_tag(aws_arn, self.sm_session, key_to_remove)\n\n    def get_tags(self, tag_type=\"user\") -&gt; list:\n        \"\"\"Get the tags for this artifact\n        Args:\n            tag_type (str): Type of tags to return (user or health)\n        Returns:\n            list[str]: List of tags for this artifact\n        \"\"\"\n        if tag_type == \"user\":\n            user_tags = self.sageworks_meta().get(\"sageworks_tags\")\n            return user_tags.split(self.tag_delimiter) if user_tags else []\n\n        # Grab our health tags\n        health_tags = self.sageworks_meta().get(\"sageworks_health_tags\")\n\n        # If we don't have health tags, create the storage and return an empty list\n        if health_tags is None:\n            self.log.important(f\"{self.uuid} creating sageworks_health_tags storage...\")\n            self.upsert_sageworks_meta({\"sageworks_health_tags\": \"\"})\n            return []\n\n        # Otherwise, return the health tags\n        return health_tags.split(self.tag_delimiter) if health_tags else []\n\n    def set_tags(self, tags):\n        self.upsert_sageworks_meta({\"sageworks_tags\": self.tag_delimiter.join(tags)})\n\n    def add_tag(self, tag, tag_type=\"user\"):\n        \"\"\"Add a tag for this artifact, ensuring no duplicates and maintaining order.\n        Args:\n            tag (str): Tag to add for this artifact\n            tag_type (str): Type of tag to add (user or health)\n        \"\"\"\n        current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n        if tag not in current_tags:\n            current_tags.append(tag)\n            combined_tags = self.tag_delimiter.join(current_tags)\n            if tag_type == \"user\":\n                self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n            else:\n                self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n\n    def remove_sageworks_tag(self, tag, tag_type=\"user\"):\n        \"\"\"Remove a tag from this artifact if it exists.\n        Args:\n            tag (str): Tag to remove from this artifact\n            tag_type (str): Type of tag to remove (user or health)\n        \"\"\"\n        current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n        if tag in current_tags:\n            current_tags.remove(tag)\n            combined_tags = self.tag_delimiter.join(current_tags)\n            if tag_type == \"user\":\n                self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n            elif tag_type == \"health\":\n                self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n\n    # Syntactic sugar for health tags\n    def get_health_tags(self):\n        return self.get_tags(tag_type=\"health\")\n\n    def set_health_tags(self, tags):\n        self.upsert_sageworks_meta({\"sageworks_health_tags\": self.tag_delimiter.join(tags)})\n\n    def add_health_tag(self, tag):\n        self.add_tag(tag, tag_type=\"health\")\n\n    def remove_health_tag(self, tag):\n        self.remove_sageworks_tag(tag, tag_type=\"health\")\n\n    # Owner of this artifact\n    def get_owner(self) -&gt; str:\n        \"\"\"Get the owner of this artifact\"\"\"\n        return self.sageworks_meta().get(\"sageworks_owner\", \"unknown\")\n\n    def set_owner(self, owner: str):\n        \"\"\"Set the owner of this artifact\n\n        Args:\n            owner (str): Owner to set for this artifact\n        \"\"\"\n        self.upsert_sageworks_meta({\"sageworks_owner\": owner})\n\n    def get_input(self) -&gt; str:\n        \"\"\"Get the input data for this artifact\"\"\"\n        return self.sageworks_meta().get(\"sageworks_input\", \"unknown\")\n\n    def set_input(self, input_data: str):\n        \"\"\"Set the input data for this artifact\n\n        Args:\n            input_data (str): Name of input data for this artifact\n        Note:\n            This breaks the official provenance of the artifact, so use with caution.\n        \"\"\"\n        self.log.important(f\"{self.uuid}: Setting input to {input_data}...\")\n        self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n        self.upsert_sageworks_meta({\"sageworks_input\": input_data})\n\n    def get_status(self) -&gt; str:\n        \"\"\"Get the status for this artifact\"\"\"\n        return self.sageworks_meta().get(\"sageworks_status\", \"unknown\")\n\n    def set_status(self, status: str):\n        \"\"\"Set the status for this artifact\n        Args:\n            status (str): Status to set for this artifact\n        \"\"\"\n        self.upsert_sageworks_meta({\"sageworks_status\": status})\n\n    def health_check(self) -&gt; list[str]:\n        \"\"\"Perform a health check on this artifact\n        Returns:\n            list[str]: List of health issues\n        \"\"\"\n        health_issues = []\n        if not self.ready():\n            return [\"needs_onboard\"]\n        if \"unknown\" in self.aws_url():\n            health_issues.append(\"aws_url_unknown\")\n        return health_issues\n\n    def summary(self) -&gt; dict:\n        \"\"\"This is generic summary information for all Artifacts. If you\n        want to get more detailed information, call the details() method\n        which is implemented by the specific Artifact class\"\"\"\n        basic = {\n            \"uuid\": self.uuid,\n            \"health_tags\": self.get_health_tags(),\n            \"aws_arn\": self.arn(),\n            \"size\": self.size(),\n            \"created\": self.created(),\n            \"modified\": self.modified(),\n            \"input\": self.get_input(),\n        }\n        # Combine the sageworks metadata with the basic metadata\n        return {**basic, **self.sageworks_meta()}\n\n    def __repr__(self) -&gt; str:\n        \"\"\"String representation of this artifact\n\n        Returns:\n            str: String representation of this artifact\n        \"\"\"\n        summary_dict = self.summary()\n        display_keys = [\n            \"aws_arn\",\n            \"health_tags\",\n            \"size\",\n            \"created\",\n            \"modified\",\n            \"input\",\n            \"sageworks_status\",\n            \"sageworks_tags\",\n        ]\n        summary_items = [f\"  {repr(key)}: {repr(value)}\" for key, value in summary_dict.items() if key in display_keys]\n        summary_str = f\"{self.__class__.__name__}: {self.uuid}\\n\" + \",\\n\".join(summary_items)\n        return summary_str\n\n    def delete_metadata(self, key_to_delete: str):\n        \"\"\"Delete specific metadata from this artifact\n        Args:\n            key_to_delete (str): Metadata key to delete\n        \"\"\"\n\n        aws_arn = self.arn()\n        self.log.important(f\"Deleting Metadata {key_to_delete} for Artifact: {aws_arn}...\")\n\n        # First, fetch all the existing tags\n        response = self.sm_session.list_tags(aws_arn)\n        existing_tags = response.get(\"Tags\", [])\n\n        # Convert existing AWS tags to a dictionary for easy manipulation\n        existing_tags_dict = {item[\"Key\"]: item[\"Value\"] for item in existing_tags}\n\n        # Identify tags to delete\n        tag_list_to_delete = []\n        for key in existing_tags_dict.keys():\n            if key == key_to_delete or key.startswith(f\"{key_to_delete}_chunk_\"):\n                tag_list_to_delete.append(key)\n\n        # Delete the identified tags\n        if tag_list_to_delete:\n            self.sm_client.delete_tags(ResourceArn=aws_arn, TagKeys=tag_list_to_delete)\n        else:\n            self.log.info(f\"No Metadata found: {key_to_delete}...\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__init__","title":"<code>__init__(uuid)</code>","text":"<p>Initialize the Artifact Base Class</p> <p>Parameters:</p> Name Type Description Default <code>uuid</code> <code>str</code> <p>The UUID of this artifact</p> required Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def __init__(self, uuid: str):\n    \"\"\"Initialize the Artifact Base Class\n\n    Args:\n        uuid (str): The UUID of this artifact\n    \"\"\"\n    self.uuid = uuid\n\n    # Set up our Boto3 and SageMaker Session and SageMaker Client\n    self.aws_account_clamp = AWSAccountClamp()\n    self.boto_session = self.aws_account_clamp.boto_session()\n    self.sm_session = self.aws_account_clamp.sagemaker_session(self.boto_session)\n    self.sm_client = self.aws_account_clamp.sagemaker_client(self.boto_session)\n    self.aws_region = self.aws_account_clamp.region\n\n    # The Meta() class pulls and collects metadata from a bunch of AWS Services\n    self.aws_broker = AWSServiceBroker()\n    from sageworks.api.meta import Meta\n\n    self.meta_broker = Meta()\n\n    # Config Manager Checks\n    self.cm = ConfigManager()\n    if not self.cm.config_okay():\n        self.log.error(\"SageWorks Configuration Incomplete...\")\n        self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n        raise FatalConfigError()\n\n    # Grab our SageWorks Bucket from Config\n    self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n    if self.sageworks_bucket is None:\n        self.log = logging.getLogger(\"sageworks\")\n        self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n        sys.exit(1)\n\n    # Setup Bucket Paths\n    self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n    self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n    self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n    self.endpoints_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n    # Data Cache for Artifacts\n    self.data_storage = SageWorksCache(prefix=\"data_storage\")\n    self.temp_storage = SageWorksCache(prefix=\"temp_storage\", expire=300)  # 5 minutes\n    self.ephemeral_storage = SageWorksCache(prefix=\"ephemeral_storage\", expire=1)  # 1 second\n\n    # Delimiter for storing lists in AWS Tags\n    self.tag_delimiter = \"::\"\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__post_init__","title":"<code>__post_init__()</code>","text":"<p>Artifact Post Initialization</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def __post_init__(self):\n    \"\"\"Artifact Post Initialization\"\"\"\n\n    # Do I exist? (very metaphysical)\n    if not self.exists():\n        self.log.debug(f\"Artifact {self.uuid} does not exist\")\n        return\n\n    # Conduct a Health Check on this Artifact\n    health_issues = self.health_check()\n    if health_issues:\n        if \"needs_onboard\" in health_issues:\n            self.log.important(f\"Artifact {self.uuid} needs to be onboarded\")\n        elif health_issues == [\"no_activity\"]:\n            self.log.debug(f\"Artifact {self.uuid} has no activity\")\n        else:\n            self.log.warning(f\"Health Check Failed {self.uuid}: {health_issues}\")\n        for issue in health_issues:\n            self.add_health_tag(issue)\n    else:\n        self.log.info(f\"Health Check Passed {self.uuid}\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__repr__","title":"<code>__repr__()</code>","text":"<p>String representation of this artifact</p> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>String representation of this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def __repr__(self) -&gt; str:\n    \"\"\"String representation of this artifact\n\n    Returns:\n        str: String representation of this artifact\n    \"\"\"\n    summary_dict = self.summary()\n    display_keys = [\n        \"aws_arn\",\n        \"health_tags\",\n        \"size\",\n        \"created\",\n        \"modified\",\n        \"input\",\n        \"sageworks_status\",\n        \"sageworks_tags\",\n    ]\n    summary_items = [f\"  {repr(key)}: {repr(value)}\" for key, value in summary_dict.items() if key in display_keys]\n    summary_str = f\"{self.__class__.__name__}: {self.uuid}\\n\" + \",\\n\".join(summary_items)\n    return summary_str\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.add_tag","title":"<code>add_tag(tag, tag_type='user')</code>","text":"<p>Add a tag for this artifact, ensuring no duplicates and maintaining order. Args:     tag (str): Tag to add for this artifact     tag_type (str): Type of tag to add (user or health)</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def add_tag(self, tag, tag_type=\"user\"):\n    \"\"\"Add a tag for this artifact, ensuring no duplicates and maintaining order.\n    Args:\n        tag (str): Tag to add for this artifact\n        tag_type (str): Type of tag to add (user or health)\n    \"\"\"\n    current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n    if tag not in current_tags:\n        current_tags.append(tag)\n        combined_tags = self.tag_delimiter.join(current_tags)\n        if tag_type == \"user\":\n            self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n        else:\n            self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.arn","title":"<code>arn()</code>  <code>abstractmethod</code>","text":"<p>AWS ARN (Amazon Resource Name) for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef arn(self):\n    \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.aws_meta","title":"<code>aws_meta()</code>  <code>abstractmethod</code>","text":"<p>Get the full AWS metadata for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef aws_meta(self) -&gt; dict:\n    \"\"\"Get the full AWS metadata for this artifact\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.aws_url","title":"<code>aws_url()</code>  <code>abstractmethod</code>","text":"<p>AWS console/web interface for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef aws_url(self):\n    \"\"\"AWS console/web interface for this artifact\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.created","title":"<code>created()</code>  <code>abstractmethod</code>","text":"<p>Return the datetime when this artifact was created</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef created(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was created\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.delete","title":"<code>delete()</code>  <code>abstractmethod</code>","text":"<p>Delete this artifact including all related AWS objects</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef delete(self):\n    \"\"\"Delete this artifact including all related AWS objects\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.delete_metadata","title":"<code>delete_metadata(key_to_delete)</code>","text":"<p>Delete specific metadata from this artifact Args:     key_to_delete (str): Metadata key to delete</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def delete_metadata(self, key_to_delete: str):\n    \"\"\"Delete specific metadata from this artifact\n    Args:\n        key_to_delete (str): Metadata key to delete\n    \"\"\"\n\n    aws_arn = self.arn()\n    self.log.important(f\"Deleting Metadata {key_to_delete} for Artifact: {aws_arn}...\")\n\n    # First, fetch all the existing tags\n    response = self.sm_session.list_tags(aws_arn)\n    existing_tags = response.get(\"Tags\", [])\n\n    # Convert existing AWS tags to a dictionary for easy manipulation\n    existing_tags_dict = {item[\"Key\"]: item[\"Value\"] for item in existing_tags}\n\n    # Identify tags to delete\n    tag_list_to_delete = []\n    for key in existing_tags_dict.keys():\n        if key == key_to_delete or key.startswith(f\"{key_to_delete}_chunk_\"):\n            tag_list_to_delete.append(key)\n\n    # Delete the identified tags\n    if tag_list_to_delete:\n        self.sm_client.delete_tags(ResourceArn=aws_arn, TagKeys=tag_list_to_delete)\n    else:\n        self.log.info(f\"No Metadata found: {key_to_delete}...\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.details","title":"<code>details()</code>  <code>abstractmethod</code>","text":"<p>Additional Details about this Artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef details(self) -&gt; dict:\n    \"\"\"Additional Details about this Artifact\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.ensure_valid_name","title":"<code>ensure_valid_name(name, delimiter='_')</code>  <code>classmethod</code>","text":"<p>Check if the ID adheres to the naming conventions for this Artifact.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>The name/id to check.</p> required <code>delimiter</code> <code>str</code> <p>The delimiter to use in the name/id string (default: \"_\")</p> <code>'_'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the name/id is not valid.</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@classmethod\ndef ensure_valid_name(cls, name: str, delimiter: str = \"_\"):\n    \"\"\"Check if the ID adheres to the naming conventions for this Artifact.\n\n    Args:\n        name (str): The name/id to check.\n        delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n\n    Raises:\n        ValueError: If the name/id is not valid.\n    \"\"\"\n    valid_name = cls.generate_valid_name(name, delimiter=delimiter)\n    if name != valid_name:\n        error_msg = f\"{name} doesn't conform and should be converted to: {valid_name}\"\n        cls.log.error(error_msg)\n        raise ValueError(error_msg)\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.exists","title":"<code>exists()</code>  <code>abstractmethod</code>","text":"<p>Does the Artifact exist? Can we connect to it?</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef exists(self) -&gt; bool:\n    \"\"\"Does the Artifact exist? Can we connect to it?\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.expected_meta","title":"<code>expected_meta()</code>","text":"<p>Metadata we expect to see for this Artifact when it's ready Returns:     list[str]: List of expected metadata keys</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def expected_meta(self) -&gt; list[str]:\n    \"\"\"Metadata we expect to see for this Artifact when it's ready\n    Returns:\n        list[str]: List of expected metadata keys\n    \"\"\"\n\n    # If an artifact has additional expected metadata override this method\n    return [\"sageworks_status\"]\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.generate_valid_name","title":"<code>generate_valid_name(name, delimiter='_')</code>  <code>staticmethod</code>","text":"<p>Only allow letters and the specified delimiter, also lowercase the string</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>The name/id string to check</p> required <code>delimiter</code> <code>str</code> <p>The delimiter to use in the name/id string (default: \"_\")</p> <code>'_'</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>A generated valid name/id</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@staticmethod\ndef generate_valid_name(name: str, delimiter: str = \"_\") -&gt; str:\n    \"\"\"Only allow letters and the specified delimiter, also lowercase the string\n\n    Args:\n        name (str): The name/id string to check\n        delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n\n    Returns:\n        str: A generated valid name/id\n    \"\"\"\n    valid_name = \"\".join(c for c in name if c.isalnum() or c in [\"_\", \"-\"]).lower()\n    valid_name = valid_name.replace(\"_\", delimiter)\n    valid_name = valid_name.replace(\"-\", delimiter)\n    return valid_name\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_input","title":"<code>get_input()</code>","text":"<p>Get the input data for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def get_input(self) -&gt; str:\n    \"\"\"Get the input data for this artifact\"\"\"\n    return self.sageworks_meta().get(\"sageworks_input\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_owner","title":"<code>get_owner()</code>","text":"<p>Get the owner of this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def get_owner(self) -&gt; str:\n    \"\"\"Get the owner of this artifact\"\"\"\n    return self.sageworks_meta().get(\"sageworks_owner\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_status","title":"<code>get_status()</code>","text":"<p>Get the status for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def get_status(self) -&gt; str:\n    \"\"\"Get the status for this artifact\"\"\"\n    return self.sageworks_meta().get(\"sageworks_status\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_tags","title":"<code>get_tags(tag_type='user')</code>","text":"<p>Get the tags for this artifact Args:     tag_type (str): Type of tags to return (user or health) Returns:     list[str]: List of tags for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def get_tags(self, tag_type=\"user\") -&gt; list:\n    \"\"\"Get the tags for this artifact\n    Args:\n        tag_type (str): Type of tags to return (user or health)\n    Returns:\n        list[str]: List of tags for this artifact\n    \"\"\"\n    if tag_type == \"user\":\n        user_tags = self.sageworks_meta().get(\"sageworks_tags\")\n        return user_tags.split(self.tag_delimiter) if user_tags else []\n\n    # Grab our health tags\n    health_tags = self.sageworks_meta().get(\"sageworks_health_tags\")\n\n    # If we don't have health tags, create the storage and return an empty list\n    if health_tags is None:\n        self.log.important(f\"{self.uuid} creating sageworks_health_tags storage...\")\n        self.upsert_sageworks_meta({\"sageworks_health_tags\": \"\"})\n        return []\n\n    # Otherwise, return the health tags\n    return health_tags.split(self.tag_delimiter) if health_tags else []\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.health_check","title":"<code>health_check()</code>","text":"<p>Perform a health check on this artifact Returns:     list[str]: List of health issues</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def health_check(self) -&gt; list[str]:\n    \"\"\"Perform a health check on this artifact\n    Returns:\n        list[str]: List of health issues\n    \"\"\"\n    health_issues = []\n    if not self.ready():\n        return [\"needs_onboard\"]\n    if \"unknown\" in self.aws_url():\n        health_issues.append(\"aws_url_unknown\")\n    return health_issues\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.modified","title":"<code>modified()</code>  <code>abstractmethod</code>","text":"<p>Return the datetime when this artifact was last modified</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef modified(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was last modified\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.onboard","title":"<code>onboard()</code>  <code>abstractmethod</code>","text":"<p>Onboard this Artifact into SageWorks Returns:     bool: True if the Artifact was successfully onboarded, False otherwise</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef onboard(self) -&gt; bool:\n    \"\"\"Onboard this Artifact into SageWorks\n    Returns:\n        bool: True if the Artifact was successfully onboarded, False otherwise\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.ready","title":"<code>ready()</code>","text":"<p>Is the Artifact ready? Is initial setup complete and expected metadata populated?</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def ready(self) -&gt; bool:\n    \"\"\"Is the Artifact ready? Is initial setup complete and expected metadata populated?\"\"\"\n\n    # If anything goes wrong, assume the artifact is not ready\n    try:\n        # Check for the expected metadata\n        expected_meta = self.expected_meta()\n        existing_meta = self.sageworks_meta()\n        ready = set(existing_meta.keys()).issuperset(expected_meta)\n        if ready:\n            return True\n        else:\n            self.log.info(\"Artifact is not ready!\")\n            return False\n    except Exception as e:\n        self.log.error(f\"Artifact malformed: {e}\")\n        return False\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.refresh_meta","title":"<code>refresh_meta()</code>  <code>abstractmethod</code>","text":"<p>Refresh the Artifact's metadata</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef refresh_meta(self):\n    \"\"\"Refresh the Artifact's metadata\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.remove_sageworks_meta","title":"<code>remove_sageworks_meta(key_to_remove)</code>","text":"<p>Remove SageWorks specific metadata from this Artifact Args:     key_to_remove (str): The metadata key to remove Note:     This functionality will work for FeatureSets, Models, and Endpoints     but not for DataSources. The DataSource class overrides this method.</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def remove_sageworks_meta(self, key_to_remove: str):\n    \"\"\"Remove SageWorks specific metadata from this Artifact\n    Args:\n        key_to_remove (str): The metadata key to remove\n    Note:\n        This functionality will work for FeatureSets, Models, and Endpoints\n        but not for DataSources. The DataSource class overrides this method.\n    \"\"\"\n    aws_arn = self.arn()\n    # Sanity check\n    if aws_arn is None:\n        self.log.error(f\"ARN is None for {self.uuid}!\")\n        return\n    self.log.info(f\"Removing SageWorks Metadata {key_to_remove} for Artifact: {aws_arn}...\")\n    sagemaker_delete_tag(aws_arn, self.sm_session, key_to_remove)\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.remove_sageworks_tag","title":"<code>remove_sageworks_tag(tag, tag_type='user')</code>","text":"<p>Remove a tag from this artifact if it exists. Args:     tag (str): Tag to remove from this artifact     tag_type (str): Type of tag to remove (user or health)</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def remove_sageworks_tag(self, tag, tag_type=\"user\"):\n    \"\"\"Remove a tag from this artifact if it exists.\n    Args:\n        tag (str): Tag to remove from this artifact\n        tag_type (str): Type of tag to remove (user or health)\n    \"\"\"\n    current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n    if tag in current_tags:\n        current_tags.remove(tag)\n        combined_tags = self.tag_delimiter.join(current_tags)\n        if tag_type == \"user\":\n            self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n        elif tag_type == \"health\":\n            self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.sageworks_meta","title":"<code>sageworks_meta()</code>","text":"<p>Get the SageWorks specific metadata for this Artifact Note: This functionality will work for FeatureSets, Models, and Endpoints       but not for DataSources/Graphs. DataSource/Graph classes need to override this method.</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def sageworks_meta(self) -&gt; dict:\n    \"\"\"Get the SageWorks specific metadata for this Artifact\n    Note: This functionality will work for FeatureSets, Models, and Endpoints\n          but not for DataSources/Graphs. DataSource/Graph classes need to override this method.\n    \"\"\"\n    # First, check our cache\n    meta_data_key = f\"{self.uuid}_sageworks_meta\"\n    meta_data = self.ephemeral_storage.get(meta_data_key)\n    if meta_data is not None:\n        return meta_data\n\n    # Otherwise, fetch the metadata from AWS, store it in the cache, and return it\n    meta_data = list_tags_with_throttle(self.arn(), self.sm_session)\n    self.ephemeral_storage.set(meta_data_key, meta_data)\n    return meta_data\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_input","title":"<code>set_input(input_data)</code>","text":"<p>Set the input data for this artifact</p> <p>Parameters:</p> Name Type Description Default <code>input_data</code> <code>str</code> <p>Name of input data for this artifact</p> required <p>Note:     This breaks the official provenance of the artifact, so use with caution.</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def set_input(self, input_data: str):\n    \"\"\"Set the input data for this artifact\n\n    Args:\n        input_data (str): Name of input data for this artifact\n    Note:\n        This breaks the official provenance of the artifact, so use with caution.\n    \"\"\"\n    self.log.important(f\"{self.uuid}: Setting input to {input_data}...\")\n    self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n    self.upsert_sageworks_meta({\"sageworks_input\": input_data})\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_owner","title":"<code>set_owner(owner)</code>","text":"<p>Set the owner of this artifact</p> <p>Parameters:</p> Name Type Description Default <code>owner</code> <code>str</code> <p>Owner to set for this artifact</p> required Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def set_owner(self, owner: str):\n    \"\"\"Set the owner of this artifact\n\n    Args:\n        owner (str): Owner to set for this artifact\n    \"\"\"\n    self.upsert_sageworks_meta({\"sageworks_owner\": owner})\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_status","title":"<code>set_status(status)</code>","text":"<p>Set the status for this artifact Args:     status (str): Status to set for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def set_status(self, status: str):\n    \"\"\"Set the status for this artifact\n    Args:\n        status (str): Status to set for this artifact\n    \"\"\"\n    self.upsert_sageworks_meta({\"sageworks_status\": status})\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.size","title":"<code>size()</code>  <code>abstractmethod</code>","text":"<p>Return the size of this artifact in MegaBytes</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef size(self) -&gt; float:\n    \"\"\"Return the size of this artifact in MegaBytes\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.summary","title":"<code>summary()</code>","text":"<p>This is generic summary information for all Artifacts. If you want to get more detailed information, call the details() method which is implemented by the specific Artifact class</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def summary(self) -&gt; dict:\n    \"\"\"This is generic summary information for all Artifacts. If you\n    want to get more detailed information, call the details() method\n    which is implemented by the specific Artifact class\"\"\"\n    basic = {\n        \"uuid\": self.uuid,\n        \"health_tags\": self.get_health_tags(),\n        \"aws_arn\": self.arn(),\n        \"size\": self.size(),\n        \"created\": self.created(),\n        \"modified\": self.modified(),\n        \"input\": self.get_input(),\n    }\n    # Combine the sageworks metadata with the basic metadata\n    return {**basic, **self.sageworks_meta()}\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.upsert_sageworks_meta","title":"<code>upsert_sageworks_meta(new_meta)</code>","text":"<p>Add SageWorks specific metadata to this Artifact Args:     new_meta (dict): Dictionary of NEW metadata to add Note:     This functionality will work for FeatureSets, Models, and Endpoints     but not for DataSources. The DataSource class overrides this method.</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def upsert_sageworks_meta(self, new_meta: dict):\n    \"\"\"Add SageWorks specific metadata to this Artifact\n    Args:\n        new_meta (dict): Dictionary of NEW metadata to add\n    Note:\n        This functionality will work for FeatureSets, Models, and Endpoints\n        but not for DataSources. The DataSource class overrides this method.\n    \"\"\"\n    # Sanity check\n    aws_arn = self.arn()\n    if aws_arn is None:\n        self.log.error(f\"ARN is None for {self.uuid}!\")\n        return\n\n    # Add the new metadata to the existing metadata\n    self.log.info(f\"Upserting SageWorks Metadata for Artifact: {aws_arn}...\")\n    aws_tags = dict_to_aws_tags(new_meta)\n    self.sm_client.add_tags(ResourceArn=aws_arn, Tags=aws_tags)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/","title":"AthenaSource","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the DataSource API Class and voil\u00e0 it works the same.</p> <p>AthenaSource: SageWorks Data Source accessible through Athena</p>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource","title":"<code>AthenaSource</code>","text":"<p>               Bases: <code>DataSourceAbstract</code></p> <p>AthenaSource: SageWorks Data Source accessible through Athena</p> Common Usage <pre><code>my_data = AthenaSource(data_uuid, database=\"sageworks\")\nmy_data.summary()\nmy_data.details()\ndf = my_data.query(f\"select * from {data_uuid} limit 5\")\n</code></pre> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>class AthenaSource(DataSourceAbstract):\n    \"\"\"AthenaSource: SageWorks Data Source accessible through Athena\n\n    Common Usage:\n        ```\n        my_data = AthenaSource(data_uuid, database=\"sageworks\")\n        my_data.summary()\n        my_data.details()\n        df = my_data.query(f\"select * from {data_uuid} limit 5\")\n        ```\n    \"\"\"\n\n    def __init__(self, data_uuid, database=\"sageworks\", force_refresh: bool = False):\n        \"\"\"AthenaSource Initialization\n\n        Args:\n            data_uuid (str): Name of Athena Table\n            database (str): Athena Database Name (default: sageworks)\n            force_refresh (bool): Force refresh of AWS Metadata (default: False)\n        \"\"\"\n        # Ensure the data_uuid is a valid name/id\n        self.ensure_valid_name(data_uuid)\n\n        # Call superclass init\n        super().__init__(data_uuid, database)\n\n        # Flag for metadata cache refresh logic\n        self.metadata_refresh_needed = False\n\n        # Setup our AWS Metadata Broker\n        self.catalog_table_meta = self.meta_broker.data_source_details(\n            data_uuid, self.get_database(), refresh=force_refresh\n        )\n        if self.catalog_table_meta is None:\n            self.log.important(f\"Unable to find {self.get_database()}:{self.get_table_name()} in Glue Catalogs...\")\n\n        # Call superclass post init\n        super().__post_init__()\n\n        # All done\n        self.log.debug(f\"AthenaSource Initialized: {self.get_database()}.{self.get_table_name()}\")\n\n    def refresh_meta(self):\n        \"\"\"Refresh our internal AWS Broker catalog metadata\"\"\"\n        _catalog_meta = self.aws_broker.get_metadata(ServiceCategory.DATA_CATALOG, force_refresh=True)\n        self.catalog_table_meta = _catalog_meta[self.get_database()].get(self.get_table_name())\n        self.metadata_refresh_needed = False\n\n    def exists(self) -&gt; bool:\n        \"\"\"Validation Checks for this Data Source\"\"\"\n\n        # We're we able to pull AWS Metadata for this table_name?\"\"\"\n        if self.catalog_table_meta is None:\n            self.log.debug(f\"AthenaSource {self.get_table_name()} not found in SageWorks Metadata...\")\n            return False\n        return True\n\n    def arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n        # Grab our SageWorks Role Manager, get our AWS account id, and region for ARN creation\n        account_id = self.aws_account_clamp.account_id\n        region = self.aws_account_clamp.region\n        arn = f\"arn:aws:glue:{region}:{account_id}:table/{self.get_database()}/{self.get_table_name()}\"\n        return arn\n\n    def sageworks_meta(self) -&gt; dict:\n        \"\"\"Get the SageWorks specific metadata for this Artifact\"\"\"\n\n        # Sanity Check if we have invalid AWS Metadata\n        self.log.info(f\"Retrieving SageWorks Metadata for Artifact: {self.uuid}...\")\n        if self.catalog_table_meta is None:\n            if not self.exists():\n                self.log.error(f\"DataSource {self.uuid} doesn't appear to exist...\")\n            else:\n                self.log.critical(f\"Unable to get AWS Metadata for {self.get_table_name()}\")\n                self.log.critical(\"Malformed Artifact! Delete this Artifact and recreate it!\")\n            return {}\n\n        # Check if we need to refresh our metadata\n        if self.metadata_refresh_needed:\n            self.refresh_meta()\n\n        # Get the SageWorks Metadata from the Catalog Table Metadata\n        return sageworks_meta_from_catalog_table_meta(self.catalog_table_meta)\n\n    def upsert_sageworks_meta(self, new_meta: dict):\n        \"\"\"Add SageWorks specific metadata to this Artifact\n\n        Args:\n            new_meta (dict): Dictionary of new metadata to add\n        \"\"\"\n\n        # Give a warning message for keys that don't start with sageworks_\n        for key in new_meta.keys():\n            if not key.startswith(\"sageworks_\"):\n                self.log.warning(\"Append 'sageworks_' to key names to avoid overwriting AWS meta data\")\n\n        # Now convert any non-string values to JSON strings\n        for key, value in new_meta.items():\n            if not isinstance(value, str):\n                new_meta[key] = json.dumps(value, cls=CustomEncoder)\n\n        # Store our updated metadata\n        try:\n            wr.catalog.upsert_table_parameters(\n                parameters=new_meta,\n                database=self.get_database(),\n                table=self.get_table_name(),\n                boto3_session=self.boto_session,\n            )\n            self.metadata_refresh_needed = True\n        except botocore.exceptions.ClientError as e:\n            error_code = e.response[\"Error\"][\"Code\"]\n            if error_code == \"InvalidInputException\":\n                self.log.error(f\"Unable to upsert metadata for {self.get_table_name()}\")\n                self.log.error(\"Probably because the metadata is too large\")\n                self.log.error(new_meta)\n            elif error_code == \"ConcurrentModificationException\":\n                self.log.warning(\"ConcurrentModificationException... trying again...\")\n                time.sleep(5)\n                wr.catalog.upsert_table_parameters(\n                    parameters=new_meta,\n                    database=self.get_database(),\n                    table=self.get_table_name(),\n                    boto3_session=self.boto_session,\n                )\n            else:\n                raise e\n\n    def size(self) -&gt; float:\n        \"\"\"Return the size of this data in MegaBytes\"\"\"\n        size_in_bytes = sum(wr.s3.size_objects(self.s3_storage_location(), boto3_session=self.boto_session).values())\n        size_in_mb = size_in_bytes / 1_000_000\n        return size_in_mb\n\n    def aws_meta(self) -&gt; dict:\n        \"\"\"Get the FULL AWS metadata for this artifact\"\"\"\n        return self.catalog_table_meta\n\n    def aws_url(self):\n        \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n        sageworks_details = self.sageworks_meta().get(\"sageworks_details\", {})\n        return sageworks_details.get(\"aws_url\", \"unknown\")\n\n    def created(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was created\"\"\"\n        return self.catalog_table_meta[\"CreateTime\"]\n\n    def modified(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was last modified\"\"\"\n        return self.catalog_table_meta[\"UpdateTime\"]\n\n    def num_rows(self) -&gt; int:\n        \"\"\"Return the number of rows for this Data Source\"\"\"\n        count_df = self.query(\n            f'select count(*) AS sageworks_count from \"{self.get_database()}\".\"{self.get_table_name()}\"'\n        )\n        return count_df[\"sageworks_count\"][0]\n\n    def num_columns(self) -&gt; int:\n        \"\"\"Return the number of columns for this Data Source\"\"\"\n        return len(self.column_names())\n\n    def column_names(self) -&gt; list[str]:\n        \"\"\"Return the column names for this Athena Table\"\"\"\n        return [item[\"Name\"] for item in self.catalog_table_meta[\"StorageDescriptor\"][\"Columns\"]]\n\n    def column_types(self) -&gt; list[str]:\n        \"\"\"Return the column types of the internal AthenaSource\"\"\"\n        return [item[\"Type\"] for item in self.catalog_table_meta[\"StorageDescriptor\"][\"Columns\"]]\n\n    def query(self, query: str) -&gt; pd.DataFrame:\n        \"\"\"Query the AthenaSource\n\n        Args:\n            query (str): The query to run against the AthenaSource\n\n        Returns:\n            pd.DataFrame: The results of the query\n        \"\"\"\n        df = wr.athena.read_sql_query(\n            sql=query,\n            database=self.get_database(),\n            ctas_approach=False,\n            boto3_session=self.boto_session,\n        )\n        scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n        if scanned_bytes &gt; 0:\n            self.log.info(f\"Athena Query successful (scanned bytes: {scanned_bytes})\")\n        return df\n\n    def execute_statement(self, query: str):\n        \"\"\"Execute a non-returning SQL statement in Athena.\"\"\"\n        try:\n            # Start the query execution\n            query_execution_id = wr.athena.start_query_execution(\n                sql=query,\n                database=self.get_database(),\n                boto3_session=self.boto_session,\n            )\n            self.log.debug(f\"QueryExecutionId: {query_execution_id}\")\n\n            # Wait for the query to complete\n            wr.athena.wait_query(query_execution_id=query_execution_id, boto3_session=self.boto_session)\n            self.log.debug(f\"Statement executed successfully: {query_execution_id}\")\n        except Exception as e:\n            self.log.error(f\"Failed to execute statement: {e}\")\n            raise\n\n    def s3_storage_location(self) -&gt; str:\n        \"\"\"Get the S3 Storage Location for this Data Source\"\"\"\n        return self.catalog_table_meta[\"StorageDescriptor\"][\"Location\"]\n\n    def athena_test_query(self):\n        \"\"\"Validate that Athena Queries are working\"\"\"\n        query = f\"select count(*) as sageworks_count from {self.get_table_name()}\"\n        df = wr.athena.read_sql_query(\n            sql=query,\n            database=self.get_database(),\n            ctas_approach=False,\n            boto3_session=self.boto_session,\n        )\n        scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n        self.log.info(f\"Athena TEST Query successful (scanned bytes: {scanned_bytes})\")\n\n    def sample_impl(self) -&gt; pd.DataFrame:\n        \"\"\"Pull a sample of rows from the DataSource\n\n        Returns:\n            pd.DataFrame: A sample DataFrame for an Athena DataSource\n        \"\"\"\n\n        # Call the SQL function to pull a sample of the rows\n        return sample_rows.sample_rows(self)\n\n    def descriptive_stats(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n\n        Args:\n            recompute (bool): Recompute the descriptive stats (default: False)\n\n        Returns:\n            dict(dict): A dictionary of descriptive stats for each column in the form\n                 {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n                  'col2': ...}\n        \"\"\"\n\n        # First check if we have already computed the descriptive stats\n        stat_dict_json = self.sageworks_meta().get(\"sageworks_descriptive_stats\")\n        if stat_dict_json and not recompute:\n            return stat_dict_json\n\n        # Call the SQL function to compute descriptive stats\n        stat_dict = descriptive_stats.descriptive_stats(self)\n\n        # Push the descriptive stat data into our DataSource Metadata\n        self.upsert_sageworks_meta({\"sageworks_descriptive_stats\": stat_dict})\n\n        # Return the descriptive stats\n        return stat_dict\n\n    def outliers_impl(self, scale: float = 1.5, use_stddev=False, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Compute outliers for all the numeric columns in a DataSource\n\n        Args:\n            scale (float): The scale to use for the IQR (default: 1.5)\n            use_stddev (bool): Use Standard Deviation instead of IQR (default: False)\n            recompute (bool): Recompute the outliers (default: False)\n\n        Returns:\n            pd.DataFrame: A DataFrame of outliers from this DataSource\n\n        Notes:\n            Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n            The scale parameter can be adjusted to change the IQR multiplier\n        \"\"\"\n\n        # Compute outliers using the SQL Outliers class\n        sql_outliers = outliers.Outliers()\n        return sql_outliers.compute_outliers(self, scale=scale, use_stddev=use_stddev)\n\n    def smart_sample(self) -&gt; pd.DataFrame:\n        \"\"\"Get a smart sample dataframe for this DataSource\n\n        Note:\n            smart = sample data + outliers for the DataSource\"\"\"\n\n        # Outliers DataFrame\n        outlier_rows = self.outliers()\n\n        # Sample DataFrame\n        sample_rows = self.sample()\n        sample_rows[\"outlier_group\"] = \"sample\"\n\n        # Combine the sample rows with the outlier rows\n        all_rows = pd.concat([outlier_rows, sample_rows]).reset_index(drop=True)\n\n        # Drop duplicates\n        all_except_outlier_group = [col for col in all_rows.columns if col != \"outlier_group\"]\n        all_rows = all_rows.drop_duplicates(subset=all_except_outlier_group, ignore_index=True)\n        return all_rows\n\n    def correlations(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n        Args:\n            recompute (bool): Recompute the column stats (default: False)\n\n        Returns:\n            dict(dict): A dictionary of correlations for each column in this format\n                 {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n                  'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n        \"\"\"\n\n        # First check if we have already computed the correlations\n        correlations_dict = self.sageworks_meta().get(\"sageworks_correlations\")\n        if correlations_dict and not recompute:\n            return correlations_dict\n\n        # Call the SQL function to compute correlations\n        correlations_dict = correlations.correlations(self)\n\n        # Push the correlation data into our DataSource Metadata\n        self.upsert_sageworks_meta({\"sageworks_correlations\": correlations_dict})\n\n        # Return the correlation data\n        return correlations_dict\n\n    def column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Column Stats for all the columns in a DataSource\n\n        Args:\n            recompute (bool): Recompute the column stats (default: False)\n\n        Returns:\n            dict(dict): A dictionary of stats for each column this format\n            NB: String columns will NOT have num_zeros, descriptive_stats or correlation data\n                {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n                 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100,\n                          'descriptive_stats': {...}, 'correlations': {...}},\n                 ...}\n        \"\"\"\n\n        # First check if we have already computed the column stats\n        columns_stats_dict = self.sageworks_meta().get(\"sageworks_column_stats\")\n        if columns_stats_dict and not recompute:\n            return columns_stats_dict\n\n        # Call the SQL function to compute column stats\n        column_stats_dict = column_stats.column_stats(self, recompute=recompute)\n\n        # Push the column stats data into our DataSource Metadata\n        self.upsert_sageworks_meta({\"sageworks_column_stats\": column_stats_dict})\n\n        # Return the column stats data\n        return column_stats_dict\n\n    def value_counts(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n\n        Args:\n            recompute (bool): Recompute the value counts (default: False)\n\n        Returns:\n            dict(dict): A dictionary of value counts for each column in the form\n                 {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...},\n                  'col2': ...}\n        \"\"\"\n\n        # First check if we have already computed the value counts\n        value_counts_dict = self.sageworks_meta().get(\"sageworks_value_counts\")\n        if value_counts_dict and not recompute:\n            return value_counts_dict\n\n        # Call the SQL function to compute value_counts\n        value_count_dict = value_counts.value_counts(self)\n\n        # Push the value_count data into our DataSource Metadata\n        self.upsert_sageworks_meta({\"sageworks_value_counts\": value_count_dict})\n\n        # Return the value_count data\n        return value_count_dict\n\n    def details(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Additional Details about this AthenaSource Artifact\n\n        Args:\n            recompute (bool): Recompute the details (default: False)\n\n        Returns:\n            dict(dict): A dictionary of details about this AthenaSource\n        \"\"\"\n\n        # Check if we have cached version of the DataSource Details\n        storage_key = f\"data_source:{self.uuid}:details\"\n        cached_details = self.data_storage.get(storage_key)\n        if cached_details and not recompute:\n            return cached_details\n\n        self.log.info(f\"Recomputing DataSource Details ({self.uuid})...\")\n\n        # Get the details from the base class\n        details = super().details()\n\n        # Compute additional details\n        details[\"s3_storage_location\"] = self.s3_storage_location()\n        details[\"storage_type\"] = \"athena\"\n\n        # Compute our AWS URL\n        query = f\"select * from {self.get_database()}.{self.get_table_name()} limit 10\"\n        query_exec_id = wr.athena.start_query_execution(\n            sql=query, database=self.get_database(), boto3_session=self.boto_session\n        )\n        base_url = \"https://console.aws.amazon.com/athena/home\"\n        details[\"aws_url\"] = f\"{base_url}?region={self.aws_region}#query/history/{query_exec_id}\"\n\n        # Push the aws_url data into our DataSource Metadata\n        self.upsert_sageworks_meta({\"sageworks_details\": {\"aws_url\": details[\"aws_url\"]}})\n\n        # Convert any datetime fields to ISO-8601 strings\n        details = convert_all_to_iso8601(details)\n\n        # Add the column stats\n        details[\"column_stats\"] = self.column_stats()\n\n        # Cache the details\n        self.data_storage.set(storage_key, details)\n\n        # Return the details data\n        return details\n\n    def delete(self):\n        \"\"\"Delete the AWS Data Catalog Table and S3 Storage Objects\"\"\"\n\n        # Make sure the Feature Group exists\n        if not self.exists():\n            self.log.warning(f\"Trying to delete a AthenaSource that doesn't exist: {self.get_table_name()}\")\n\n        # Delete Data Catalog Table\n        self.log.info(f\"Deleting DataCatalog Table: {self.get_database()}.{self.get_table_name()}...\")\n        wr.catalog.delete_table_if_exists(self.get_database(), self.get_table_name(), boto3_session=self.boto_session)\n\n        # Delete S3 Storage Objects (if they exist)\n        try:\n            # Make sure we add the trailing slash\n            s3_path = self.s3_storage_location()\n            s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n\n            self.log.info(f\"Deleting S3 Storage Objects: {s3_path}...\")\n            wr.s3.delete_objects(s3_path, boto3_session=self.boto_session)\n        except TypeError:\n            self.log.warning(\"Malformed Artifact... good thing it's being deleted...\")\n\n        # Delete any data in the Cache\n        for key in self.data_storage.list_subkeys(f\"data_source:{self.uuid}:\"):\n            self.log.info(f\"Deleting Cache Key {key}...\")\n            self.data_storage.delete(key)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.__init__","title":"<code>__init__(data_uuid, database='sageworks', force_refresh=False)</code>","text":"<p>AthenaSource Initialization</p> <p>Parameters:</p> Name Type Description Default <code>data_uuid</code> <code>str</code> <p>Name of Athena Table</p> required <code>database</code> <code>str</code> <p>Athena Database Name (default: sageworks)</p> <code>'sageworks'</code> <code>force_refresh</code> <code>bool</code> <p>Force refresh of AWS Metadata (default: False)</p> <code>False</code> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def __init__(self, data_uuid, database=\"sageworks\", force_refresh: bool = False):\n    \"\"\"AthenaSource Initialization\n\n    Args:\n        data_uuid (str): Name of Athena Table\n        database (str): Athena Database Name (default: sageworks)\n        force_refresh (bool): Force refresh of AWS Metadata (default: False)\n    \"\"\"\n    # Ensure the data_uuid is a valid name/id\n    self.ensure_valid_name(data_uuid)\n\n    # Call superclass init\n    super().__init__(data_uuid, database)\n\n    # Flag for metadata cache refresh logic\n    self.metadata_refresh_needed = False\n\n    # Setup our AWS Metadata Broker\n    self.catalog_table_meta = self.meta_broker.data_source_details(\n        data_uuid, self.get_database(), refresh=force_refresh\n    )\n    if self.catalog_table_meta is None:\n        self.log.important(f\"Unable to find {self.get_database()}:{self.get_table_name()} in Glue Catalogs...\")\n\n    # Call superclass post init\n    super().__post_init__()\n\n    # All done\n    self.log.debug(f\"AthenaSource Initialized: {self.get_database()}.{self.get_table_name()}\")\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.arn","title":"<code>arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for this artifact</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n    # Grab our SageWorks Role Manager, get our AWS account id, and region for ARN creation\n    account_id = self.aws_account_clamp.account_id\n    region = self.aws_account_clamp.region\n    arn = f\"arn:aws:glue:{region}:{account_id}:table/{self.get_database()}/{self.get_table_name()}\"\n    return arn\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.athena_test_query","title":"<code>athena_test_query()</code>","text":"<p>Validate that Athena Queries are working</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def athena_test_query(self):\n    \"\"\"Validate that Athena Queries are working\"\"\"\n    query = f\"select count(*) as sageworks_count from {self.get_table_name()}\"\n    df = wr.athena.read_sql_query(\n        sql=query,\n        database=self.get_database(),\n        ctas_approach=False,\n        boto3_session=self.boto_session,\n    )\n    scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n    self.log.info(f\"Athena TEST Query successful (scanned bytes: {scanned_bytes})\")\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.aws_meta","title":"<code>aws_meta()</code>","text":"<p>Get the FULL AWS metadata for this artifact</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def aws_meta(self) -&gt; dict:\n    \"\"\"Get the FULL AWS metadata for this artifact\"\"\"\n    return self.catalog_table_meta\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.aws_url","title":"<code>aws_url()</code>","text":"<p>The AWS URL for looking at/querying this data source</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def aws_url(self):\n    \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n    sageworks_details = self.sageworks_meta().get(\"sageworks_details\", {})\n    return sageworks_details.get(\"aws_url\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.column_names","title":"<code>column_names()</code>","text":"<p>Return the column names for this Athena Table</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def column_names(self) -&gt; list[str]:\n    \"\"\"Return the column names for this Athena Table\"\"\"\n    return [item[\"Name\"] for item in self.catalog_table_meta[\"StorageDescriptor\"][\"Columns\"]]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.column_stats","title":"<code>column_stats(recompute=False)</code>","text":"<p>Compute Column Stats for all the columns in a DataSource</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the column stats (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of stats for each column this format</p> <code>NB</code> <code>dict[dict]</code> <p>String columns will NOT have num_zeros, descriptive_stats or correlation data {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},  'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100,           'descriptive_stats': {...}, 'correlations': {...}},  ...}</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Column Stats for all the columns in a DataSource\n\n    Args:\n        recompute (bool): Recompute the column stats (default: False)\n\n    Returns:\n        dict(dict): A dictionary of stats for each column this format\n        NB: String columns will NOT have num_zeros, descriptive_stats or correlation data\n            {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n             'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100,\n                      'descriptive_stats': {...}, 'correlations': {...}},\n             ...}\n    \"\"\"\n\n    # First check if we have already computed the column stats\n    columns_stats_dict = self.sageworks_meta().get(\"sageworks_column_stats\")\n    if columns_stats_dict and not recompute:\n        return columns_stats_dict\n\n    # Call the SQL function to compute column stats\n    column_stats_dict = column_stats.column_stats(self, recompute=recompute)\n\n    # Push the column stats data into our DataSource Metadata\n    self.upsert_sageworks_meta({\"sageworks_column_stats\": column_stats_dict})\n\n    # Return the column stats data\n    return column_stats_dict\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.column_types","title":"<code>column_types()</code>","text":"<p>Return the column types of the internal AthenaSource</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def column_types(self) -&gt; list[str]:\n    \"\"\"Return the column types of the internal AthenaSource\"\"\"\n    return [item[\"Type\"] for item in self.catalog_table_meta[\"StorageDescriptor\"][\"Columns\"]]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.correlations","title":"<code>correlations(recompute=False)</code>","text":"<p>Compute Correlations for all the numeric columns in a DataSource</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the column stats (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of correlations for each column in this format  {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},   'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def correlations(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n    Args:\n        recompute (bool): Recompute the column stats (default: False)\n\n    Returns:\n        dict(dict): A dictionary of correlations for each column in this format\n             {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n              'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n    \"\"\"\n\n    # First check if we have already computed the correlations\n    correlations_dict = self.sageworks_meta().get(\"sageworks_correlations\")\n    if correlations_dict and not recompute:\n        return correlations_dict\n\n    # Call the SQL function to compute correlations\n    correlations_dict = correlations.correlations(self)\n\n    # Push the correlation data into our DataSource Metadata\n    self.upsert_sageworks_meta({\"sageworks_correlations\": correlations_dict})\n\n    # Return the correlation data\n    return correlations_dict\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.created","title":"<code>created()</code>","text":"<p>Return the datetime when this artifact was created</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def created(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was created\"\"\"\n    return self.catalog_table_meta[\"CreateTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.delete","title":"<code>delete()</code>","text":"<p>Delete the AWS Data Catalog Table and S3 Storage Objects</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def delete(self):\n    \"\"\"Delete the AWS Data Catalog Table and S3 Storage Objects\"\"\"\n\n    # Make sure the Feature Group exists\n    if not self.exists():\n        self.log.warning(f\"Trying to delete a AthenaSource that doesn't exist: {self.get_table_name()}\")\n\n    # Delete Data Catalog Table\n    self.log.info(f\"Deleting DataCatalog Table: {self.get_database()}.{self.get_table_name()}...\")\n    wr.catalog.delete_table_if_exists(self.get_database(), self.get_table_name(), boto3_session=self.boto_session)\n\n    # Delete S3 Storage Objects (if they exist)\n    try:\n        # Make sure we add the trailing slash\n        s3_path = self.s3_storage_location()\n        s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n\n        self.log.info(f\"Deleting S3 Storage Objects: {s3_path}...\")\n        wr.s3.delete_objects(s3_path, boto3_session=self.boto_session)\n    except TypeError:\n        self.log.warning(\"Malformed Artifact... good thing it's being deleted...\")\n\n    # Delete any data in the Cache\n    for key in self.data_storage.list_subkeys(f\"data_source:{self.uuid}:\"):\n        self.log.info(f\"Deleting Cache Key {key}...\")\n        self.data_storage.delete(key)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.descriptive_stats","title":"<code>descriptive_stats(recompute=False)</code>","text":"<p>Compute Descriptive Stats for all the numeric columns in a DataSource</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the descriptive stats (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of descriptive stats for each column in the form  {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},   'col2': ...}</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def descriptive_stats(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n\n    Args:\n        recompute (bool): Recompute the descriptive stats (default: False)\n\n    Returns:\n        dict(dict): A dictionary of descriptive stats for each column in the form\n             {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n              'col2': ...}\n    \"\"\"\n\n    # First check if we have already computed the descriptive stats\n    stat_dict_json = self.sageworks_meta().get(\"sageworks_descriptive_stats\")\n    if stat_dict_json and not recompute:\n        return stat_dict_json\n\n    # Call the SQL function to compute descriptive stats\n    stat_dict = descriptive_stats.descriptive_stats(self)\n\n    # Push the descriptive stat data into our DataSource Metadata\n    self.upsert_sageworks_meta({\"sageworks_descriptive_stats\": stat_dict})\n\n    # Return the descriptive stats\n    return stat_dict\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.details","title":"<code>details(recompute=False)</code>","text":"<p>Additional Details about this AthenaSource Artifact</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the details (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about this AthenaSource</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def details(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Additional Details about this AthenaSource Artifact\n\n    Args:\n        recompute (bool): Recompute the details (default: False)\n\n    Returns:\n        dict(dict): A dictionary of details about this AthenaSource\n    \"\"\"\n\n    # Check if we have cached version of the DataSource Details\n    storage_key = f\"data_source:{self.uuid}:details\"\n    cached_details = self.data_storage.get(storage_key)\n    if cached_details and not recompute:\n        return cached_details\n\n    self.log.info(f\"Recomputing DataSource Details ({self.uuid})...\")\n\n    # Get the details from the base class\n    details = super().details()\n\n    # Compute additional details\n    details[\"s3_storage_location\"] = self.s3_storage_location()\n    details[\"storage_type\"] = \"athena\"\n\n    # Compute our AWS URL\n    query = f\"select * from {self.get_database()}.{self.get_table_name()} limit 10\"\n    query_exec_id = wr.athena.start_query_execution(\n        sql=query, database=self.get_database(), boto3_session=self.boto_session\n    )\n    base_url = \"https://console.aws.amazon.com/athena/home\"\n    details[\"aws_url\"] = f\"{base_url}?region={self.aws_region}#query/history/{query_exec_id}\"\n\n    # Push the aws_url data into our DataSource Metadata\n    self.upsert_sageworks_meta({\"sageworks_details\": {\"aws_url\": details[\"aws_url\"]}})\n\n    # Convert any datetime fields to ISO-8601 strings\n    details = convert_all_to_iso8601(details)\n\n    # Add the column stats\n    details[\"column_stats\"] = self.column_stats()\n\n    # Cache the details\n    self.data_storage.set(storage_key, details)\n\n    # Return the details data\n    return details\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.execute_statement","title":"<code>execute_statement(query)</code>","text":"<p>Execute a non-returning SQL statement in Athena.</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def execute_statement(self, query: str):\n    \"\"\"Execute a non-returning SQL statement in Athena.\"\"\"\n    try:\n        # Start the query execution\n        query_execution_id = wr.athena.start_query_execution(\n            sql=query,\n            database=self.get_database(),\n            boto3_session=self.boto_session,\n        )\n        self.log.debug(f\"QueryExecutionId: {query_execution_id}\")\n\n        # Wait for the query to complete\n        wr.athena.wait_query(query_execution_id=query_execution_id, boto3_session=self.boto_session)\n        self.log.debug(f\"Statement executed successfully: {query_execution_id}\")\n    except Exception as e:\n        self.log.error(f\"Failed to execute statement: {e}\")\n        raise\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.exists","title":"<code>exists()</code>","text":"<p>Validation Checks for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def exists(self) -&gt; bool:\n    \"\"\"Validation Checks for this Data Source\"\"\"\n\n    # We're we able to pull AWS Metadata for this table_name?\"\"\"\n    if self.catalog_table_meta is None:\n        self.log.debug(f\"AthenaSource {self.get_table_name()} not found in SageWorks Metadata...\")\n        return False\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.modified","title":"<code>modified()</code>","text":"<p>Return the datetime when this artifact was last modified</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def modified(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was last modified\"\"\"\n    return self.catalog_table_meta[\"UpdateTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.num_columns","title":"<code>num_columns()</code>","text":"<p>Return the number of columns for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def num_columns(self) -&gt; int:\n    \"\"\"Return the number of columns for this Data Source\"\"\"\n    return len(self.column_names())\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.num_rows","title":"<code>num_rows()</code>","text":"<p>Return the number of rows for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def num_rows(self) -&gt; int:\n    \"\"\"Return the number of rows for this Data Source\"\"\"\n    count_df = self.query(\n        f'select count(*) AS sageworks_count from \"{self.get_database()}\".\"{self.get_table_name()}\"'\n    )\n    return count_df[\"sageworks_count\"][0]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.outliers_impl","title":"<code>outliers_impl(scale=1.5, use_stddev=False, recompute=False)</code>","text":"<p>Compute outliers for all the numeric columns in a DataSource</p> <p>Parameters:</p> Name Type Description Default <code>scale</code> <code>float</code> <p>The scale to use for the IQR (default: 1.5)</p> <code>1.5</code> <code>use_stddev</code> <code>bool</code> <p>Use Standard Deviation instead of IQR (default: False)</p> <code>False</code> <code>recompute</code> <code>bool</code> <p>Recompute the outliers (default: False)</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A DataFrame of outliers from this DataSource</p> Notes <p>Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma) The scale parameter can be adjusted to change the IQR multiplier</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def outliers_impl(self, scale: float = 1.5, use_stddev=False, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Compute outliers for all the numeric columns in a DataSource\n\n    Args:\n        scale (float): The scale to use for the IQR (default: 1.5)\n        use_stddev (bool): Use Standard Deviation instead of IQR (default: False)\n        recompute (bool): Recompute the outliers (default: False)\n\n    Returns:\n        pd.DataFrame: A DataFrame of outliers from this DataSource\n\n    Notes:\n        Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n        The scale parameter can be adjusted to change the IQR multiplier\n    \"\"\"\n\n    # Compute outliers using the SQL Outliers class\n    sql_outliers = outliers.Outliers()\n    return sql_outliers.compute_outliers(self, scale=scale, use_stddev=use_stddev)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.query","title":"<code>query(query)</code>","text":"<p>Query the AthenaSource</p> <p>Parameters:</p> Name Type Description Default <code>query</code> <code>str</code> <p>The query to run against the AthenaSource</p> required <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The results of the query</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def query(self, query: str) -&gt; pd.DataFrame:\n    \"\"\"Query the AthenaSource\n\n    Args:\n        query (str): The query to run against the AthenaSource\n\n    Returns:\n        pd.DataFrame: The results of the query\n    \"\"\"\n    df = wr.athena.read_sql_query(\n        sql=query,\n        database=self.get_database(),\n        ctas_approach=False,\n        boto3_session=self.boto_session,\n    )\n    scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n    if scanned_bytes &gt; 0:\n        self.log.info(f\"Athena Query successful (scanned bytes: {scanned_bytes})\")\n    return df\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.refresh_meta","title":"<code>refresh_meta()</code>","text":"<p>Refresh our internal AWS Broker catalog metadata</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def refresh_meta(self):\n    \"\"\"Refresh our internal AWS Broker catalog metadata\"\"\"\n    _catalog_meta = self.aws_broker.get_metadata(ServiceCategory.DATA_CATALOG, force_refresh=True)\n    self.catalog_table_meta = _catalog_meta[self.get_database()].get(self.get_table_name())\n    self.metadata_refresh_needed = False\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.s3_storage_location","title":"<code>s3_storage_location()</code>","text":"<p>Get the S3 Storage Location for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def s3_storage_location(self) -&gt; str:\n    \"\"\"Get the S3 Storage Location for this Data Source\"\"\"\n    return self.catalog_table_meta[\"StorageDescriptor\"][\"Location\"]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.sageworks_meta","title":"<code>sageworks_meta()</code>","text":"<p>Get the SageWorks specific metadata for this Artifact</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def sageworks_meta(self) -&gt; dict:\n    \"\"\"Get the SageWorks specific metadata for this Artifact\"\"\"\n\n    # Sanity Check if we have invalid AWS Metadata\n    self.log.info(f\"Retrieving SageWorks Metadata for Artifact: {self.uuid}...\")\n    if self.catalog_table_meta is None:\n        if not self.exists():\n            self.log.error(f\"DataSource {self.uuid} doesn't appear to exist...\")\n        else:\n            self.log.critical(f\"Unable to get AWS Metadata for {self.get_table_name()}\")\n            self.log.critical(\"Malformed Artifact! Delete this Artifact and recreate it!\")\n        return {}\n\n    # Check if we need to refresh our metadata\n    if self.metadata_refresh_needed:\n        self.refresh_meta()\n\n    # Get the SageWorks Metadata from the Catalog Table Metadata\n    return sageworks_meta_from_catalog_table_meta(self.catalog_table_meta)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.sample_impl","title":"<code>sample_impl()</code>","text":"<p>Pull a sample of rows from the DataSource</p> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A sample DataFrame for an Athena DataSource</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def sample_impl(self) -&gt; pd.DataFrame:\n    \"\"\"Pull a sample of rows from the DataSource\n\n    Returns:\n        pd.DataFrame: A sample DataFrame for an Athena DataSource\n    \"\"\"\n\n    # Call the SQL function to pull a sample of the rows\n    return sample_rows.sample_rows(self)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.size","title":"<code>size()</code>","text":"<p>Return the size of this data in MegaBytes</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def size(self) -&gt; float:\n    \"\"\"Return the size of this data in MegaBytes\"\"\"\n    size_in_bytes = sum(wr.s3.size_objects(self.s3_storage_location(), boto3_session=self.boto_session).values())\n    size_in_mb = size_in_bytes / 1_000_000\n    return size_in_mb\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.smart_sample","title":"<code>smart_sample()</code>","text":"<p>Get a smart sample dataframe for this DataSource</p> Note <p>smart = sample data + outliers for the DataSource</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def smart_sample(self) -&gt; pd.DataFrame:\n    \"\"\"Get a smart sample dataframe for this DataSource\n\n    Note:\n        smart = sample data + outliers for the DataSource\"\"\"\n\n    # Outliers DataFrame\n    outlier_rows = self.outliers()\n\n    # Sample DataFrame\n    sample_rows = self.sample()\n    sample_rows[\"outlier_group\"] = \"sample\"\n\n    # Combine the sample rows with the outlier rows\n    all_rows = pd.concat([outlier_rows, sample_rows]).reset_index(drop=True)\n\n    # Drop duplicates\n    all_except_outlier_group = [col for col in all_rows.columns if col != \"outlier_group\"]\n    all_rows = all_rows.drop_duplicates(subset=all_except_outlier_group, ignore_index=True)\n    return all_rows\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.upsert_sageworks_meta","title":"<code>upsert_sageworks_meta(new_meta)</code>","text":"<p>Add SageWorks specific metadata to this Artifact</p> <p>Parameters:</p> Name Type Description Default <code>new_meta</code> <code>dict</code> <p>Dictionary of new metadata to add</p> required Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def upsert_sageworks_meta(self, new_meta: dict):\n    \"\"\"Add SageWorks specific metadata to this Artifact\n\n    Args:\n        new_meta (dict): Dictionary of new metadata to add\n    \"\"\"\n\n    # Give a warning message for keys that don't start with sageworks_\n    for key in new_meta.keys():\n        if not key.startswith(\"sageworks_\"):\n            self.log.warning(\"Append 'sageworks_' to key names to avoid overwriting AWS meta data\")\n\n    # Now convert any non-string values to JSON strings\n    for key, value in new_meta.items():\n        if not isinstance(value, str):\n            new_meta[key] = json.dumps(value, cls=CustomEncoder)\n\n    # Store our updated metadata\n    try:\n        wr.catalog.upsert_table_parameters(\n            parameters=new_meta,\n            database=self.get_database(),\n            table=self.get_table_name(),\n            boto3_session=self.boto_session,\n        )\n        self.metadata_refresh_needed = True\n    except botocore.exceptions.ClientError as e:\n        error_code = e.response[\"Error\"][\"Code\"]\n        if error_code == \"InvalidInputException\":\n            self.log.error(f\"Unable to upsert metadata for {self.get_table_name()}\")\n            self.log.error(\"Probably because the metadata is too large\")\n            self.log.error(new_meta)\n        elif error_code == \"ConcurrentModificationException\":\n            self.log.warning(\"ConcurrentModificationException... trying again...\")\n            time.sleep(5)\n            wr.catalog.upsert_table_parameters(\n                parameters=new_meta,\n                database=self.get_database(),\n                table=self.get_table_name(),\n                boto3_session=self.boto_session,\n            )\n        else:\n            raise e\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.value_counts","title":"<code>value_counts(recompute=False)</code>","text":"<p>Compute 'value_counts' for all the string columns in a DataSource</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the value counts (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of value counts for each column in the form  {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...},   'col2': ...}</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def value_counts(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n\n    Args:\n        recompute (bool): Recompute the value counts (default: False)\n\n    Returns:\n        dict(dict): A dictionary of value counts for each column in the form\n             {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...},\n              'col2': ...}\n    \"\"\"\n\n    # First check if we have already computed the value counts\n    value_counts_dict = self.sageworks_meta().get(\"sageworks_value_counts\")\n    if value_counts_dict and not recompute:\n        return value_counts_dict\n\n    # Call the SQL function to compute value_counts\n    value_count_dict = value_counts.value_counts(self)\n\n    # Push the value_count data into our DataSource Metadata\n    self.upsert_sageworks_meta({\"sageworks_value_counts\": value_count_dict})\n\n    # Return the value_count data\n    return value_count_dict\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/","title":"DataSource Abstract","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the DataSource API Class and voil\u00e0 it works the same.</p> <p>The DataSource Abstract class is a base/abstract class that defines API implemented by all the child classes (currently just AthenaSource but later RDSSource, FutureThing ).</p> <p>DataSourceAbstract: Abstract Base Class for all data sources (S3: CSV, JSONL, Parquet, RDS, etc)</p>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract","title":"<code>DataSourceAbstract</code>","text":"<p>               Bases: <code>Artifact</code></p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>class DataSourceAbstract(Artifact):\n    def __init__(self, data_uuid: str, database: str = \"sageworks\"):\n        \"\"\"DataSourceAbstract: Abstract Base Class for all data sources\n        Args:\n            data_uuid(str): The UUID for this Data Source\n            database(str): The database to use for this Data Source (default: sageworks)\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(data_uuid)\n\n        # Set up our instance attributes\n        self._database = database\n        self._table_name = data_uuid\n        self._display_columns = None\n\n    def __post_init__(self):\n        # Call superclass post_init\n        super().__post_init__()\n\n    def get_database(self) -&gt; str:\n        \"\"\"Get the database for this Data Source\"\"\"\n        return self._database\n\n    def get_table_name(self) -&gt; str:\n        \"\"\"Get the base table name for this Data Source\"\"\"\n        return self._table_name\n\n    @abstractmethod\n    def num_rows(self) -&gt; int:\n        \"\"\"Return the number of rows for this Data Source\"\"\"\n        pass\n\n    @abstractmethod\n    def num_columns(self) -&gt; int:\n        \"\"\"Return the number of columns for this Data Source\"\"\"\n        pass\n\n    @abstractmethod\n    def column_names(self) -&gt; list[str]:\n        \"\"\"Return the column names for this Data Source\"\"\"\n        pass\n\n    @abstractmethod\n    def column_types(self) -&gt; list[str]:\n        \"\"\"Return the column types for this Data Source\"\"\"\n        pass\n\n    def column_details(self, view: str = \"all\") -&gt; dict:\n        \"\"\"Return the column details for this Data Source\n        Args:\n            view (str): The view to get column details for (default: \"all\")\n        Returns:\n            dict: The column details for this Data Source\n        \"\"\"\n        names = self.column_names()\n        types = self.column_types()\n        if view == \"display\":\n            return {name: type_ for name, type_ in zip(names, types) if name in self.get_display_columns()}\n        elif view == \"computation\":\n            return {name: type_ for name, type_ in zip(names, types) if name in self.get_computation_columns()}\n        elif view == \"all\":\n            return {name: type_ for name, type_ in zip(names, types)}  # Return the full column details\n        else:\n            raise ValueError(f\"Unknown column details view: {view}\")\n\n    def get_display_columns(self) -&gt; list[str]:\n        \"\"\"Get the display columns for this Data Source\n        Returns:\n            list[str]: The display columns for this Data Source\n        \"\"\"\n        # Check if we have the display columns in our metadata\n        if self._display_columns is None:\n            self._display_columns = self.sageworks_meta().get(\"sageworks_display_columns\")\n\n        # If we still don't have display columns, try to set them\n        if self._display_columns is None:\n            # Exclude these automatically generated columns\n            exclude_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"id\"]\n\n            # We're going to remove any excluded columns from the display columns and limit to 30 total columns\n            self._display_columns = [col for col in self.column_names() if col not in exclude_columns][:30]\n\n            # Add the outlier_group column if it exists and isn't already in the display columns\n            if \"outlier_group\" in self.column_names():\n                self._display_columns = list(set(self._display_columns) + set([\"outlier_group\"]))\n\n            # Set the display columns in the metadata\n            self.set_display_columns(self._display_columns, onboard=False)\n\n        # Return the display columns\n        return self._display_columns\n\n    def set_display_columns(self, display_columns: list[str], onboard: bool = True):\n        \"\"\"Set the display columns for this Data Source\n\n        Args:\n            display_columns (list[str]): The display columns for this Data Source\n            onboard (bool): Onboard the Data Source after setting the display columns (default: True)\n        \"\"\"\n        self.log.important(f\"Setting Display Columns...{display_columns}\")\n        self._display_columns = display_columns\n        self.upsert_sageworks_meta({\"sageworks_display_columns\": self._display_columns})\n        if onboard:\n            self.onboard()\n\n    def num_display_columns(self) -&gt; int:\n        \"\"\"Return the number of display columns for this Data Source\"\"\"\n        return len(self._display_columns) if self._display_columns else 0\n\n    def get_computation_columns(self) -&gt; list[str]:\n        return self.get_display_columns()\n\n    def set_computation_columns(self, computation_columns: list[str]):\n        self.set_display_columns(computation_columns)\n\n    def num_computation_columns(self) -&gt; int:\n        return self.num_display_columns()\n\n    @abstractmethod\n    def query(self, query: str) -&gt; pd.DataFrame:\n        \"\"\"Query the DataSourceAbstract\n        Args:\n            query(str): The SQL query to execute\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def execute_statement(self, query: str):\n        \"\"\"Execute a non-returning SQL statement\n        Args:\n            query(str): The SQL query to execute\n        \"\"\"\n        pass\n\n    def sample(self, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Return a sample DataFrame from this DataSource\n        Args:\n            recompute (bool): Recompute the sample (default: False)\n        Returns:\n            pd.DataFrame: A sample DataFrame from this DataSource\n        \"\"\"\n\n        # Check if we have a cached sample of rows\n        storage_key = f\"data_source:{self.uuid}:sample\"\n        if not recompute and self.data_storage.get(storage_key):\n            return pd.read_json(StringIO(self.data_storage.get(storage_key)))\n\n        # No Cache, so we have to compute a sample of data\n        self.log.info(f\"Sampling {self.uuid}...\")\n        df = self.sample_impl()\n        self.data_storage.set(storage_key, df.to_json())\n        return df\n\n    @abstractmethod\n    def sample_impl(self) -&gt; pd.DataFrame:\n        \"\"\"Return a sample DataFrame from this DataSourceAbstract\n        Returns:\n            pd.DataFrame: A sample DataFrame from this DataSource\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def descriptive_stats(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n        Args:\n            recompute (bool): Recompute the descriptive stats (default: False)\n        Returns:\n            dict(dict): A dictionary of descriptive stats for each column in the form\n                 {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n                  'col2': ...}\n        \"\"\"\n        pass\n\n    def outliers(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Return a DataFrame of outliers from this DataSource\n        Args:\n            scale (float): The scale to use for the IQR (default: 1.5)\n            recompute (bool): Recompute the outliers (default: False)\n        Returns:\n            pd.DataFrame: A DataFrame of outliers from this DataSource\n        Notes:\n            Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n            The scale parameter can be adjusted to change the IQR multiplier\n        \"\"\"\n\n        # Check if we have cached outliers\n        storage_key = f\"data_source:{self.uuid}:outliers\"\n        if not recompute and self.data_storage.get(storage_key):\n            return pd.read_json(StringIO(self.data_storage.get(storage_key)))\n\n        # No Cache, so we have to compute the outliers\n        self.log.info(f\"Computing Outliers {self.uuid}...\")\n        df = self.outliers_impl()\n        self.data_storage.set(storage_key, df.to_json())\n        return df\n\n    @abstractmethod\n    def outliers_impl(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Return a DataFrame of outliers from this DataSource\n        Args:\n            scale (float): The scale to use for the IQR (default: 1.5)\n            recompute (bool): Recompute the outliers (default: False)\n        Returns:\n            pd.DataFrame: A DataFrame of outliers from this DataSource\n        Notes:\n            Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n            The scale parameter can be adjusted to change the IQR multiplier\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def smart_sample(self) -&gt; pd.DataFrame:\n        \"\"\"Get a SMART sample dataframe from this DataSource\n        Returns:\n            pd.DataFrame: A combined DataFrame of sample data + outliers\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def value_counts(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n        Args:\n            recompute (bool): Recompute the value counts (default: False)\n        Returns:\n            dict(dict): A dictionary of value counts for each column in the form\n                 {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...},\n                  'col2': ...}\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Column Stats for all the columns in a DataSource\n        Args:\n            recompute (bool): Recompute the column stats (default: False)\n        Returns:\n            dict(dict): A dictionary of stats for each column this format\n            NB: String columns will NOT have num_zeros and descriptive stats\n             {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n              'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n              ...}\n        \"\"\"\n        pass\n\n    def details(self) -&gt; dict:\n        \"\"\"Additional Details about this DataSourceAbstract Artifact\"\"\"\n        details = self.summary()\n        details[\"num_rows\"] = self.num_rows()\n        details[\"num_columns\"] = self.num_columns()\n        details[\"num_display_columns\"] = self.num_display_columns()\n        details[\"column_details\"] = self.column_details()\n        return details\n\n    def expected_meta(self) -&gt; list[str]:\n        \"\"\"DataSources have quite a bit of expected Metadata for EDA displays\"\"\"\n\n        # For DataSources, we expect to see the following metadata\n        expected_meta = [\n            \"sageworks_details\",\n            \"sageworks_descriptive_stats\",\n            \"sageworks_value_counts\",\n            \"sageworks_correlations\",\n            \"sageworks_column_stats\",\n        ]\n        return expected_meta\n\n    def ready(self) -&gt; bool:\n        \"\"\"Is the DataSource ready?\"\"\"\n\n        # Check if the Artifact is ready\n        if not super().ready():\n            return False\n\n        # Check if the samples and outliers have been computed\n        storage_key = f\"data_source:{self.uuid}:sample\"\n        if not self.data_storage.get(storage_key):\n            self.log.important(f\"DataSource {self.uuid} doesn't have sample() calling it...\")\n            self.sample()\n        storage_key = f\"data_source:{self.uuid}:outliers\"\n        if not self.data_storage.get(storage_key):\n            self.log.important(f\"DataSource {self.uuid} doesn't have outliers() calling it...\")\n            try:\n                self.outliers()\n            except KeyError:\n                self.log.error(\"DataSource outliers() failed...recomputing columns stats and trying again...\")\n                self.column_stats(recompute=True)\n                self.refresh_meta()\n                self.outliers()\n\n        # Okay so we have the samples and outliers, so we are ready\n        return True\n\n    def onboard(self) -&gt; bool:\n        \"\"\"This is a BLOCKING method that will onboard the data source (make it ready)\n\n        Returns:\n            bool: True if the DataSource was onboarded successfully\n        \"\"\"\n        self.log.important(f\"Onboarding {self.uuid}...\")\n        self.set_status(\"onboarding\")\n        self.remove_health_tag(\"needs_onboard\")\n        self.sample(recompute=True)\n        self.column_stats(recompute=True)\n        self.refresh_meta()  # Refresh the meta since outliers needs descriptive_stats and value_counts\n        self.outliers(recompute=True)\n\n        # Run a health check and refresh the meta\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.health_check()\n        self.refresh_meta()\n        self.details(recompute=True)\n        self.set_status(\"ready\")\n        return True\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.__init__","title":"<code>__init__(data_uuid, database='sageworks')</code>","text":"<p>DataSourceAbstract: Abstract Base Class for all data sources Args:     data_uuid(str): The UUID for this Data Source     database(str): The database to use for this Data Source (default: sageworks)</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def __init__(self, data_uuid: str, database: str = \"sageworks\"):\n    \"\"\"DataSourceAbstract: Abstract Base Class for all data sources\n    Args:\n        data_uuid(str): The UUID for this Data Source\n        database(str): The database to use for this Data Source (default: sageworks)\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(data_uuid)\n\n    # Set up our instance attributes\n    self._database = database\n    self._table_name = data_uuid\n    self._display_columns = None\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_details","title":"<code>column_details(view='all')</code>","text":"<p>Return the column details for this Data Source Args:     view (str): The view to get column details for (default: \"all\") Returns:     dict: The column details for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def column_details(self, view: str = \"all\") -&gt; dict:\n    \"\"\"Return the column details for this Data Source\n    Args:\n        view (str): The view to get column details for (default: \"all\")\n    Returns:\n        dict: The column details for this Data Source\n    \"\"\"\n    names = self.column_names()\n    types = self.column_types()\n    if view == \"display\":\n        return {name: type_ for name, type_ in zip(names, types) if name in self.get_display_columns()}\n    elif view == \"computation\":\n        return {name: type_ for name, type_ in zip(names, types) if name in self.get_computation_columns()}\n    elif view == \"all\":\n        return {name: type_ for name, type_ in zip(names, types)}  # Return the full column details\n    else:\n        raise ValueError(f\"Unknown column details view: {view}\")\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_names","title":"<code>column_names()</code>  <code>abstractmethod</code>","text":"<p>Return the column names for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef column_names(self) -&gt; list[str]:\n    \"\"\"Return the column names for this Data Source\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_stats","title":"<code>column_stats(recompute=False)</code>  <code>abstractmethod</code>","text":"<p>Compute Column Stats for all the columns in a DataSource Args:     recompute (bool): Recompute the column stats (default: False) Returns:     dict(dict): A dictionary of stats for each column this format     NB: String columns will NOT have num_zeros and descriptive stats      {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},       'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},       ...}</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Column Stats for all the columns in a DataSource\n    Args:\n        recompute (bool): Recompute the column stats (default: False)\n    Returns:\n        dict(dict): A dictionary of stats for each column this format\n        NB: String columns will NOT have num_zeros and descriptive stats\n         {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n          'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n          ...}\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_types","title":"<code>column_types()</code>  <code>abstractmethod</code>","text":"<p>Return the column types for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef column_types(self) -&gt; list[str]:\n    \"\"\"Return the column types for this Data Source\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.descriptive_stats","title":"<code>descriptive_stats(recompute=False)</code>  <code>abstractmethod</code>","text":"<p>Compute Descriptive Stats for all the numeric columns in a DataSource Args:     recompute (bool): Recompute the descriptive stats (default: False) Returns:     dict(dict): A dictionary of descriptive stats for each column in the form          {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},           'col2': ...}</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef descriptive_stats(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n    Args:\n        recompute (bool): Recompute the descriptive stats (default: False)\n    Returns:\n        dict(dict): A dictionary of descriptive stats for each column in the form\n             {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n              'col2': ...}\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.details","title":"<code>details()</code>","text":"<p>Additional Details about this DataSourceAbstract Artifact</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def details(self) -&gt; dict:\n    \"\"\"Additional Details about this DataSourceAbstract Artifact\"\"\"\n    details = self.summary()\n    details[\"num_rows\"] = self.num_rows()\n    details[\"num_columns\"] = self.num_columns()\n    details[\"num_display_columns\"] = self.num_display_columns()\n    details[\"column_details\"] = self.column_details()\n    return details\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.execute_statement","title":"<code>execute_statement(query)</code>  <code>abstractmethod</code>","text":"<p>Execute a non-returning SQL statement Args:     query(str): The SQL query to execute</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef execute_statement(self, query: str):\n    \"\"\"Execute a non-returning SQL statement\n    Args:\n        query(str): The SQL query to execute\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.expected_meta","title":"<code>expected_meta()</code>","text":"<p>DataSources have quite a bit of expected Metadata for EDA displays</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def expected_meta(self) -&gt; list[str]:\n    \"\"\"DataSources have quite a bit of expected Metadata for EDA displays\"\"\"\n\n    # For DataSources, we expect to see the following metadata\n    expected_meta = [\n        \"sageworks_details\",\n        \"sageworks_descriptive_stats\",\n        \"sageworks_value_counts\",\n        \"sageworks_correlations\",\n        \"sageworks_column_stats\",\n    ]\n    return expected_meta\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.get_database","title":"<code>get_database()</code>","text":"<p>Get the database for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def get_database(self) -&gt; str:\n    \"\"\"Get the database for this Data Source\"\"\"\n    return self._database\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.get_display_columns","title":"<code>get_display_columns()</code>","text":"<p>Get the display columns for this Data Source Returns:     list[str]: The display columns for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def get_display_columns(self) -&gt; list[str]:\n    \"\"\"Get the display columns for this Data Source\n    Returns:\n        list[str]: The display columns for this Data Source\n    \"\"\"\n    # Check if we have the display columns in our metadata\n    if self._display_columns is None:\n        self._display_columns = self.sageworks_meta().get(\"sageworks_display_columns\")\n\n    # If we still don't have display columns, try to set them\n    if self._display_columns is None:\n        # Exclude these automatically generated columns\n        exclude_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"id\"]\n\n        # We're going to remove any excluded columns from the display columns and limit to 30 total columns\n        self._display_columns = [col for col in self.column_names() if col not in exclude_columns][:30]\n\n        # Add the outlier_group column if it exists and isn't already in the display columns\n        if \"outlier_group\" in self.column_names():\n            self._display_columns = list(set(self._display_columns) + set([\"outlier_group\"]))\n\n        # Set the display columns in the metadata\n        self.set_display_columns(self._display_columns, onboard=False)\n\n    # Return the display columns\n    return self._display_columns\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.get_table_name","title":"<code>get_table_name()</code>","text":"<p>Get the base table name for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def get_table_name(self) -&gt; str:\n    \"\"\"Get the base table name for this Data Source\"\"\"\n    return self._table_name\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.num_columns","title":"<code>num_columns()</code>  <code>abstractmethod</code>","text":"<p>Return the number of columns for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef num_columns(self) -&gt; int:\n    \"\"\"Return the number of columns for this Data Source\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.num_display_columns","title":"<code>num_display_columns()</code>","text":"<p>Return the number of display columns for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def num_display_columns(self) -&gt; int:\n    \"\"\"Return the number of display columns for this Data Source\"\"\"\n    return len(self._display_columns) if self._display_columns else 0\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.num_rows","title":"<code>num_rows()</code>  <code>abstractmethod</code>","text":"<p>Return the number of rows for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef num_rows(self) -&gt; int:\n    \"\"\"Return the number of rows for this Data Source\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.onboard","title":"<code>onboard()</code>","text":"<p>This is a BLOCKING method that will onboard the data source (make it ready)</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the DataSource was onboarded successfully</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def onboard(self) -&gt; bool:\n    \"\"\"This is a BLOCKING method that will onboard the data source (make it ready)\n\n    Returns:\n        bool: True if the DataSource was onboarded successfully\n    \"\"\"\n    self.log.important(f\"Onboarding {self.uuid}...\")\n    self.set_status(\"onboarding\")\n    self.remove_health_tag(\"needs_onboard\")\n    self.sample(recompute=True)\n    self.column_stats(recompute=True)\n    self.refresh_meta()  # Refresh the meta since outliers needs descriptive_stats and value_counts\n    self.outliers(recompute=True)\n\n    # Run a health check and refresh the meta\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.health_check()\n    self.refresh_meta()\n    self.details(recompute=True)\n    self.set_status(\"ready\")\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.outliers","title":"<code>outliers(scale=1.5, recompute=False)</code>","text":"<p>Return a DataFrame of outliers from this DataSource Args:     scale (float): The scale to use for the IQR (default: 1.5)     recompute (bool): Recompute the outliers (default: False) Returns:     pd.DataFrame: A DataFrame of outliers from this DataSource Notes:     Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers     The scale parameter can be adjusted to change the IQR multiplier</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def outliers(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Return a DataFrame of outliers from this DataSource\n    Args:\n        scale (float): The scale to use for the IQR (default: 1.5)\n        recompute (bool): Recompute the outliers (default: False)\n    Returns:\n        pd.DataFrame: A DataFrame of outliers from this DataSource\n    Notes:\n        Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n        The scale parameter can be adjusted to change the IQR multiplier\n    \"\"\"\n\n    # Check if we have cached outliers\n    storage_key = f\"data_source:{self.uuid}:outliers\"\n    if not recompute and self.data_storage.get(storage_key):\n        return pd.read_json(StringIO(self.data_storage.get(storage_key)))\n\n    # No Cache, so we have to compute the outliers\n    self.log.info(f\"Computing Outliers {self.uuid}...\")\n    df = self.outliers_impl()\n    self.data_storage.set(storage_key, df.to_json())\n    return df\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.outliers_impl","title":"<code>outliers_impl(scale=1.5, recompute=False)</code>  <code>abstractmethod</code>","text":"<p>Return a DataFrame of outliers from this DataSource Args:     scale (float): The scale to use for the IQR (default: 1.5)     recompute (bool): Recompute the outliers (default: False) Returns:     pd.DataFrame: A DataFrame of outliers from this DataSource Notes:     Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers     The scale parameter can be adjusted to change the IQR multiplier</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef outliers_impl(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Return a DataFrame of outliers from this DataSource\n    Args:\n        scale (float): The scale to use for the IQR (default: 1.5)\n        recompute (bool): Recompute the outliers (default: False)\n    Returns:\n        pd.DataFrame: A DataFrame of outliers from this DataSource\n    Notes:\n        Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n        The scale parameter can be adjusted to change the IQR multiplier\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.query","title":"<code>query(query)</code>  <code>abstractmethod</code>","text":"<p>Query the DataSourceAbstract Args:     query(str): The SQL query to execute</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef query(self, query: str) -&gt; pd.DataFrame:\n    \"\"\"Query the DataSourceAbstract\n    Args:\n        query(str): The SQL query to execute\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.ready","title":"<code>ready()</code>","text":"<p>Is the DataSource ready?</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def ready(self) -&gt; bool:\n    \"\"\"Is the DataSource ready?\"\"\"\n\n    # Check if the Artifact is ready\n    if not super().ready():\n        return False\n\n    # Check if the samples and outliers have been computed\n    storage_key = f\"data_source:{self.uuid}:sample\"\n    if not self.data_storage.get(storage_key):\n        self.log.important(f\"DataSource {self.uuid} doesn't have sample() calling it...\")\n        self.sample()\n    storage_key = f\"data_source:{self.uuid}:outliers\"\n    if not self.data_storage.get(storage_key):\n        self.log.important(f\"DataSource {self.uuid} doesn't have outliers() calling it...\")\n        try:\n            self.outliers()\n        except KeyError:\n            self.log.error(\"DataSource outliers() failed...recomputing columns stats and trying again...\")\n            self.column_stats(recompute=True)\n            self.refresh_meta()\n            self.outliers()\n\n    # Okay so we have the samples and outliers, so we are ready\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.sample","title":"<code>sample(recompute=False)</code>","text":"<p>Return a sample DataFrame from this DataSource Args:     recompute (bool): Recompute the sample (default: False) Returns:     pd.DataFrame: A sample DataFrame from this DataSource</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def sample(self, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Return a sample DataFrame from this DataSource\n    Args:\n        recompute (bool): Recompute the sample (default: False)\n    Returns:\n        pd.DataFrame: A sample DataFrame from this DataSource\n    \"\"\"\n\n    # Check if we have a cached sample of rows\n    storage_key = f\"data_source:{self.uuid}:sample\"\n    if not recompute and self.data_storage.get(storage_key):\n        return pd.read_json(StringIO(self.data_storage.get(storage_key)))\n\n    # No Cache, so we have to compute a sample of data\n    self.log.info(f\"Sampling {self.uuid}...\")\n    df = self.sample_impl()\n    self.data_storage.set(storage_key, df.to_json())\n    return df\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.sample_impl","title":"<code>sample_impl()</code>  <code>abstractmethod</code>","text":"<p>Return a sample DataFrame from this DataSourceAbstract Returns:     pd.DataFrame: A sample DataFrame from this DataSource</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef sample_impl(self) -&gt; pd.DataFrame:\n    \"\"\"Return a sample DataFrame from this DataSourceAbstract\n    Returns:\n        pd.DataFrame: A sample DataFrame from this DataSource\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.set_display_columns","title":"<code>set_display_columns(display_columns, onboard=True)</code>","text":"<p>Set the display columns for this Data Source</p> <p>Parameters:</p> Name Type Description Default <code>display_columns</code> <code>list[str]</code> <p>The display columns for this Data Source</p> required <code>onboard</code> <code>bool</code> <p>Onboard the Data Source after setting the display columns (default: True)</p> <code>True</code> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def set_display_columns(self, display_columns: list[str], onboard: bool = True):\n    \"\"\"Set the display columns for this Data Source\n\n    Args:\n        display_columns (list[str]): The display columns for this Data Source\n        onboard (bool): Onboard the Data Source after setting the display columns (default: True)\n    \"\"\"\n    self.log.important(f\"Setting Display Columns...{display_columns}\")\n    self._display_columns = display_columns\n    self.upsert_sageworks_meta({\"sageworks_display_columns\": self._display_columns})\n    if onboard:\n        self.onboard()\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.smart_sample","title":"<code>smart_sample()</code>  <code>abstractmethod</code>","text":"<p>Get a SMART sample dataframe from this DataSource Returns:     pd.DataFrame: A combined DataFrame of sample data + outliers</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef smart_sample(self) -&gt; pd.DataFrame:\n    \"\"\"Get a SMART sample dataframe from this DataSource\n    Returns:\n        pd.DataFrame: A combined DataFrame of sample data + outliers\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.value_counts","title":"<code>value_counts(recompute=False)</code>  <code>abstractmethod</code>","text":"<p>Compute 'value_counts' for all the string columns in a DataSource Args:     recompute (bool): Recompute the value counts (default: False) Returns:     dict(dict): A dictionary of value counts for each column in the form          {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...},           'col2': ...}</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef value_counts(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n    Args:\n        recompute (bool): Recompute the value counts (default: False)\n    Returns:\n        dict(dict): A dictionary of value counts for each column in the form\n             {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...},\n              'col2': ...}\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/","title":"EndpointCore","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the Endpoint API Class and voil\u00e0 it works the same.</p> <p>EndpointCore: SageWorks EndpointCore Class</p>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore","title":"<code>EndpointCore</code>","text":"<p>               Bases: <code>Artifact</code></p> <p>EndpointCore: SageWorks EndpointCore Class</p> Common Usage <pre><code>my_endpoint = EndpointCore(endpoint_uuid)\nprediction_df = my_endpoint.predict(test_df)\nmetrics = my_endpoint.regression_metrics(target_column, prediction_df)\nfor metric, value in metrics.items():\n    print(f\"{metric}: {value:0.3f}\")\n</code></pre> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>class EndpointCore(Artifact):\n    \"\"\"EndpointCore: SageWorks EndpointCore Class\n\n    Common Usage:\n        ```\n        my_endpoint = EndpointCore(endpoint_uuid)\n        prediction_df = my_endpoint.predict(test_df)\n        metrics = my_endpoint.regression_metrics(target_column, prediction_df)\n        for metric, value in metrics.items():\n            print(f\"{metric}: {value:0.3f}\")\n        ```\n    \"\"\"\n\n    def __init__(self, endpoint_uuid, force_refresh: bool = False, legacy: bool = False):\n        \"\"\"EndpointCore Initialization\n\n        Args:\n            endpoint_uuid (str): Name of Endpoint in SageWorks\n            force_refresh (bool, optional): Force a refresh of the AWS Broker. Defaults to False.\n            legacy (bool, optional): Force load of legacy models. Defaults to False.\n        \"\"\"\n\n        # Make sure the endpoint_uuid is a valid name\n        if not legacy:\n            self.ensure_valid_name(endpoint_uuid, delimiter=\"-\")\n\n        # Call SuperClass Initialization\n        super().__init__(endpoint_uuid)\n\n        # Grab an AWS Metadata Broker object and pull information for Endpoints\n        self.endpoint_name = endpoint_uuid\n        self.endpoint_meta = self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=force_refresh).get(\n            self.endpoint_name\n        )\n\n        # Sanity check that we found the endpoint\n        if self.endpoint_meta is None:\n            self.log.important(f\"Could not find endpoint {self.uuid} within current visibility scope\")\n            return\n\n        # Sanity check the Endpoint state\n        if self.endpoint_meta[\"EndpointStatus\"] == \"Failed\":\n            self.log.critical(f\"Endpoint {self.uuid} is in a failed state\")\n            reason = self.endpoint_meta[\"FailureReason\"]\n            self.log.critical(f\"Failure Reason: {reason}\")\n            self.log.critical(\"Please delete this endpoint and re-deploy...\")\n\n        # Set the Inference, Capture, and Monitoring S3 Paths\n        self.endpoint_inference_path = self.endpoints_s3_path + \"/inference/\" + self.uuid\n        self.endpoint_data_capture_path = self.endpoints_s3_path + \"/data_capture/\" + self.uuid\n        self.endpoint_monitoring_path = self.endpoints_s3_path + \"/monitoring/\" + self.uuid\n\n        # Set the Model Name\n        self.model_name = self.get_input()\n\n        # This is for endpoint error handling later\n        self.endpoint_return_columns = None\n        self.endpoint_retry = 0\n\n        # Call SuperClass Post Initialization\n        super().__post_init__()\n\n        # All done\n        self.log.info(f\"EndpointCore Initialized: {self.endpoint_name}\")\n\n    def refresh_meta(self):\n        \"\"\"Refresh the Artifact's metadata\"\"\"\n        self.endpoint_meta = self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=True).get(\n            self.endpoint_name\n        )\n\n    def exists(self) -&gt; bool:\n        \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n        if self.endpoint_meta is None:\n            self.log.debug(f\"Endpoint {self.endpoint_name} not found in AWS Metadata\")\n            return False\n        return True\n\n    def health_check(self) -&gt; list[str]:\n        \"\"\"Perform a health check on this model\n\n        Returns:\n            list[str]: List of health issues\n        \"\"\"\n        if not self.ready():\n            return [\"needs_onboard\"]\n\n        # Call the base class health check\n        health_issues = super().health_check()\n\n        # We're going to check for 5xx errors and no activity\n        endpoint_metrics = self.endpoint_metrics()\n\n        # Check if we have metrics\n        if endpoint_metrics is None:\n            health_issues.append(\"unknown_error\")\n            return health_issues\n\n        # Check for 5xx errors\n        num_errors = endpoint_metrics[\"Invocation5XXErrors\"].sum()\n        if num_errors &gt; 5:\n            health_issues.append(\"5xx_errors\")\n        elif num_errors &gt; 0:\n            health_issues.append(\"5xx_errors_min\")\n        else:\n            self.remove_health_tag(\"5xx_errors\")\n            self.remove_health_tag(\"5xx_errors_min\")\n\n        # Check for Endpoint activity\n        num_invocations = endpoint_metrics[\"Invocations\"].sum()\n        if num_invocations == 0:\n            health_issues.append(\"no_activity\")\n        else:\n            self.remove_health_tag(\"no_activity\")\n        return health_issues\n\n    def is_serverless(self):\n        \"\"\"Check if the current endpoint is serverless.\n\n        Returns:\n            bool: True if the endpoint is serverless, False otherwise.\n        \"\"\"\n        return \"Serverless\" in self.endpoint_meta[\"InstanceType\"]\n\n    def add_data_capture(self):\n        \"\"\"Add data capture to the endpoint\"\"\"\n        self.get_monitor().add_data_capture()\n\n    def get_monitor(self):\n        \"\"\"Get the MonitorCore class for this endpoint\"\"\"\n        from sageworks.core.artifacts.monitor_core import MonitorCore\n\n        return MonitorCore(self.endpoint_name)\n\n    def size(self) -&gt; float:\n        \"\"\"Return the size of this data in MegaBytes\"\"\"\n        return 0.0\n\n    def aws_meta(self) -&gt; dict:\n        \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n        return self.endpoint_meta\n\n    def arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n        return self.endpoint_meta[\"EndpointArn\"]\n\n    def aws_url(self):\n        \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n        return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n\n    def created(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was created\"\"\"\n        return self.endpoint_meta[\"CreationTime\"]\n\n    def modified(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was last modified\"\"\"\n        return self.endpoint_meta[\"LastModifiedTime\"]\n\n    def endpoint_metrics(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Return the metrics for this endpoint\n\n        Returns:\n            pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)\n        \"\"\"\n\n        # Do we have it cached?\n        metrics_key = f\"endpoint:{self.uuid}:endpoint_metrics\"\n        endpoint_metrics = self.temp_storage.get(metrics_key)\n        if endpoint_metrics is not None:\n            return endpoint_metrics\n\n        # We don't have it cached so let's get it from CloudWatch\n        if \"ProductionVariants\" not in self.endpoint_meta:\n            return None\n        self.log.important(\"Updating endpoint metrics...\")\n        variant = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n        endpoint_metrics = EndpointMetrics().get_metrics(self.uuid, variant=variant)\n        self.temp_storage.set(metrics_key, endpoint_metrics)\n        return endpoint_metrics\n\n    def details(self, recompute: bool = False) -&gt; dict:\n        \"\"\"Additional Details about this Endpoint\n        Args:\n            recompute (bool): Recompute the details (default: False)\n        Returns:\n            dict(dict): A dictionary of details about this Endpoint\n        \"\"\"\n        # Check if we have cached version of the FeatureSet Details\n        details_key = f\"endpoint:{self.uuid}:details\"\n\n        cached_details = self.data_storage.get(details_key)\n        if cached_details and not recompute:\n            # Update the endpoint metrics before returning cached details\n            endpoint_metrics = self.endpoint_metrics()\n            cached_details[\"endpoint_metrics\"] = endpoint_metrics\n            return cached_details\n\n        # Fill in all the details about this Endpoint\n        details = self.summary()\n\n        # Get details from our AWS Metadata\n        details[\"status\"] = self.endpoint_meta[\"EndpointStatus\"]\n        details[\"instance\"] = self.endpoint_meta[\"InstanceType\"]\n        try:\n            details[\"instance_count\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"CurrentInstanceCount\"] or \"-\"\n        except KeyError:\n            details[\"instance_count\"] = \"-\"\n        if \"ProductionVariants\" in self.endpoint_meta:\n            details[\"variant\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n        else:\n            details[\"variant\"] = \"-\"\n\n        # Add the underlying model details\n        details[\"model_name\"] = self.model_name\n        model_details = self.model_details()\n        details[\"model_type\"] = model_details.get(\"model_type\", \"unknown\")\n        details[\"model_metrics\"] = model_details.get(\"model_metrics\")\n        details[\"confusion_matrix\"] = model_details.get(\"confusion_matrix\")\n        details[\"predictions\"] = model_details.get(\"predictions\")\n        details[\"inference_meta\"] = model_details.get(\"inference_meta\")\n\n        # Add endpoint metrics from CloudWatch\n        details[\"endpoint_metrics\"] = self.endpoint_metrics()\n\n        # Cache the details\n        self.data_storage.set(details_key, details)\n\n        # Return the details\n        return details\n\n    def onboard(self, interactive: bool = False) -&gt; bool:\n        \"\"\"This is a BLOCKING method that will onboard the Endpoint (make it ready)\n        Args:\n            interactive (bool, optional): If True, will prompt the user for information. (default: False)\n        Returns:\n            bool: True if the Endpoint is successfully onboarded, False otherwise\n        \"\"\"\n\n        # Make sure our input is defined\n        if self.get_input() == \"unknown\":\n            if interactive:\n                input_model = input(\"Input Model?: \")\n            else:\n                self.log.error(\"Input Model is not defined!\")\n                return False\n        else:\n            input_model = self.get_input()\n\n        # Now that we have the details, let's onboard the Endpoint with args\n        return self.onboard_with_args(input_model)\n\n        # Run a health check and refresh the meta\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.health_check()\n        self.refresh_meta()\n        self.details(recompute=True)\n        self.set_status(\"ready\")\n        return True\n\n    def onboard_with_args(self, input_model: str) -&gt; bool:\n        \"\"\"Onboard the Endpoint with the given arguments\n\n        Args:\n            input_model (str): The input model for this endpoint\n        Returns:\n            bool: True if the Endpoint is successfully onboarded, False otherwise\n        \"\"\"\n        # Set the status to onboarding\n        self.set_status(\"onboarding\")\n\n        self.upsert_sageworks_meta({\"sageworks_input\": input_model})\n        self.model_name = input_model\n\n        # Remove the needs_onboard tag\n        self.remove_health_tag(\"needs_onboard\")\n        self.set_status(\"ready\")\n\n        # Run a health check and refresh the meta\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.health_check()\n        self.refresh_meta()\n        self.details(recompute=True)\n        return True\n\n    def model_details(self) -&gt; dict:\n        \"\"\"Return the details about the model used in this Endpoint\"\"\"\n        if self.model_name == \"unknown\":\n            return {}\n        else:\n            model = ModelCore(self.model_name)\n            if model.exists():\n                return model.details()\n            else:\n                return {}\n\n    def model_type(self) -&gt; str:\n        \"\"\"Return the type of model used in this Endpoint\"\"\"\n        return self.details().get(\"model_type\", \"unknown\")\n\n    def auto_inference(self, capture: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Run inference on the endpoint using FeatureSet data\n\n        Args:\n            capture (bool, optional): Capture the inference results and metrics (default=False)\n        \"\"\"\n\n        # This import needs to happen here (instead of top of file) to avoid circular imports\n        from sageworks.utils.endpoint_utils import fs_evaluation_data\n\n        eval_data_df = fs_evaluation_data(self)\n        capture_uuid = \"training_holdout\" if capture else None\n        return self.inference(eval_data_df, capture_uuid)\n\n    def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -&gt; pd.DataFrame:\n        \"\"\"Run inference and compute performance metrics with optional capture\n\n        Args:\n            eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n            capture_uuid (str, optional): UUID of the inference capture (default=None)\n            id_column (str, optional): Name of the ID column (default=None)\n\n        Returns:\n            pd.DataFrame: DataFrame with the inference results\n\n        Note:\n            If capture=True inference/performance metrics are written to S3 Endpoint Inference Folder\n        \"\"\"\n\n        # Run predictions on the evaluation data\n        prediction_df = self._predict(eval_df)\n\n        # Get the target column\n        target_column = ModelCore(self.model_name).target()\n\n        # Sanity Check that the target column is present\n        if target_column not in prediction_df.columns:\n            self.log.warning(f\"Target Column {target_column} not found in prediction_df!\")\n            self.log.warning(\"In order to compute metrics, the target column must be present!\")\n            return prediction_df\n\n        # Compute the standard performance metrics for this model\n        model_type = self.model_type()\n        if model_type in [ModelType.REGRESSOR.value, ModelType.QUANTILE_REGRESSOR.value]:\n            prediction_df = self.residuals(target_column, prediction_df)\n            metrics = self.regression_metrics(target_column, prediction_df)\n        elif model_type == ModelType.CLASSIFIER.value:\n            metrics = self.classification_metrics(target_column, prediction_df)\n        else:\n            # Unknown Model Type: Give log message and set metrics to empty dataframe\n            self.log.warning(f\"Unknown Model Type: {model_type}\")\n            metrics = pd.DataFrame()\n\n        # Print out the metrics\n        print(f\"Performance Metrics for {self.model_name} on {self.uuid}\")\n        print(metrics.head())\n\n        # Capture the inference results and metrics\n        if capture_uuid is not None:\n            description = capture_uuid.replace(\"_\", \" \").title()\n            self._capture_inference_results(capture_uuid, prediction_df, target_column, metrics, description, id_column)\n\n        # Return the prediction DataFrame\n        return prediction_df\n\n    def _predict(self, eval_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Internal: Run prediction on the given observations in the given DataFrame\n        Args:\n            eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n        Returns:\n            pd.DataFrame: Return the DataFrame with additional columns, prediction and any _proba columns\n        \"\"\"\n\n        # Make sure the eval_df has the features used to train the model\n        features = ModelCore(self.model_name).features()\n        if features and not set(features).issubset(eval_df.columns):\n            raise ValueError(f\"DataFrame does not contain required features: {features}\")\n\n        # Create our Endpoint Predictor Class\n        predictor = Predictor(\n            self.endpoint_name,\n            sagemaker_session=self.sm_session,\n            serializer=CSVSerializer(),\n            deserializer=CSVDeserializer(),\n        )\n\n        # Now split up the dataframe into 500 row chunks, send those chunks to our\n        # endpoint (with error handling) and stitch all the chunks back together\n        df_list = []\n        for index in range(0, len(eval_df), 500):\n            print(\"Processing...\")\n\n            # Compute partial DataFrames, add them to a list, and concatenate at the end\n            partial_df = self._endpoint_error_handling(predictor, eval_df[index : index + 500])\n            df_list.append(partial_df)\n\n        # Concatenate the dataframes\n        combined_df = pd.concat(df_list, ignore_index=True)\n\n        # Convert data to numeric\n        # Note: Since we're using CSV serializers numeric columns often get changed to generic 'object' types\n\n        # Hard Conversion\n        # Note: We explicitly catch exceptions for columns that cannot be converted to numeric\n        converted_df = combined_df.copy()\n        for column in combined_df.columns:\n            try:\n                converted_df[column] = pd.to_numeric(combined_df[column])\n            except ValueError:\n                # If a ValueError is raised, the column cannot be converted to numeric, so we keep it as is\n                pass\n\n        # Soft Conversion\n        # Convert columns to the best possible dtype that supports the pd.NA missing value.\n        converted_df = converted_df.convert_dtypes()\n\n        # Return the Dataframe\n        return converted_df\n\n    def _endpoint_error_handling(self, predictor, feature_df):\n        \"\"\"Internal: Method that handles Errors, Retries, and Binary Search for Error Row(s)\"\"\"\n\n        # Convert the DataFrame into a CSV buffer\n        csv_buffer = StringIO()\n        feature_df.to_csv(csv_buffer, index=False)\n\n        # Error Handling if the Endpoint gives back an error\n        try:\n            # Send the CSV Buffer to the predictor\n            results = predictor.predict(csv_buffer.getvalue())\n\n            # Construct a DataFrame from the results\n            results_df = pd.DataFrame.from_records(results[1:], columns=results[0])\n\n            # Capture the return columns\n            self.endpoint_return_columns = results_df.columns.tolist()\n\n            # Return the results dataframe\n            return results_df\n\n        except botocore.exceptions.ClientError as err:\n            if err.response[\"Error\"][\"Code\"] == \"ModelError\":  # Model Error\n                # Report the error and raise an exception\n                self.log.critical(f\"Endpoint prediction error: {err.response.get('Message')}\")\n                raise err\n\n            # Base case: DataFrame with 1 Row\n            if len(feature_df) == 1:\n                # If we don't have ANY known good results we're kinda screwed\n                if not self.endpoint_return_columns:\n                    raise err\n\n                # Construct an Error DataFrame (one row of NaNs in the return columns)\n                results_df = self._error_df(feature_df, self.endpoint_return_columns)\n                return results_df\n\n            # Recurse on binary splits of the dataframe\n            num_rows = len(feature_df)\n            split = int(num_rows / 2)\n            first_half = self._endpoint_error_handling(predictor, feature_df[0:split])\n            second_half = self._endpoint_error_handling(predictor, feature_df[split:num_rows])\n            return pd.concat([first_half, second_half], ignore_index=True)\n\n        # Catch the botocore.errorfactory.ModelNotReadyException\n        # Note: This is a SageMaker specific error that sometimes occurs\n        #       when the endpoint hasn't been used in a long time.\n        except botocore.errorfactory.ModelNotReadyException as err:\n            if self.endpoint_retry &gt;= 3:\n                raise err\n            self.endpoint_retry += 1\n            self.log.critical(f\"Endpoint model not ready: {err}\")\n            self.log.critical(\"Waiting and Retrying...\")\n            time.sleep(30)\n            return self._endpoint_error_handling(predictor, feature_df)\n\n    def _error_df(self, df, all_columns):\n        \"\"\"Internal: Method to construct an Error DataFrame (a Pandas DataFrame with one row of NaNs)\"\"\"\n        # Create a new dataframe with all NaNs\n        error_df = pd.DataFrame(dict(zip(all_columns, [[np.NaN]] * len(self.endpoint_return_columns))))\n        # Now set the original values for the incoming dataframe\n        for column in df.columns:\n            error_df[column] = df[column].values\n        return error_df\n\n    def _capture_inference_results(\n        self,\n        capture_uuid: str,\n        pred_results_df: pd.DataFrame,\n        target_column: str,\n        metrics: pd.DataFrame,\n        description: str,\n        id_column: str = None,\n    ):\n        \"\"\"Internal: Capture the inference results and metrics to S3\n\n        Args:\n            capture_uuid (str): UUID of the inference capture\n            pred_results_df (pd.DataFrame): DataFrame with the prediction results\n            target_column (str): Name of the target column\n            metrics (pd.DataFrame): DataFrame with the performance metrics\n            description (str): Description of the inference results\n            id_column (str, optional): Name of the ID column (default=None)\n        \"\"\"\n\n        # Compute a dataframe hash (just use the last 8)\n        data_hash = joblib.hash(pred_results_df)[:8]\n\n        # Metadata for the model inference\n        inference_meta = {\n            \"name\": capture_uuid,\n            \"data_hash\": data_hash,\n            \"num_rows\": len(pred_results_df),\n            \"description\": description,\n        }\n\n        # Create the S3 Path for the Inference Capture\n        inference_capture_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n        # Write the metadata dictionary, and metrics to our S3 Model Inference Folder\n        wr.s3.to_json(\n            pd.DataFrame([inference_meta]),\n            f\"{inference_capture_path}/inference_meta.json\",\n            index=False,\n        )\n        self.log.info(f\"Writing metrics to {inference_capture_path}/inference_metrics.csv\")\n        wr.s3.to_csv(metrics, f\"{inference_capture_path}/inference_metrics.csv\", index=False)\n\n        # Grab the target column, prediction column, any _proba columns, and the ID column (if present)\n        prediction_col = \"prediction\" if \"prediction\" in pred_results_df.columns else \"predictions\"\n        output_columns = [target_column, prediction_col]\n\n        # Add any _proba columns to the output columns\n        output_columns += [col for col in pred_results_df.columns if col.endswith(\"_proba\")]\n\n        # Add any quantile columns to the output columns\n        output_columns += [col for col in pred_results_df.columns if col.startswith(\"q_\") or col.startswith(\"qr_\")]\n\n        # Add the ID column\n        if id_column and id_column in pred_results_df.columns:\n            output_columns.append(id_column)\n\n        # Write the predictions to our S3 Model Inference Folder\n        self.log.info(f\"Writing predictions to {inference_capture_path}/inference_predictions.csv\")\n        subset_df = pred_results_df[output_columns]\n        wr.s3.to_csv(subset_df, f\"{inference_capture_path}/inference_predictions.csv\", index=False)\n\n        # CLASSIFIER: Write the confusion matrix to our S3 Model Inference Folder\n        model_type = self.model_type()\n        if model_type == ModelType.CLASSIFIER.value:\n            conf_mtx = self.confusion_matrix(target_column, pred_results_df)\n            self.log.info(f\"Writing confusion matrix to {inference_capture_path}/inference_cm.csv\")\n            # Note: Unlike other dataframes here, we want to write the index (labels) to the CSV\n            wr.s3.to_csv(conf_mtx, f\"{inference_capture_path}/inference_cm.csv\", index=True)\n\n        # Generate SHAP values for our Prediction Dataframe\n        generate_shap_values(self.endpoint_name, model_type, pred_results_df, inference_capture_path)\n\n        # Now recompute the details for our Model\n        self.log.important(f\"Recomputing Details for {self.model_name} to show latest Inference Results...\")\n        model = ModelCore(self.model_name)\n        model._load_inference_metrics(capture_uuid)\n        model.details(recompute=True)\n\n        # Recompute the details so that inference model metrics are updated\n        self.log.important(f\"Recomputing Details for {self.uuid} to show latest Inference Results...\")\n        self.details(recompute=True)\n\n    @staticmethod\n    def regression_metrics(target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Compute the performance metrics for this Endpoint\n        Args:\n            target_column (str): Name of the target column\n            prediction_df (pd.DataFrame): DataFrame with the prediction results\n        Returns:\n            pd.DataFrame: DataFrame with the performance metrics\n        \"\"\"\n\n        # Compute the metrics\n        y_true = prediction_df[target_column]\n        prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n        y_pred = prediction_df[prediction_col]\n\n        mae = mean_absolute_error(y_true, y_pred)\n        rmse = root_mean_squared_error(y_true, y_pred)\n        r2 = r2_score(y_true, y_pred)\n        # Mean Absolute Percentage Error\n        mape = np.mean(np.where(y_true != 0, np.abs((y_true - y_pred) / y_true), np.abs(y_true - y_pred))) * 100\n        # Median Absolute Error\n        medae = median_absolute_error(y_true, y_pred)\n\n        # Organize and return the metrics\n        metrics = {\n            \"MAE\": round(mae, 3),\n            \"RMSE\": round(rmse, 3),\n            \"R2\": round(r2, 3),\n            \"MAPE\": round(mape, 3),\n            \"MedAE\": round(medae, 3),\n            \"NumRows\": len(prediction_df),\n        }\n        return pd.DataFrame.from_records([metrics])\n\n    def residuals(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Add the residuals to the prediction DataFrame\n        Args:\n            target_column (str): Name of the target column\n            prediction_df (pd.DataFrame): DataFrame with the prediction results\n        Returns:\n            pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'\n        \"\"\"\n        # Sanity Check that this is a regression model\n        if self.model_type() not in [ModelType.REGRESSOR.value, ModelType.QUANTILE_REGRESSOR.value]:\n            self.log.warning(\"Residuals are only computed for regression models\")\n            return prediction_df\n\n        # Compute the residuals\n        y_true = prediction_df[target_column]\n        prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n        y_pred = prediction_df[prediction_col]\n\n        # Add the residuals and the absolute values to the DataFrame\n        prediction_df[\"residuals\"] = y_true - y_pred\n        prediction_df[\"residuals_abs\"] = np.abs(prediction_df[\"residuals\"])\n        return prediction_df\n\n    def classification_metrics(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Compute the performance metrics for this Endpoint\n        Args:\n            target_column (str): Name of the target column\n            prediction_df (pd.DataFrame): DataFrame with the prediction results\n        Returns:\n            pd.DataFrame: DataFrame with the performance metrics\n        \"\"\"\n\n        # Get a list of unique labels\n        labels = prediction_df[target_column].unique()\n\n        # Calculate scores\n        prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n        scores = precision_recall_fscore_support(\n            prediction_df[target_column], prediction_df[prediction_col], average=None, labels=labels\n        )\n\n        # Calculate ROC AUC\n        # ROC-AUC score measures the model's ability to distinguish between classes;\n        # - A value of 0.5 indicates no discrimination (equivalent to random guessing)\n        # - A score close to 1 indicates high discriminative power\n\n        # Sanity check for older versions that have a single column for probability\n        if \"pred_proba\" in prediction_df.columns:\n            self.log.error(\"Older version of prediction output detected, rerun inference...\")\n            roc_auc = [0.0] * len(labels)\n\n        # Convert probability columns to a 2D NumPy array\n        else:\n            proba_columns = [col for col in prediction_df.columns if col.endswith(\"_proba\")]\n            y_score = prediction_df[proba_columns].to_numpy()\n\n            # One-hot encode the true labels\n            lb = LabelBinarizer()\n            lb.fit(prediction_df[target_column])\n            y_true = lb.transform(prediction_df[target_column])\n\n            # Compute ROC AUC\n            roc_auc = roc_auc_score(y_true, y_score, multi_class=\"ovr\", average=None)\n\n        # Put the scores into a dataframe\n        score_df = pd.DataFrame(\n            {\n                target_column: labels,\n                \"precision\": scores[0],\n                \"recall\": scores[1],\n                \"fscore\": scores[2],\n                \"roc_auc\": roc_auc,\n                \"support\": scores[3],\n            }\n        )\n\n        # Sort the target labels\n        score_df = score_df.sort_values(by=[target_column], ascending=True)\n        return score_df\n\n    def confusion_matrix(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Compute the confusion matrix for this Endpoint\n        Args:\n            target_column (str): Name of the target column\n            prediction_df (pd.DataFrame): DataFrame with the prediction results\n        Returns:\n            pd.DataFrame: DataFrame with the confusion matrix\n        \"\"\"\n\n        y_true = prediction_df[target_column]\n        prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n        y_pred = prediction_df[prediction_col]\n\n        # Special case for low, medium, high classes\n        if (set(y_true) | set(y_pred)) == {\"low\", \"medium\", \"high\"}:\n            labels = [\"low\", \"medium\", \"high\"]\n        else:\n            labels = sorted(list(set(y_true) | set(y_pred)))\n\n        # Compute the confusion matrix\n        conf_mtx = confusion_matrix(y_true, y_pred, labels=labels)\n\n        # Create a DataFrame\n        conf_mtx_df = pd.DataFrame(conf_mtx, index=labels, columns=labels)\n        conf_mtx_df.index.name = \"labels\"\n        return conf_mtx_df\n\n    def endpoint_config_name(self) -&gt; str:\n        # Grab the Endpoint Config Name from the AWS\n        details = self.sm_client.describe_endpoint(EndpointName=self.endpoint_name)\n        return details[\"EndpointConfigName\"]\n\n    def set_input(self, input: str, force=False):\n        \"\"\"Override: Set the input data for this artifact\n\n        Args:\n            input (str): Name of input for this artifact\n            force (bool, optional): Force the input to be set. Defaults to False.\n        Note:\n            We're going to not allow this to be used for Models\n        \"\"\"\n        if not force:\n            self.log.warning(f\"Endpoint {self.uuid}: Does not allow manual override of the input!\")\n            return\n\n        # Okay we're going to allow this to be set\n        self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n        self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n        self.upsert_sageworks_meta({\"sageworks_input\": input})\n\n    def delete(self):\n        \"\"\"Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint\"\"\"\n        self.delete_endpoint_models()\n\n        # Grab the Endpoint Config Name from the AWS\n        endpoint_config_name = self.endpoint_config_name()\n        try:\n            self.log.info(f\"Deleting Endpoint Config {endpoint_config_name}...\")\n            self.sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n        except botocore.exceptions.ClientError:\n            self.log.info(f\"Endpoint Config {endpoint_config_name} doesn't exist...\")\n\n        # Check for any monitoring schedules\n        response = self.sm_client.list_monitoring_schedules(EndpointName=self.uuid)\n        monitoring_schedules = response[\"MonitoringScheduleSummaries\"]\n        for schedule in monitoring_schedules:\n            self.log.info(f\"Deleting Endpoint Monitoring Schedule {schedule['MonitoringScheduleName']}...\")\n            self.sm_client.delete_monitoring_schedule(MonitoringScheduleName=schedule[\"MonitoringScheduleName\"])\n\n        # Delete any inference, data_capture or monitoring artifacts\n        for s3_path in [self.endpoint_inference_path, self.endpoint_data_capture_path, self.endpoint_monitoring_path]:\n\n            # Make sure we add the trailing slash\n            s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n            objects = wr.s3.list_objects(s3_path, boto3_session=self.boto_session)\n            for obj in objects:\n                self.log.info(f\"Deleting S3 Object {obj}...\")\n            wr.s3.delete_objects(objects, boto3_session=self.boto_session)\n\n        # Now delete any data in the Cache\n        for key in self.data_storage.list_subkeys(f\"endpoint:{self.uuid}:\"):\n            self.log.info(f\"Deleting Cache Key: {key}\")\n            self.data_storage.delete(key)\n\n        # Okay now delete the Endpoint\n        try:\n            time.sleep(2)  # Let AWS catch up with any deletions performed above\n            self.log.info(f\"Deleting Endpoint {self.uuid}...\")\n            self.sm_client.delete_endpoint(EndpointName=self.uuid)\n        except botocore.exceptions.ClientError as e:\n            self.log.info(\"Endpoint ClientError...\")\n            raise e\n\n        # One more sleep to let AWS fully register the endpoint deletion\n        time.sleep(5)\n\n    def delete_endpoint_models(self):\n        \"\"\"Delete the underlying Model for an Endpoint\"\"\"\n\n        # Grab the Endpoint Config Name from the AWS\n        endpoint_config_name = self.endpoint_config_name()\n\n        # Retrieve the Model Names from the Endpoint Config\n        try:\n            endpoint_config = self.sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n        except botocore.exceptions.ClientError:\n            self.log.info(f\"Endpoint Config {self.uuid} doesn't exist...\")\n            return\n        model_names = [variant[\"ModelName\"] for variant in endpoint_config[\"ProductionVariants\"]]\n        for model_name in model_names:\n            self.log.info(f\"Deleting Model {model_name}...\")\n            try:\n                self.sm_client.delete_model(ModelName=model_name)\n            except botocore.exceptions.ClientError as error:\n                error_code = error.response[\"Error\"][\"Code\"]\n                error_message = error.response[\"Error\"][\"Message\"]\n                if error_code == \"ResourceInUse\":\n                    self.log.warning(f\"Model {model_name} is still in use...\")\n                else:\n                    self.log.warning(f\"Error: {error_code} - {error_message}\")\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.__init__","title":"<code>__init__(endpoint_uuid, force_refresh=False, legacy=False)</code>","text":"<p>EndpointCore Initialization</p> <p>Parameters:</p> Name Type Description Default <code>endpoint_uuid</code> <code>str</code> <p>Name of Endpoint in SageWorks</p> required <code>force_refresh</code> <code>bool</code> <p>Force a refresh of the AWS Broker. Defaults to False.</p> <code>False</code> <code>legacy</code> <code>bool</code> <p>Force load of legacy models. Defaults to False.</p> <code>False</code> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def __init__(self, endpoint_uuid, force_refresh: bool = False, legacy: bool = False):\n    \"\"\"EndpointCore Initialization\n\n    Args:\n        endpoint_uuid (str): Name of Endpoint in SageWorks\n        force_refresh (bool, optional): Force a refresh of the AWS Broker. Defaults to False.\n        legacy (bool, optional): Force load of legacy models. Defaults to False.\n    \"\"\"\n\n    # Make sure the endpoint_uuid is a valid name\n    if not legacy:\n        self.ensure_valid_name(endpoint_uuid, delimiter=\"-\")\n\n    # Call SuperClass Initialization\n    super().__init__(endpoint_uuid)\n\n    # Grab an AWS Metadata Broker object and pull information for Endpoints\n    self.endpoint_name = endpoint_uuid\n    self.endpoint_meta = self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=force_refresh).get(\n        self.endpoint_name\n    )\n\n    # Sanity check that we found the endpoint\n    if self.endpoint_meta is None:\n        self.log.important(f\"Could not find endpoint {self.uuid} within current visibility scope\")\n        return\n\n    # Sanity check the Endpoint state\n    if self.endpoint_meta[\"EndpointStatus\"] == \"Failed\":\n        self.log.critical(f\"Endpoint {self.uuid} is in a failed state\")\n        reason = self.endpoint_meta[\"FailureReason\"]\n        self.log.critical(f\"Failure Reason: {reason}\")\n        self.log.critical(\"Please delete this endpoint and re-deploy...\")\n\n    # Set the Inference, Capture, and Monitoring S3 Paths\n    self.endpoint_inference_path = self.endpoints_s3_path + \"/inference/\" + self.uuid\n    self.endpoint_data_capture_path = self.endpoints_s3_path + \"/data_capture/\" + self.uuid\n    self.endpoint_monitoring_path = self.endpoints_s3_path + \"/monitoring/\" + self.uuid\n\n    # Set the Model Name\n    self.model_name = self.get_input()\n\n    # This is for endpoint error handling later\n    self.endpoint_return_columns = None\n    self.endpoint_retry = 0\n\n    # Call SuperClass Post Initialization\n    super().__post_init__()\n\n    # All done\n    self.log.info(f\"EndpointCore Initialized: {self.endpoint_name}\")\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.add_data_capture","title":"<code>add_data_capture()</code>","text":"<p>Add data capture to the endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def add_data_capture(self):\n    \"\"\"Add data capture to the endpoint\"\"\"\n    self.get_monitor().add_data_capture()\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.arn","title":"<code>arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for this artifact</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n    return self.endpoint_meta[\"EndpointArn\"]\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.auto_inference","title":"<code>auto_inference(capture=False)</code>","text":"<p>Run inference on the endpoint using FeatureSet data</p> <p>Parameters:</p> Name Type Description Default <code>capture</code> <code>bool</code> <p>Capture the inference results and metrics (default=False)</p> <code>False</code> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def auto_inference(self, capture: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Run inference on the endpoint using FeatureSet data\n\n    Args:\n        capture (bool, optional): Capture the inference results and metrics (default=False)\n    \"\"\"\n\n    # This import needs to happen here (instead of top of file) to avoid circular imports\n    from sageworks.utils.endpoint_utils import fs_evaluation_data\n\n    eval_data_df = fs_evaluation_data(self)\n    capture_uuid = \"training_holdout\" if capture else None\n    return self.inference(eval_data_df, capture_uuid)\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.aws_meta","title":"<code>aws_meta()</code>","text":"<p>Get ALL the AWS metadata for this artifact</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def aws_meta(self) -&gt; dict:\n    \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n    return self.endpoint_meta\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.aws_url","title":"<code>aws_url()</code>","text":"<p>The AWS URL for looking at/querying this data source</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def aws_url(self):\n    \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n    return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.classification_metrics","title":"<code>classification_metrics(target_column, prediction_df)</code>","text":"<p>Compute the performance metrics for this Endpoint Args:     target_column (str): Name of the target column     prediction_df (pd.DataFrame): DataFrame with the prediction results Returns:     pd.DataFrame: DataFrame with the performance metrics</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def classification_metrics(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Compute the performance metrics for this Endpoint\n    Args:\n        target_column (str): Name of the target column\n        prediction_df (pd.DataFrame): DataFrame with the prediction results\n    Returns:\n        pd.DataFrame: DataFrame with the performance metrics\n    \"\"\"\n\n    # Get a list of unique labels\n    labels = prediction_df[target_column].unique()\n\n    # Calculate scores\n    prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n    scores = precision_recall_fscore_support(\n        prediction_df[target_column], prediction_df[prediction_col], average=None, labels=labels\n    )\n\n    # Calculate ROC AUC\n    # ROC-AUC score measures the model's ability to distinguish between classes;\n    # - A value of 0.5 indicates no discrimination (equivalent to random guessing)\n    # - A score close to 1 indicates high discriminative power\n\n    # Sanity check for older versions that have a single column for probability\n    if \"pred_proba\" in prediction_df.columns:\n        self.log.error(\"Older version of prediction output detected, rerun inference...\")\n        roc_auc = [0.0] * len(labels)\n\n    # Convert probability columns to a 2D NumPy array\n    else:\n        proba_columns = [col for col in prediction_df.columns if col.endswith(\"_proba\")]\n        y_score = prediction_df[proba_columns].to_numpy()\n\n        # One-hot encode the true labels\n        lb = LabelBinarizer()\n        lb.fit(prediction_df[target_column])\n        y_true = lb.transform(prediction_df[target_column])\n\n        # Compute ROC AUC\n        roc_auc = roc_auc_score(y_true, y_score, multi_class=\"ovr\", average=None)\n\n    # Put the scores into a dataframe\n    score_df = pd.DataFrame(\n        {\n            target_column: labels,\n            \"precision\": scores[0],\n            \"recall\": scores[1],\n            \"fscore\": scores[2],\n            \"roc_auc\": roc_auc,\n            \"support\": scores[3],\n        }\n    )\n\n    # Sort the target labels\n    score_df = score_df.sort_values(by=[target_column], ascending=True)\n    return score_df\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.confusion_matrix","title":"<code>confusion_matrix(target_column, prediction_df)</code>","text":"<p>Compute the confusion matrix for this Endpoint Args:     target_column (str): Name of the target column     prediction_df (pd.DataFrame): DataFrame with the prediction results Returns:     pd.DataFrame: DataFrame with the confusion matrix</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def confusion_matrix(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Compute the confusion matrix for this Endpoint\n    Args:\n        target_column (str): Name of the target column\n        prediction_df (pd.DataFrame): DataFrame with the prediction results\n    Returns:\n        pd.DataFrame: DataFrame with the confusion matrix\n    \"\"\"\n\n    y_true = prediction_df[target_column]\n    prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n    y_pred = prediction_df[prediction_col]\n\n    # Special case for low, medium, high classes\n    if (set(y_true) | set(y_pred)) == {\"low\", \"medium\", \"high\"}:\n        labels = [\"low\", \"medium\", \"high\"]\n    else:\n        labels = sorted(list(set(y_true) | set(y_pred)))\n\n    # Compute the confusion matrix\n    conf_mtx = confusion_matrix(y_true, y_pred, labels=labels)\n\n    # Create a DataFrame\n    conf_mtx_df = pd.DataFrame(conf_mtx, index=labels, columns=labels)\n    conf_mtx_df.index.name = \"labels\"\n    return conf_mtx_df\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.created","title":"<code>created()</code>","text":"<p>Return the datetime when this artifact was created</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def created(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was created\"\"\"\n    return self.endpoint_meta[\"CreationTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.delete","title":"<code>delete()</code>","text":"<p>Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def delete(self):\n    \"\"\"Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint\"\"\"\n    self.delete_endpoint_models()\n\n    # Grab the Endpoint Config Name from the AWS\n    endpoint_config_name = self.endpoint_config_name()\n    try:\n        self.log.info(f\"Deleting Endpoint Config {endpoint_config_name}...\")\n        self.sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n    except botocore.exceptions.ClientError:\n        self.log.info(f\"Endpoint Config {endpoint_config_name} doesn't exist...\")\n\n    # Check for any monitoring schedules\n    response = self.sm_client.list_monitoring_schedules(EndpointName=self.uuid)\n    monitoring_schedules = response[\"MonitoringScheduleSummaries\"]\n    for schedule in monitoring_schedules:\n        self.log.info(f\"Deleting Endpoint Monitoring Schedule {schedule['MonitoringScheduleName']}...\")\n        self.sm_client.delete_monitoring_schedule(MonitoringScheduleName=schedule[\"MonitoringScheduleName\"])\n\n    # Delete any inference, data_capture or monitoring artifacts\n    for s3_path in [self.endpoint_inference_path, self.endpoint_data_capture_path, self.endpoint_monitoring_path]:\n\n        # Make sure we add the trailing slash\n        s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n        objects = wr.s3.list_objects(s3_path, boto3_session=self.boto_session)\n        for obj in objects:\n            self.log.info(f\"Deleting S3 Object {obj}...\")\n        wr.s3.delete_objects(objects, boto3_session=self.boto_session)\n\n    # Now delete any data in the Cache\n    for key in self.data_storage.list_subkeys(f\"endpoint:{self.uuid}:\"):\n        self.log.info(f\"Deleting Cache Key: {key}\")\n        self.data_storage.delete(key)\n\n    # Okay now delete the Endpoint\n    try:\n        time.sleep(2)  # Let AWS catch up with any deletions performed above\n        self.log.info(f\"Deleting Endpoint {self.uuid}...\")\n        self.sm_client.delete_endpoint(EndpointName=self.uuid)\n    except botocore.exceptions.ClientError as e:\n        self.log.info(\"Endpoint ClientError...\")\n        raise e\n\n    # One more sleep to let AWS fully register the endpoint deletion\n    time.sleep(5)\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.delete_endpoint_models","title":"<code>delete_endpoint_models()</code>","text":"<p>Delete the underlying Model for an Endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def delete_endpoint_models(self):\n    \"\"\"Delete the underlying Model for an Endpoint\"\"\"\n\n    # Grab the Endpoint Config Name from the AWS\n    endpoint_config_name = self.endpoint_config_name()\n\n    # Retrieve the Model Names from the Endpoint Config\n    try:\n        endpoint_config = self.sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n    except botocore.exceptions.ClientError:\n        self.log.info(f\"Endpoint Config {self.uuid} doesn't exist...\")\n        return\n    model_names = [variant[\"ModelName\"] for variant in endpoint_config[\"ProductionVariants\"]]\n    for model_name in model_names:\n        self.log.info(f\"Deleting Model {model_name}...\")\n        try:\n            self.sm_client.delete_model(ModelName=model_name)\n        except botocore.exceptions.ClientError as error:\n            error_code = error.response[\"Error\"][\"Code\"]\n            error_message = error.response[\"Error\"][\"Message\"]\n            if error_code == \"ResourceInUse\":\n                self.log.warning(f\"Model {model_name} is still in use...\")\n            else:\n                self.log.warning(f\"Error: {error_code} - {error_message}\")\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.details","title":"<code>details(recompute=False)</code>","text":"<p>Additional Details about this Endpoint Args:     recompute (bool): Recompute the details (default: False) Returns:     dict(dict): A dictionary of details about this Endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def details(self, recompute: bool = False) -&gt; dict:\n    \"\"\"Additional Details about this Endpoint\n    Args:\n        recompute (bool): Recompute the details (default: False)\n    Returns:\n        dict(dict): A dictionary of details about this Endpoint\n    \"\"\"\n    # Check if we have cached version of the FeatureSet Details\n    details_key = f\"endpoint:{self.uuid}:details\"\n\n    cached_details = self.data_storage.get(details_key)\n    if cached_details and not recompute:\n        # Update the endpoint metrics before returning cached details\n        endpoint_metrics = self.endpoint_metrics()\n        cached_details[\"endpoint_metrics\"] = endpoint_metrics\n        return cached_details\n\n    # Fill in all the details about this Endpoint\n    details = self.summary()\n\n    # Get details from our AWS Metadata\n    details[\"status\"] = self.endpoint_meta[\"EndpointStatus\"]\n    details[\"instance\"] = self.endpoint_meta[\"InstanceType\"]\n    try:\n        details[\"instance_count\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"CurrentInstanceCount\"] or \"-\"\n    except KeyError:\n        details[\"instance_count\"] = \"-\"\n    if \"ProductionVariants\" in self.endpoint_meta:\n        details[\"variant\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n    else:\n        details[\"variant\"] = \"-\"\n\n    # Add the underlying model details\n    details[\"model_name\"] = self.model_name\n    model_details = self.model_details()\n    details[\"model_type\"] = model_details.get(\"model_type\", \"unknown\")\n    details[\"model_metrics\"] = model_details.get(\"model_metrics\")\n    details[\"confusion_matrix\"] = model_details.get(\"confusion_matrix\")\n    details[\"predictions\"] = model_details.get(\"predictions\")\n    details[\"inference_meta\"] = model_details.get(\"inference_meta\")\n\n    # Add endpoint metrics from CloudWatch\n    details[\"endpoint_metrics\"] = self.endpoint_metrics()\n\n    # Cache the details\n    self.data_storage.set(details_key, details)\n\n    # Return the details\n    return details\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.endpoint_metrics","title":"<code>endpoint_metrics()</code>","text":"<p>Return the metrics for this endpoint</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def endpoint_metrics(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Return the metrics for this endpoint\n\n    Returns:\n        pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)\n    \"\"\"\n\n    # Do we have it cached?\n    metrics_key = f\"endpoint:{self.uuid}:endpoint_metrics\"\n    endpoint_metrics = self.temp_storage.get(metrics_key)\n    if endpoint_metrics is not None:\n        return endpoint_metrics\n\n    # We don't have it cached so let's get it from CloudWatch\n    if \"ProductionVariants\" not in self.endpoint_meta:\n        return None\n    self.log.important(\"Updating endpoint metrics...\")\n    variant = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n    endpoint_metrics = EndpointMetrics().get_metrics(self.uuid, variant=variant)\n    self.temp_storage.set(metrics_key, endpoint_metrics)\n    return endpoint_metrics\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.exists","title":"<code>exists()</code>","text":"<p>Does the feature_set_name exist in the AWS Metadata?</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def exists(self) -&gt; bool:\n    \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n    if self.endpoint_meta is None:\n        self.log.debug(f\"Endpoint {self.endpoint_name} not found in AWS Metadata\")\n        return False\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.get_monitor","title":"<code>get_monitor()</code>","text":"<p>Get the MonitorCore class for this endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def get_monitor(self):\n    \"\"\"Get the MonitorCore class for this endpoint\"\"\"\n    from sageworks.core.artifacts.monitor_core import MonitorCore\n\n    return MonitorCore(self.endpoint_name)\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.health_check","title":"<code>health_check()</code>","text":"<p>Perform a health check on this model</p> <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: List of health issues</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def health_check(self) -&gt; list[str]:\n    \"\"\"Perform a health check on this model\n\n    Returns:\n        list[str]: List of health issues\n    \"\"\"\n    if not self.ready():\n        return [\"needs_onboard\"]\n\n    # Call the base class health check\n    health_issues = super().health_check()\n\n    # We're going to check for 5xx errors and no activity\n    endpoint_metrics = self.endpoint_metrics()\n\n    # Check if we have metrics\n    if endpoint_metrics is None:\n        health_issues.append(\"unknown_error\")\n        return health_issues\n\n    # Check for 5xx errors\n    num_errors = endpoint_metrics[\"Invocation5XXErrors\"].sum()\n    if num_errors &gt; 5:\n        health_issues.append(\"5xx_errors\")\n    elif num_errors &gt; 0:\n        health_issues.append(\"5xx_errors_min\")\n    else:\n        self.remove_health_tag(\"5xx_errors\")\n        self.remove_health_tag(\"5xx_errors_min\")\n\n    # Check for Endpoint activity\n    num_invocations = endpoint_metrics[\"Invocations\"].sum()\n    if num_invocations == 0:\n        health_issues.append(\"no_activity\")\n    else:\n        self.remove_health_tag(\"no_activity\")\n    return health_issues\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.inference","title":"<code>inference(eval_df, capture_uuid=None, id_column=None)</code>","text":"<p>Run inference and compute performance metrics with optional capture</p> <p>Parameters:</p> Name Type Description Default <code>eval_df</code> <code>DataFrame</code> <p>DataFrame to run predictions on (must have superset of features)</p> required <code>capture_uuid</code> <code>str</code> <p>UUID of the inference capture (default=None)</p> <code>None</code> <code>id_column</code> <code>str</code> <p>Name of the ID column (default=None)</p> <code>None</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: DataFrame with the inference results</p> Note <p>If capture=True inference/performance metrics are written to S3 Endpoint Inference Folder</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -&gt; pd.DataFrame:\n    \"\"\"Run inference and compute performance metrics with optional capture\n\n    Args:\n        eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n        capture_uuid (str, optional): UUID of the inference capture (default=None)\n        id_column (str, optional): Name of the ID column (default=None)\n\n    Returns:\n        pd.DataFrame: DataFrame with the inference results\n\n    Note:\n        If capture=True inference/performance metrics are written to S3 Endpoint Inference Folder\n    \"\"\"\n\n    # Run predictions on the evaluation data\n    prediction_df = self._predict(eval_df)\n\n    # Get the target column\n    target_column = ModelCore(self.model_name).target()\n\n    # Sanity Check that the target column is present\n    if target_column not in prediction_df.columns:\n        self.log.warning(f\"Target Column {target_column} not found in prediction_df!\")\n        self.log.warning(\"In order to compute metrics, the target column must be present!\")\n        return prediction_df\n\n    # Compute the standard performance metrics for this model\n    model_type = self.model_type()\n    if model_type in [ModelType.REGRESSOR.value, ModelType.QUANTILE_REGRESSOR.value]:\n        prediction_df = self.residuals(target_column, prediction_df)\n        metrics = self.regression_metrics(target_column, prediction_df)\n    elif model_type == ModelType.CLASSIFIER.value:\n        metrics = self.classification_metrics(target_column, prediction_df)\n    else:\n        # Unknown Model Type: Give log message and set metrics to empty dataframe\n        self.log.warning(f\"Unknown Model Type: {model_type}\")\n        metrics = pd.DataFrame()\n\n    # Print out the metrics\n    print(f\"Performance Metrics for {self.model_name} on {self.uuid}\")\n    print(metrics.head())\n\n    # Capture the inference results and metrics\n    if capture_uuid is not None:\n        description = capture_uuid.replace(\"_\", \" \").title()\n        self._capture_inference_results(capture_uuid, prediction_df, target_column, metrics, description, id_column)\n\n    # Return the prediction DataFrame\n    return prediction_df\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.is_serverless","title":"<code>is_serverless()</code>","text":"<p>Check if the current endpoint is serverless.</p> <p>Returns:</p> Name Type Description <code>bool</code> <p>True if the endpoint is serverless, False otherwise.</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def is_serverless(self):\n    \"\"\"Check if the current endpoint is serverless.\n\n    Returns:\n        bool: True if the endpoint is serverless, False otherwise.\n    \"\"\"\n    return \"Serverless\" in self.endpoint_meta[\"InstanceType\"]\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.model_details","title":"<code>model_details()</code>","text":"<p>Return the details about the model used in this Endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def model_details(self) -&gt; dict:\n    \"\"\"Return the details about the model used in this Endpoint\"\"\"\n    if self.model_name == \"unknown\":\n        return {}\n    else:\n        model = ModelCore(self.model_name)\n        if model.exists():\n            return model.details()\n        else:\n            return {}\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.model_type","title":"<code>model_type()</code>","text":"<p>Return the type of model used in this Endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def model_type(self) -&gt; str:\n    \"\"\"Return the type of model used in this Endpoint\"\"\"\n    return self.details().get(\"model_type\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.modified","title":"<code>modified()</code>","text":"<p>Return the datetime when this artifact was last modified</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def modified(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was last modified\"\"\"\n    return self.endpoint_meta[\"LastModifiedTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.onboard","title":"<code>onboard(interactive=False)</code>","text":"<p>This is a BLOCKING method that will onboard the Endpoint (make it ready) Args:     interactive (bool, optional): If True, will prompt the user for information. (default: False) Returns:     bool: True if the Endpoint is successfully onboarded, False otherwise</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def onboard(self, interactive: bool = False) -&gt; bool:\n    \"\"\"This is a BLOCKING method that will onboard the Endpoint (make it ready)\n    Args:\n        interactive (bool, optional): If True, will prompt the user for information. (default: False)\n    Returns:\n        bool: True if the Endpoint is successfully onboarded, False otherwise\n    \"\"\"\n\n    # Make sure our input is defined\n    if self.get_input() == \"unknown\":\n        if interactive:\n            input_model = input(\"Input Model?: \")\n        else:\n            self.log.error(\"Input Model is not defined!\")\n            return False\n    else:\n        input_model = self.get_input()\n\n    # Now that we have the details, let's onboard the Endpoint with args\n    return self.onboard_with_args(input_model)\n\n    # Run a health check and refresh the meta\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.health_check()\n    self.refresh_meta()\n    self.details(recompute=True)\n    self.set_status(\"ready\")\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.onboard_with_args","title":"<code>onboard_with_args(input_model)</code>","text":"<p>Onboard the Endpoint with the given arguments</p> <p>Parameters:</p> Name Type Description Default <code>input_model</code> <code>str</code> <p>The input model for this endpoint</p> required <p>Returns:     bool: True if the Endpoint is successfully onboarded, False otherwise</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def onboard_with_args(self, input_model: str) -&gt; bool:\n    \"\"\"Onboard the Endpoint with the given arguments\n\n    Args:\n        input_model (str): The input model for this endpoint\n    Returns:\n        bool: True if the Endpoint is successfully onboarded, False otherwise\n    \"\"\"\n    # Set the status to onboarding\n    self.set_status(\"onboarding\")\n\n    self.upsert_sageworks_meta({\"sageworks_input\": input_model})\n    self.model_name = input_model\n\n    # Remove the needs_onboard tag\n    self.remove_health_tag(\"needs_onboard\")\n    self.set_status(\"ready\")\n\n    # Run a health check and refresh the meta\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.health_check()\n    self.refresh_meta()\n    self.details(recompute=True)\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.refresh_meta","title":"<code>refresh_meta()</code>","text":"<p>Refresh the Artifact's metadata</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def refresh_meta(self):\n    \"\"\"Refresh the Artifact's metadata\"\"\"\n    self.endpoint_meta = self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=True).get(\n        self.endpoint_name\n    )\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.regression_metrics","title":"<code>regression_metrics(target_column, prediction_df)</code>  <code>staticmethod</code>","text":"<p>Compute the performance metrics for this Endpoint Args:     target_column (str): Name of the target column     prediction_df (pd.DataFrame): DataFrame with the prediction results Returns:     pd.DataFrame: DataFrame with the performance metrics</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>@staticmethod\ndef regression_metrics(target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Compute the performance metrics for this Endpoint\n    Args:\n        target_column (str): Name of the target column\n        prediction_df (pd.DataFrame): DataFrame with the prediction results\n    Returns:\n        pd.DataFrame: DataFrame with the performance metrics\n    \"\"\"\n\n    # Compute the metrics\n    y_true = prediction_df[target_column]\n    prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n    y_pred = prediction_df[prediction_col]\n\n    mae = mean_absolute_error(y_true, y_pred)\n    rmse = root_mean_squared_error(y_true, y_pred)\n    r2 = r2_score(y_true, y_pred)\n    # Mean Absolute Percentage Error\n    mape = np.mean(np.where(y_true != 0, np.abs((y_true - y_pred) / y_true), np.abs(y_true - y_pred))) * 100\n    # Median Absolute Error\n    medae = median_absolute_error(y_true, y_pred)\n\n    # Organize and return the metrics\n    metrics = {\n        \"MAE\": round(mae, 3),\n        \"RMSE\": round(rmse, 3),\n        \"R2\": round(r2, 3),\n        \"MAPE\": round(mape, 3),\n        \"MedAE\": round(medae, 3),\n        \"NumRows\": len(prediction_df),\n    }\n    return pd.DataFrame.from_records([metrics])\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.residuals","title":"<code>residuals(target_column, prediction_df)</code>","text":"<p>Add the residuals to the prediction DataFrame Args:     target_column (str): Name of the target column     prediction_df (pd.DataFrame): DataFrame with the prediction results Returns:     pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def residuals(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Add the residuals to the prediction DataFrame\n    Args:\n        target_column (str): Name of the target column\n        prediction_df (pd.DataFrame): DataFrame with the prediction results\n    Returns:\n        pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'\n    \"\"\"\n    # Sanity Check that this is a regression model\n    if self.model_type() not in [ModelType.REGRESSOR.value, ModelType.QUANTILE_REGRESSOR.value]:\n        self.log.warning(\"Residuals are only computed for regression models\")\n        return prediction_df\n\n    # Compute the residuals\n    y_true = prediction_df[target_column]\n    prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n    y_pred = prediction_df[prediction_col]\n\n    # Add the residuals and the absolute values to the DataFrame\n    prediction_df[\"residuals\"] = y_true - y_pred\n    prediction_df[\"residuals_abs\"] = np.abs(prediction_df[\"residuals\"])\n    return prediction_df\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.set_input","title":"<code>set_input(input, force=False)</code>","text":"<p>Override: Set the input data for this artifact</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>str</code> <p>Name of input for this artifact</p> required <code>force</code> <code>bool</code> <p>Force the input to be set. Defaults to False.</p> <code>False</code> <p>Note:     We're going to not allow this to be used for Models</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def set_input(self, input: str, force=False):\n    \"\"\"Override: Set the input data for this artifact\n\n    Args:\n        input (str): Name of input for this artifact\n        force (bool, optional): Force the input to be set. Defaults to False.\n    Note:\n        We're going to not allow this to be used for Models\n    \"\"\"\n    if not force:\n        self.log.warning(f\"Endpoint {self.uuid}: Does not allow manual override of the input!\")\n        return\n\n    # Okay we're going to allow this to be set\n    self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n    self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n    self.upsert_sageworks_meta({\"sageworks_input\": input})\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.size","title":"<code>size()</code>","text":"<p>Return the size of this data in MegaBytes</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def size(self) -&gt; float:\n    \"\"\"Return the size of this data in MegaBytes\"\"\"\n    return 0.0\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/","title":"FeatureSetCore","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the FeatureSet API Class and voil\u00e0 it works the same.</p> <p>FeatureSet: SageWorks Feature Set accessible through Athena</p>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore","title":"<code>FeatureSetCore</code>","text":"<p>               Bases: <code>Artifact</code></p> <p>FeatureSetCore: SageWorks FeatureSetCore Class</p> Common Usage <pre><code>my_features = FeatureSetCore(feature_uuid)\nmy_features.summary()\nmy_features.details()\n</code></pre> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>class FeatureSetCore(Artifact):\n    \"\"\"FeatureSetCore: SageWorks FeatureSetCore Class\n\n    Common Usage:\n        ```\n        my_features = FeatureSetCore(feature_uuid)\n        my_features.summary()\n        my_features.details()\n        ```\n    \"\"\"\n\n    def __init__(self, feature_set_uuid: str, force_refresh: bool = False):\n        \"\"\"FeatureSetCore Initialization\n\n        Args:\n            feature_set_uuid (str): Name of Feature Set\n            force_refresh (bool): Force a refresh of the Feature Set metadata (default: False)\n        \"\"\"\n\n        # Make sure the feature_set name is valid\n        self.ensure_valid_name(feature_set_uuid)\n\n        # Call superclass init\n        super().__init__(feature_set_uuid)\n\n        # Setup our AWS Broker catalog metadata\n        _catalog_meta = self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=force_refresh)\n        self.feature_meta = _catalog_meta.get(self.uuid)\n\n        # Sanity check and then set up our FeatureSet attributes\n        if self.feature_meta is None:\n            self.log.important(f\"Could not find feature set {self.uuid} within current visibility scope\")\n            self.data_source = None\n            return\n        else:\n            self.record_id = self.feature_meta[\"RecordIdentifierFeatureName\"]\n            self.event_time = self.feature_meta[\"EventTimeFeatureName\"]\n\n            # Pull Athena and S3 Storage information from metadata\n            self.athena_database = self.feature_meta[\"sageworks_meta\"].get(\"athena_database\")\n            self.athena_table = self.feature_meta[\"sageworks_meta\"].get(\"athena_table\")\n            self.s3_storage = self.feature_meta[\"sageworks_meta\"].get(\"s3_storage\")\n\n            # Create our internal DataSource (hardcoded to Athena for now)\n            self.data_source = AthenaSource(self.athena_table, self.athena_database)\n\n        # Spin up our Feature Store\n        self.feature_store = FeatureStore(self.sm_session)\n\n        # Call superclass post_init\n        super().__post_init__()\n\n        # All done\n        self.log.info(f\"FeatureSet Initialized: {self.uuid}\")\n\n    def refresh_meta(self):\n        \"\"\"Internal: Refresh our internal AWS Feature Store metadata\"\"\"\n        self.log.info(\"Calling refresh_meta() on the underlying DataSource\")\n        self.data_source.refresh_meta()\n\n    def exists(self) -&gt; bool:\n        \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n        if self.feature_meta is None:\n            self.log.debug(f\"FeatureSet {self.uuid} not found in AWS Metadata!\")\n            return False\n        return True\n\n    def health_check(self) -&gt; list[str]:\n        \"\"\"Perform a health check on this model\n\n        Returns:\n            list[str]: List of health issues\n        \"\"\"\n        # Call the base class health check\n        health_issues = super().health_check()\n\n        # If we have a 'needs_onboard' in the health check then just return\n        if \"needs_onboard\" in health_issues:\n            return health_issues\n\n        # Check our DataSource\n        if not self.data_source.exists():\n            self.log.critical(f\"Data Source check failed for {self.uuid}\")\n            self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n            health_issues.append(\"data_source_missing\")\n        return health_issues\n\n    def aws_meta(self) -&gt; dict:\n        \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n        return self.feature_meta\n\n    def arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n        return self.feature_meta[\"FeatureGroupArn\"]\n\n    def size(self) -&gt; float:\n        \"\"\"Return the size of the internal DataSource in MegaBytes\"\"\"\n        return self.data_source.size()\n\n    def column_names(self) -&gt; list[str]:\n        \"\"\"Return the column names of the Feature Set\"\"\"\n        return list(self.column_details().keys())\n\n    def column_types(self) -&gt; list[str]:\n        \"\"\"Return the column types of the Feature Set\"\"\"\n        return list(self.column_details().values())\n\n    def column_details(self, view: str = \"all\") -&gt; dict:\n        \"\"\"Return the column details of the Feature Set\n\n        Args:\n            view (str): The view to get column details for (default: \"all\")\n\n        Returns:\n            dict: The column details of the Feature Set\n\n        Notes:\n            We can't call just call self.data_source.column_details() because FeatureSets have different\n            types, so we need to overlay that type information on top of the DataSource type information\n        \"\"\"\n        fs_details = {item[\"FeatureName\"]: item[\"FeatureType\"] for item in self.feature_meta[\"FeatureDefinitions\"]}\n        ds_details = self.data_source.column_details(view)\n\n        # Overlay the FeatureSet type information on top of the DataSource type information\n        for col, dtype in ds_details.items():\n            ds_details[col] = fs_details.get(col, dtype)\n        return ds_details\n\n        # Not going to use these for now\n        \"\"\"\n        internal = {\n            \"write_time\": \"Timestamp\",\n            \"api_invocation_time\": \"Timestamp\",\n            \"is_deleted\": \"Boolean\",\n        }\n        details.update(internal)\n        return details\n        \"\"\"\n\n    def get_display_columns(self) -&gt; list[str]:\n        \"\"\"Get the display columns for this FeatureSet\n\n        Returns:\n            list[str]: The display columns for this FeatureSet\n\n        Notes:\n            This just pulls the display columns from the underlying DataSource\n        \"\"\"\n        return self.data_source.get_display_columns()\n\n    def set_display_columns(self, display_columns: list[str]):\n        \"\"\"Set the display columns for this FeatureSet\n\n        Args:\n            display_columns (list[str]): The display columns for this FeatureSet\n\n        Notes:\n            This just sets the display columns for the underlying DataSource\n        \"\"\"\n        self.data_source.set_display_columns(display_columns)\n        self.onboard()\n\n    def num_columns(self) -&gt; int:\n        \"\"\"Return the number of columns of the Feature Set\"\"\"\n        return len(self.column_names())\n\n    def num_rows(self) -&gt; int:\n        \"\"\"Return the number of rows of the internal DataSource\"\"\"\n        return self.data_source.num_rows()\n\n    def query(self, query: str, overwrite: bool = True) -&gt; pd.DataFrame:\n        \"\"\"Query the internal DataSource\n\n        Args:\n            query (str): The query to run against the DataSource\n            overwrite (bool): Overwrite the table name in the query (default: True)\n\n        Returns:\n            pd.DataFrame: The results of the query\n        \"\"\"\n        if overwrite:\n            query = query.replace(\" \" + self.uuid + \" \", \" \" + self.athena_table + \" \")\n        return self.data_source.query(query)\n\n    def aws_url(self):\n        \"\"\"The AWS URL for looking at/querying the underlying data source\"\"\"\n        return self.data_source.details().get(\"aws_url\", \"unknown\")\n\n    def created(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was created\"\"\"\n        return self.feature_meta[\"CreationTime\"]\n\n    def modified(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was last modified\"\"\"\n        # Note: We can't currently figure out how to this from AWS Metadata\n        return self.feature_meta[\"CreationTime\"]\n\n    def get_data_source(self) -&gt; DataSourceFactory:\n        \"\"\"Return the underlying DataSource object\"\"\"\n        return self.data_source\n\n    def get_feature_store(self) -&gt; FeatureStore:\n        \"\"\"Return the underlying AWS FeatureStore object. This can be useful for more advanced usage\n        with create_dataset() such as Joins and time ranges and a host of other options\n        See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n        \"\"\"\n        return self.feature_store\n\n    def create_s3_training_data(self) -&gt; str:\n        \"\"\"Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want\n        additional options/features use the get_feature_store() method and see AWS docs for all\n        the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n        Returns:\n            str: The full path/file for the CSV file created by Feature Store create_dataset()\n        \"\"\"\n\n        # Set up the S3 Query results path\n        date_time = datetime.now(timezone.utc).strftime(\"%Y-%m-%d_%H:%M:%S\")\n        s3_output_path = self.feature_sets_s3_path + f\"/{self.uuid}/datasets/all_{date_time}\"\n\n        # Get the training data query\n        query = self.get_training_data_query()\n\n        # Make the query\n        athena_query = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session).athena_query()\n        athena_query.run(query, output_location=s3_output_path)\n        athena_query.wait()\n        query_execution = athena_query.get_query_execution()\n\n        # Get the full path to the S3 files with the results\n        full_s3_path = s3_output_path + f\"/{query_execution['QueryExecution']['QueryExecutionId']}.csv\"\n        return full_s3_path\n\n    def get_training_data_query(self) -&gt; str:\n        \"\"\"Get the training data query for this FeatureSet\n\n        Returns:\n            str: The training data query for this FeatureSet\n        \"\"\"\n\n        # Do we have a training view?\n        training_view = self.get_training_view_table()\n        if training_view:\n            self.log.important(f\"Pulling Data from Training View {training_view}...\")\n            table_name = training_view\n        else:\n            self.log.warning(f\"No Training View found for {self.uuid}, using FeatureSet directly...\")\n            table_name = self.athena_table\n\n        # Make a query that gets all the data from the FeatureSet\n        return f\"SELECT * FROM {table_name}\"\n\n    def get_training_data(self, limit=50000) -&gt; pd.DataFrame:\n        \"\"\"Get the training data for this FeatureSet\n\n        Args:\n            limit (int): The number of rows to limit the query to (default: 1000)\n        Returns:\n            pd.DataFrame: The training data for this FeatureSet\n        \"\"\"\n\n        # Get the training data query (put a limit on it for now)\n        query = self.get_training_data_query() + f\" LIMIT {limit}\"\n\n        # Make the query\n        return self.query(query)\n\n    def snapshot_query(self, table_name: str = None) -&gt; str:\n        \"\"\"An Athena query to get the latest snapshot of features\n\n        Args:\n            table_name (str): The name of the table to query (default: None)\n\n        Returns:\n            str: The Athena query to get the latest snapshot of features\n        \"\"\"\n        # Remove FeatureGroup metadata columns that might have gotten added\n        columns = self.column_names()\n        filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n        columns = \", \".join(['\"' + x + '\"' for x in columns if x not in filter_columns])\n\n        query = (\n            f\"SELECT {columns} \"\n            f\"    FROM (SELECT *, row_number() OVER (PARTITION BY {self.record_id} \"\n            f\"        ORDER BY {self.event_time} desc, api_invocation_time DESC, write_time DESC) AS row_num \"\n            f'        FROM \"{table_name}\") '\n            \"    WHERE row_num = 1 and NOT is_deleted;\"\n        )\n        return query\n\n    def details(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Additional Details about this FeatureSet Artifact\n\n        Args:\n            recompute (bool): Recompute the details (default: False)\n\n        Returns:\n            dict(dict): A dictionary of details about this FeatureSet\n        \"\"\"\n\n        # Check if we have cached version of the FeatureSet Details\n        storage_key = f\"feature_set:{self.uuid}:details\"\n        cached_details = self.data_storage.get(storage_key)\n        if cached_details and not recompute:\n            return cached_details\n\n        self.log.info(f\"Recomputing FeatureSet Details ({self.uuid})...\")\n        details = self.summary()\n        details[\"aws_url\"] = self.aws_url()\n\n        # Store the AWS URL in the SageWorks Metadata\n        self.upsert_sageworks_meta({\"aws_url\": details[\"aws_url\"]})\n\n        # Now get a summary of the underlying DataSource\n        details[\"storage_summary\"] = self.data_source.summary()\n\n        # Number of Columns\n        details[\"num_columns\"] = self.num_columns()\n\n        # Number of Rows\n        details[\"num_rows\"] = self.num_rows()\n\n        # Additional Details\n        details[\"sageworks_status\"] = self.get_status()\n        details[\"sageworks_input\"] = self.get_input()\n        details[\"sageworks_tags\"] = self.tag_delimiter.join(self.get_tags())\n\n        # Underlying Storage Details\n        details[\"storage_type\"] = \"athena\"  # TODO: Add RDS support\n        details[\"storage_uuid\"] = self.data_source.uuid\n\n        # Add the column details and column stats\n        details[\"column_details\"] = self.column_details()\n        details[\"column_stats\"] = self.column_stats()\n\n        # Cache the details\n        self.data_storage.set(storage_key, details)\n\n        # Return the details data\n        return details\n\n    def delete(self):\n        \"\"\"Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\"\"\"\n\n        # Delete the Feature Group and ensure that it gets deleted\n        self.log.important(f\"Deleting FeatureSet {self.uuid}...\")\n        remove_fg = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session)\n        remove_fg.delete()\n        self.ensure_feature_group_deleted(remove_fg)\n\n        # Delete our underlying DataSource (Data Catalog Table and S3 Storage Objects)\n        self.data_source.delete()\n\n        # Delete the training view\n        self.delete_training_view()\n\n        # Feature Sets can often have a lot of cruft so delete the entire bucket/prefix\n        s3_delete_path = self.feature_sets_s3_path + f\"/{self.uuid}/\"\n        self.log.info(f\"Deleting All FeatureSet S3 Storage Objects {s3_delete_path}\")\n        wr.s3.delete_objects(s3_delete_path, boto3_session=self.boto_session)\n\n        # Now delete any data in the Cache\n        for key in self.data_storage.list_subkeys(f\"feature_set:{self.uuid}:\"):\n            self.log.info(f\"Deleting Cache Key: {key}\")\n            self.data_storage.delete(key)\n\n        # Force a refresh of the AWS Metadata (to make sure references to deleted artifacts are gone)\n        self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=True)\n\n    def ensure_feature_group_deleted(self, feature_group):\n        status = \"Deleting\"\n        while status == \"Deleting\":\n            self.log.debug(\"FeatureSet being Deleted...\")\n            try:\n                status = feature_group.describe().get(\"FeatureGroupStatus\")\n            except botocore.exceptions.ClientError as error:\n                # For ResourceNotFound/ValidationException, this is fine, otherwise raise all other exceptions\n                if error.response[\"Error\"][\"Code\"] in [\"ResourceNotFound\", \"ValidationException\"]:\n                    break\n                else:\n                    raise error\n            time.sleep(1)\n        self.log.info(f\"FeatureSet {feature_group.name} successfully deleted\")\n\n    def create_default_training_view(self):\n        \"\"\"Create a default view in Athena that assigns roughly 80% of the data to training\"\"\"\n\n        # Create the view name\n        view_name = f\"{self.athena_table}_training\"\n        self.log.important(f\"Creating default Training View {view_name}...\")\n\n        # Do we already have a training column?\n        if \"training\" in self.column_names():\n            create_view_query = f\"CREATE OR REPLACE VIEW {view_name} AS SELECT * FROM {self.athena_table}\"\n        else:\n            # No training column, so create one:\n            #    Construct the CREATE VIEW query with a simple modulo operation for the 80/20 split\n            #    using self.record_id as the stable identifier for row numbering\n            create_view_query = f\"\"\"\n            CREATE OR REPLACE VIEW {view_name} AS\n            SELECT *, CASE\n                WHEN MOD(ROW_NUMBER() OVER (ORDER BY {self.record_id}), 10) &lt; 8 THEN 1  -- Assign 80% to training\n                ELSE 0  -- Assign roughly 20% to validation\n            END AS training\n            FROM {self.athena_table}\n            \"\"\"\n\n        # Execute the CREATE VIEW query\n        self.data_source.execute_statement(create_view_query)\n\n    def create_training_view(self, id_column: str, holdout_ids: list[str]):\n        \"\"\"Create a view in Athena that marks hold out ids for this FeatureSet\n\n        Args:\n            id_column (str): The name of the id column in the output DataFrame.\n            holdout_ids (list[str]): The list of hold out ids.\n        \"\"\"\n\n        # Create the view name\n        view_name = f\"{self.athena_table}_training\"\n        self.log.important(f\"Creating Training View {view_name}...\")\n\n        # Format the list of hold out ids for SQL IN clause\n        if holdout_ids and all(isinstance(id, str) for id in holdout_ids):\n            formatted_holdout_ids = \", \".join(f\"'{id}'\" for id in holdout_ids)\n        else:\n            formatted_holdout_ids = \", \".join(map(str, holdout_ids))\n\n        # Construct the CREATE VIEW query\n        create_view_query = f\"\"\"\n        CREATE OR REPLACE VIEW {view_name} AS\n        SELECT *, CASE\n            WHEN {id_column} IN ({formatted_holdout_ids}) THEN 0\n            ELSE 1\n        END AS training\n        FROM {self.athena_table}\n        \"\"\"\n\n        # Execute the CREATE VIEW query\n        self.data_source.execute_statement(create_view_query)\n\n    def set_holdout_ids(self, id_column: str, holdout_ids: list[str]):\n        \"\"\"Set the hold out ids for this FeatureSet\n\n        Args:\n            id_column (str): The name of the id column in the output DataFrame.\n            holdout_ids (list[str]): The list of hold out ids.\n        \"\"\"\n        self.create_training_view(id_column, holdout_ids)\n\n    def get_holdout_ids(self, id_column: str) -&gt; list[str]:\n        \"\"\"Get the hold out ids for this FeatureSet\n\n        Args:\n            id_column (str): The name of the id column in the output DataFrame.\n\n        Returns:\n            list[str]: The list of hold out ids.\n        \"\"\"\n        training_view_table = self.get_training_view_table(create=False)\n        if training_view_table is not None:\n            query = f\"SELECT {id_column} FROM {training_view_table} WHERE training = 0\"\n            holdout_ids = self.query(query)[id_column].tolist()\n            return holdout_ids\n        else:\n            return []\n\n    def get_training_view_table(self, create: bool = True) -&gt; Union[str, None]:\n        \"\"\"Get the name of the training view for this FeatureSet\n        Args:\n            create (bool): Create the training view if it doesn't exist (default=True)\n        Returns:\n            str: The name of the training view for this FeatureSet\n        \"\"\"\n        training_view_name = f\"{self.athena_table}_training\"\n        glue_client = self.boto_session.client(\"glue\")\n        try:\n            glue_client.get_table(DatabaseName=self.athena_database, Name=training_view_name)\n            return training_view_name\n        except glue_client.exceptions.EntityNotFoundException:\n            if not create:\n                return None\n            self.log.warning(f\"Training View for {self.uuid} doesn't exist, creating one...\")\n            self.create_default_training_view()\n            time.sleep(1)  # Give AWS a second to catch up\n            return training_view_name\n\n    def delete_training_view(self):\n        \"\"\"Delete the training view for this FeatureSet\"\"\"\n        try:\n            training_view_table = self.get_training_view_table(create=False)\n            if training_view_table is not None:\n                self.log.info(f\"Deleting Training View {training_view_table} for {self.uuid}\")\n                glue_client = self.boto_session.client(\"glue\")\n                glue_client.delete_table(DatabaseName=self.athena_database, Name=training_view_table)\n        except botocore.exceptions.ClientError as error:\n            # For ResourceNotFound/ValidationException, this is fine, otherwise raise all other exceptions\n            if error.response[\"Error\"][\"Code\"] in [\"ResourceNotFound\", \"ValidationException\"]:\n                self.log.warning(f\"Training View for {self.uuid} doesn't exist, nothing to delete...\")\n                pass\n            else:\n                raise error\n\n    def descriptive_stats(self, recompute: bool = False) -&gt; dict:\n        \"\"\"Get the descriptive stats for the numeric columns of the underlying DataSource\n        Args:\n            recompute (bool): Recompute the descriptive stats (default=False)\n        Returns:\n            dict: A dictionary of descriptive stats for the numeric columns\n        \"\"\"\n        return self.data_source.descriptive_stats(recompute)\n\n    def sample(self, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Get a sample of the data from the underlying DataSource\n        Args:\n            recompute (bool): Recompute the sample (default=False)\n        Returns:\n            pd.DataFrame: A sample of the data from the underlying DataSource\n        \"\"\"\n        return self.data_source.sample(recompute)\n\n    def outliers(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Compute outliers for all the numeric columns in a DataSource\n        Args:\n            scale (float): The scale to use for the IQR (default: 1.5)\n            recompute (bool): Recompute the outliers (default: False)\n        Returns:\n            pd.DataFrame: A DataFrame of outliers from this DataSource\n        Notes:\n            Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n            The scale parameter can be adjusted to change the IQR multiplier\n        \"\"\"\n        return self.data_source.outliers(scale=scale, recompute=recompute)\n\n    def smart_sample(self) -&gt; pd.DataFrame:\n        \"\"\"Get a SMART sample dataframe from this FeatureSet\n        Returns:\n            pd.DataFrame: A combined DataFrame of sample data + outliers\n        \"\"\"\n        return self.data_source.smart_sample()\n\n    def anomalies(self) -&gt; pd.DataFrame:\n        \"\"\"Get a set of anomalous data from the underlying DataSource\n        Returns:\n            pd.DataFrame: A dataframe of anomalies from the underlying DataSource\n        \"\"\"\n\n        # FIXME: Mock this for now\n        anom_df = self.sample().copy()\n        anom_df[\"anomaly_score\"] = np.random.rand(anom_df.shape[0])\n        anom_df[\"cluster\"] = np.random.randint(0, 10, anom_df.shape[0])\n        anom_df[\"x\"] = np.random.rand(anom_df.shape[0])\n        anom_df[\"y\"] = np.random.rand(anom_df.shape[0])\n        return anom_df\n\n    def value_counts(self, recompute: bool = False) -&gt; dict:\n        \"\"\"Get the value counts for the string columns of the underlying DataSource\n        Args:\n            recompute (bool): Recompute the value counts (default=False)\n        Returns:\n            dict: A dictionary of value counts for the string columns\n        \"\"\"\n        return self.data_source.value_counts(recompute)\n\n    def correlations(self, recompute: bool = False) -&gt; dict:\n        \"\"\"Get the correlations for the numeric columns of the underlying DataSource\n        Args:\n            recompute (bool): Recompute the value counts (default=False)\n        Returns:\n            dict: A dictionary of correlations for the numeric columns\n        \"\"\"\n        return self.data_source.correlations(recompute)\n\n    def column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Column Stats for all the columns in the FeatureSets underlying DataSource\n        Args:\n            recompute (bool): Recompute the column stats (default: False)\n        Returns:\n            dict(dict): A dictionary of stats for each column this format\n            NB: String columns will NOT have num_zeros and descriptive_stats\n             {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n              'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n              ...}\n        \"\"\"\n\n        # Grab the column stats from our DataSource\n        ds_column_stats = self.data_source.column_stats(recompute)\n\n        # Map the types from our DataSource to the FeatureSet types\n        fs_type_mapper = self.column_details()\n        for col, details in ds_column_stats.items():\n            details[\"fs_dtype\"] = fs_type_mapper.get(col, \"unknown\")\n\n        return ds_column_stats\n\n    def ready(self) -&gt; bool:\n        \"\"\"Is the FeatureSet ready? Is initial setup complete and expected metadata populated?\n        Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to\n           check both to see if the FeatureSet is ready.\"\"\"\n\n        # Check the expected metadata for the FeatureSet\n        expected_meta = self.expected_meta()\n        existing_meta = self.sageworks_meta()\n        feature_set_ready = set(existing_meta.keys()).issuperset(expected_meta)\n        if not feature_set_ready:\n            self.log.info(f\"FeatureSet {self.uuid} is not ready!\")\n            return False\n\n        # Okay now call/return the DataSource ready() method\n        return self.data_source.ready()\n\n    def onboard(self) -&gt; bool:\n        \"\"\"This is a BLOCKING method that will onboard the FeatureSet (make it ready)\"\"\"\n\n        # Set our status to onboarding\n        self.log.important(f\"Onboarding {self.uuid}...\")\n        self.set_status(\"onboarding\")\n        self.remove_health_tag(\"needs_onboard\")\n\n        # Call our underlying DataSource onboard method\n        self.data_source.refresh_meta()\n        if not self.data_source.exists():\n            self.log.critical(f\"Data Source check failed for {self.uuid}\")\n            self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n            return False\n        if not self.data_source.ready():\n            self.data_source.onboard()\n\n        # Run a health check and refresh the meta\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.health_check()\n        self.refresh_meta()\n        self.details(recompute=True)\n        self.set_status(\"ready\")\n        return True\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.__init__","title":"<code>__init__(feature_set_uuid, force_refresh=False)</code>","text":"<p>FeatureSetCore Initialization</p> <p>Parameters:</p> Name Type Description Default <code>feature_set_uuid</code> <code>str</code> <p>Name of Feature Set</p> required <code>force_refresh</code> <code>bool</code> <p>Force a refresh of the Feature Set metadata (default: False)</p> <code>False</code> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def __init__(self, feature_set_uuid: str, force_refresh: bool = False):\n    \"\"\"FeatureSetCore Initialization\n\n    Args:\n        feature_set_uuid (str): Name of Feature Set\n        force_refresh (bool): Force a refresh of the Feature Set metadata (default: False)\n    \"\"\"\n\n    # Make sure the feature_set name is valid\n    self.ensure_valid_name(feature_set_uuid)\n\n    # Call superclass init\n    super().__init__(feature_set_uuid)\n\n    # Setup our AWS Broker catalog metadata\n    _catalog_meta = self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=force_refresh)\n    self.feature_meta = _catalog_meta.get(self.uuid)\n\n    # Sanity check and then set up our FeatureSet attributes\n    if self.feature_meta is None:\n        self.log.important(f\"Could not find feature set {self.uuid} within current visibility scope\")\n        self.data_source = None\n        return\n    else:\n        self.record_id = self.feature_meta[\"RecordIdentifierFeatureName\"]\n        self.event_time = self.feature_meta[\"EventTimeFeatureName\"]\n\n        # Pull Athena and S3 Storage information from metadata\n        self.athena_database = self.feature_meta[\"sageworks_meta\"].get(\"athena_database\")\n        self.athena_table = self.feature_meta[\"sageworks_meta\"].get(\"athena_table\")\n        self.s3_storage = self.feature_meta[\"sageworks_meta\"].get(\"s3_storage\")\n\n        # Create our internal DataSource (hardcoded to Athena for now)\n        self.data_source = AthenaSource(self.athena_table, self.athena_database)\n\n    # Spin up our Feature Store\n    self.feature_store = FeatureStore(self.sm_session)\n\n    # Call superclass post_init\n    super().__post_init__()\n\n    # All done\n    self.log.info(f\"FeatureSet Initialized: {self.uuid}\")\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.anomalies","title":"<code>anomalies()</code>","text":"<p>Get a set of anomalous data from the underlying DataSource Returns:     pd.DataFrame: A dataframe of anomalies from the underlying DataSource</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def anomalies(self) -&gt; pd.DataFrame:\n    \"\"\"Get a set of anomalous data from the underlying DataSource\n    Returns:\n        pd.DataFrame: A dataframe of anomalies from the underlying DataSource\n    \"\"\"\n\n    # FIXME: Mock this for now\n    anom_df = self.sample().copy()\n    anom_df[\"anomaly_score\"] = np.random.rand(anom_df.shape[0])\n    anom_df[\"cluster\"] = np.random.randint(0, 10, anom_df.shape[0])\n    anom_df[\"x\"] = np.random.rand(anom_df.shape[0])\n    anom_df[\"y\"] = np.random.rand(anom_df.shape[0])\n    return anom_df\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.arn","title":"<code>arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for this artifact</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n    return self.feature_meta[\"FeatureGroupArn\"]\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.aws_meta","title":"<code>aws_meta()</code>","text":"<p>Get ALL the AWS metadata for this artifact</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def aws_meta(self) -&gt; dict:\n    \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n    return self.feature_meta\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.aws_url","title":"<code>aws_url()</code>","text":"<p>The AWS URL for looking at/querying the underlying data source</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def aws_url(self):\n    \"\"\"The AWS URL for looking at/querying the underlying data source\"\"\"\n    return self.data_source.details().get(\"aws_url\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_details","title":"<code>column_details(view='all')</code>","text":"<p>Return the column details of the Feature Set</p> <p>Parameters:</p> Name Type Description Default <code>view</code> <code>str</code> <p>The view to get column details for (default: \"all\")</p> <code>'all'</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>The column details of the Feature Set</p> Notes <p>We can't call just call self.data_source.column_details() because FeatureSets have different types, so we need to overlay that type information on top of the DataSource type information</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def column_details(self, view: str = \"all\") -&gt; dict:\n    \"\"\"Return the column details of the Feature Set\n\n    Args:\n        view (str): The view to get column details for (default: \"all\")\n\n    Returns:\n        dict: The column details of the Feature Set\n\n    Notes:\n        We can't call just call self.data_source.column_details() because FeatureSets have different\n        types, so we need to overlay that type information on top of the DataSource type information\n    \"\"\"\n    fs_details = {item[\"FeatureName\"]: item[\"FeatureType\"] for item in self.feature_meta[\"FeatureDefinitions\"]}\n    ds_details = self.data_source.column_details(view)\n\n    # Overlay the FeatureSet type information on top of the DataSource type information\n    for col, dtype in ds_details.items():\n        ds_details[col] = fs_details.get(col, dtype)\n    return ds_details\n\n    # Not going to use these for now\n    \"\"\"\n    internal = {\n        \"write_time\": \"Timestamp\",\n        \"api_invocation_time\": \"Timestamp\",\n        \"is_deleted\": \"Boolean\",\n    }\n    details.update(internal)\n    return details\n    \"\"\"\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_names","title":"<code>column_names()</code>","text":"<p>Return the column names of the Feature Set</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def column_names(self) -&gt; list[str]:\n    \"\"\"Return the column names of the Feature Set\"\"\"\n    return list(self.column_details().keys())\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_stats","title":"<code>column_stats(recompute=False)</code>","text":"<p>Compute Column Stats for all the columns in the FeatureSets underlying DataSource Args:     recompute (bool): Recompute the column stats (default: False) Returns:     dict(dict): A dictionary of stats for each column this format     NB: String columns will NOT have num_zeros and descriptive_stats      {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},       'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},       ...}</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Column Stats for all the columns in the FeatureSets underlying DataSource\n    Args:\n        recompute (bool): Recompute the column stats (default: False)\n    Returns:\n        dict(dict): A dictionary of stats for each column this format\n        NB: String columns will NOT have num_zeros and descriptive_stats\n         {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n          'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n          ...}\n    \"\"\"\n\n    # Grab the column stats from our DataSource\n    ds_column_stats = self.data_source.column_stats(recompute)\n\n    # Map the types from our DataSource to the FeatureSet types\n    fs_type_mapper = self.column_details()\n    for col, details in ds_column_stats.items():\n        details[\"fs_dtype\"] = fs_type_mapper.get(col, \"unknown\")\n\n    return ds_column_stats\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_types","title":"<code>column_types()</code>","text":"<p>Return the column types of the Feature Set</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def column_types(self) -&gt; list[str]:\n    \"\"\"Return the column types of the Feature Set\"\"\"\n    return list(self.column_details().values())\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.correlations","title":"<code>correlations(recompute=False)</code>","text":"<p>Get the correlations for the numeric columns of the underlying DataSource Args:     recompute (bool): Recompute the value counts (default=False) Returns:     dict: A dictionary of correlations for the numeric columns</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def correlations(self, recompute: bool = False) -&gt; dict:\n    \"\"\"Get the correlations for the numeric columns of the underlying DataSource\n    Args:\n        recompute (bool): Recompute the value counts (default=False)\n    Returns:\n        dict: A dictionary of correlations for the numeric columns\n    \"\"\"\n    return self.data_source.correlations(recompute)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.create_default_training_view","title":"<code>create_default_training_view()</code>","text":"<p>Create a default view in Athena that assigns roughly 80% of the data to training</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def create_default_training_view(self):\n    \"\"\"Create a default view in Athena that assigns roughly 80% of the data to training\"\"\"\n\n    # Create the view name\n    view_name = f\"{self.athena_table}_training\"\n    self.log.important(f\"Creating default Training View {view_name}...\")\n\n    # Do we already have a training column?\n    if \"training\" in self.column_names():\n        create_view_query = f\"CREATE OR REPLACE VIEW {view_name} AS SELECT * FROM {self.athena_table}\"\n    else:\n        # No training column, so create one:\n        #    Construct the CREATE VIEW query with a simple modulo operation for the 80/20 split\n        #    using self.record_id as the stable identifier for row numbering\n        create_view_query = f\"\"\"\n        CREATE OR REPLACE VIEW {view_name} AS\n        SELECT *, CASE\n            WHEN MOD(ROW_NUMBER() OVER (ORDER BY {self.record_id}), 10) &lt; 8 THEN 1  -- Assign 80% to training\n            ELSE 0  -- Assign roughly 20% to validation\n        END AS training\n        FROM {self.athena_table}\n        \"\"\"\n\n    # Execute the CREATE VIEW query\n    self.data_source.execute_statement(create_view_query)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.create_s3_training_data","title":"<code>create_s3_training_data()</code>","text":"<p>Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want additional options/features use the get_feature_store() method and see AWS docs for all the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html Returns:     str: The full path/file for the CSV file created by Feature Store create_dataset()</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def create_s3_training_data(self) -&gt; str:\n    \"\"\"Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want\n    additional options/features use the get_feature_store() method and see AWS docs for all\n    the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n    Returns:\n        str: The full path/file for the CSV file created by Feature Store create_dataset()\n    \"\"\"\n\n    # Set up the S3 Query results path\n    date_time = datetime.now(timezone.utc).strftime(\"%Y-%m-%d_%H:%M:%S\")\n    s3_output_path = self.feature_sets_s3_path + f\"/{self.uuid}/datasets/all_{date_time}\"\n\n    # Get the training data query\n    query = self.get_training_data_query()\n\n    # Make the query\n    athena_query = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session).athena_query()\n    athena_query.run(query, output_location=s3_output_path)\n    athena_query.wait()\n    query_execution = athena_query.get_query_execution()\n\n    # Get the full path to the S3 files with the results\n    full_s3_path = s3_output_path + f\"/{query_execution['QueryExecution']['QueryExecutionId']}.csv\"\n    return full_s3_path\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.create_training_view","title":"<code>create_training_view(id_column, holdout_ids)</code>","text":"<p>Create a view in Athena that marks hold out ids for this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>id_column</code> <code>str</code> <p>The name of the id column in the output DataFrame.</p> required <code>holdout_ids</code> <code>list[str]</code> <p>The list of hold out ids.</p> required Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def create_training_view(self, id_column: str, holdout_ids: list[str]):\n    \"\"\"Create a view in Athena that marks hold out ids for this FeatureSet\n\n    Args:\n        id_column (str): The name of the id column in the output DataFrame.\n        holdout_ids (list[str]): The list of hold out ids.\n    \"\"\"\n\n    # Create the view name\n    view_name = f\"{self.athena_table}_training\"\n    self.log.important(f\"Creating Training View {view_name}...\")\n\n    # Format the list of hold out ids for SQL IN clause\n    if holdout_ids and all(isinstance(id, str) for id in holdout_ids):\n        formatted_holdout_ids = \", \".join(f\"'{id}'\" for id in holdout_ids)\n    else:\n        formatted_holdout_ids = \", \".join(map(str, holdout_ids))\n\n    # Construct the CREATE VIEW query\n    create_view_query = f\"\"\"\n    CREATE OR REPLACE VIEW {view_name} AS\n    SELECT *, CASE\n        WHEN {id_column} IN ({formatted_holdout_ids}) THEN 0\n        ELSE 1\n    END AS training\n    FROM {self.athena_table}\n    \"\"\"\n\n    # Execute the CREATE VIEW query\n    self.data_source.execute_statement(create_view_query)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.created","title":"<code>created()</code>","text":"<p>Return the datetime when this artifact was created</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def created(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was created\"\"\"\n    return self.feature_meta[\"CreationTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.delete","title":"<code>delete()</code>","text":"<p>Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def delete(self):\n    \"\"\"Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\"\"\"\n\n    # Delete the Feature Group and ensure that it gets deleted\n    self.log.important(f\"Deleting FeatureSet {self.uuid}...\")\n    remove_fg = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session)\n    remove_fg.delete()\n    self.ensure_feature_group_deleted(remove_fg)\n\n    # Delete our underlying DataSource (Data Catalog Table and S3 Storage Objects)\n    self.data_source.delete()\n\n    # Delete the training view\n    self.delete_training_view()\n\n    # Feature Sets can often have a lot of cruft so delete the entire bucket/prefix\n    s3_delete_path = self.feature_sets_s3_path + f\"/{self.uuid}/\"\n    self.log.info(f\"Deleting All FeatureSet S3 Storage Objects {s3_delete_path}\")\n    wr.s3.delete_objects(s3_delete_path, boto3_session=self.boto_session)\n\n    # Now delete any data in the Cache\n    for key in self.data_storage.list_subkeys(f\"feature_set:{self.uuid}:\"):\n        self.log.info(f\"Deleting Cache Key: {key}\")\n        self.data_storage.delete(key)\n\n    # Force a refresh of the AWS Metadata (to make sure references to deleted artifacts are gone)\n    self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=True)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.delete_training_view","title":"<code>delete_training_view()</code>","text":"<p>Delete the training view for this FeatureSet</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def delete_training_view(self):\n    \"\"\"Delete the training view for this FeatureSet\"\"\"\n    try:\n        training_view_table = self.get_training_view_table(create=False)\n        if training_view_table is not None:\n            self.log.info(f\"Deleting Training View {training_view_table} for {self.uuid}\")\n            glue_client = self.boto_session.client(\"glue\")\n            glue_client.delete_table(DatabaseName=self.athena_database, Name=training_view_table)\n    except botocore.exceptions.ClientError as error:\n        # For ResourceNotFound/ValidationException, this is fine, otherwise raise all other exceptions\n        if error.response[\"Error\"][\"Code\"] in [\"ResourceNotFound\", \"ValidationException\"]:\n            self.log.warning(f\"Training View for {self.uuid} doesn't exist, nothing to delete...\")\n            pass\n        else:\n            raise error\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.descriptive_stats","title":"<code>descriptive_stats(recompute=False)</code>","text":"<p>Get the descriptive stats for the numeric columns of the underlying DataSource Args:     recompute (bool): Recompute the descriptive stats (default=False) Returns:     dict: A dictionary of descriptive stats for the numeric columns</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def descriptive_stats(self, recompute: bool = False) -&gt; dict:\n    \"\"\"Get the descriptive stats for the numeric columns of the underlying DataSource\n    Args:\n        recompute (bool): Recompute the descriptive stats (default=False)\n    Returns:\n        dict: A dictionary of descriptive stats for the numeric columns\n    \"\"\"\n    return self.data_source.descriptive_stats(recompute)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.details","title":"<code>details(recompute=False)</code>","text":"<p>Additional Details about this FeatureSet Artifact</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the details (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about this FeatureSet</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def details(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Additional Details about this FeatureSet Artifact\n\n    Args:\n        recompute (bool): Recompute the details (default: False)\n\n    Returns:\n        dict(dict): A dictionary of details about this FeatureSet\n    \"\"\"\n\n    # Check if we have cached version of the FeatureSet Details\n    storage_key = f\"feature_set:{self.uuid}:details\"\n    cached_details = self.data_storage.get(storage_key)\n    if cached_details and not recompute:\n        return cached_details\n\n    self.log.info(f\"Recomputing FeatureSet Details ({self.uuid})...\")\n    details = self.summary()\n    details[\"aws_url\"] = self.aws_url()\n\n    # Store the AWS URL in the SageWorks Metadata\n    self.upsert_sageworks_meta({\"aws_url\": details[\"aws_url\"]})\n\n    # Now get a summary of the underlying DataSource\n    details[\"storage_summary\"] = self.data_source.summary()\n\n    # Number of Columns\n    details[\"num_columns\"] = self.num_columns()\n\n    # Number of Rows\n    details[\"num_rows\"] = self.num_rows()\n\n    # Additional Details\n    details[\"sageworks_status\"] = self.get_status()\n    details[\"sageworks_input\"] = self.get_input()\n    details[\"sageworks_tags\"] = self.tag_delimiter.join(self.get_tags())\n\n    # Underlying Storage Details\n    details[\"storage_type\"] = \"athena\"  # TODO: Add RDS support\n    details[\"storage_uuid\"] = self.data_source.uuid\n\n    # Add the column details and column stats\n    details[\"column_details\"] = self.column_details()\n    details[\"column_stats\"] = self.column_stats()\n\n    # Cache the details\n    self.data_storage.set(storage_key, details)\n\n    # Return the details data\n    return details\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.exists","title":"<code>exists()</code>","text":"<p>Does the feature_set_name exist in the AWS Metadata?</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def exists(self) -&gt; bool:\n    \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n    if self.feature_meta is None:\n        self.log.debug(f\"FeatureSet {self.uuid} not found in AWS Metadata!\")\n        return False\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_data_source","title":"<code>get_data_source()</code>","text":"<p>Return the underlying DataSource object</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_data_source(self) -&gt; DataSourceFactory:\n    \"\"\"Return the underlying DataSource object\"\"\"\n    return self.data_source\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_display_columns","title":"<code>get_display_columns()</code>","text":"<p>Get the display columns for this FeatureSet</p> <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: The display columns for this FeatureSet</p> Notes <p>This just pulls the display columns from the underlying DataSource</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_display_columns(self) -&gt; list[str]:\n    \"\"\"Get the display columns for this FeatureSet\n\n    Returns:\n        list[str]: The display columns for this FeatureSet\n\n    Notes:\n        This just pulls the display columns from the underlying DataSource\n    \"\"\"\n    return self.data_source.get_display_columns()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_feature_store","title":"<code>get_feature_store()</code>","text":"<p>Return the underlying AWS FeatureStore object. This can be useful for more advanced usage with create_dataset() such as Joins and time ranges and a host of other options See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_feature_store(self) -&gt; FeatureStore:\n    \"\"\"Return the underlying AWS FeatureStore object. This can be useful for more advanced usage\n    with create_dataset() such as Joins and time ranges and a host of other options\n    See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n    \"\"\"\n    return self.feature_store\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_holdout_ids","title":"<code>get_holdout_ids(id_column)</code>","text":"<p>Get the hold out ids for this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>id_column</code> <code>str</code> <p>The name of the id column in the output DataFrame.</p> required <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: The list of hold out ids.</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_holdout_ids(self, id_column: str) -&gt; list[str]:\n    \"\"\"Get the hold out ids for this FeatureSet\n\n    Args:\n        id_column (str): The name of the id column in the output DataFrame.\n\n    Returns:\n        list[str]: The list of hold out ids.\n    \"\"\"\n    training_view_table = self.get_training_view_table(create=False)\n    if training_view_table is not None:\n        query = f\"SELECT {id_column} FROM {training_view_table} WHERE training = 0\"\n        holdout_ids = self.query(query)[id_column].tolist()\n        return holdout_ids\n    else:\n        return []\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_training_data","title":"<code>get_training_data(limit=50000)</code>","text":"<p>Get the training data for this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>limit</code> <code>int</code> <p>The number of rows to limit the query to (default: 1000)</p> <code>50000</code> <p>Returns:     pd.DataFrame: The training data for this FeatureSet</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_training_data(self, limit=50000) -&gt; pd.DataFrame:\n    \"\"\"Get the training data for this FeatureSet\n\n    Args:\n        limit (int): The number of rows to limit the query to (default: 1000)\n    Returns:\n        pd.DataFrame: The training data for this FeatureSet\n    \"\"\"\n\n    # Get the training data query (put a limit on it for now)\n    query = self.get_training_data_query() + f\" LIMIT {limit}\"\n\n    # Make the query\n    return self.query(query)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_training_data_query","title":"<code>get_training_data_query()</code>","text":"<p>Get the training data query for this FeatureSet</p> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The training data query for this FeatureSet</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_training_data_query(self) -&gt; str:\n    \"\"\"Get the training data query for this FeatureSet\n\n    Returns:\n        str: The training data query for this FeatureSet\n    \"\"\"\n\n    # Do we have a training view?\n    training_view = self.get_training_view_table()\n    if training_view:\n        self.log.important(f\"Pulling Data from Training View {training_view}...\")\n        table_name = training_view\n    else:\n        self.log.warning(f\"No Training View found for {self.uuid}, using FeatureSet directly...\")\n        table_name = self.athena_table\n\n    # Make a query that gets all the data from the FeatureSet\n    return f\"SELECT * FROM {table_name}\"\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_training_view_table","title":"<code>get_training_view_table(create=True)</code>","text":"<p>Get the name of the training view for this FeatureSet Args:     create (bool): Create the training view if it doesn't exist (default=True) Returns:     str: The name of the training view for this FeatureSet</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_training_view_table(self, create: bool = True) -&gt; Union[str, None]:\n    \"\"\"Get the name of the training view for this FeatureSet\n    Args:\n        create (bool): Create the training view if it doesn't exist (default=True)\n    Returns:\n        str: The name of the training view for this FeatureSet\n    \"\"\"\n    training_view_name = f\"{self.athena_table}_training\"\n    glue_client = self.boto_session.client(\"glue\")\n    try:\n        glue_client.get_table(DatabaseName=self.athena_database, Name=training_view_name)\n        return training_view_name\n    except glue_client.exceptions.EntityNotFoundException:\n        if not create:\n            return None\n        self.log.warning(f\"Training View for {self.uuid} doesn't exist, creating one...\")\n        self.create_default_training_view()\n        time.sleep(1)  # Give AWS a second to catch up\n        return training_view_name\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.health_check","title":"<code>health_check()</code>","text":"<p>Perform a health check on this model</p> <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: List of health issues</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def health_check(self) -&gt; list[str]:\n    \"\"\"Perform a health check on this model\n\n    Returns:\n        list[str]: List of health issues\n    \"\"\"\n    # Call the base class health check\n    health_issues = super().health_check()\n\n    # If we have a 'needs_onboard' in the health check then just return\n    if \"needs_onboard\" in health_issues:\n        return health_issues\n\n    # Check our DataSource\n    if not self.data_source.exists():\n        self.log.critical(f\"Data Source check failed for {self.uuid}\")\n        self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n        health_issues.append(\"data_source_missing\")\n    return health_issues\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.modified","title":"<code>modified()</code>","text":"<p>Return the datetime when this artifact was last modified</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def modified(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was last modified\"\"\"\n    # Note: We can't currently figure out how to this from AWS Metadata\n    return self.feature_meta[\"CreationTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.num_columns","title":"<code>num_columns()</code>","text":"<p>Return the number of columns of the Feature Set</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def num_columns(self) -&gt; int:\n    \"\"\"Return the number of columns of the Feature Set\"\"\"\n    return len(self.column_names())\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.num_rows","title":"<code>num_rows()</code>","text":"<p>Return the number of rows of the internal DataSource</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def num_rows(self) -&gt; int:\n    \"\"\"Return the number of rows of the internal DataSource\"\"\"\n    return self.data_source.num_rows()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.onboard","title":"<code>onboard()</code>","text":"<p>This is a BLOCKING method that will onboard the FeatureSet (make it ready)</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def onboard(self) -&gt; bool:\n    \"\"\"This is a BLOCKING method that will onboard the FeatureSet (make it ready)\"\"\"\n\n    # Set our status to onboarding\n    self.log.important(f\"Onboarding {self.uuid}...\")\n    self.set_status(\"onboarding\")\n    self.remove_health_tag(\"needs_onboard\")\n\n    # Call our underlying DataSource onboard method\n    self.data_source.refresh_meta()\n    if not self.data_source.exists():\n        self.log.critical(f\"Data Source check failed for {self.uuid}\")\n        self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n        return False\n    if not self.data_source.ready():\n        self.data_source.onboard()\n\n    # Run a health check and refresh the meta\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.health_check()\n    self.refresh_meta()\n    self.details(recompute=True)\n    self.set_status(\"ready\")\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.outliers","title":"<code>outliers(scale=1.5, recompute=False)</code>","text":"<p>Compute outliers for all the numeric columns in a DataSource Args:     scale (float): The scale to use for the IQR (default: 1.5)     recompute (bool): Recompute the outliers (default: False) Returns:     pd.DataFrame: A DataFrame of outliers from this DataSource Notes:     Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers     The scale parameter can be adjusted to change the IQR multiplier</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def outliers(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Compute outliers for all the numeric columns in a DataSource\n    Args:\n        scale (float): The scale to use for the IQR (default: 1.5)\n        recompute (bool): Recompute the outliers (default: False)\n    Returns:\n        pd.DataFrame: A DataFrame of outliers from this DataSource\n    Notes:\n        Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n        The scale parameter can be adjusted to change the IQR multiplier\n    \"\"\"\n    return self.data_source.outliers(scale=scale, recompute=recompute)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.query","title":"<code>query(query, overwrite=True)</code>","text":"<p>Query the internal DataSource</p> <p>Parameters:</p> Name Type Description Default <code>query</code> <code>str</code> <p>The query to run against the DataSource</p> required <code>overwrite</code> <code>bool</code> <p>Overwrite the table name in the query (default: True)</p> <code>True</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The results of the query</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def query(self, query: str, overwrite: bool = True) -&gt; pd.DataFrame:\n    \"\"\"Query the internal DataSource\n\n    Args:\n        query (str): The query to run against the DataSource\n        overwrite (bool): Overwrite the table name in the query (default: True)\n\n    Returns:\n        pd.DataFrame: The results of the query\n    \"\"\"\n    if overwrite:\n        query = query.replace(\" \" + self.uuid + \" \", \" \" + self.athena_table + \" \")\n    return self.data_source.query(query)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.ready","title":"<code>ready()</code>","text":"<p>Is the FeatureSet ready? Is initial setup complete and expected metadata populated? Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to    check both to see if the FeatureSet is ready.</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def ready(self) -&gt; bool:\n    \"\"\"Is the FeatureSet ready? Is initial setup complete and expected metadata populated?\n    Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to\n       check both to see if the FeatureSet is ready.\"\"\"\n\n    # Check the expected metadata for the FeatureSet\n    expected_meta = self.expected_meta()\n    existing_meta = self.sageworks_meta()\n    feature_set_ready = set(existing_meta.keys()).issuperset(expected_meta)\n    if not feature_set_ready:\n        self.log.info(f\"FeatureSet {self.uuid} is not ready!\")\n        return False\n\n    # Okay now call/return the DataSource ready() method\n    return self.data_source.ready()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.refresh_meta","title":"<code>refresh_meta()</code>","text":"<p>Internal: Refresh our internal AWS Feature Store metadata</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def refresh_meta(self):\n    \"\"\"Internal: Refresh our internal AWS Feature Store metadata\"\"\"\n    self.log.info(\"Calling refresh_meta() on the underlying DataSource\")\n    self.data_source.refresh_meta()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.sample","title":"<code>sample(recompute=False)</code>","text":"<p>Get a sample of the data from the underlying DataSource Args:     recompute (bool): Recompute the sample (default=False) Returns:     pd.DataFrame: A sample of the data from the underlying DataSource</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def sample(self, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Get a sample of the data from the underlying DataSource\n    Args:\n        recompute (bool): Recompute the sample (default=False)\n    Returns:\n        pd.DataFrame: A sample of the data from the underlying DataSource\n    \"\"\"\n    return self.data_source.sample(recompute)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.set_display_columns","title":"<code>set_display_columns(display_columns)</code>","text":"<p>Set the display columns for this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>display_columns</code> <code>list[str]</code> <p>The display columns for this FeatureSet</p> required Notes <p>This just sets the display columns for the underlying DataSource</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def set_display_columns(self, display_columns: list[str]):\n    \"\"\"Set the display columns for this FeatureSet\n\n    Args:\n        display_columns (list[str]): The display columns for this FeatureSet\n\n    Notes:\n        This just sets the display columns for the underlying DataSource\n    \"\"\"\n    self.data_source.set_display_columns(display_columns)\n    self.onboard()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.set_holdout_ids","title":"<code>set_holdout_ids(id_column, holdout_ids)</code>","text":"<p>Set the hold out ids for this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>id_column</code> <code>str</code> <p>The name of the id column in the output DataFrame.</p> required <code>holdout_ids</code> <code>list[str]</code> <p>The list of hold out ids.</p> required Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def set_holdout_ids(self, id_column: str, holdout_ids: list[str]):\n    \"\"\"Set the hold out ids for this FeatureSet\n\n    Args:\n        id_column (str): The name of the id column in the output DataFrame.\n        holdout_ids (list[str]): The list of hold out ids.\n    \"\"\"\n    self.create_training_view(id_column, holdout_ids)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.size","title":"<code>size()</code>","text":"<p>Return the size of the internal DataSource in MegaBytes</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def size(self) -&gt; float:\n    \"\"\"Return the size of the internal DataSource in MegaBytes\"\"\"\n    return self.data_source.size()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.smart_sample","title":"<code>smart_sample()</code>","text":"<p>Get a SMART sample dataframe from this FeatureSet Returns:     pd.DataFrame: A combined DataFrame of sample data + outliers</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def smart_sample(self) -&gt; pd.DataFrame:\n    \"\"\"Get a SMART sample dataframe from this FeatureSet\n    Returns:\n        pd.DataFrame: A combined DataFrame of sample data + outliers\n    \"\"\"\n    return self.data_source.smart_sample()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.snapshot_query","title":"<code>snapshot_query(table_name=None)</code>","text":"<p>An Athena query to get the latest snapshot of features</p> <p>Parameters:</p> Name Type Description Default <code>table_name</code> <code>str</code> <p>The name of the table to query (default: None)</p> <code>None</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The Athena query to get the latest snapshot of features</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def snapshot_query(self, table_name: str = None) -&gt; str:\n    \"\"\"An Athena query to get the latest snapshot of features\n\n    Args:\n        table_name (str): The name of the table to query (default: None)\n\n    Returns:\n        str: The Athena query to get the latest snapshot of features\n    \"\"\"\n    # Remove FeatureGroup metadata columns that might have gotten added\n    columns = self.column_names()\n    filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n    columns = \", \".join(['\"' + x + '\"' for x in columns if x not in filter_columns])\n\n    query = (\n        f\"SELECT {columns} \"\n        f\"    FROM (SELECT *, row_number() OVER (PARTITION BY {self.record_id} \"\n        f\"        ORDER BY {self.event_time} desc, api_invocation_time DESC, write_time DESC) AS row_num \"\n        f'        FROM \"{table_name}\") '\n        \"    WHERE row_num = 1 and NOT is_deleted;\"\n    )\n    return query\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.value_counts","title":"<code>value_counts(recompute=False)</code>","text":"<p>Get the value counts for the string columns of the underlying DataSource Args:     recompute (bool): Recompute the value counts (default=False) Returns:     dict: A dictionary of value counts for the string columns</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def value_counts(self, recompute: bool = False) -&gt; dict:\n    \"\"\"Get the value counts for the string columns of the underlying DataSource\n    Args:\n        recompute (bool): Recompute the value counts (default=False)\n    Returns:\n        dict: A dictionary of value counts for the string columns\n    \"\"\"\n    return self.data_source.value_counts(recompute)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/","title":"ModelCore","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the Model API Class and voil\u00e0 it works the same.</p> <p>ModelCore: SageWorks ModelCore Class</p>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore","title":"<code>ModelCore</code>","text":"<p>               Bases: <code>Artifact</code></p> <p>ModelCore: SageWorks ModelCore Class</p> Common Usage <pre><code>my_model = ModelCore(model_uuid)\nmy_model.summary()\nmy_model.details()\n</code></pre> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>class ModelCore(Artifact):\n    \"\"\"ModelCore: SageWorks ModelCore Class\n\n    Common Usage:\n        ```\n        my_model = ModelCore(model_uuid)\n        my_model.summary()\n        my_model.details()\n        ```\n    \"\"\"\n\n    def __init__(\n        self, model_uuid: str, force_refresh: bool = False, model_type: ModelType = None, legacy: bool = False\n    ):\n        \"\"\"ModelCore Initialization\n        Args:\n            model_uuid (str): Name of Model in SageWorks.\n            force_refresh (bool, optional): Force a refresh of the AWS Broker. Defaults to False.\n            model_type (ModelType, optional): Set this for newly created Models. Defaults to None.\n            legacy (bool, optional): Force load of legacy models. Defaults to False.\n        \"\"\"\n\n        # Make sure the model name is valid\n        if not legacy:\n            self.ensure_valid_name(model_uuid, delimiter=\"-\")\n\n        # Call SuperClass Initialization\n        super().__init__(model_uuid)\n\n        # Grab an AWS Metadata Broker object and pull information for Models\n        self.model_name = model_uuid\n        aws_meta = self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=force_refresh)\n        self.model_meta = aws_meta.get(self.model_name)\n        if self.model_meta is None:\n            self.log.important(f\"Could not find model {self.model_name} within current visibility scope\")\n            self.latest_model = None\n            self.model_type = ModelType.UNKNOWN\n            return\n        else:\n            try:\n                self.latest_model = self.model_meta[0]\n                self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n                self.training_job_name = self._extract_training_job_name()\n                if model_type:\n                    self._set_model_type(model_type)\n                else:\n                    self.model_type = self._get_model_type()\n            except (IndexError, KeyError):\n                self.log.critical(f\"Model {self.model_name} appears to be malformed. Delete and recreate it!\")\n                self.latest_model = None\n                self.model_type = ModelType.UNKNOWN\n                return\n\n        # Set the Model Training S3 Path\n        self.model_training_path = self.models_s3_path + \"/training/\" + self.model_name\n\n        # Get our Endpoint Inference Path (might be None)\n        self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n        # Call SuperClass Post Initialization\n        super().__post_init__()\n\n        # All done\n        self.log.info(f\"Model Initialized: {self.model_name}\")\n\n    def refresh_meta(self):\n        \"\"\"Refresh the Artifact's metadata\"\"\"\n        self.model_meta = self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=True).get(self.model_name)\n        self.latest_model = self.model_meta[0]\n        self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n        self.training_job_name = self._extract_training_job_name()\n\n    def exists(self) -&gt; bool:\n        \"\"\"Does the model metadata exist in the AWS Metadata?\"\"\"\n        if self.model_meta is None:\n            self.log.debug(f\"Model {self.model_name} not found in AWS Metadata!\")\n            return False\n        return True\n\n    def health_check(self) -&gt; list[str]:\n        \"\"\"Perform a health check on this model\n        Returns:\n            list[str]: List of health issues\n        \"\"\"\n        # Call the base class health check\n        health_issues = super().health_check()\n\n        # Model Type\n        if self._get_model_type() == ModelType.UNKNOWN:\n            health_issues.append(\"model_type_unknown\")\n        else:\n            self.remove_health_tag(\"model_type_unknown\")\n\n        # Model Performance Metrics\n        if self.get_inference_metrics() is None:\n            health_issues.append(\"metrics_needed\")\n        else:\n            self.remove_health_tag(\"metrics_needed\")\n        return health_issues\n\n    def latest_model_object(self) -&gt; SagemakerModel:\n        \"\"\"Return the latest AWS Sagemaker Model object for this SageWorks Model\n\n        Returns:\n           sagemaker.model.Model: AWS Sagemaker Model object\n        \"\"\"\n        return SagemakerModel(\n            model_data=self.model_package_arn(), sagemaker_session=self.sm_session, image_uri=self.model_image()\n        )\n\n    def list_inference_runs(self) -&gt; list[str]:\n        \"\"\"List the inference runs for this model\n\n        Returns:\n            list[str]: List of inference run UUIDs\n        \"\"\"\n        if self.endpoint_inference_path is None:\n            return [\"model_training\"]  # Just the training run\n        directories = wr.s3.list_directories(path=self.endpoint_inference_path + \"/\")\n        inference_runs = [urlparse(directory).path.split(\"/\")[-2] for directory in directories]\n\n        # We're going to add the training to the front of the list\n        inference_runs.insert(0, \"model_training\")\n        return inference_runs\n\n    def get_inference_metrics(self, capture_uuid: str = \"latest\") -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Retrieve the inference performance metrics for this model\n\n        Args:\n            capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n        Returns:\n            pd.DataFrame: DataFrame of the Model Metrics\n\n        Note:\n            If a capture_uuid isn't specified this will try to return something reasonable\n        \"\"\"\n        # Try to get the auto_capture 'training_holdout' or the training\n        if capture_uuid == \"latest\":\n            metrics_df = self.get_inference_metrics(\"training_holdout\")\n            return metrics_df if metrics_df is not None else self.get_inference_metrics(\"model_training\")\n\n        # Grab the metrics captured during model training (could return None)\n        if capture_uuid == \"model_training\":\n            metrics = self.sageworks_meta().get(\"sageworks_training_metrics\")\n            return pd.DataFrame.from_dict(metrics) if metrics else None\n\n        else:  # Specific capture_uuid (could return None)\n            s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n            metrics = pull_s3_data(s3_path, embedded_index=True)\n            if metrics is not None:\n                return metrics\n            else:\n                self.log.warning(f\"Performance metrics {capture_uuid} not found for {self.model_name}!\")\n                return None\n\n    def confusion_matrix(self, capture_uuid: str = \"latest\") -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Retrieve the confusion_matrix for this model\n\n        Args:\n            capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n        Returns:\n            pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n        \"\"\"\n        # Grab the metrics from the SageWorks Metadata (try inference first, then training)\n        if capture_uuid == \"latest\":\n            cm = self.sageworks_meta().get(\"sageworks_inference_cm\")\n            return cm if cm is not None else self.confusion_matrix(\"model_training\")\n\n        # Grab the confusion matrix captured during model training (could return None)\n        if capture_uuid == \"model_training\":\n            cm = self.sageworks_meta().get(\"sageworks_training_cm\")\n            return pd.DataFrame.from_dict(cm) if cm else None\n\n        else:  # Specific capture_uuid\n            s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_cm.csv\"\n            cm = pull_s3_data(s3_path, embedded_index=True)\n            if cm is not None:\n                return cm\n            else:\n                self.log.warning(f\"Confusion Matrix {capture_uuid} not found for {self.model_name}!\")\n                return None\n\n    def set_input(self, input: str, force: bool = False):\n        \"\"\"Override: Set the input data for this artifact\n\n        Args:\n            input (str): Name of input for this artifact\n            force (bool, optional): Force the input to be set (default: False)\n        Note:\n            We're going to not allow this to be used for Models\n        \"\"\"\n        if not force:\n            self.log.warning(f\"Model {self.uuid}: Does not allow manual override of the input!\")\n            return\n\n        # Okay we're going to allow this to be set\n        self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n        self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n        self.upsert_sageworks_meta({\"sageworks_input\": input})\n\n    def size(self) -&gt; float:\n        \"\"\"Return the size of this data in MegaBytes\"\"\"\n        return 0.0\n\n    def aws_meta(self) -&gt; dict:\n        \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n        return self.latest_model\n\n    def arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n        return self.group_arn()\n\n    def group_arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n        return self.latest_model[\"ModelPackageGroupArn\"]\n\n    def model_package_arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for the Model Package (within the Group)\"\"\"\n        return self.latest_model[\"ModelPackageArn\"]\n\n    def model_container_info(self) -&gt; dict:\n        \"\"\"Container Info for the Latest Model Package\"\"\"\n        return self.latest_model[\"ModelPackageDetails\"][\"InferenceSpecification\"][\"Containers\"][0]\n\n    def model_image(self) -&gt; str:\n        \"\"\"Container Image for the Latest Model Package\"\"\"\n        return self.model_container_info()[\"Image\"]\n\n    def aws_url(self):\n        \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n        return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n\n    def created(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was created\"\"\"\n        return self.latest_model[\"CreationTime\"]\n\n    def modified(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was last modified\"\"\"\n        return self.latest_model[\"CreationTime\"]\n\n    def register_endpoint(self, endpoint_name: str):\n        \"\"\"Add this endpoint to the set of registered endpoints for the model\n\n        Args:\n            endpoint_name (str): Name of the endpoint\n        \"\"\"\n        self.log.important(f\"Registering Endpoint {endpoint_name} with Model {self.uuid}...\")\n        registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n        registered_endpoints.add(endpoint_name)\n        self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n        # A new endpoint means we need to refresh our inference path\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n    def endpoints(self) -&gt; list[str]:\n        \"\"\"Get the list of registered endpoints for this Model\n\n        Returns:\n            list[str]: List of registered endpoints\n        \"\"\"\n        return self.sageworks_meta().get(\"sageworks_registered_endpoints\", [])\n\n    def get_endpoint_inference_path(self) -&gt; str:\n        \"\"\"Get the S3 Path for the Inference Data\"\"\"\n\n        # Look for any Registered Endpoints\n        registered_endpoints = self.sageworks_meta().get(\"sageworks_registered_endpoints\")\n\n        # Note: We may have 0 to N endpoints, so we find the one with the most recent artifacts\n        if registered_endpoints:\n            endpoint_inference_base = self.endpoints_s3_path + \"/inference/\"\n            endpoint_inference_paths = [endpoint_inference_base + e for e in registered_endpoints]\n            return newest_files(endpoint_inference_paths, self.sm_session)\n        else:\n            self.log.warning(f\"No registered endpoints found for {self.model_name}!\")\n            return None\n\n    def set_target(self, target_column: str):\n        \"\"\"Set the target for this Model\n\n        Args:\n            target_column (str): Target column for this Model\n        \"\"\"\n        self.upsert_sageworks_meta({\"sageworks_model_target\": target_column})\n\n    def set_features(self, feature_columns: list[str]):\n        \"\"\"Set the features for this Model\n\n        Args:\n            feature_columns (list[str]): List of feature columns\n        \"\"\"\n        self.upsert_sageworks_meta({\"sageworks_model_features\": feature_columns})\n\n    def target(self) -&gt; Union[str, None]:\n        \"\"\"Return the target for this Model (if supervised, else None)\n\n        Returns:\n            str: Target column for this Model (if supervised, else None)\n        \"\"\"\n        return self.sageworks_meta().get(\"sageworks_model_target\")  # Returns None if not found\n\n    def features(self) -&gt; Union[list[str], None]:\n        \"\"\"Return a list of features used for this Model\n\n        Returns:\n            list[str]: List of features used for this Model\n        \"\"\"\n        return self.sageworks_meta().get(\"sageworks_model_features\")  # Returns None if not found\n\n    def class_labels(self) -&gt; Union[list[str], None]:\n        \"\"\"Return the class labels for this Model (if it's a classifier)\n\n        Returns:\n            list[str]: List of class labels\n        \"\"\"\n        if self.model_type == ModelType.CLASSIFIER:\n            return self.sageworks_meta().get(\"class_labels\")  # Returns None if not found\n        else:\n            return None\n\n    def set_class_labels(self, labels: list[str]):\n        \"\"\"Return the class labels for this Model (if it's a classifier)\n\n        Args:\n            labels (list[str]): List of class labels\n        \"\"\"\n        if self.model_type == ModelType.CLASSIFIER:\n            self.upsert_sageworks_meta({\"class_labels\": labels})\n        else:\n            self.log.error(f\"Model {self.model_name} is not a classifier!\")\n\n    def details(self, recompute=False) -&gt; dict:\n        \"\"\"Additional Details about this Model\n        Args:\n            recompute (bool, optional): Recompute the details (default: False)\n        Returns:\n            dict: Dictionary of details about this Model\n        \"\"\"\n\n        # Check if we have cached version of the Model Details\n        storage_key = f\"model:{self.uuid}:details\"\n        cached_details = self.data_storage.get(storage_key)\n        if cached_details and not recompute:\n            return cached_details\n\n        self.log.info(\"Recomputing Model Details...\")\n        details = self.summary()\n        details[\"pipeline\"] = self.get_pipeline()\n        details[\"model_type\"] = self.model_type.value\n        details[\"model_package_group_arn\"] = self.group_arn()\n        details[\"model_package_arn\"] = self.model_package_arn()\n        aws_meta = self.aws_meta()\n        details[\"description\"] = aws_meta.get(\"ModelPackageDescription\", \"-\")\n        details[\"version\"] = aws_meta[\"ModelPackageVersion\"]\n        details[\"status\"] = aws_meta[\"ModelPackageStatus\"]\n        details[\"approval_status\"] = aws_meta[\"ModelApprovalStatus\"]\n        details[\"image\"] = self.model_image().split(\"/\")[-1]  # Shorten the image uri\n\n        # Grab the inference and container info\n        package_details = aws_meta[\"ModelPackageDetails\"]\n        inference_spec = package_details[\"InferenceSpecification\"]\n        container_info = self.model_container_info()\n        details[\"framework\"] = container_info.get(\"Framework\", \"unknown\")\n        details[\"framework_version\"] = container_info.get(\"FrameworkVersion\", \"unknown\")\n        details[\"inference_types\"] = inference_spec[\"SupportedRealtimeInferenceInstanceTypes\"]\n        details[\"transform_types\"] = inference_spec[\"SupportedTransformInstanceTypes\"]\n        details[\"content_types\"] = inference_spec[\"SupportedContentTypes\"]\n        details[\"response_types\"] = inference_spec[\"SupportedResponseMIMETypes\"]\n        details[\"model_metrics\"] = self.get_inference_metrics()\n        if self.model_type == ModelType.CLASSIFIER:\n            details[\"confusion_matrix\"] = self.confusion_matrix()\n            details[\"predictions\"] = None\n        else:\n            details[\"confusion_matrix\"] = None\n            details[\"predictions\"] = self.get_inference_predictions()\n\n        # Grab the inference metadata\n        details[\"inference_meta\"] = self.get_inference_metadata()\n\n        # Cache the details\n        self.data_storage.set(storage_key, details)\n\n        # Return the details\n        return details\n\n    # Pipeline for this model\n    def get_pipeline(self) -&gt; str:\n        \"\"\"Get the pipeline for this model\"\"\"\n        return self.sageworks_meta().get(\"sageworks_pipeline\")\n\n    def set_pipeline(self, pipeline: str):\n        \"\"\"Set the pipeline for this model\n\n        Args:\n            pipeline (str): Pipeline that was used to create this model\n        \"\"\"\n        self.upsert_sageworks_meta({\"sageworks_pipeline\": pipeline})\n\n    def expected_meta(self) -&gt; list[str]:\n        \"\"\"Metadata we expect to see for this Model when it's ready\n        Returns:\n            list[str]: List of expected metadata keys\n        \"\"\"\n        # Our current list of expected metadata, we can add to this as needed\n        return [\"sageworks_status\", \"sageworks_training_metrics\", \"sageworks_training_cm\"]\n\n    def is_model_unknown(self) -&gt; bool:\n        \"\"\"Is the Model Type unknown?\"\"\"\n        return self.model_type == ModelType.UNKNOWN\n\n    def _determine_model_type(self):\n        \"\"\"Internal: Determine the Model Type\"\"\"\n        model_type = input(\"Model Type? (classifier, regressor, quantile_regressor, unsupervised, transformer): \")\n        if model_type == \"classifier\":\n            self._set_model_type(ModelType.CLASSIFIER)\n        elif model_type == \"regressor\":\n            self._set_model_type(ModelType.REGRESSOR)\n        elif model_type == \"quantile_regressor\":\n            self._set_model_type(ModelType.QUANTILE_REGRESSOR)\n        elif model_type == \"unsupervised\":\n            self._set_model_type(ModelType.UNSUPERVISED)\n        elif model_type == \"transformer\":\n            self._set_model_type(ModelType.TRANSFORMER)\n        else:\n            self.log.warning(f\"Unknown Model Type {model_type}!\")\n            self._set_model_type(ModelType.UNKNOWN)\n\n    def onboard(self, ask_everything=False) -&gt; bool:\n        \"\"\"This is an interactive method that will onboard the Model (make it ready)\n\n        Args:\n            ask_everything (bool, optional): Ask for all the details. Defaults to False.\n\n        Returns:\n            bool: True if the Model is successfully onboarded, False otherwise\n        \"\"\"\n        # Set the status to onboarding\n        self.set_status(\"onboarding\")\n\n        # Determine the Model Type\n        while self.is_model_unknown():\n            self._determine_model_type()\n\n        # Is our input data set?\n        if self.get_input() in [\"\", \"unknown\"] or ask_everything:\n            input_data = input(\"Input Data?: \")\n            if input_data not in [\"None\", \"none\", \"\", \"unknown\"]:\n                self.set_input(input_data)\n\n        # Determine the Target Column (can be None)\n        target_column = self.target()\n        if target_column is None or ask_everything:\n            target_column = input(\"Target Column? (for unsupervised/transformer just type None): \")\n            if target_column in [\"None\", \"none\", \"\"]:\n                target_column = None\n\n        # Determine the Feature Columns\n        feature_columns = self.features()\n        if feature_columns is None or ask_everything:\n            feature_columns = input(\"Feature Columns? (use commas): \")\n            feature_columns = [e.strip() for e in feature_columns.split(\",\")]\n            if feature_columns in [[\"None\"], [\"none\"], [\"\"]]:\n                feature_columns = None\n\n        # Registered Endpoints?\n        endpoints = self.endpoints()\n        if not endpoints or ask_everything:\n            endpoints = input(\"Register Endpoints? (use commas for multiple): \")\n            endpoints = [e.strip() for e in endpoints.split(\",\")]\n            if endpoints in [[\"None\"], [\"none\"], [\"\"]]:\n                endpoints = None\n\n        # Model Owner?\n        owner = self.get_owner()\n        if owner in [None, \"unknown\"] or ask_everything:\n            owner = input(\"Model Owner: \")\n            if owner in [\"None\", \"none\", \"\"]:\n                owner = \"unknown\"\n\n        # Now that we have all the details, let's onboard the Model with all the args\n        return self.onboard_with_args(self.model_type, target_column, feature_columns, endpoints, owner)\n\n    def onboard_with_args(\n        self,\n        model_type: ModelType,\n        target_column: str = None,\n        feature_list: list = None,\n        endpoints: list = None,\n        owner: str = None,\n    ) -&gt; bool:\n        \"\"\"Onboard the Model with the given arguments\n\n        Args:\n            model_type (ModelType): Model Type\n            target_column (str): Target Column\n            feature_list (list): List of Feature Columns\n            endpoints (list, optional): List of Endpoints. Defaults to None.\n            owner (str, optional): Model Owner. Defaults to None.\n        Returns:\n            bool: True if the Model is successfully onboarded, False otherwise\n        \"\"\"\n        # Set the status to onboarding\n        self.set_status(\"onboarding\")\n\n        # Set All the Details\n        self._set_model_type(model_type)\n        if target_column:\n            self.set_target(target_column)\n        if feature_list:\n            self.set_features(feature_list)\n        if endpoints:\n            for endpoint in endpoints:\n                self.register_endpoint(endpoint)\n        if owner:\n            self.set_owner(owner)\n\n        # Load the training metrics and inference metrics\n        self._load_training_metrics()\n        self._load_inference_metrics()\n        self._load_inference_cm()\n\n        # Remove the needs_onboard tag\n        self.remove_health_tag(\"needs_onboard\")\n        self.set_status(\"ready\")\n\n        # Run a health check and refresh the meta\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.health_check()\n        self.refresh_meta()\n        self.details(recompute=True)\n        return True\n\n    def delete(self):\n        \"\"\"Delete the Model Packages and the Model Group\"\"\"\n\n        # If we don't have meta then the model probably doesn't exist\n        if self.model_meta is None:\n            self.log.info(f\"Model {self.model_name} doesn't appear to exist...\")\n            return\n\n        # First delete the Model Packages within the Model Group\n        for model in self.model_meta:\n            self.log.info(f\"Deleting Model Package {model['ModelPackageArn']}...\")\n            self.sm_client.delete_model_package(ModelPackageName=model[\"ModelPackageArn\"])\n\n        # Delete the Model Package Group\n        self.log.info(f\"Deleting Model Group {self.model_name}...\")\n        self.sm_client.delete_model_package_group(ModelPackageGroupName=self.model_name)\n\n        # Delete any training artifacts\n        s3_delete_path = f\"{self.model_training_path}/\"\n        self.log.info(f\"Deleting Training S3 Objects {s3_delete_path}\")\n        wr.s3.delete_objects(s3_delete_path, boto3_session=self.boto_session)\n\n        # Delete any data in the Cache\n        for key in self.data_storage.list_subkeys(f\"model:{self.uuid}:\"):\n            self.log.info(f\"Deleting Cache Key {key}...\")\n            self.data_storage.delete(key)\n\n    def _set_model_type(self, model_type: ModelType):\n        \"\"\"Internal: Set the Model Type for this Model\"\"\"\n        self.model_type = model_type\n        self.upsert_sageworks_meta({\"sageworks_model_type\": self.model_type.value})\n        self.remove_health_tag(\"model_type_unknown\")\n\n    def _get_model_type(self) -&gt; ModelType:\n        \"\"\"Internal: Query the SageWorks Metadata to get the model type\n        Returns:\n            ModelType: The ModelType of this Model\n        Notes:\n            This is an internal method that should not be called directly\n            Use the model_type attribute instead\n        \"\"\"\n        model_type = self.sageworks_meta().get(\"sageworks_model_type\")\n        try:\n            return ModelType(model_type)\n        except ValueError:\n            self.log.warning(f\"Could not determine model type for {self.model_name}!\")\n            return ModelType.UNKNOWN\n\n    def _load_training_metrics(self):\n        \"\"\"Internal: Retrieve the training metrics and Confusion Matrix for this model\n                     and load the data into the SageWorks Metadata\n\n        Notes:\n            This may or may not exist based on whether we have access to TrainingJobAnalytics\n        \"\"\"\n        try:\n            df = TrainingJobAnalytics(training_job_name=self.training_job_name).dataframe()\n            if df.empty:\n                self.log.warning(f\"No training job metrics found for {self.training_job_name}\")\n                self.upsert_sageworks_meta({\"sageworks_training_metrics\": None, \"sageworks_training_cm\": None})\n                return\n            if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n                if \"timestamp\" in df.columns:\n                    df = df.drop(columns=[\"timestamp\"])\n\n                # We're going to pivot the DataFrame to get the desired structure\n                reg_metrics_df = df.set_index(\"metric_name\").T\n\n                # Store and return the metrics in the SageWorks Metadata\n                self.upsert_sageworks_meta(\n                    {\"sageworks_training_metrics\": reg_metrics_df.to_dict(), \"sageworks_training_cm\": None}\n                )\n                return\n\n        except (KeyError, botocore.exceptions.ClientError):\n            self.log.warning(f\"No training job metrics found for {self.training_job_name}\")\n            # Store and return the metrics in the SageWorks Metadata\n            self.upsert_sageworks_meta({\"sageworks_training_metrics\": None, \"sageworks_training_cm\": None})\n            return\n\n        # We need additional processing for classification metrics\n        if self.model_type == ModelType.CLASSIFIER:\n            metrics_df, cm_df = self._process_classification_metrics(df)\n\n            # Store and return the metrics in the SageWorks Metadata\n            self.upsert_sageworks_meta(\n                {\"sageworks_training_metrics\": metrics_df.to_dict(), \"sageworks_training_cm\": cm_df.to_dict()}\n            )\n\n    def _load_inference_metrics(self, capture_uuid: str = \"training_holdout\"):\n        \"\"\"Internal: Retrieve the inference model metrics for this model\n                     and load the data into the SageWorks Metadata\n\n        Args:\n            capture_uuid (str, optional): A specific capture_uuid (default: \"training_holdout\")\n        Notes:\n            This may or may not exist based on whether an Endpoint ran Inference\n        \"\"\"\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n        inference_metrics = pull_s3_data(s3_path)\n\n        # Store data into the SageWorks Metadata\n        metrics_storage = None if inference_metrics is None else inference_metrics.to_dict(\"records\")\n        self.upsert_sageworks_meta({\"sageworks_inference_metrics\": metrics_storage})\n\n    def _load_inference_cm(self, capture_uuid: str = \"training_holdout\"):\n        \"\"\"Internal: Pull the inference Confusion Matrix for this model\n                     and load the data into the SageWorks Metadata\n\n        Args:\n            capture_uuid (str, optional): A specific capture_uuid (default: \"training_holdout\")\n\n        Returns:\n            pd.DataFrame: DataFrame of the inference Confusion Matrix (might be None)\n\n        Notes:\n            This may or may not exist based on whether an Endpoint ran Inference\n        \"\"\"\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_cm.csv\"\n        inference_cm = pull_s3_data(s3_path, embedded_index=True)\n\n        # Store data into the SageWorks Metadata\n        cm_storage = None if inference_cm is None else inference_cm.to_dict(\"records\")\n        self.upsert_sageworks_meta({\"sageworks_inference_cm\": cm_storage})\n\n    def get_inference_metadata(self, capture_uuid: str = \"training_holdout\") -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Retrieve the inference metadata for this model\n\n        Args:\n            capture_uuid (str, optional): A specific capture_uuid (default: \"training_holdout\")\n\n        Returns:\n            dict: Dictionary of the inference metadata (might be None)\n        Notes:\n            Basically when Endpoint inference was run, name of the dataset, the MD5, etc\n        \"\"\"\n        # Sanity check the inference path (which may or may not exist)\n        if self.endpoint_inference_path is None:\n            return None\n\n        # Check for model_training capture_uuid\n        if capture_uuid == \"model_training\":\n            # Create a DataFrame with the training metadata\n            meta_df = pd.DataFrame(\n                [\n                    {\n                        \"name\": \"AWS Training Capture\",\n                        \"data_hash\": \"N/A\",\n                        \"num_rows\": \"-\",\n                        \"description\": \"-\",\n                    }\n                ]\n            )\n            return meta_df\n\n        # Pull the inference metadata\n        try:\n            s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_meta.json\"\n            return wr.s3.read_json(s3_path)\n        except NoFilesFound:\n            self.log.info(f\"Could not find model inference meta at {s3_path}...\")\n            return None\n\n    def get_inference_predictions(self, capture_uuid: str = \"training_holdout\") -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Retrieve the captured prediction results for this model\n\n        Args:\n            capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n        Returns:\n            pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n        \"\"\"\n        self.log.important(f\"Grabbing {capture_uuid} predictions for {self.model_name}...\")\n\n        # Special case for model_training\n        if capture_uuid == \"model_training\":\n            return self._get_validation_predictions()\n\n        # Construct the S3 path for the Inference Predictions\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_predictions.csv\"\n        return pull_s3_data(s3_path)\n\n    def _get_validation_predictions(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Internal: Retrieve the captured prediction results for this model\n\n        Returns:\n            pd.DataFrame: DataFrame of the Captured Validation Predictions (might be None)\n        \"\"\"\n        self.log.important(f\"Grabbing Validation Predictions for {self.model_name}...\")\n        s3_path = f\"{self.model_training_path}/validation_predictions.csv\"\n        df = pull_s3_data(s3_path)\n        return df\n\n    def _extract_training_job_name(self) -&gt; Union[str, None]:\n        \"\"\"Internal: Extract the training job name from the ModelDataUrl\"\"\"\n        try:\n            model_data_url = self.model_container_info()[\"ModelDataUrl\"]\n            parsed_url = urllib.parse.urlparse(model_data_url)\n            training_job_name = parsed_url.path.lstrip(\"/\").split(\"/\")[0]\n            return training_job_name\n        except KeyError:\n            self.log.warning(f\"Could not extract training job name from {model_data_url}\")\n            return None\n\n    @staticmethod\n    def _process_classification_metrics(df: pd.DataFrame) -&gt; (pd.DataFrame, pd.DataFrame):\n        \"\"\"Internal: Process classification metrics into a more reasonable format\n        Args:\n            df (pd.DataFrame): DataFrame of training metrics\n        Returns:\n            (pd.DataFrame, pd.DataFrame): Tuple of DataFrames. Metrics and confusion matrix\n        \"\"\"\n        # Split into two DataFrames based on 'metric_name'\n        metrics_df = df[df[\"metric_name\"].str.startswith(\"Metrics:\")].copy()\n        cm_df = df[df[\"metric_name\"].str.startswith(\"ConfusionMatrix:\")].copy()\n\n        # Split the 'metric_name' into different parts\n        metrics_df[\"class\"] = metrics_df[\"metric_name\"].str.split(\":\").str[1]\n        metrics_df[\"metric_type\"] = metrics_df[\"metric_name\"].str.split(\":\").str[2]\n\n        # Pivot the DataFrame to get the desired structure\n        metrics_df = metrics_df.pivot(index=\"class\", columns=\"metric_type\", values=\"value\").reset_index()\n        metrics_df = metrics_df.rename_axis(None, axis=1)\n\n        # Now process the confusion matrix\n        cm_df[\"row_class\"] = cm_df[\"metric_name\"].str.split(\":\").str[1]\n        cm_df[\"col_class\"] = cm_df[\"metric_name\"].str.split(\":\").str[2]\n\n        # Pivot the DataFrame to create a form suitable for the heatmap\n        cm_df = cm_df.pivot(index=\"row_class\", columns=\"col_class\", values=\"value\")\n\n        # Convert the values in cm_df to integers\n        cm_df = cm_df.astype(int)\n\n        return metrics_df, cm_df\n\n    def shapley_values(self, capture_uuid: str = \"training_holdout\") -&gt; Union[list[pd.DataFrame], pd.DataFrame, None]:\n        \"\"\"Retrieve the Shapely values for this model\n\n        Args:\n            capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n        Returns:\n            pd.DataFrame: Dataframe of the shapley values for the prediction dataframe\n\n        Notes:\n            This may or may not exist based on whether an Endpoint ran Shapley\n        \"\"\"\n\n        # Sanity check the inference path (which may or may not exist)\n        if self.endpoint_inference_path is None:\n            return None\n\n        # Construct the S3 path for the Shapley values\n        shapley_s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n        # Multiple CSV if classifier\n        if self.model_type == ModelType.CLASSIFIER:\n            # CSVs for shap values are indexed by prediction class\n            # Because we don't know how many classes there are, we need to search through\n            # a list of S3 objects in the parent folder\n            s3_paths = wr.s3.list_objects(shapley_s3_path)\n            return [pull_s3_data(f) for f in s3_paths if \"inference_shap_values\" in f]\n\n        # One CSV if regressor\n        if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n            s3_path = f\"{shapley_s3_path}/inference_shap_values.csv\"\n            return pull_s3_data(s3_path)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.__init__","title":"<code>__init__(model_uuid, force_refresh=False, model_type=None, legacy=False)</code>","text":"<p>ModelCore Initialization Args:     model_uuid (str): Name of Model in SageWorks.     force_refresh (bool, optional): Force a refresh of the AWS Broker. Defaults to False.     model_type (ModelType, optional): Set this for newly created Models. Defaults to None.     legacy (bool, optional): Force load of legacy models. Defaults to False.</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def __init__(\n    self, model_uuid: str, force_refresh: bool = False, model_type: ModelType = None, legacy: bool = False\n):\n    \"\"\"ModelCore Initialization\n    Args:\n        model_uuid (str): Name of Model in SageWorks.\n        force_refresh (bool, optional): Force a refresh of the AWS Broker. Defaults to False.\n        model_type (ModelType, optional): Set this for newly created Models. Defaults to None.\n        legacy (bool, optional): Force load of legacy models. Defaults to False.\n    \"\"\"\n\n    # Make sure the model name is valid\n    if not legacy:\n        self.ensure_valid_name(model_uuid, delimiter=\"-\")\n\n    # Call SuperClass Initialization\n    super().__init__(model_uuid)\n\n    # Grab an AWS Metadata Broker object and pull information for Models\n    self.model_name = model_uuid\n    aws_meta = self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=force_refresh)\n    self.model_meta = aws_meta.get(self.model_name)\n    if self.model_meta is None:\n        self.log.important(f\"Could not find model {self.model_name} within current visibility scope\")\n        self.latest_model = None\n        self.model_type = ModelType.UNKNOWN\n        return\n    else:\n        try:\n            self.latest_model = self.model_meta[0]\n            self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n            self.training_job_name = self._extract_training_job_name()\n            if model_type:\n                self._set_model_type(model_type)\n            else:\n                self.model_type = self._get_model_type()\n        except (IndexError, KeyError):\n            self.log.critical(f\"Model {self.model_name} appears to be malformed. Delete and recreate it!\")\n            self.latest_model = None\n            self.model_type = ModelType.UNKNOWN\n            return\n\n    # Set the Model Training S3 Path\n    self.model_training_path = self.models_s3_path + \"/training/\" + self.model_name\n\n    # Get our Endpoint Inference Path (might be None)\n    self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n    # Call SuperClass Post Initialization\n    super().__post_init__()\n\n    # All done\n    self.log.info(f\"Model Initialized: {self.model_name}\")\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.arn","title":"<code>arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for the Model Package Group</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n    return self.group_arn()\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.aws_meta","title":"<code>aws_meta()</code>","text":"<p>Get ALL the AWS metadata for this artifact</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def aws_meta(self) -&gt; dict:\n    \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n    return self.latest_model\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.aws_url","title":"<code>aws_url()</code>","text":"<p>The AWS URL for looking at/querying this data source</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def aws_url(self):\n    \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n    return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.class_labels","title":"<code>class_labels()</code>","text":"<p>Return the class labels for this Model (if it's a classifier)</p> <p>Returns:</p> Type Description <code>Union[list[str], None]</code> <p>list[str]: List of class labels</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def class_labels(self) -&gt; Union[list[str], None]:\n    \"\"\"Return the class labels for this Model (if it's a classifier)\n\n    Returns:\n        list[str]: List of class labels\n    \"\"\"\n    if self.model_type == ModelType.CLASSIFIER:\n        return self.sageworks_meta().get(\"class_labels\")  # Returns None if not found\n    else:\n        return None\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.confusion_matrix","title":"<code>confusion_matrix(capture_uuid='latest')</code>","text":"<p>Retrieve the confusion_matrix for this model</p> <p>Parameters:</p> Name Type Description Default <code>capture_uuid</code> <code>str</code> <p>Specific capture_uuid or \"training\" (default: \"latest\")</p> <code>'latest'</code> <p>Returns:     pd.DataFrame: DataFrame of the Confusion Matrix (might be None)</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def confusion_matrix(self, capture_uuid: str = \"latest\") -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Retrieve the confusion_matrix for this model\n\n    Args:\n        capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n    Returns:\n        pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n    \"\"\"\n    # Grab the metrics from the SageWorks Metadata (try inference first, then training)\n    if capture_uuid == \"latest\":\n        cm = self.sageworks_meta().get(\"sageworks_inference_cm\")\n        return cm if cm is not None else self.confusion_matrix(\"model_training\")\n\n    # Grab the confusion matrix captured during model training (could return None)\n    if capture_uuid == \"model_training\":\n        cm = self.sageworks_meta().get(\"sageworks_training_cm\")\n        return pd.DataFrame.from_dict(cm) if cm else None\n\n    else:  # Specific capture_uuid\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_cm.csv\"\n        cm = pull_s3_data(s3_path, embedded_index=True)\n        if cm is not None:\n            return cm\n        else:\n            self.log.warning(f\"Confusion Matrix {capture_uuid} not found for {self.model_name}!\")\n            return None\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.created","title":"<code>created()</code>","text":"<p>Return the datetime when this artifact was created</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def created(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was created\"\"\"\n    return self.latest_model[\"CreationTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.delete","title":"<code>delete()</code>","text":"<p>Delete the Model Packages and the Model Group</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def delete(self):\n    \"\"\"Delete the Model Packages and the Model Group\"\"\"\n\n    # If we don't have meta then the model probably doesn't exist\n    if self.model_meta is None:\n        self.log.info(f\"Model {self.model_name} doesn't appear to exist...\")\n        return\n\n    # First delete the Model Packages within the Model Group\n    for model in self.model_meta:\n        self.log.info(f\"Deleting Model Package {model['ModelPackageArn']}...\")\n        self.sm_client.delete_model_package(ModelPackageName=model[\"ModelPackageArn\"])\n\n    # Delete the Model Package Group\n    self.log.info(f\"Deleting Model Group {self.model_name}...\")\n    self.sm_client.delete_model_package_group(ModelPackageGroupName=self.model_name)\n\n    # Delete any training artifacts\n    s3_delete_path = f\"{self.model_training_path}/\"\n    self.log.info(f\"Deleting Training S3 Objects {s3_delete_path}\")\n    wr.s3.delete_objects(s3_delete_path, boto3_session=self.boto_session)\n\n    # Delete any data in the Cache\n    for key in self.data_storage.list_subkeys(f\"model:{self.uuid}:\"):\n        self.log.info(f\"Deleting Cache Key {key}...\")\n        self.data_storage.delete(key)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.details","title":"<code>details(recompute=False)</code>","text":"<p>Additional Details about this Model Args:     recompute (bool, optional): Recompute the details (default: False) Returns:     dict: Dictionary of details about this Model</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def details(self, recompute=False) -&gt; dict:\n    \"\"\"Additional Details about this Model\n    Args:\n        recompute (bool, optional): Recompute the details (default: False)\n    Returns:\n        dict: Dictionary of details about this Model\n    \"\"\"\n\n    # Check if we have cached version of the Model Details\n    storage_key = f\"model:{self.uuid}:details\"\n    cached_details = self.data_storage.get(storage_key)\n    if cached_details and not recompute:\n        return cached_details\n\n    self.log.info(\"Recomputing Model Details...\")\n    details = self.summary()\n    details[\"pipeline\"] = self.get_pipeline()\n    details[\"model_type\"] = self.model_type.value\n    details[\"model_package_group_arn\"] = self.group_arn()\n    details[\"model_package_arn\"] = self.model_package_arn()\n    aws_meta = self.aws_meta()\n    details[\"description\"] = aws_meta.get(\"ModelPackageDescription\", \"-\")\n    details[\"version\"] = aws_meta[\"ModelPackageVersion\"]\n    details[\"status\"] = aws_meta[\"ModelPackageStatus\"]\n    details[\"approval_status\"] = aws_meta[\"ModelApprovalStatus\"]\n    details[\"image\"] = self.model_image().split(\"/\")[-1]  # Shorten the image uri\n\n    # Grab the inference and container info\n    package_details = aws_meta[\"ModelPackageDetails\"]\n    inference_spec = package_details[\"InferenceSpecification\"]\n    container_info = self.model_container_info()\n    details[\"framework\"] = container_info.get(\"Framework\", \"unknown\")\n    details[\"framework_version\"] = container_info.get(\"FrameworkVersion\", \"unknown\")\n    details[\"inference_types\"] = inference_spec[\"SupportedRealtimeInferenceInstanceTypes\"]\n    details[\"transform_types\"] = inference_spec[\"SupportedTransformInstanceTypes\"]\n    details[\"content_types\"] = inference_spec[\"SupportedContentTypes\"]\n    details[\"response_types\"] = inference_spec[\"SupportedResponseMIMETypes\"]\n    details[\"model_metrics\"] = self.get_inference_metrics()\n    if self.model_type == ModelType.CLASSIFIER:\n        details[\"confusion_matrix\"] = self.confusion_matrix()\n        details[\"predictions\"] = None\n    else:\n        details[\"confusion_matrix\"] = None\n        details[\"predictions\"] = self.get_inference_predictions()\n\n    # Grab the inference metadata\n    details[\"inference_meta\"] = self.get_inference_metadata()\n\n    # Cache the details\n    self.data_storage.set(storage_key, details)\n\n    # Return the details\n    return details\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.endpoints","title":"<code>endpoints()</code>","text":"<p>Get the list of registered endpoints for this Model</p> <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: List of registered endpoints</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def endpoints(self) -&gt; list[str]:\n    \"\"\"Get the list of registered endpoints for this Model\n\n    Returns:\n        list[str]: List of registered endpoints\n    \"\"\"\n    return self.sageworks_meta().get(\"sageworks_registered_endpoints\", [])\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.exists","title":"<code>exists()</code>","text":"<p>Does the model metadata exist in the AWS Metadata?</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def exists(self) -&gt; bool:\n    \"\"\"Does the model metadata exist in the AWS Metadata?\"\"\"\n    if self.model_meta is None:\n        self.log.debug(f\"Model {self.model_name} not found in AWS Metadata!\")\n        return False\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.expected_meta","title":"<code>expected_meta()</code>","text":"<p>Metadata we expect to see for this Model when it's ready Returns:     list[str]: List of expected metadata keys</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def expected_meta(self) -&gt; list[str]:\n    \"\"\"Metadata we expect to see for this Model when it's ready\n    Returns:\n        list[str]: List of expected metadata keys\n    \"\"\"\n    # Our current list of expected metadata, we can add to this as needed\n    return [\"sageworks_status\", \"sageworks_training_metrics\", \"sageworks_training_cm\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.features","title":"<code>features()</code>","text":"<p>Return a list of features used for this Model</p> <p>Returns:</p> Type Description <code>Union[list[str], None]</code> <p>list[str]: List of features used for this Model</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def features(self) -&gt; Union[list[str], None]:\n    \"\"\"Return a list of features used for this Model\n\n    Returns:\n        list[str]: List of features used for this Model\n    \"\"\"\n    return self.sageworks_meta().get(\"sageworks_model_features\")  # Returns None if not found\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_endpoint_inference_path","title":"<code>get_endpoint_inference_path()</code>","text":"<p>Get the S3 Path for the Inference Data</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def get_endpoint_inference_path(self) -&gt; str:\n    \"\"\"Get the S3 Path for the Inference Data\"\"\"\n\n    # Look for any Registered Endpoints\n    registered_endpoints = self.sageworks_meta().get(\"sageworks_registered_endpoints\")\n\n    # Note: We may have 0 to N endpoints, so we find the one with the most recent artifacts\n    if registered_endpoints:\n        endpoint_inference_base = self.endpoints_s3_path + \"/inference/\"\n        endpoint_inference_paths = [endpoint_inference_base + e for e in registered_endpoints]\n        return newest_files(endpoint_inference_paths, self.sm_session)\n    else:\n        self.log.warning(f\"No registered endpoints found for {self.model_name}!\")\n        return None\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_metadata","title":"<code>get_inference_metadata(capture_uuid='training_holdout')</code>","text":"<p>Retrieve the inference metadata for this model</p> <p>Parameters:</p> Name Type Description Default <code>capture_uuid</code> <code>str</code> <p>A specific capture_uuid (default: \"training_holdout\")</p> <code>'training_holdout'</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>Union[DataFrame, None]</code> <p>Dictionary of the inference metadata (might be None)</p> <p>Notes:     Basically when Endpoint inference was run, name of the dataset, the MD5, etc</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def get_inference_metadata(self, capture_uuid: str = \"training_holdout\") -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Retrieve the inference metadata for this model\n\n    Args:\n        capture_uuid (str, optional): A specific capture_uuid (default: \"training_holdout\")\n\n    Returns:\n        dict: Dictionary of the inference metadata (might be None)\n    Notes:\n        Basically when Endpoint inference was run, name of the dataset, the MD5, etc\n    \"\"\"\n    # Sanity check the inference path (which may or may not exist)\n    if self.endpoint_inference_path is None:\n        return None\n\n    # Check for model_training capture_uuid\n    if capture_uuid == \"model_training\":\n        # Create a DataFrame with the training metadata\n        meta_df = pd.DataFrame(\n            [\n                {\n                    \"name\": \"AWS Training Capture\",\n                    \"data_hash\": \"N/A\",\n                    \"num_rows\": \"-\",\n                    \"description\": \"-\",\n                }\n            ]\n        )\n        return meta_df\n\n    # Pull the inference metadata\n    try:\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_meta.json\"\n        return wr.s3.read_json(s3_path)\n    except NoFilesFound:\n        self.log.info(f\"Could not find model inference meta at {s3_path}...\")\n        return None\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_metrics","title":"<code>get_inference_metrics(capture_uuid='latest')</code>","text":"<p>Retrieve the inference performance metrics for this model</p> <p>Parameters:</p> Name Type Description Default <code>capture_uuid</code> <code>str</code> <p>Specific capture_uuid or \"training\" (default: \"latest\")</p> <code>'latest'</code> <p>Returns:     pd.DataFrame: DataFrame of the Model Metrics</p> Note <p>If a capture_uuid isn't specified this will try to return something reasonable</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def get_inference_metrics(self, capture_uuid: str = \"latest\") -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Retrieve the inference performance metrics for this model\n\n    Args:\n        capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n    Returns:\n        pd.DataFrame: DataFrame of the Model Metrics\n\n    Note:\n        If a capture_uuid isn't specified this will try to return something reasonable\n    \"\"\"\n    # Try to get the auto_capture 'training_holdout' or the training\n    if capture_uuid == \"latest\":\n        metrics_df = self.get_inference_metrics(\"training_holdout\")\n        return metrics_df if metrics_df is not None else self.get_inference_metrics(\"model_training\")\n\n    # Grab the metrics captured during model training (could return None)\n    if capture_uuid == \"model_training\":\n        metrics = self.sageworks_meta().get(\"sageworks_training_metrics\")\n        return pd.DataFrame.from_dict(metrics) if metrics else None\n\n    else:  # Specific capture_uuid (could return None)\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n        metrics = pull_s3_data(s3_path, embedded_index=True)\n        if metrics is not None:\n            return metrics\n        else:\n            self.log.warning(f\"Performance metrics {capture_uuid} not found for {self.model_name}!\")\n            return None\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_predictions","title":"<code>get_inference_predictions(capture_uuid='training_holdout')</code>","text":"<p>Retrieve the captured prediction results for this model</p> <p>Parameters:</p> Name Type Description Default <code>capture_uuid</code> <code>str</code> <p>Specific capture_uuid (default: training_holdout)</p> <code>'training_holdout'</code> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: DataFrame of the Captured Predictions (might be None)</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def get_inference_predictions(self, capture_uuid: str = \"training_holdout\") -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Retrieve the captured prediction results for this model\n\n    Args:\n        capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n    Returns:\n        pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n    \"\"\"\n    self.log.important(f\"Grabbing {capture_uuid} predictions for {self.model_name}...\")\n\n    # Special case for model_training\n    if capture_uuid == \"model_training\":\n        return self._get_validation_predictions()\n\n    # Construct the S3 path for the Inference Predictions\n    s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_predictions.csv\"\n    return pull_s3_data(s3_path)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_pipeline","title":"<code>get_pipeline()</code>","text":"<p>Get the pipeline for this model</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def get_pipeline(self) -&gt; str:\n    \"\"\"Get the pipeline for this model\"\"\"\n    return self.sageworks_meta().get(\"sageworks_pipeline\")\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.group_arn","title":"<code>group_arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for the Model Package Group</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def group_arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n    return self.latest_model[\"ModelPackageGroupArn\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.health_check","title":"<code>health_check()</code>","text":"<p>Perform a health check on this model Returns:     list[str]: List of health issues</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def health_check(self) -&gt; list[str]:\n    \"\"\"Perform a health check on this model\n    Returns:\n        list[str]: List of health issues\n    \"\"\"\n    # Call the base class health check\n    health_issues = super().health_check()\n\n    # Model Type\n    if self._get_model_type() == ModelType.UNKNOWN:\n        health_issues.append(\"model_type_unknown\")\n    else:\n        self.remove_health_tag(\"model_type_unknown\")\n\n    # Model Performance Metrics\n    if self.get_inference_metrics() is None:\n        health_issues.append(\"metrics_needed\")\n    else:\n        self.remove_health_tag(\"metrics_needed\")\n    return health_issues\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.is_model_unknown","title":"<code>is_model_unknown()</code>","text":"<p>Is the Model Type unknown?</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def is_model_unknown(self) -&gt; bool:\n    \"\"\"Is the Model Type unknown?\"\"\"\n    return self.model_type == ModelType.UNKNOWN\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.latest_model_object","title":"<code>latest_model_object()</code>","text":"<p>Return the latest AWS Sagemaker Model object for this SageWorks Model</p> <p>Returns:</p> Type Description <code>Model</code> <p>sagemaker.model.Model: AWS Sagemaker Model object</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def latest_model_object(self) -&gt; SagemakerModel:\n    \"\"\"Return the latest AWS Sagemaker Model object for this SageWorks Model\n\n    Returns:\n       sagemaker.model.Model: AWS Sagemaker Model object\n    \"\"\"\n    return SagemakerModel(\n        model_data=self.model_package_arn(), sagemaker_session=self.sm_session, image_uri=self.model_image()\n    )\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.list_inference_runs","title":"<code>list_inference_runs()</code>","text":"<p>List the inference runs for this model</p> <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: List of inference run UUIDs</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def list_inference_runs(self) -&gt; list[str]:\n    \"\"\"List the inference runs for this model\n\n    Returns:\n        list[str]: List of inference run UUIDs\n    \"\"\"\n    if self.endpoint_inference_path is None:\n        return [\"model_training\"]  # Just the training run\n    directories = wr.s3.list_directories(path=self.endpoint_inference_path + \"/\")\n    inference_runs = [urlparse(directory).path.split(\"/\")[-2] for directory in directories]\n\n    # We're going to add the training to the front of the list\n    inference_runs.insert(0, \"model_training\")\n    return inference_runs\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.model_container_info","title":"<code>model_container_info()</code>","text":"<p>Container Info for the Latest Model Package</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def model_container_info(self) -&gt; dict:\n    \"\"\"Container Info for the Latest Model Package\"\"\"\n    return self.latest_model[\"ModelPackageDetails\"][\"InferenceSpecification\"][\"Containers\"][0]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.model_image","title":"<code>model_image()</code>","text":"<p>Container Image for the Latest Model Package</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def model_image(self) -&gt; str:\n    \"\"\"Container Image for the Latest Model Package\"\"\"\n    return self.model_container_info()[\"Image\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.model_package_arn","title":"<code>model_package_arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for the Model Package (within the Group)</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def model_package_arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for the Model Package (within the Group)\"\"\"\n    return self.latest_model[\"ModelPackageArn\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.modified","title":"<code>modified()</code>","text":"<p>Return the datetime when this artifact was last modified</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def modified(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was last modified\"\"\"\n    return self.latest_model[\"CreationTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.onboard","title":"<code>onboard(ask_everything=False)</code>","text":"<p>This is an interactive method that will onboard the Model (make it ready)</p> <p>Parameters:</p> Name Type Description Default <code>ask_everything</code> <code>bool</code> <p>Ask for all the details. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the Model is successfully onboarded, False otherwise</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def onboard(self, ask_everything=False) -&gt; bool:\n    \"\"\"This is an interactive method that will onboard the Model (make it ready)\n\n    Args:\n        ask_everything (bool, optional): Ask for all the details. Defaults to False.\n\n    Returns:\n        bool: True if the Model is successfully onboarded, False otherwise\n    \"\"\"\n    # Set the status to onboarding\n    self.set_status(\"onboarding\")\n\n    # Determine the Model Type\n    while self.is_model_unknown():\n        self._determine_model_type()\n\n    # Is our input data set?\n    if self.get_input() in [\"\", \"unknown\"] or ask_everything:\n        input_data = input(\"Input Data?: \")\n        if input_data not in [\"None\", \"none\", \"\", \"unknown\"]:\n            self.set_input(input_data)\n\n    # Determine the Target Column (can be None)\n    target_column = self.target()\n    if target_column is None or ask_everything:\n        target_column = input(\"Target Column? (for unsupervised/transformer just type None): \")\n        if target_column in [\"None\", \"none\", \"\"]:\n            target_column = None\n\n    # Determine the Feature Columns\n    feature_columns = self.features()\n    if feature_columns is None or ask_everything:\n        feature_columns = input(\"Feature Columns? (use commas): \")\n        feature_columns = [e.strip() for e in feature_columns.split(\",\")]\n        if feature_columns in [[\"None\"], [\"none\"], [\"\"]]:\n            feature_columns = None\n\n    # Registered Endpoints?\n    endpoints = self.endpoints()\n    if not endpoints or ask_everything:\n        endpoints = input(\"Register Endpoints? (use commas for multiple): \")\n        endpoints = [e.strip() for e in endpoints.split(\",\")]\n        if endpoints in [[\"None\"], [\"none\"], [\"\"]]:\n            endpoints = None\n\n    # Model Owner?\n    owner = self.get_owner()\n    if owner in [None, \"unknown\"] or ask_everything:\n        owner = input(\"Model Owner: \")\n        if owner in [\"None\", \"none\", \"\"]:\n            owner = \"unknown\"\n\n    # Now that we have all the details, let's onboard the Model with all the args\n    return self.onboard_with_args(self.model_type, target_column, feature_columns, endpoints, owner)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.onboard_with_args","title":"<code>onboard_with_args(model_type, target_column=None, feature_list=None, endpoints=None, owner=None)</code>","text":"<p>Onboard the Model with the given arguments</p> <p>Parameters:</p> Name Type Description Default <code>model_type</code> <code>ModelType</code> <p>Model Type</p> required <code>target_column</code> <code>str</code> <p>Target Column</p> <code>None</code> <code>feature_list</code> <code>list</code> <p>List of Feature Columns</p> <code>None</code> <code>endpoints</code> <code>list</code> <p>List of Endpoints. Defaults to None.</p> <code>None</code> <code>owner</code> <code>str</code> <p>Model Owner. Defaults to None.</p> <code>None</code> <p>Returns:     bool: True if the Model is successfully onboarded, False otherwise</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def onboard_with_args(\n    self,\n    model_type: ModelType,\n    target_column: str = None,\n    feature_list: list = None,\n    endpoints: list = None,\n    owner: str = None,\n) -&gt; bool:\n    \"\"\"Onboard the Model with the given arguments\n\n    Args:\n        model_type (ModelType): Model Type\n        target_column (str): Target Column\n        feature_list (list): List of Feature Columns\n        endpoints (list, optional): List of Endpoints. Defaults to None.\n        owner (str, optional): Model Owner. Defaults to None.\n    Returns:\n        bool: True if the Model is successfully onboarded, False otherwise\n    \"\"\"\n    # Set the status to onboarding\n    self.set_status(\"onboarding\")\n\n    # Set All the Details\n    self._set_model_type(model_type)\n    if target_column:\n        self.set_target(target_column)\n    if feature_list:\n        self.set_features(feature_list)\n    if endpoints:\n        for endpoint in endpoints:\n            self.register_endpoint(endpoint)\n    if owner:\n        self.set_owner(owner)\n\n    # Load the training metrics and inference metrics\n    self._load_training_metrics()\n    self._load_inference_metrics()\n    self._load_inference_cm()\n\n    # Remove the needs_onboard tag\n    self.remove_health_tag(\"needs_onboard\")\n    self.set_status(\"ready\")\n\n    # Run a health check and refresh the meta\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.health_check()\n    self.refresh_meta()\n    self.details(recompute=True)\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.refresh_meta","title":"<code>refresh_meta()</code>","text":"<p>Refresh the Artifact's metadata</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def refresh_meta(self):\n    \"\"\"Refresh the Artifact's metadata\"\"\"\n    self.model_meta = self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=True).get(self.model_name)\n    self.latest_model = self.model_meta[0]\n    self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n    self.training_job_name = self._extract_training_job_name()\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.register_endpoint","title":"<code>register_endpoint(endpoint_name)</code>","text":"<p>Add this endpoint to the set of registered endpoints for the model</p> <p>Parameters:</p> Name Type Description Default <code>endpoint_name</code> <code>str</code> <p>Name of the endpoint</p> required Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def register_endpoint(self, endpoint_name: str):\n    \"\"\"Add this endpoint to the set of registered endpoints for the model\n\n    Args:\n        endpoint_name (str): Name of the endpoint\n    \"\"\"\n    self.log.important(f\"Registering Endpoint {endpoint_name} with Model {self.uuid}...\")\n    registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n    registered_endpoints.add(endpoint_name)\n    self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n    # A new endpoint means we need to refresh our inference path\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.endpoint_inference_path = self.get_endpoint_inference_path()\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_class_labels","title":"<code>set_class_labels(labels)</code>","text":"<p>Return the class labels for this Model (if it's a classifier)</p> <p>Parameters:</p> Name Type Description Default <code>labels</code> <code>list[str]</code> <p>List of class labels</p> required Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def set_class_labels(self, labels: list[str]):\n    \"\"\"Return the class labels for this Model (if it's a classifier)\n\n    Args:\n        labels (list[str]): List of class labels\n    \"\"\"\n    if self.model_type == ModelType.CLASSIFIER:\n        self.upsert_sageworks_meta({\"class_labels\": labels})\n    else:\n        self.log.error(f\"Model {self.model_name} is not a classifier!\")\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_features","title":"<code>set_features(feature_columns)</code>","text":"<p>Set the features for this Model</p> <p>Parameters:</p> Name Type Description Default <code>feature_columns</code> <code>list[str]</code> <p>List of feature columns</p> required Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def set_features(self, feature_columns: list[str]):\n    \"\"\"Set the features for this Model\n\n    Args:\n        feature_columns (list[str]): List of feature columns\n    \"\"\"\n    self.upsert_sageworks_meta({\"sageworks_model_features\": feature_columns})\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_input","title":"<code>set_input(input, force=False)</code>","text":"<p>Override: Set the input data for this artifact</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>str</code> <p>Name of input for this artifact</p> required <code>force</code> <code>bool</code> <p>Force the input to be set (default: False)</p> <code>False</code> <p>Note:     We're going to not allow this to be used for Models</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def set_input(self, input: str, force: bool = False):\n    \"\"\"Override: Set the input data for this artifact\n\n    Args:\n        input (str): Name of input for this artifact\n        force (bool, optional): Force the input to be set (default: False)\n    Note:\n        We're going to not allow this to be used for Models\n    \"\"\"\n    if not force:\n        self.log.warning(f\"Model {self.uuid}: Does not allow manual override of the input!\")\n        return\n\n    # Okay we're going to allow this to be set\n    self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n    self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n    self.upsert_sageworks_meta({\"sageworks_input\": input})\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_pipeline","title":"<code>set_pipeline(pipeline)</code>","text":"<p>Set the pipeline for this model</p> <p>Parameters:</p> Name Type Description Default <code>pipeline</code> <code>str</code> <p>Pipeline that was used to create this model</p> required Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def set_pipeline(self, pipeline: str):\n    \"\"\"Set the pipeline for this model\n\n    Args:\n        pipeline (str): Pipeline that was used to create this model\n    \"\"\"\n    self.upsert_sageworks_meta({\"sageworks_pipeline\": pipeline})\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_target","title":"<code>set_target(target_column)</code>","text":"<p>Set the target for this Model</p> <p>Parameters:</p> Name Type Description Default <code>target_column</code> <code>str</code> <p>Target column for this Model</p> required Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def set_target(self, target_column: str):\n    \"\"\"Set the target for this Model\n\n    Args:\n        target_column (str): Target column for this Model\n    \"\"\"\n    self.upsert_sageworks_meta({\"sageworks_model_target\": target_column})\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.shapley_values","title":"<code>shapley_values(capture_uuid='training_holdout')</code>","text":"<p>Retrieve the Shapely values for this model</p> <p>Parameters:</p> Name Type Description Default <code>capture_uuid</code> <code>str</code> <p>Specific capture_uuid (default: training_holdout)</p> <code>'training_holdout'</code> <p>Returns:</p> Type Description <code>Union[list[DataFrame], DataFrame, None]</code> <p>pd.DataFrame: Dataframe of the shapley values for the prediction dataframe</p> Notes <p>This may or may not exist based on whether an Endpoint ran Shapley</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def shapley_values(self, capture_uuid: str = \"training_holdout\") -&gt; Union[list[pd.DataFrame], pd.DataFrame, None]:\n    \"\"\"Retrieve the Shapely values for this model\n\n    Args:\n        capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n    Returns:\n        pd.DataFrame: Dataframe of the shapley values for the prediction dataframe\n\n    Notes:\n        This may or may not exist based on whether an Endpoint ran Shapley\n    \"\"\"\n\n    # Sanity check the inference path (which may or may not exist)\n    if self.endpoint_inference_path is None:\n        return None\n\n    # Construct the S3 path for the Shapley values\n    shapley_s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n    # Multiple CSV if classifier\n    if self.model_type == ModelType.CLASSIFIER:\n        # CSVs for shap values are indexed by prediction class\n        # Because we don't know how many classes there are, we need to search through\n        # a list of S3 objects in the parent folder\n        s3_paths = wr.s3.list_objects(shapley_s3_path)\n        return [pull_s3_data(f) for f in s3_paths if \"inference_shap_values\" in f]\n\n    # One CSV if regressor\n    if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n        s3_path = f\"{shapley_s3_path}/inference_shap_values.csv\"\n        return pull_s3_data(s3_path)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.size","title":"<code>size()</code>","text":"<p>Return the size of this data in MegaBytes</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def size(self) -&gt; float:\n    \"\"\"Return the size of this data in MegaBytes\"\"\"\n    return 0.0\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.target","title":"<code>target()</code>","text":"<p>Return the target for this Model (if supervised, else None)</p> <p>Returns:</p> Name Type Description <code>str</code> <code>Union[str, None]</code> <p>Target column for this Model (if supervised, else None)</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def target(self) -&gt; Union[str, None]:\n    \"\"\"Return the target for this Model (if supervised, else None)\n\n    Returns:\n        str: Target column for this Model (if supervised, else None)\n    \"\"\"\n    return self.sageworks_meta().get(\"sageworks_model_target\")  # Returns None if not found\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelType","title":"<code>ModelType</code>","text":"<p>               Bases: <code>Enum</code></p> <p>Enumerated Types for SageWorks Model Types</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>class ModelType(Enum):\n    \"\"\"Enumerated Types for SageWorks Model Types\"\"\"\n\n    CLASSIFIER = \"classifier\"\n    REGRESSOR = \"regressor\"\n    CLUSTERER = \"clusterer\"\n    TRANSFORMER = \"transformer\"\n    QUANTILE_REGRESSOR = \"quantile_regressor\"\n    UNKNOWN = \"unknown\"\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/","title":"MonitorCore","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the Monitor API Class and voil\u00e0 it works the same.</p> <p>MonitorCore class for monitoring SageMaker endpoints</p>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore","title":"<code>MonitorCore</code>","text":"Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>class MonitorCore:\n    def __init__(self, endpoint_name, instance_type=\"ml.t3.large\"):\n        \"\"\"ExtractModelArtifact Class\n        Args:\n            endpoint_name (str): Name of the endpoint to set up monitoring for\n            instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\".\n                                 Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...\n        \"\"\"\n        self.log = logging.getLogger(\"sageworks\")\n        self.endpoint_name = endpoint_name\n        self.endpoint = EndpointCore(self.endpoint_name)\n\n        # Initialize Class Attributes\n        self.sagemaker_session = self.endpoint.sm_session\n        self.sagemaker_client = self.endpoint.sm_client\n        self.data_capture_path = self.endpoint.endpoint_data_capture_path\n        self.monitoring_path = self.endpoint.endpoint_monitoring_path\n        self.instance_type = instance_type\n        self.monitoring_schedule_name = f\"{self.endpoint_name}-monitoring-schedule\"\n        self.monitoring_output_path = f\"{self.monitoring_path}/monitoring_reports\"\n        self.baseline_dir = f\"{self.monitoring_path}/baseline\"\n        self.baseline_csv_file = f\"{self.baseline_dir}/baseline.csv\"\n        self.constraints_json_file = f\"{self.baseline_dir}/constraints.json\"\n        self.statistics_json_file = f\"{self.baseline_dir}/statistics.json\"\n\n        # Initialize the DefaultModelMonitor\n        self.sageworks_role = AWSAccountClamp().sageworks_execution_role_arn()\n        self.model_monitor = DefaultModelMonitor(role=self.sageworks_role, instance_type=self.instance_type)\n\n    def summary(self) -&gt; dict:\n        \"\"\"Return the summary of information about the endpoint monitor\n\n        Returns:\n            dict: Summary of information about the endpoint monitor\n        \"\"\"\n        if self.endpoint.is_serverless():\n            return {\n                \"endpoint_type\": \"serverless\",\n                \"data_capture\": \"not supported\",\n                \"baseline\": \"not supported\",\n                \"monitoring_schedule\": \"not supported\",\n            }\n        else:\n            summary = {\n                \"endpoint_type\": \"realtime\",\n                \"data_capture\": self.is_data_capture_configured(capture_percentage=100),\n                \"baseline\": self.baseline_exists(),\n                \"monitoring_schedule\": self.monitoring_schedule_exists(),\n            }\n            summary.update(self.last_run_details() or {})\n            return summary\n\n    def __repr__(self) -&gt; str:\n        \"\"\"String representation of this MonitorCore object\n\n        Returns:\n            str: String representation of this MonitorCore object\n        \"\"\"\n        summary_dict = self.summary()\n        summary_items = [f\"  {repr(key)}: {repr(value)}\" for key, value in summary_dict.items()]\n        summary_str = f\"{self.__class__.__name__}: {self.endpoint_name}\\n\" + \",\\n\".join(summary_items)\n        return summary_str\n\n    def last_run_details(self) -&gt; Union[dict, None]:\n        \"\"\"Return the details of the last monitoring run for the endpoint\n\n        Returns:\n            dict: The details of the last monitoring run for the endpoint (None if no monitoring schedule)\n        \"\"\"\n        # Check if we have a monitoring schedule\n        if not self.monitoring_schedule_exists():\n            return None\n\n        # Get the details of the last monitoring run\n        schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n            MonitoringScheduleName=self.monitoring_schedule_name\n        )\n        last_run_status = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"MonitoringExecutionStatus\")\n        last_run_time = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"ScheduledTime\")\n        failure_reason = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"FailureReason\")\n        return {\n            \"last_run_status\": last_run_status,\n            \"last_run_time\": str(last_run_time),\n            \"failure_reason\": failure_reason,\n        }\n\n    def details(self) -&gt; dict:\n        \"\"\"Return the details of the monitoring for the endpoint\n\n        Returns:\n            dict: The details of the monitoring for the endpoint\n        \"\"\"\n        # Check if we have data capture\n        if self.is_data_capture_configured(capture_percentage=100):\n            data_capture_path = self.data_capture_path\n        else:\n            data_capture_path = None\n\n        # Check if we have a baseline\n        if self.baseline_exists():\n            baseline_csv_file = self.baseline_csv_file\n            constraints_json_file = self.constraints_json_file\n            statistics_json_file = self.statistics_json_file\n        else:\n            baseline_csv_file = None\n            constraints_json_file = None\n            statistics_json_file = None\n\n        # Check if we have a monitoring schedule\n        if self.monitoring_schedule_exists():\n            schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n                MonitoringScheduleName=self.monitoring_schedule_name\n            )\n\n            # General monitoring details\n            schedule_name = schedule_details.get(\"MonitoringScheduleName\")\n            schedule_status = schedule_details.get(\"MonitoringScheduleStatus\")\n            output_path = self.monitoring_output_path\n            last_run_details = self.last_run_details()\n        else:\n            schedule_name = None\n            schedule_status = \"Not Scheduled\"\n            schedule_details = None\n            output_path = None\n            last_run_details = None\n\n        # General monitoring details\n        general = {\n            \"data_capture_path\": data_capture_path,\n            \"baseline_csv_file\": baseline_csv_file,\n            \"baseline_constraints_json_file\": constraints_json_file,\n            \"baseline_statistics_json_file\": statistics_json_file,\n            \"monitoring_schedule_name\": schedule_name,\n            \"monitoring_output_path\": output_path,\n            \"monitoring_schedule_status\": schedule_status,\n            \"monitoring_schedule_details\": schedule_details,\n        }\n        if last_run_details:\n            general.update(last_run_details)\n        return general\n\n    def add_data_capture(self, capture_percentage=100):\n        \"\"\"\n        Add data capture configuration for the SageMaker endpoint.\n\n        Args:\n            capture_percentage (int): Percentage of data to capture. Defaults to 100.\n        \"\"\"\n\n        # Check if this endpoint is a serverless endpoint\n        if self.endpoint.is_serverless():\n            self.log.warning(\"Data capture is not currently supported for serverless endpoints.\")\n            return\n\n        # Check if the endpoint already has data capture configured\n        if self.is_data_capture_configured(capture_percentage):\n            self.log.important(f\"Data capture {capture_percentage} already configured for {self.endpoint_name}.\")\n            return\n\n        # Get the current endpoint configuration name\n        current_endpoint_config_name = self.endpoint.endpoint_config_name()\n\n        # Log the data capture path\n        self.log.important(f\"Adding Data Capture to {self.endpoint_name} --&gt; {self.data_capture_path}\")\n        self.log.important(\"This normally redeploys the endpoint...\")\n\n        # Setup data capture config\n        data_capture_config = DataCaptureConfig(\n            enable_capture=True,\n            sampling_percentage=capture_percentage,\n            destination_s3_uri=self.data_capture_path,\n            capture_options=[\"Input\", \"Output\"],\n            csv_content_types=[\"text/csv\"],\n        )\n\n        # Create a Predictor instance and update data capture configuration\n        predictor = Predictor(self.endpoint_name, sagemaker_session=self.sagemaker_session)\n        predictor.update_data_capture_config(data_capture_config=data_capture_config)\n\n        # Delete the old endpoint configuration\n        self.log.important(f\"Deleting old endpoint configuration: {current_endpoint_config_name}\")\n        self.sagemaker_client.delete_endpoint_config(EndpointConfigName=current_endpoint_config_name)\n\n    def is_data_capture_configured(self, capture_percentage):\n        \"\"\"\n        Check if data capture is already configured on the endpoint.\n        Args:\n            capture_percentage (int): Expected data capture percentage.\n        Returns:\n            bool: True if data capture is already configured, False otherwise.\n        \"\"\"\n        try:\n            endpoint_config_name = self.endpoint.endpoint_config_name()\n            endpoint_config = self.sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n            data_capture_config = endpoint_config.get(\"DataCaptureConfig\", {})\n\n            # Check if data capture is enabled and the percentage matches\n            is_enabled = data_capture_config.get(\"EnableCapture\", False)\n            current_percentage = data_capture_config.get(\"InitialSamplingPercentage\", 0)\n            return is_enabled and current_percentage == capture_percentage\n        except Exception as e:\n            self.log.error(f\"Error checking data capture configuration: {e}\")\n            return False\n\n    def get_latest_data_capture(self) -&gt; (pd.DataFrame, pd.DataFrame):\n        \"\"\"\n        Get the latest data capture from S3.\n\n        Returns:\n            DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n        \"\"\"\n        # List files in the specified S3 path\n        files = wr.s3.list_objects(self.data_capture_path)\n\n        if files:\n            print(f\"Found {len(files)} files in {self.data_capture_path}. Reading the most recent file.\")\n\n            # Read the most recent file into a DataFrame\n            df = wr.s3.read_json(path=files[-1], lines=True)  # Reads the last file assuming it's the most recent one\n\n            # Process the captured data and return the input and output DataFrames\n            return self.process_captured_data(df)\n        else:\n            print(f\"No data capture files found in {self.data_capture_path}.\")\n            return None, None\n\n    @staticmethod\n    def process_captured_data(df: pd.DataFrame) -&gt; (pd.DataFrame, pd.DataFrame):\n        \"\"\"\n        Process the captured data DataFrame to extract and flatten the nested data.\n\n        Args:\n            df (DataFrame): DataFrame with captured data.\n\n        Returns:\n            DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n        \"\"\"\n        processed_records = []\n\n        # Phase1: Process the AWS Data Capture format into a flatter DataFrame\n        for _, row in df.iterrows():\n            # Extract data from captureData dictionary\n            capture_data = row[\"captureData\"]\n            input_data = capture_data[\"endpointInput\"]\n            output_data = capture_data[\"endpointOutput\"]\n\n            # Process input and output, both meta and actual data\n            record = {\n                \"input_content_type\": input_data.get(\"observedContentType\"),\n                \"input_encoding\": input_data.get(\"encoding\"),\n                \"input\": input_data.get(\"data\"),\n                \"output_content_type\": output_data.get(\"observedContentType\"),\n                \"output_encoding\": output_data.get(\"encoding\"),\n                \"output\": output_data.get(\"data\"),\n            }\n            processed_records.append(record)\n        processed_df = pd.DataFrame(processed_records)\n\n        # Phase2: Process the input and output 'data' columns into separate DataFrames\n        input_df_list = []\n        output_df_list = []\n        for _, row in processed_df.iterrows():\n            input_df = pd.read_csv(StringIO(row[\"input\"]))\n            input_df_list.append(input_df)\n            output_df = pd.read_csv(StringIO(row[\"output\"]))\n            output_df_list.append(output_df)\n\n        # Return the input and output DataFrames\n        return pd.concat(input_df_list), pd.concat(output_df_list)\n\n    def baseline_exists(self) -&gt; bool:\n        \"\"\"\n        Check if baseline files exist in S3.\n\n        Returns:\n            bool: True if all files exist, False otherwise.\n        \"\"\"\n\n        files = [self.baseline_csv_file, self.constraints_json_file, self.statistics_json_file]\n        return all(wr.s3.does_object_exist(file) for file in files)\n\n    def create_baseline(self, recreate: bool = False):\n        \"\"\"Code to create a baseline for monitoring\n        Args:\n            recreate (bool): If True, recreate the baseline even if it already exists\n        Notes:\n            This will create/write three files to the baseline_dir:\n            - baseline.csv\n            - constraints.json\n            - statistics.json\n        \"\"\"\n        # Check if this endpoint is a serverless endpoint\n        if self.endpoint.is_serverless():\n            self.log.warning(\n                \"You can create a baseline but it can't be used/monitored for serverless endpoints, skipping...\"\n            )\n            return\n\n        if not self.baseline_exists() or recreate:\n            # Create a baseline for monitoring (training data from the FeatureSet)\n            baseline_df = endpoint_utils.fs_training_data(self.endpoint)\n            wr.s3.to_csv(baseline_df, self.baseline_csv_file, index=False)\n\n            self.log.important(f\"Creating baseline files for {self.endpoint_name} --&gt; {self.baseline_dir}\")\n            self.model_monitor.suggest_baseline(\n                baseline_dataset=self.baseline_csv_file,\n                dataset_format=DatasetFormat.csv(header=True),\n                output_s3_uri=self.baseline_dir,\n            )\n        else:\n            self.log.important(f\"Baseline already exists for {self.endpoint_name}\")\n\n    def get_baseline(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n        Returns:\n            pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n        \"\"\"\n        # Read the monitoring data from S3\n        if not wr.s3.does_object_exist(path=self.baseline_csv_file):\n            self.log.warning(\"baseline.csv data does not exist in S3.\")\n            return None\n        else:\n            return wr.s3.read_csv(self.baseline_csv_file)\n\n    def get_constraints(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the constraints from the baseline\n\n        Returns:\n           pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n        \"\"\"\n        return self._get_monitor_json_data(self.constraints_json_file)\n\n    def get_statistics(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the statistics from the baseline\n\n        Returns:\n            pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n        \"\"\"\n        return self._get_monitor_json_data(self.statistics_json_file)\n\n    def _get_monitor_json_data(self, s3_path: str) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Internal: Convert the JSON monitoring data into a DataFrame\n        Args:\n            s3_path(str): The S3 path to the monitoring data\n        Returns:\n            pd.DataFrame: Monitoring data in DataFrame form (None if it doesn't exist)\n        \"\"\"\n        # Read the monitoring data from S3\n        if not wr.s3.does_object_exist(path=s3_path):\n            self.log.warning(\"Monitoring data does not exist in S3.\")\n            return None\n        else:\n            raw_json = read_s3_file(s3_path=s3_path)\n            monitoring_data = json.loads(raw_json)\n            monitoring_df = pd.json_normalize(monitoring_data[\"features\"])\n            return monitoring_df\n\n    def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n        \"\"\"\n        Sets up the monitoring schedule for the model endpoint.\n        Args:\n            schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n            recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n        \"\"\"\n        # Check if this endpoint is a serverless endpoint\n        if self.endpoint.is_serverless():\n            self.log.warning(\"Monitoring Schedule is not currently supported for serverless endpoints.\")\n            return\n\n        # Set up the monitoring schedule, name, and output path\n        if schedule == \"daily\":\n            schedule = CronExpressionGenerator.daily()\n        else:\n            schedule = CronExpressionGenerator.hourly()\n\n        # Check if the baseline exists\n        if not self.baseline_exists():\n            self.log.warning(f\"Baseline does not exist for {self.endpoint_name}. Call create_baseline() first...\")\n            return\n\n        # Check if monitoring schedule already exists\n        schedule_exists = self.monitoring_schedule_exists()\n\n        # If the schedule exists, and we don't want to recreate it, return\n        if schedule_exists and not recreate:\n            return\n\n        # If the schedule exists, delete it\n        if schedule_exists:\n            self.log.important(f\"Deleting existing monitoring schedule for {self.endpoint_name}...\")\n            self.sagemaker_client.delete_monitoring_schedule(MonitoringScheduleName=self.monitoring_schedule_name)\n\n        # Set up a NEW monitoring schedule\n        self.model_monitor.create_monitoring_schedule(\n            monitor_schedule_name=self.monitoring_schedule_name,\n            endpoint_input=self.endpoint_name,\n            output_s3_uri=self.monitoring_output_path,\n            statistics=self.statistics_json_file,\n            constraints=self.constraints_json_file,\n            schedule_cron_expression=schedule,\n        )\n        self.log.important(f\"New Monitoring schedule created for {self.endpoint_name}.\")\n\n    def setup_alerts(self):\n        \"\"\"Code to set up alerts based on monitoring results\"\"\"\n        pass\n\n    def monitoring_schedule_exists(self):\n        \"\"\"Code to figure out if a monitoring schedule already exists for this endpoint\"\"\"\n        existing_schedules = self.sagemaker_client.list_monitoring_schedules(MaxResults=100).get(\n            \"MonitoringScheduleSummaries\", []\n        )\n        if any(schedule[\"MonitoringScheduleName\"] == self.monitoring_schedule_name for schedule in existing_schedules):\n            self.log.info(f\"Monitoring schedule already exists for {self.endpoint_name}.\")\n            return True\n        else:\n            self.log.info(f\"Could not find a Monitoring schedule for {self.endpoint_name}.\")\n            return False\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.__init__","title":"<code>__init__(endpoint_name, instance_type='ml.t3.large')</code>","text":"<p>ExtractModelArtifact Class Args:     endpoint_name (str): Name of the endpoint to set up monitoring for     instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\".                          Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def __init__(self, endpoint_name, instance_type=\"ml.t3.large\"):\n    \"\"\"ExtractModelArtifact Class\n    Args:\n        endpoint_name (str): Name of the endpoint to set up monitoring for\n        instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\".\n                             Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...\n    \"\"\"\n    self.log = logging.getLogger(\"sageworks\")\n    self.endpoint_name = endpoint_name\n    self.endpoint = EndpointCore(self.endpoint_name)\n\n    # Initialize Class Attributes\n    self.sagemaker_session = self.endpoint.sm_session\n    self.sagemaker_client = self.endpoint.sm_client\n    self.data_capture_path = self.endpoint.endpoint_data_capture_path\n    self.monitoring_path = self.endpoint.endpoint_monitoring_path\n    self.instance_type = instance_type\n    self.monitoring_schedule_name = f\"{self.endpoint_name}-monitoring-schedule\"\n    self.monitoring_output_path = f\"{self.monitoring_path}/monitoring_reports\"\n    self.baseline_dir = f\"{self.monitoring_path}/baseline\"\n    self.baseline_csv_file = f\"{self.baseline_dir}/baseline.csv\"\n    self.constraints_json_file = f\"{self.baseline_dir}/constraints.json\"\n    self.statistics_json_file = f\"{self.baseline_dir}/statistics.json\"\n\n    # Initialize the DefaultModelMonitor\n    self.sageworks_role = AWSAccountClamp().sageworks_execution_role_arn()\n    self.model_monitor = DefaultModelMonitor(role=self.sageworks_role, instance_type=self.instance_type)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.__repr__","title":"<code>__repr__()</code>","text":"<p>String representation of this MonitorCore object</p> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>String representation of this MonitorCore object</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def __repr__(self) -&gt; str:\n    \"\"\"String representation of this MonitorCore object\n\n    Returns:\n        str: String representation of this MonitorCore object\n    \"\"\"\n    summary_dict = self.summary()\n    summary_items = [f\"  {repr(key)}: {repr(value)}\" for key, value in summary_dict.items()]\n    summary_str = f\"{self.__class__.__name__}: {self.endpoint_name}\\n\" + \",\\n\".join(summary_items)\n    return summary_str\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.add_data_capture","title":"<code>add_data_capture(capture_percentage=100)</code>","text":"<p>Add data capture configuration for the SageMaker endpoint.</p> <p>Parameters:</p> Name Type Description Default <code>capture_percentage</code> <code>int</code> <p>Percentage of data to capture. Defaults to 100.</p> <code>100</code> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def add_data_capture(self, capture_percentage=100):\n    \"\"\"\n    Add data capture configuration for the SageMaker endpoint.\n\n    Args:\n        capture_percentage (int): Percentage of data to capture. Defaults to 100.\n    \"\"\"\n\n    # Check if this endpoint is a serverless endpoint\n    if self.endpoint.is_serverless():\n        self.log.warning(\"Data capture is not currently supported for serverless endpoints.\")\n        return\n\n    # Check if the endpoint already has data capture configured\n    if self.is_data_capture_configured(capture_percentage):\n        self.log.important(f\"Data capture {capture_percentage} already configured for {self.endpoint_name}.\")\n        return\n\n    # Get the current endpoint configuration name\n    current_endpoint_config_name = self.endpoint.endpoint_config_name()\n\n    # Log the data capture path\n    self.log.important(f\"Adding Data Capture to {self.endpoint_name} --&gt; {self.data_capture_path}\")\n    self.log.important(\"This normally redeploys the endpoint...\")\n\n    # Setup data capture config\n    data_capture_config = DataCaptureConfig(\n        enable_capture=True,\n        sampling_percentage=capture_percentage,\n        destination_s3_uri=self.data_capture_path,\n        capture_options=[\"Input\", \"Output\"],\n        csv_content_types=[\"text/csv\"],\n    )\n\n    # Create a Predictor instance and update data capture configuration\n    predictor = Predictor(self.endpoint_name, sagemaker_session=self.sagemaker_session)\n    predictor.update_data_capture_config(data_capture_config=data_capture_config)\n\n    # Delete the old endpoint configuration\n    self.log.important(f\"Deleting old endpoint configuration: {current_endpoint_config_name}\")\n    self.sagemaker_client.delete_endpoint_config(EndpointConfigName=current_endpoint_config_name)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.baseline_exists","title":"<code>baseline_exists()</code>","text":"<p>Check if baseline files exist in S3.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if all files exist, False otherwise.</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def baseline_exists(self) -&gt; bool:\n    \"\"\"\n    Check if baseline files exist in S3.\n\n    Returns:\n        bool: True if all files exist, False otherwise.\n    \"\"\"\n\n    files = [self.baseline_csv_file, self.constraints_json_file, self.statistics_json_file]\n    return all(wr.s3.does_object_exist(file) for file in files)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.create_baseline","title":"<code>create_baseline(recreate=False)</code>","text":"<p>Code to create a baseline for monitoring Args:     recreate (bool): If True, recreate the baseline even if it already exists Notes:     This will create/write three files to the baseline_dir:     - baseline.csv     - constraints.json     - statistics.json</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def create_baseline(self, recreate: bool = False):\n    \"\"\"Code to create a baseline for monitoring\n    Args:\n        recreate (bool): If True, recreate the baseline even if it already exists\n    Notes:\n        This will create/write three files to the baseline_dir:\n        - baseline.csv\n        - constraints.json\n        - statistics.json\n    \"\"\"\n    # Check if this endpoint is a serverless endpoint\n    if self.endpoint.is_serverless():\n        self.log.warning(\n            \"You can create a baseline but it can't be used/monitored for serverless endpoints, skipping...\"\n        )\n        return\n\n    if not self.baseline_exists() or recreate:\n        # Create a baseline for monitoring (training data from the FeatureSet)\n        baseline_df = endpoint_utils.fs_training_data(self.endpoint)\n        wr.s3.to_csv(baseline_df, self.baseline_csv_file, index=False)\n\n        self.log.important(f\"Creating baseline files for {self.endpoint_name} --&gt; {self.baseline_dir}\")\n        self.model_monitor.suggest_baseline(\n            baseline_dataset=self.baseline_csv_file,\n            dataset_format=DatasetFormat.csv(header=True),\n            output_s3_uri=self.baseline_dir,\n        )\n    else:\n        self.log.important(f\"Baseline already exists for {self.endpoint_name}\")\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.create_monitoring_schedule","title":"<code>create_monitoring_schedule(schedule='hourly', recreate=False)</code>","text":"<p>Sets up the monitoring schedule for the model endpoint. Args:     schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).     recreate (bool): If True, recreate the monitoring schedule even if it already exists.</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n    \"\"\"\n    Sets up the monitoring schedule for the model endpoint.\n    Args:\n        schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n        recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n    \"\"\"\n    # Check if this endpoint is a serverless endpoint\n    if self.endpoint.is_serverless():\n        self.log.warning(\"Monitoring Schedule is not currently supported for serverless endpoints.\")\n        return\n\n    # Set up the monitoring schedule, name, and output path\n    if schedule == \"daily\":\n        schedule = CronExpressionGenerator.daily()\n    else:\n        schedule = CronExpressionGenerator.hourly()\n\n    # Check if the baseline exists\n    if not self.baseline_exists():\n        self.log.warning(f\"Baseline does not exist for {self.endpoint_name}. Call create_baseline() first...\")\n        return\n\n    # Check if monitoring schedule already exists\n    schedule_exists = self.monitoring_schedule_exists()\n\n    # If the schedule exists, and we don't want to recreate it, return\n    if schedule_exists and not recreate:\n        return\n\n    # If the schedule exists, delete it\n    if schedule_exists:\n        self.log.important(f\"Deleting existing monitoring schedule for {self.endpoint_name}...\")\n        self.sagemaker_client.delete_monitoring_schedule(MonitoringScheduleName=self.monitoring_schedule_name)\n\n    # Set up a NEW monitoring schedule\n    self.model_monitor.create_monitoring_schedule(\n        monitor_schedule_name=self.monitoring_schedule_name,\n        endpoint_input=self.endpoint_name,\n        output_s3_uri=self.monitoring_output_path,\n        statistics=self.statistics_json_file,\n        constraints=self.constraints_json_file,\n        schedule_cron_expression=schedule,\n    )\n    self.log.important(f\"New Monitoring schedule created for {self.endpoint_name}.\")\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.details","title":"<code>details()</code>","text":"<p>Return the details of the monitoring for the endpoint</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>The details of the monitoring for the endpoint</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def details(self) -&gt; dict:\n    \"\"\"Return the details of the monitoring for the endpoint\n\n    Returns:\n        dict: The details of the monitoring for the endpoint\n    \"\"\"\n    # Check if we have data capture\n    if self.is_data_capture_configured(capture_percentage=100):\n        data_capture_path = self.data_capture_path\n    else:\n        data_capture_path = None\n\n    # Check if we have a baseline\n    if self.baseline_exists():\n        baseline_csv_file = self.baseline_csv_file\n        constraints_json_file = self.constraints_json_file\n        statistics_json_file = self.statistics_json_file\n    else:\n        baseline_csv_file = None\n        constraints_json_file = None\n        statistics_json_file = None\n\n    # Check if we have a monitoring schedule\n    if self.monitoring_schedule_exists():\n        schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n            MonitoringScheduleName=self.monitoring_schedule_name\n        )\n\n        # General monitoring details\n        schedule_name = schedule_details.get(\"MonitoringScheduleName\")\n        schedule_status = schedule_details.get(\"MonitoringScheduleStatus\")\n        output_path = self.monitoring_output_path\n        last_run_details = self.last_run_details()\n    else:\n        schedule_name = None\n        schedule_status = \"Not Scheduled\"\n        schedule_details = None\n        output_path = None\n        last_run_details = None\n\n    # General monitoring details\n    general = {\n        \"data_capture_path\": data_capture_path,\n        \"baseline_csv_file\": baseline_csv_file,\n        \"baseline_constraints_json_file\": constraints_json_file,\n        \"baseline_statistics_json_file\": statistics_json_file,\n        \"monitoring_schedule_name\": schedule_name,\n        \"monitoring_output_path\": output_path,\n        \"monitoring_schedule_status\": schedule_status,\n        \"monitoring_schedule_details\": schedule_details,\n    }\n    if last_run_details:\n        general.update(last_run_details)\n    return general\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_baseline","title":"<code>get_baseline()</code>","text":"<p>Code to get the baseline CSV from the S3 baseline directory</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def get_baseline(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n    Returns:\n        pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n    \"\"\"\n    # Read the monitoring data from S3\n    if not wr.s3.does_object_exist(path=self.baseline_csv_file):\n        self.log.warning(\"baseline.csv data does not exist in S3.\")\n        return None\n    else:\n        return wr.s3.read_csv(self.baseline_csv_file)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_constraints","title":"<code>get_constraints()</code>","text":"<p>Code to get the constraints from the baseline</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def get_constraints(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the constraints from the baseline\n\n    Returns:\n       pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n    \"\"\"\n    return self._get_monitor_json_data(self.constraints_json_file)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_latest_data_capture","title":"<code>get_latest_data_capture()</code>","text":"<p>Get the latest data capture from S3.</p> <p>Returns:</p> Name Type Description <code>DataFrame</code> <code>input), DataFrame(output</code> <p>Flattened and processed DataFrames for input and output data.</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def get_latest_data_capture(self) -&gt; (pd.DataFrame, pd.DataFrame):\n    \"\"\"\n    Get the latest data capture from S3.\n\n    Returns:\n        DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n    \"\"\"\n    # List files in the specified S3 path\n    files = wr.s3.list_objects(self.data_capture_path)\n\n    if files:\n        print(f\"Found {len(files)} files in {self.data_capture_path}. Reading the most recent file.\")\n\n        # Read the most recent file into a DataFrame\n        df = wr.s3.read_json(path=files[-1], lines=True)  # Reads the last file assuming it's the most recent one\n\n        # Process the captured data and return the input and output DataFrames\n        return self.process_captured_data(df)\n    else:\n        print(f\"No data capture files found in {self.data_capture_path}.\")\n        return None, None\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_statistics","title":"<code>get_statistics()</code>","text":"<p>Code to get the statistics from the baseline</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def get_statistics(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the statistics from the baseline\n\n    Returns:\n        pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n    \"\"\"\n    return self._get_monitor_json_data(self.statistics_json_file)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.is_data_capture_configured","title":"<code>is_data_capture_configured(capture_percentage)</code>","text":"<p>Check if data capture is already configured on the endpoint. Args:     capture_percentage (int): Expected data capture percentage. Returns:     bool: True if data capture is already configured, False otherwise.</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def is_data_capture_configured(self, capture_percentage):\n    \"\"\"\n    Check if data capture is already configured on the endpoint.\n    Args:\n        capture_percentage (int): Expected data capture percentage.\n    Returns:\n        bool: True if data capture is already configured, False otherwise.\n    \"\"\"\n    try:\n        endpoint_config_name = self.endpoint.endpoint_config_name()\n        endpoint_config = self.sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n        data_capture_config = endpoint_config.get(\"DataCaptureConfig\", {})\n\n        # Check if data capture is enabled and the percentage matches\n        is_enabled = data_capture_config.get(\"EnableCapture\", False)\n        current_percentage = data_capture_config.get(\"InitialSamplingPercentage\", 0)\n        return is_enabled and current_percentage == capture_percentage\n    except Exception as e:\n        self.log.error(f\"Error checking data capture configuration: {e}\")\n        return False\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.last_run_details","title":"<code>last_run_details()</code>","text":"<p>Return the details of the last monitoring run for the endpoint</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>Union[dict, None]</code> <p>The details of the last monitoring run for the endpoint (None if no monitoring schedule)</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def last_run_details(self) -&gt; Union[dict, None]:\n    \"\"\"Return the details of the last monitoring run for the endpoint\n\n    Returns:\n        dict: The details of the last monitoring run for the endpoint (None if no monitoring schedule)\n    \"\"\"\n    # Check if we have a monitoring schedule\n    if not self.monitoring_schedule_exists():\n        return None\n\n    # Get the details of the last monitoring run\n    schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n        MonitoringScheduleName=self.monitoring_schedule_name\n    )\n    last_run_status = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"MonitoringExecutionStatus\")\n    last_run_time = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"ScheduledTime\")\n    failure_reason = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"FailureReason\")\n    return {\n        \"last_run_status\": last_run_status,\n        \"last_run_time\": str(last_run_time),\n        \"failure_reason\": failure_reason,\n    }\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.monitoring_schedule_exists","title":"<code>monitoring_schedule_exists()</code>","text":"<p>Code to figure out if a monitoring schedule already exists for this endpoint</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def monitoring_schedule_exists(self):\n    \"\"\"Code to figure out if a monitoring schedule already exists for this endpoint\"\"\"\n    existing_schedules = self.sagemaker_client.list_monitoring_schedules(MaxResults=100).get(\n        \"MonitoringScheduleSummaries\", []\n    )\n    if any(schedule[\"MonitoringScheduleName\"] == self.monitoring_schedule_name for schedule in existing_schedules):\n        self.log.info(f\"Monitoring schedule already exists for {self.endpoint_name}.\")\n        return True\n    else:\n        self.log.info(f\"Could not find a Monitoring schedule for {self.endpoint_name}.\")\n        return False\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.process_captured_data","title":"<code>process_captured_data(df)</code>  <code>staticmethod</code>","text":"<p>Process the captured data DataFrame to extract and flatten the nested data.</p> <p>Parameters:</p> Name Type Description Default <code>df</code> <code>DataFrame</code> <p>DataFrame with captured data.</p> required <p>Returns:</p> Name Type Description <code>DataFrame</code> <code>input), DataFrame(output</code> <p>Flattened and processed DataFrames for input and output data.</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>@staticmethod\ndef process_captured_data(df: pd.DataFrame) -&gt; (pd.DataFrame, pd.DataFrame):\n    \"\"\"\n    Process the captured data DataFrame to extract and flatten the nested data.\n\n    Args:\n        df (DataFrame): DataFrame with captured data.\n\n    Returns:\n        DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n    \"\"\"\n    processed_records = []\n\n    # Phase1: Process the AWS Data Capture format into a flatter DataFrame\n    for _, row in df.iterrows():\n        # Extract data from captureData dictionary\n        capture_data = row[\"captureData\"]\n        input_data = capture_data[\"endpointInput\"]\n        output_data = capture_data[\"endpointOutput\"]\n\n        # Process input and output, both meta and actual data\n        record = {\n            \"input_content_type\": input_data.get(\"observedContentType\"),\n            \"input_encoding\": input_data.get(\"encoding\"),\n            \"input\": input_data.get(\"data\"),\n            \"output_content_type\": output_data.get(\"observedContentType\"),\n            \"output_encoding\": output_data.get(\"encoding\"),\n            \"output\": output_data.get(\"data\"),\n        }\n        processed_records.append(record)\n    processed_df = pd.DataFrame(processed_records)\n\n    # Phase2: Process the input and output 'data' columns into separate DataFrames\n    input_df_list = []\n    output_df_list = []\n    for _, row in processed_df.iterrows():\n        input_df = pd.read_csv(StringIO(row[\"input\"]))\n        input_df_list.append(input_df)\n        output_df = pd.read_csv(StringIO(row[\"output\"]))\n        output_df_list.append(output_df)\n\n    # Return the input and output DataFrames\n    return pd.concat(input_df_list), pd.concat(output_df_list)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.setup_alerts","title":"<code>setup_alerts()</code>","text":"<p>Code to set up alerts based on monitoring results</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def setup_alerts(self):\n    \"\"\"Code to set up alerts based on monitoring results\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.summary","title":"<code>summary()</code>","text":"<p>Return the summary of information about the endpoint monitor</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>Summary of information about the endpoint monitor</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def summary(self) -&gt; dict:\n    \"\"\"Return the summary of information about the endpoint monitor\n\n    Returns:\n        dict: Summary of information about the endpoint monitor\n    \"\"\"\n    if self.endpoint.is_serverless():\n        return {\n            \"endpoint_type\": \"serverless\",\n            \"data_capture\": \"not supported\",\n            \"baseline\": \"not supported\",\n            \"monitoring_schedule\": \"not supported\",\n        }\n    else:\n        summary = {\n            \"endpoint_type\": \"realtime\",\n            \"data_capture\": self.is_data_capture_configured(capture_percentage=100),\n            \"baseline\": self.baseline_exists(),\n            \"monitoring_schedule\": self.monitoring_schedule_exists(),\n        }\n        summary.update(self.last_run_details() or {})\n        return summary\n</code></pre>"},{"location":"core_classes/artifacts/overview/","title":"SageWorks Artifacts","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p>"},{"location":"core_classes/artifacts/overview/#welcome-to-the-sageworks-core-artifact-classes","title":"Welcome to the SageWorks Core Artifact Classes","text":"<p>These classes provide low-level APIs for the SageWorks package, they interact more directly with AWS Services and are therefore more complex with a fairly large number of methods. </p> <ul> <li>AthenaSource: Manages AWS Data Catalog and Athena</li> <li>FeatureSetCore: Manages AWS Feature Store and Feature Groups</li> <li>ModelCore: Manages the training and deployment of AWS Model Groups and Packages</li> <li>EndpointCore: Manages the deployment and invocations/inference on AWS Endpoints</li> </ul> <p></p>"},{"location":"core_classes/transforms/data_loaders_heavy/","title":"DataLoaders Heavy","text":"<p>These DataLoader Classes are intended to load larger dataset into AWS. For large data we need to use AWS Glue Jobs/Batch Jobs and in general the process is a bit more complicated and has less features.</p> <p>If you have smaller data please see DataLoaders Light</p> <p>Welcome to the SageWorks DataLoaders Heavy Classes</p> <p>These classes provide low-level APIs for loading larger data into AWS services</p> <ul> <li>S3HeavyToDataSource: Loads large data from S3 into a DataSource</li> </ul>"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource","title":"<code>S3HeavyToDataSource</code>","text":"Source code in <code>src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py</code> <pre><code>class S3HeavyToDataSource:\n    def __init__(self, glue_context: GlueContext, input_uuid: str, output_uuid: str):\n        \"\"\"S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource\n\n        Args:\n            glue_context: GlueContext, AWS Glue Specific wrapper around SparkContext\n            input_uuid (str): The S3 Path to the files to be loaded\n            output_uuid (str): The UUID of the SageWorks DataSource to be created\n        \"\"\"\n        self.log = glue_context.get_logger()\n\n        # FIXME: Pull these from Parameter Store or Config\n        self.input_uuid = input_uuid\n        self.output_uuid = output_uuid\n        self.output_meta = {\"sageworks_input\": self.input_uuid}\n        sageworks_bucket = \"s3://sandbox-sageworks-artifacts\"\n        self.data_sources_s3_path = sageworks_bucket + \"/data-sources\"\n\n        # Our Spark Context\n        self.glue_context = glue_context\n\n    @staticmethod\n    def resolve_choice_fields(dyf):\n        # Get schema fields\n        schema_fields = dyf.schema().fields\n\n        # Collect choice fields\n        choice_fields = [(field.name, \"cast:long\") for field in schema_fields if field.dataType.typeName() == \"choice\"]\n        print(f\"Choice Fields: {choice_fields}\")\n\n        # If there are choice fields, resolve them\n        if choice_fields:\n            dyf = dyf.resolveChoice(specs=choice_fields)\n\n        return dyf\n\n    def timestamp_conversions(self, dyf: DynamicFrame, time_columns: list = []) -&gt; DynamicFrame:\n        \"\"\"Convert columns in the DynamicFrame to the correct data types\n        Args:\n            dyf (DynamicFrame): The DynamicFrame to convert\n            time_columns (list): A list of column names to convert to timestamp\n        Returns:\n            DynamicFrame: The converted DynamicFrame\n        \"\"\"\n\n        # Convert the timestamp columns to timestamp types\n        spark_df = dyf.toDF()\n        for column in time_columns:\n            spark_df = spark_df.withColumn(column, to_timestamp(col(column)))\n\n        # Convert the Spark DataFrame back to a Glue DynamicFrame and return\n        return DynamicFrame.fromDF(spark_df, self.glue_context, \"output_dyf\")\n\n    @staticmethod\n    def remove_periods_from_column_names(dyf: DynamicFrame) -&gt; DynamicFrame:\n        \"\"\"Remove periods from column names in the DynamicFrame\n        Args:\n            dyf (DynamicFrame): The DynamicFrame to convert\n        Returns:\n            DynamicFrame: The converted DynamicFrame\n        \"\"\"\n        # Extract the column names from the schema\n        old_column_names = [field.name for field in dyf.schema().fields]\n\n        # Create a new list of renamed column names\n        new_column_names = [name.replace(\".\", \"_\") for name in old_column_names]\n        print(old_column_names)\n        print(new_column_names)\n\n        # Create a new DynamicFrame with renamed columns\n        for c_old, c_new in zip(old_column_names, new_column_names):\n            dyf = dyf.rename_field(f\"`{c_old}`\", c_new)\n        return dyf\n\n    def transform(\n        self,\n        input_type: str = \"json\",\n        timestamp_columns: list = None,\n        output_format: str = \"parquet\",\n    ):\n        \"\"\"Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and\n        store the information about the data to the AWS Data Catalog sageworks database\n        Args:\n            input_type (str): The type of input files, either 'csv' or 'json'\n            timestamp_columns (list): A list of column names to convert to timestamp\n            output_format (str): The format of the output files, either 'parquet' or 'orc'\n        \"\"\"\n\n        # Add some tags here\n        tags = [\"heavy\"]\n\n        # Create the Output Parquet file S3 Storage Path\n        s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n        # Read JSONL files from S3 and infer schema dynamically\n        self.log.info(f\"Reading JSONL files from {self.input_uuid}...\")\n        input_dyf = self.glue_context.create_dynamic_frame.from_options(\n            connection_type=\"s3\",\n            connection_options={\n                \"paths\": [self.input_uuid],\n                \"recurse\": True,\n                \"gzip\": True,\n            },\n            format=input_type,\n            # format_options={'jsonPath': 'auto'}, Look into this later\n        )\n        self.log.info(\"Incoming DataFrame...\")\n        input_dyf.show(5)\n        input_dyf.printSchema()\n\n        # Resolve Choice fields\n        resolved_dyf = self.resolve_choice_fields(input_dyf)\n\n        # The next couple of lines of code is for un-nesting any nested JSON\n        # Create a Dynamic Frame Collection (dfc)\n        dfc = Relationalize.apply(resolved_dyf, name=\"root\")\n\n        # Aggregate the collection into a single dynamic frame\n        output_dyf = dfc.select(\"root\")\n\n        print(\"Before TimeStamp Conversions\")\n        output_dyf.printSchema()\n\n        # Convert any timestamp columns\n        output_dyf = self.timestamp_conversions(output_dyf, timestamp_columns)\n\n        # Relationalize will put periods in the column names. This will cause\n        # problems later when we try to create a FeatureSet from this DataSource\n        output_dyf = self.remove_periods_from_column_names(output_dyf)\n\n        print(\"After TimeStamp Conversions and Removing Periods from column names\")\n        output_dyf.printSchema()\n\n        # Write Parquet files to S3\n        self.log.info(f\"Writing Parquet files to {s3_storage_path}...\")\n        self.glue_context.purge_s3_path(s3_storage_path, {\"retentionPeriod\": 0})\n        self.glue_context.write_dynamic_frame.from_options(\n            frame=output_dyf,\n            connection_type=\"s3\",\n            connection_options={\n                \"path\": s3_storage_path\n                # \"partitionKeys\": [\"year\", \"month\", \"day\"],\n            },\n            format=output_format,\n        )\n\n        # Set up our SageWorks metadata (description, tags, etc)\n        description = f\"SageWorks data source: {self.output_uuid}\"\n        sageworks_meta = {\"sageworks_tags\": self.tag_delimiter.join(tags)}\n        for key, value in self.output_meta.items():\n            sageworks_meta[key] = value\n\n        # Create a new table in the AWS Data Catalog\n        self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n\n        # Converting the Spark Types to Athena Types\n        def to_athena_type(col):\n            athena_type_map = {\"long\": \"bigint\"}\n            spark_type = col.dataType.typeName()\n            return athena_type_map.get(spark_type, spark_type)\n\n        column_name_types = [{\"Name\": col.name, \"Type\": to_athena_type(col)} for col in output_dyf.schema().fields]\n\n        # Our parameters for the Glue Data Catalog are different for Parquet and ORC\n        if output_format == \"parquet\":\n            glue_input_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat\"\n            glue_output_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat\"\n            serialization_library = \"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe\"\n        else:\n            glue_input_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n            glue_output_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n            serialization_library = \"org.apache.hadoop.hive.ql.io.orc.OrcSerde\"\n\n        table_input = {\n            \"Name\": self.output_uuid,\n            \"Description\": description,\n            \"Parameters\": sageworks_meta,\n            \"TableType\": \"EXTERNAL_TABLE\",\n            \"StorageDescriptor\": {\n                \"Columns\": column_name_types,\n                \"Location\": s3_storage_path,\n                \"InputFormat\": glue_input_format,\n                \"OutputFormat\": glue_output_format,\n                \"Compressed\": True,\n                \"SerdeInfo\": {\n                    \"SerializationLibrary\": serialization_library,\n                },\n            },\n        }\n\n        # Delete the Data Catalog Table if it already exists\n        glue_client = boto3.client(\"glue\")\n        try:\n            glue_client.delete_table(DatabaseName=\"sageworks\", Name=self.output_uuid)\n            self.log.info(f\"Deleting Data Catalog Table: {self.output_uuid}...\")\n        except ClientError as e:\n            if e.response[\"Error\"][\"Code\"] != \"EntityNotFoundException\":\n                raise e\n\n        self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n        glue_client.create_table(DatabaseName=\"sageworks\", TableInput=table_input)\n\n        # All done!\n        self.log.info(f\"{self.input_uuid} --&gt; {self.output_uuid} complete!\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.__init__","title":"<code>__init__(glue_context, input_uuid, output_uuid)</code>","text":"<p>S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource</p> <p>Parameters:</p> Name Type Description Default <code>glue_context</code> <code>GlueContext</code> <p>GlueContext, AWS Glue Specific wrapper around SparkContext</p> required <code>input_uuid</code> <code>str</code> <p>The S3 Path to the files to be loaded</p> required <code>output_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be created</p> required Source code in <code>src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py</code> <pre><code>def __init__(self, glue_context: GlueContext, input_uuid: str, output_uuid: str):\n    \"\"\"S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource\n\n    Args:\n        glue_context: GlueContext, AWS Glue Specific wrapper around SparkContext\n        input_uuid (str): The S3 Path to the files to be loaded\n        output_uuid (str): The UUID of the SageWorks DataSource to be created\n    \"\"\"\n    self.log = glue_context.get_logger()\n\n    # FIXME: Pull these from Parameter Store or Config\n    self.input_uuid = input_uuid\n    self.output_uuid = output_uuid\n    self.output_meta = {\"sageworks_input\": self.input_uuid}\n    sageworks_bucket = \"s3://sandbox-sageworks-artifacts\"\n    self.data_sources_s3_path = sageworks_bucket + \"/data-sources\"\n\n    # Our Spark Context\n    self.glue_context = glue_context\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.remove_periods_from_column_names","title":"<code>remove_periods_from_column_names(dyf)</code>  <code>staticmethod</code>","text":"<p>Remove periods from column names in the DynamicFrame Args:     dyf (DynamicFrame): The DynamicFrame to convert Returns:     DynamicFrame: The converted DynamicFrame</p> Source code in <code>src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py</code> <pre><code>@staticmethod\ndef remove_periods_from_column_names(dyf: DynamicFrame) -&gt; DynamicFrame:\n    \"\"\"Remove periods from column names in the DynamicFrame\n    Args:\n        dyf (DynamicFrame): The DynamicFrame to convert\n    Returns:\n        DynamicFrame: The converted DynamicFrame\n    \"\"\"\n    # Extract the column names from the schema\n    old_column_names = [field.name for field in dyf.schema().fields]\n\n    # Create a new list of renamed column names\n    new_column_names = [name.replace(\".\", \"_\") for name in old_column_names]\n    print(old_column_names)\n    print(new_column_names)\n\n    # Create a new DynamicFrame with renamed columns\n    for c_old, c_new in zip(old_column_names, new_column_names):\n        dyf = dyf.rename_field(f\"`{c_old}`\", c_new)\n    return dyf\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.timestamp_conversions","title":"<code>timestamp_conversions(dyf, time_columns=[])</code>","text":"<p>Convert columns in the DynamicFrame to the correct data types Args:     dyf (DynamicFrame): The DynamicFrame to convert     time_columns (list): A list of column names to convert to timestamp Returns:     DynamicFrame: The converted DynamicFrame</p> Source code in <code>src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py</code> <pre><code>def timestamp_conversions(self, dyf: DynamicFrame, time_columns: list = []) -&gt; DynamicFrame:\n    \"\"\"Convert columns in the DynamicFrame to the correct data types\n    Args:\n        dyf (DynamicFrame): The DynamicFrame to convert\n        time_columns (list): A list of column names to convert to timestamp\n    Returns:\n        DynamicFrame: The converted DynamicFrame\n    \"\"\"\n\n    # Convert the timestamp columns to timestamp types\n    spark_df = dyf.toDF()\n    for column in time_columns:\n        spark_df = spark_df.withColumn(column, to_timestamp(col(column)))\n\n    # Convert the Spark DataFrame back to a Glue DynamicFrame and return\n    return DynamicFrame.fromDF(spark_df, self.glue_context, \"output_dyf\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.transform","title":"<code>transform(input_type='json', timestamp_columns=None, output_format='parquet')</code>","text":"<p>Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and store the information about the data to the AWS Data Catalog sageworks database Args:     input_type (str): The type of input files, either 'csv' or 'json'     timestamp_columns (list): A list of column names to convert to timestamp     output_format (str): The format of the output files, either 'parquet' or 'orc'</p> Source code in <code>src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py</code> <pre><code>def transform(\n    self,\n    input_type: str = \"json\",\n    timestamp_columns: list = None,\n    output_format: str = \"parquet\",\n):\n    \"\"\"Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and\n    store the information about the data to the AWS Data Catalog sageworks database\n    Args:\n        input_type (str): The type of input files, either 'csv' or 'json'\n        timestamp_columns (list): A list of column names to convert to timestamp\n        output_format (str): The format of the output files, either 'parquet' or 'orc'\n    \"\"\"\n\n    # Add some tags here\n    tags = [\"heavy\"]\n\n    # Create the Output Parquet file S3 Storage Path\n    s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n    # Read JSONL files from S3 and infer schema dynamically\n    self.log.info(f\"Reading JSONL files from {self.input_uuid}...\")\n    input_dyf = self.glue_context.create_dynamic_frame.from_options(\n        connection_type=\"s3\",\n        connection_options={\n            \"paths\": [self.input_uuid],\n            \"recurse\": True,\n            \"gzip\": True,\n        },\n        format=input_type,\n        # format_options={'jsonPath': 'auto'}, Look into this later\n    )\n    self.log.info(\"Incoming DataFrame...\")\n    input_dyf.show(5)\n    input_dyf.printSchema()\n\n    # Resolve Choice fields\n    resolved_dyf = self.resolve_choice_fields(input_dyf)\n\n    # The next couple of lines of code is for un-nesting any nested JSON\n    # Create a Dynamic Frame Collection (dfc)\n    dfc = Relationalize.apply(resolved_dyf, name=\"root\")\n\n    # Aggregate the collection into a single dynamic frame\n    output_dyf = dfc.select(\"root\")\n\n    print(\"Before TimeStamp Conversions\")\n    output_dyf.printSchema()\n\n    # Convert any timestamp columns\n    output_dyf = self.timestamp_conversions(output_dyf, timestamp_columns)\n\n    # Relationalize will put periods in the column names. This will cause\n    # problems later when we try to create a FeatureSet from this DataSource\n    output_dyf = self.remove_periods_from_column_names(output_dyf)\n\n    print(\"After TimeStamp Conversions and Removing Periods from column names\")\n    output_dyf.printSchema()\n\n    # Write Parquet files to S3\n    self.log.info(f\"Writing Parquet files to {s3_storage_path}...\")\n    self.glue_context.purge_s3_path(s3_storage_path, {\"retentionPeriod\": 0})\n    self.glue_context.write_dynamic_frame.from_options(\n        frame=output_dyf,\n        connection_type=\"s3\",\n        connection_options={\n            \"path\": s3_storage_path\n            # \"partitionKeys\": [\"year\", \"month\", \"day\"],\n        },\n        format=output_format,\n    )\n\n    # Set up our SageWorks metadata (description, tags, etc)\n    description = f\"SageWorks data source: {self.output_uuid}\"\n    sageworks_meta = {\"sageworks_tags\": self.tag_delimiter.join(tags)}\n    for key, value in self.output_meta.items():\n        sageworks_meta[key] = value\n\n    # Create a new table in the AWS Data Catalog\n    self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n\n    # Converting the Spark Types to Athena Types\n    def to_athena_type(col):\n        athena_type_map = {\"long\": \"bigint\"}\n        spark_type = col.dataType.typeName()\n        return athena_type_map.get(spark_type, spark_type)\n\n    column_name_types = [{\"Name\": col.name, \"Type\": to_athena_type(col)} for col in output_dyf.schema().fields]\n\n    # Our parameters for the Glue Data Catalog are different for Parquet and ORC\n    if output_format == \"parquet\":\n        glue_input_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat\"\n        glue_output_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat\"\n        serialization_library = \"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe\"\n    else:\n        glue_input_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n        glue_output_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n        serialization_library = \"org.apache.hadoop.hive.ql.io.orc.OrcSerde\"\n\n    table_input = {\n        \"Name\": self.output_uuid,\n        \"Description\": description,\n        \"Parameters\": sageworks_meta,\n        \"TableType\": \"EXTERNAL_TABLE\",\n        \"StorageDescriptor\": {\n            \"Columns\": column_name_types,\n            \"Location\": s3_storage_path,\n            \"InputFormat\": glue_input_format,\n            \"OutputFormat\": glue_output_format,\n            \"Compressed\": True,\n            \"SerdeInfo\": {\n                \"SerializationLibrary\": serialization_library,\n            },\n        },\n    }\n\n    # Delete the Data Catalog Table if it already exists\n    glue_client = boto3.client(\"glue\")\n    try:\n        glue_client.delete_table(DatabaseName=\"sageworks\", Name=self.output_uuid)\n        self.log.info(f\"Deleting Data Catalog Table: {self.output_uuid}...\")\n    except ClientError as e:\n        if e.response[\"Error\"][\"Code\"] != \"EntityNotFoundException\":\n            raise e\n\n    self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n    glue_client.create_table(DatabaseName=\"sageworks\", TableInput=table_input)\n\n    # All done!\n    self.log.info(f\"{self.input_uuid} --&gt; {self.output_uuid} complete!\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/","title":"DataLoaders Light","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p> <p>These DataLoader Classes are intended to load smaller dataset into AWS. If you have large data please see DataLoaders Heavy</p> <p>Welcome to the SageWorks DataLoaders Light Classes</p> <p>These classes provide low-level APIs for loading smaller data into AWS services</p> <ul> <li>CSVToDataSource: Loads local CSV data into a DataSource</li> <li>JSONToDataSource: Loads local JSON data into a DataSource</li> <li>S3ToDataSourceLight: Loads S3 data into a DataSource</li> </ul>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource","title":"<code>CSVToDataSource</code>","text":"<p>               Bases: <code>Transform</code></p> <p>CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource</p> Common Usage <pre><code>csv_to_data = CSVToDataSource(csv_file_path, data_uuid)\ncsv_to_data.set_output_tags([\"abalone\", \"csv\", \"whatever\"])\ncsv_to_data.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py</code> <pre><code>class CSVToDataSource(Transform):\n    \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n    Common Usage:\n        ```\n        csv_to_data = CSVToDataSource(csv_file_path, data_uuid)\n        csv_to_data.set_output_tags([\"abalone\", \"csv\", \"whatever\"])\n        csv_to_data.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, csv_file_path: str, data_uuid: str):\n        \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n        Args:\n            csv_file_path (str): The path to the CSV file to be transformed\n            data_uuid (str): The UUID of the SageWorks DataSource to be created\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(csv_file_path, data_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.LOCAL_FILE\n        self.output_type = TransformOutput.DATA_SOURCE\n\n    def transform_impl(self, overwrite: bool = True):\n        \"\"\"Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and\n        store the information about the data to the AWS Data Catalog sageworks database\n        \"\"\"\n\n        # Report the transformation initiation\n        csv_file = os.path.basename(self.input_uuid)\n        self.log.info(f\"Starting {csv_file} --&gt;  DataSource: {self.output_uuid}...\")\n\n        # Read in the Local CSV as a Pandas DataFrame\n        df = pd.read_csv(self.input_uuid, low_memory=False)\n        df = convert_object_columns(df)\n\n        # Use the SageWorks Pandas to Data Source class\n        pandas_to_data = PandasToData(self.output_uuid)\n        pandas_to_data.set_input(df)\n        pandas_to_data.set_output_tags(self.output_tags)\n        pandas_to_data.add_output_meta(self.output_meta)\n        pandas_to_data.transform()\n\n        # Report the transformation results\n        self.log.info(f\"{csv_file} --&gt;  DataSource: {self.output_uuid} Complete!\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform\"\"\"\n        self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.__init__","title":"<code>__init__(csv_file_path, data_uuid)</code>","text":"<p>CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource</p> <p>Parameters:</p> Name Type Description Default <code>csv_file_path</code> <code>str</code> <p>The path to the CSV file to be transformed</p> required <code>data_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be created</p> required Source code in <code>src/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py</code> <pre><code>def __init__(self, csv_file_path: str, data_uuid: str):\n    \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n    Args:\n        csv_file_path (str): The path to the CSV file to be transformed\n        data_uuid (str): The UUID of the SageWorks DataSource to be created\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(csv_file_path, data_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.LOCAL_FILE\n    self.output_type = TransformOutput.DATA_SOURCE\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform\"\"\"\n    self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.transform_impl","title":"<code>transform_impl(overwrite=True)</code>","text":"<p>Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py</code> <pre><code>def transform_impl(self, overwrite: bool = True):\n    \"\"\"Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and\n    store the information about the data to the AWS Data Catalog sageworks database\n    \"\"\"\n\n    # Report the transformation initiation\n    csv_file = os.path.basename(self.input_uuid)\n    self.log.info(f\"Starting {csv_file} --&gt;  DataSource: {self.output_uuid}...\")\n\n    # Read in the Local CSV as a Pandas DataFrame\n    df = pd.read_csv(self.input_uuid, low_memory=False)\n    df = convert_object_columns(df)\n\n    # Use the SageWorks Pandas to Data Source class\n    pandas_to_data = PandasToData(self.output_uuid)\n    pandas_to_data.set_input(df)\n    pandas_to_data.set_output_tags(self.output_tags)\n    pandas_to_data.add_output_meta(self.output_meta)\n    pandas_to_data.transform()\n\n    # Report the transformation results\n    self.log.info(f\"{csv_file} --&gt;  DataSource: {self.output_uuid} Complete!\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource","title":"<code>JSONToDataSource</code>","text":"<p>               Bases: <code>Transform</code></p> <p>JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource</p> Common Usage <pre><code>json_to_data = JSONToDataSource(json_file_path, data_uuid)\njson_to_data.set_output_tags([\"abalone\", \"json\", \"whatever\"])\njson_to_data.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/data_loaders/light/json_to_data_source.py</code> <pre><code>class JSONToDataSource(Transform):\n    \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n    Common Usage:\n        ```\n        json_to_data = JSONToDataSource(json_file_path, data_uuid)\n        json_to_data.set_output_tags([\"abalone\", \"json\", \"whatever\"])\n        json_to_data.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, json_file_path: str, data_uuid: str):\n        \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n        Args:\n            json_file_path (str): The path to the JSON file to be transformed\n            data_uuid (str): The UUID of the SageWorks DataSource to be created\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(json_file_path, data_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.LOCAL_FILE\n        self.output_type = TransformOutput.DATA_SOURCE\n\n    def transform_impl(self, overwrite: bool = True):\n        \"\"\"Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and\n        store the information about the data to the AWS Data Catalog sageworks database\n        \"\"\"\n\n        # Report the transformation initiation\n        json_file = os.path.basename(self.input_uuid)\n        self.log.info(f\"Starting {json_file} --&gt;  DataSource: {self.output_uuid}...\")\n\n        # Read in the Local JSON as a Pandas DataFrame\n        df = pd.read_json(self.input_uuid, lines=True)\n\n        # Use the SageWorks Pandas to Data Source class\n        pandas_to_data = PandasToData(self.output_uuid)\n        pandas_to_data.set_input(df)\n        pandas_to_data.set_output_tags(self.output_tags)\n        pandas_to_data.add_output_meta(self.output_meta)\n        pandas_to_data.transform()\n\n        # Report the transformation results\n        self.log.info(f\"{json_file} --&gt;  DataSource: {self.output_uuid} Complete!\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform\"\"\"\n        self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.__init__","title":"<code>__init__(json_file_path, data_uuid)</code>","text":"<p>JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource</p> <p>Parameters:</p> Name Type Description Default <code>json_file_path</code> <code>str</code> <p>The path to the JSON file to be transformed</p> required <code>data_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be created</p> required Source code in <code>src/sageworks/core/transforms/data_loaders/light/json_to_data_source.py</code> <pre><code>def __init__(self, json_file_path: str, data_uuid: str):\n    \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n    Args:\n        json_file_path (str): The path to the JSON file to be transformed\n        data_uuid (str): The UUID of the SageWorks DataSource to be created\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(json_file_path, data_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.LOCAL_FILE\n    self.output_type = TransformOutput.DATA_SOURCE\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/json_to_data_source.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform\"\"\"\n    self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.transform_impl","title":"<code>transform_impl(overwrite=True)</code>","text":"<p>Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/json_to_data_source.py</code> <pre><code>def transform_impl(self, overwrite: bool = True):\n    \"\"\"Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and\n    store the information about the data to the AWS Data Catalog sageworks database\n    \"\"\"\n\n    # Report the transformation initiation\n    json_file = os.path.basename(self.input_uuid)\n    self.log.info(f\"Starting {json_file} --&gt;  DataSource: {self.output_uuid}...\")\n\n    # Read in the Local JSON as a Pandas DataFrame\n    df = pd.read_json(self.input_uuid, lines=True)\n\n    # Use the SageWorks Pandas to Data Source class\n    pandas_to_data = PandasToData(self.output_uuid)\n    pandas_to_data.set_input(df)\n    pandas_to_data.set_output_tags(self.output_tags)\n    pandas_to_data.add_output_meta(self.output_meta)\n    pandas_to_data.transform()\n\n    # Report the transformation results\n    self.log.info(f\"{json_file} --&gt;  DataSource: {self.output_uuid} Complete!\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight","title":"<code>S3ToDataSourceLight</code>","text":"<p>               Bases: <code>Transform</code></p> <p>S3ToDataSourceLight: Class to move LIGHT S3 Files into a SageWorks DataSource</p> Common Usage <pre><code>s3_to_data = S3ToDataSourceLight(s3_path, data_uuid, datatype=\"csv/json\")\ns3_to_data.set_output_tags([\"abalone\", \"whatever\"])\ns3_to_data.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py</code> <pre><code>class S3ToDataSourceLight(Transform):\n    \"\"\"S3ToDataSourceLight: Class to move LIGHT S3 Files into a SageWorks DataSource\n\n    Common Usage:\n        ```\n        s3_to_data = S3ToDataSourceLight(s3_path, data_uuid, datatype=\"csv/json\")\n        s3_to_data.set_output_tags([\"abalone\", \"whatever\"])\n        s3_to_data.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, s3_path: str, data_uuid: str, datatype: str = \"csv\"):\n        \"\"\"S3ToDataSourceLight Initialization\n\n        Args:\n            s3_path (str): The S3 Path to the file to be transformed\n            data_uuid (str): The UUID of the SageWorks DataSource to be created\n            datatype (str): The datatype of the file to be transformed (defaults to \"csv\")\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(s3_path, data_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.S3_OBJECT\n        self.output_type = TransformOutput.DATA_SOURCE\n        self.datatype = datatype\n\n    def input_size_mb(self) -&gt; int:\n        \"\"\"Get the size of the input S3 object in MBytes\"\"\"\n        size_in_bytes = wr.s3.size_objects(self.input_uuid, boto3_session=self.boto_session)[self.input_uuid]\n        size_in_mb = round(size_in_bytes / 1_000_000)\n        return size_in_mb\n\n    def transform_impl(self, overwrite: bool = True):\n        \"\"\"Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and\n        store the information about the data to the AWS Data Catalog sageworks database\n        \"\"\"\n\n        # Sanity Check for S3 Object size\n        object_megabytes = self.input_size_mb()\n        if object_megabytes &gt; 100:\n            self.log.error(f\"S3 Object too big ({object_megabytes} MBytes): Use the S3ToDataSourceHeavy class!\")\n            return\n\n        # Read in the S3 CSV as a Pandas DataFrame\n        if self.datatype == \"csv\":\n            df = wr.s3.read_csv(self.input_uuid, low_memory=False, boto3_session=self.boto_session)\n        else:\n            df = wr.s3.read_json(self.input_uuid, lines=True, boto3_session=self.boto_session)\n\n        # Temporary hack to limit the number of columns in the dataframe\n        if len(df.columns) &gt; 40:\n            self.log.warning(f\"{self.input_uuid} Too Many Columns! Talk to SageWorks Support...\")\n\n        # Convert object columns before sending to SageWorks Data Source\n        df = convert_object_columns(df)\n\n        # Use the SageWorks Pandas to Data Source class\n        pandas_to_data = PandasToData(self.output_uuid)\n        pandas_to_data.set_input(df)\n        pandas_to_data.set_output_tags(self.output_tags)\n        pandas_to_data.add_output_meta(self.output_meta)\n        pandas_to_data.transform()\n\n        # Report the transformation results\n        self.log.info(f\"{self.input_uuid} --&gt;  DataSource: {self.output_uuid} Complete!\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform\"\"\"\n        self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.__init__","title":"<code>__init__(s3_path, data_uuid, datatype='csv')</code>","text":"<p>S3ToDataSourceLight Initialization</p> <p>Parameters:</p> Name Type Description Default <code>s3_path</code> <code>str</code> <p>The S3 Path to the file to be transformed</p> required <code>data_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be created</p> required <code>datatype</code> <code>str</code> <p>The datatype of the file to be transformed (defaults to \"csv\")</p> <code>'csv'</code> Source code in <code>src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py</code> <pre><code>def __init__(self, s3_path: str, data_uuid: str, datatype: str = \"csv\"):\n    \"\"\"S3ToDataSourceLight Initialization\n\n    Args:\n        s3_path (str): The S3 Path to the file to be transformed\n        data_uuid (str): The UUID of the SageWorks DataSource to be created\n        datatype (str): The datatype of the file to be transformed (defaults to \"csv\")\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(s3_path, data_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.S3_OBJECT\n    self.output_type = TransformOutput.DATA_SOURCE\n    self.datatype = datatype\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.input_size_mb","title":"<code>input_size_mb()</code>","text":"<p>Get the size of the input S3 object in MBytes</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py</code> <pre><code>def input_size_mb(self) -&gt; int:\n    \"\"\"Get the size of the input S3 object in MBytes\"\"\"\n    size_in_bytes = wr.s3.size_objects(self.input_uuid, boto3_session=self.boto_session)[self.input_uuid]\n    size_in_mb = round(size_in_bytes / 1_000_000)\n    return size_in_mb\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform\"\"\"\n    self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.transform_impl","title":"<code>transform_impl(overwrite=True)</code>","text":"<p>Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py</code> <pre><code>def transform_impl(self, overwrite: bool = True):\n    \"\"\"Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and\n    store the information about the data to the AWS Data Catalog sageworks database\n    \"\"\"\n\n    # Sanity Check for S3 Object size\n    object_megabytes = self.input_size_mb()\n    if object_megabytes &gt; 100:\n        self.log.error(f\"S3 Object too big ({object_megabytes} MBytes): Use the S3ToDataSourceHeavy class!\")\n        return\n\n    # Read in the S3 CSV as a Pandas DataFrame\n    if self.datatype == \"csv\":\n        df = wr.s3.read_csv(self.input_uuid, low_memory=False, boto3_session=self.boto_session)\n    else:\n        df = wr.s3.read_json(self.input_uuid, lines=True, boto3_session=self.boto_session)\n\n    # Temporary hack to limit the number of columns in the dataframe\n    if len(df.columns) &gt; 40:\n        self.log.warning(f\"{self.input_uuid} Too Many Columns! Talk to SageWorks Support...\")\n\n    # Convert object columns before sending to SageWorks Data Source\n    df = convert_object_columns(df)\n\n    # Use the SageWorks Pandas to Data Source class\n    pandas_to_data = PandasToData(self.output_uuid)\n    pandas_to_data.set_input(df)\n    pandas_to_data.set_output_tags(self.output_tags)\n    pandas_to_data.add_output_meta(self.output_meta)\n    pandas_to_data.transform()\n\n    # Report the transformation results\n    self.log.info(f\"{self.input_uuid} --&gt;  DataSource: {self.output_uuid} Complete!\")\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/","title":"Data To Features","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p> <p>DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas</p> <p>MolecularDescriptors: Compute a Feature Set based on RDKit Descriptors</p>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight","title":"<code>DataToFeaturesLight</code>","text":"<p>               Bases: <code>Transform</code></p> <p>DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas</p> Common Usage <pre><code>to_features = DataToFeaturesLight(data_uuid, feature_uuid)\nto_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\nto_features.transform(target_column=\"target\"/None, id_column=\"id\"/None,\n                      event_time_column=\"date\"/None, query=str/None)\n</code></pre> Source code in <code>src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py</code> <pre><code>class DataToFeaturesLight(Transform):\n    \"\"\"DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas\n\n    Common Usage:\n        ```\n        to_features = DataToFeaturesLight(data_uuid, feature_uuid)\n        to_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n        to_features.transform(target_column=\"target\"/None, id_column=\"id\"/None,\n                              event_time_column=\"date\"/None, query=str/None)\n        ```\n    \"\"\"\n\n    def __init__(self, data_uuid: str, feature_uuid: str):\n        \"\"\"DataToFeaturesLight Initialization\n\n        Args:\n            data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n            feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(data_uuid, feature_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.DATA_SOURCE\n        self.output_type = TransformOutput.FEATURE_SET\n        self.input_df = None\n        self.output_df = None\n\n    def pre_transform(self, query: str = None, **kwargs):\n        \"\"\"Pull the input DataSource into our Input Pandas DataFrame\n        Args:\n            query(str): Optional query to filter the input DataFrame\n        \"\"\"\n\n        # Grab the Input (Data Source)\n        data_to_pandas = DataToPandas(self.input_uuid)\n        data_to_pandas.transform(query=query)\n        self.input_df = data_to_pandas.get_output()\n\n    def transform_impl(self, **kwargs):\n        \"\"\"Transform the input DataFrame into a Feature Set\"\"\"\n\n        # This is a reference implementation that should be overridden by the subclass\n        self.output_df = self.input_df\n\n    def post_transform(self, target_column=None, id_column=None, event_time_column=None, auto_one_hot=False, **kwargs):\n        \"\"\"At this point the output DataFrame should be populated, so publish it as a Feature Set\n        Args:\n            target_column(str): The name of the target column in the output DataFrame (default: None)\n            id_column(str): The name of the id column in the output DataFrame (default: None)\n            event_time_column(str): The name of the event time column in the output DataFrame (default: None)\n            auto_one_hot(bool): Automatically one-hot encode categorical columns (default: False)\n        \"\"\"\n        # Now publish to the output location\n        output_features = PandasToFeatures(self.output_uuid, auto_one_hot=auto_one_hot)\n        output_features.set_input(\n            self.output_df, target_column=target_column, id_column=id_column, event_time_column=event_time_column\n        )\n        output_features.set_output_tags(self.output_tags)\n        output_features.add_output_meta(self.output_meta)\n        output_features.transform()\n\n        # Create a default training_view for this FeatureSet\n        fs = FeatureSetCore(self.output_uuid, force_refresh=True)\n        fs.create_default_training_view()\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.__init__","title":"<code>__init__(data_uuid, feature_uuid)</code>","text":"<p>DataToFeaturesLight Initialization</p> <p>Parameters:</p> Name Type Description Default <code>data_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be transformed</p> required <code>feature_uuid</code> <code>str</code> <p>The UUID of the SageWorks FeatureSet to be created</p> required Source code in <code>src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py</code> <pre><code>def __init__(self, data_uuid: str, feature_uuid: str):\n    \"\"\"DataToFeaturesLight Initialization\n\n    Args:\n        data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n        feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(data_uuid, feature_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.DATA_SOURCE\n    self.output_type = TransformOutput.FEATURE_SET\n    self.input_df = None\n    self.output_df = None\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.post_transform","title":"<code>post_transform(target_column=None, id_column=None, event_time_column=None, auto_one_hot=False, **kwargs)</code>","text":"<p>At this point the output DataFrame should be populated, so publish it as a Feature Set Args:     target_column(str): The name of the target column in the output DataFrame (default: None)     id_column(str): The name of the id column in the output DataFrame (default: None)     event_time_column(str): The name of the event time column in the output DataFrame (default: None)     auto_one_hot(bool): Automatically one-hot encode categorical columns (default: False)</p> Source code in <code>src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py</code> <pre><code>def post_transform(self, target_column=None, id_column=None, event_time_column=None, auto_one_hot=False, **kwargs):\n    \"\"\"At this point the output DataFrame should be populated, so publish it as a Feature Set\n    Args:\n        target_column(str): The name of the target column in the output DataFrame (default: None)\n        id_column(str): The name of the id column in the output DataFrame (default: None)\n        event_time_column(str): The name of the event time column in the output DataFrame (default: None)\n        auto_one_hot(bool): Automatically one-hot encode categorical columns (default: False)\n    \"\"\"\n    # Now publish to the output location\n    output_features = PandasToFeatures(self.output_uuid, auto_one_hot=auto_one_hot)\n    output_features.set_input(\n        self.output_df, target_column=target_column, id_column=id_column, event_time_column=event_time_column\n    )\n    output_features.set_output_tags(self.output_tags)\n    output_features.add_output_meta(self.output_meta)\n    output_features.transform()\n\n    # Create a default training_view for this FeatureSet\n    fs = FeatureSetCore(self.output_uuid, force_refresh=True)\n    fs.create_default_training_view()\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.pre_transform","title":"<code>pre_transform(query=None, **kwargs)</code>","text":"<p>Pull the input DataSource into our Input Pandas DataFrame Args:     query(str): Optional query to filter the input DataFrame</p> Source code in <code>src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py</code> <pre><code>def pre_transform(self, query: str = None, **kwargs):\n    \"\"\"Pull the input DataSource into our Input Pandas DataFrame\n    Args:\n        query(str): Optional query to filter the input DataFrame\n    \"\"\"\n\n    # Grab the Input (Data Source)\n    data_to_pandas = DataToPandas(self.input_uuid)\n    data_to_pandas.transform(query=query)\n    self.input_df = data_to_pandas.get_output()\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.transform_impl","title":"<code>transform_impl(**kwargs)</code>","text":"<p>Transform the input DataFrame into a Feature Set</p> Source code in <code>src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py</code> <pre><code>def transform_impl(self, **kwargs):\n    \"\"\"Transform the input DataFrame into a Feature Set\"\"\"\n\n    # This is a reference implementation that should be overridden by the subclass\n    self.output_df = self.input_df\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors","title":"<code>MolecularDescriptors</code>","text":"<p>               Bases: <code>DataToFeaturesLight</code></p> <p>MolecularDescriptors: Create a FeatureSet (RDKit Descriptors) from a DataSource</p> Common Usage <pre><code>to_features = MolecularDescriptors(data_uuid, feature_uuid)\nto_features.set_output_tags([\"aqsol\", \"whatever\"])\nto_features.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py</code> <pre><code>class MolecularDescriptors(DataToFeaturesLight):\n    \"\"\"MolecularDescriptors: Create a FeatureSet (RDKit Descriptors) from a DataSource\n\n    Common Usage:\n        ```\n        to_features = MolecularDescriptors(data_uuid, feature_uuid)\n        to_features.set_output_tags([\"aqsol\", \"whatever\"])\n        to_features.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, data_uuid: str, feature_uuid: str):\n        \"\"\"MolecularDescriptors Initialization\n\n        Args:\n            data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n            feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(data_uuid, feature_uuid)\n\n        # Turn off warnings for RDKIT (revisit this)\n        RDLogger.DisableLog(\"rdApp.*\")\n\n    def transform_impl(self, **kwargs):\n        \"\"\"Compute a Feature Set based on RDKit Descriptors\"\"\"\n\n        # Check the input DataFrame has the required columns\n        if \"smiles\" not in self.input_df.columns:\n            raise ValueError(\"Input DataFrame must have a 'smiles' column\")\n\n        # There are certain smiles that cause Mordred to crash\n        # We'll replace them with 'equivalent' smiles (these need to be verified)\n        self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n            \"[O-]C([O-])=O.[NH4+]CCO.[NH4+]CCO\", \"[O]C([O])=O.[N]CCO.[N]CCO\"\n        )\n        self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n            \"[NH4+]CCO.[NH4+]CCO.[O-]C([O-])=O\", \"[N]CCO.[N]CCO.[O]C([O])=O\"\n        )\n        self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n            \"O=S(=O)(Nn1c-nnc1)C1=CC=CC=C1\", \"O=S(=O)(NN(C=N1)C=N1)C(C=CC1)=CC=1\"\n        )\n\n        # Compute/add all the Molecular Descriptors\n        self.output_df = self.compute_molecular_descriptors(self.input_df)\n\n        # Get the columns that are descriptors\n        desc_columns = set(self.output_df.columns) - set(self.input_df.columns)\n\n        # Drop any NaNs (and INFs)\n        current_rows = self.output_df.shape[0]\n        self.output_df = pandas_utils.drop_nans(self.output_df, how=\"any\", subset=desc_columns)\n        self.log.warning(f\"Dropped {current_rows - self.output_df.shape[0]} NaN rows\")\n\n    def compute_molecular_descriptors(self, process_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Compute and add all the Molecular Descriptors\n        Args:\n            process_df(pd.DataFrame): The DataFrame to process and generate RDKit Descriptors\n        Returns:\n            pd.DataFrame: The input DataFrame with all the RDKit Descriptors added\n        \"\"\"\n        self.log.important(\"Computing Molecular Descriptors...\")\n\n        # Conversion to Molecules\n        molecules = [Chem.MolFromSmiles(smile) for smile in process_df[\"smiles\"]]\n\n        # Now get all the RDKIT Descriptors\n        all_descriptors = [x[0] for x in Descriptors._descList]\n\n        # There's an overflow issue that happens with the IPC descriptor, so we'll remove it\n        # See: https://github.com/rdkit/rdkit/issues/1527\n        if \"Ipc\" in all_descriptors:\n            all_descriptors.remove(\"Ipc\")\n\n        # Make sure we don't have duplicates\n        all_descriptors = list(set(all_descriptors))\n\n        # Super useful Molecular Descriptor Calculator Class\n        calc = MoleculeDescriptors.MolecularDescriptorCalculator(all_descriptors)\n        column_names = calc.GetDescriptorNames()\n        descriptor_values = [calc.CalcDescriptors(m) for m in molecules]\n        rdkit_features_df = pd.DataFrame(descriptor_values, columns=column_names)\n\n        # Now compute Mordred Features\n        descriptor_choice = [AcidBase, Aromatic, Polarizability, RotatableBond]\n        calc = Calculator()\n        for des in descriptor_choice:\n            calc.register(des)\n        mordred_df = calc.pandas(molecules, nproc=1)\n\n        # Return the DataFrame with the RDKit and Mordred Descriptors added\n        return pd.concat([process_df, rdkit_features_df, mordred_df], axis=1)\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors.__init__","title":"<code>__init__(data_uuid, feature_uuid)</code>","text":"<p>MolecularDescriptors Initialization</p> <p>Parameters:</p> Name Type Description Default <code>data_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be transformed</p> required <code>feature_uuid</code> <code>str</code> <p>The UUID of the SageWorks FeatureSet to be created</p> required Source code in <code>src/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py</code> <pre><code>def __init__(self, data_uuid: str, feature_uuid: str):\n    \"\"\"MolecularDescriptors Initialization\n\n    Args:\n        data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n        feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(data_uuid, feature_uuid)\n\n    # Turn off warnings for RDKIT (revisit this)\n    RDLogger.DisableLog(\"rdApp.*\")\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors.compute_molecular_descriptors","title":"<code>compute_molecular_descriptors(process_df)</code>","text":"<p>Compute and add all the Molecular Descriptors Args:     process_df(pd.DataFrame): The DataFrame to process and generate RDKit Descriptors Returns:     pd.DataFrame: The input DataFrame with all the RDKit Descriptors added</p> Source code in <code>src/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py</code> <pre><code>def compute_molecular_descriptors(self, process_df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Compute and add all the Molecular Descriptors\n    Args:\n        process_df(pd.DataFrame): The DataFrame to process and generate RDKit Descriptors\n    Returns:\n        pd.DataFrame: The input DataFrame with all the RDKit Descriptors added\n    \"\"\"\n    self.log.important(\"Computing Molecular Descriptors...\")\n\n    # Conversion to Molecules\n    molecules = [Chem.MolFromSmiles(smile) for smile in process_df[\"smiles\"]]\n\n    # Now get all the RDKIT Descriptors\n    all_descriptors = [x[0] for x in Descriptors._descList]\n\n    # There's an overflow issue that happens with the IPC descriptor, so we'll remove it\n    # See: https://github.com/rdkit/rdkit/issues/1527\n    if \"Ipc\" in all_descriptors:\n        all_descriptors.remove(\"Ipc\")\n\n    # Make sure we don't have duplicates\n    all_descriptors = list(set(all_descriptors))\n\n    # Super useful Molecular Descriptor Calculator Class\n    calc = MoleculeDescriptors.MolecularDescriptorCalculator(all_descriptors)\n    column_names = calc.GetDescriptorNames()\n    descriptor_values = [calc.CalcDescriptors(m) for m in molecules]\n    rdkit_features_df = pd.DataFrame(descriptor_values, columns=column_names)\n\n    # Now compute Mordred Features\n    descriptor_choice = [AcidBase, Aromatic, Polarizability, RotatableBond]\n    calc = Calculator()\n    for des in descriptor_choice:\n        calc.register(des)\n    mordred_df = calc.pandas(molecules, nproc=1)\n\n    # Return the DataFrame with the RDKit and Mordred Descriptors added\n    return pd.concat([process_df, rdkit_features_df, mordred_df], axis=1)\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors.transform_impl","title":"<code>transform_impl(**kwargs)</code>","text":"<p>Compute a Feature Set based on RDKit Descriptors</p> Source code in <code>src/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py</code> <pre><code>def transform_impl(self, **kwargs):\n    \"\"\"Compute a Feature Set based on RDKit Descriptors\"\"\"\n\n    # Check the input DataFrame has the required columns\n    if \"smiles\" not in self.input_df.columns:\n        raise ValueError(\"Input DataFrame must have a 'smiles' column\")\n\n    # There are certain smiles that cause Mordred to crash\n    # We'll replace them with 'equivalent' smiles (these need to be verified)\n    self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n        \"[O-]C([O-])=O.[NH4+]CCO.[NH4+]CCO\", \"[O]C([O])=O.[N]CCO.[N]CCO\"\n    )\n    self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n        \"[NH4+]CCO.[NH4+]CCO.[O-]C([O-])=O\", \"[N]CCO.[N]CCO.[O]C([O])=O\"\n    )\n    self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n        \"O=S(=O)(Nn1c-nnc1)C1=CC=CC=C1\", \"O=S(=O)(NN(C=N1)C=N1)C(C=CC1)=CC=1\"\n    )\n\n    # Compute/add all the Molecular Descriptors\n    self.output_df = self.compute_molecular_descriptors(self.input_df)\n\n    # Get the columns that are descriptors\n    desc_columns = set(self.output_df.columns) - set(self.input_df.columns)\n\n    # Drop any NaNs (and INFs)\n    current_rows = self.output_df.shape[0]\n    self.output_df = pandas_utils.drop_nans(self.output_df, how=\"any\", subset=desc_columns)\n    self.log.warning(f\"Dropped {current_rows - self.output_df.shape[0]} NaN rows\")\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/","title":"Features To Model","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p> <p>FeaturesToModel: Train/Create a Model from a Feature Set</p>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel","title":"<code>FeaturesToModel</code>","text":"<p>               Bases: <code>Transform</code></p> <p>FeaturesToModel: Train/Create a Model from a FeatureSet</p> Common Usage <pre><code>to_model = FeaturesToModel(feature_uuid, model_uuid, model_type=ModelType)\nto_model.set_output_tags([\"abalone\", \"public\", \"whatever\"])\nto_model.transform(target_column=\"class_number_of_rings\",\n                   input_feature_list=[feature_list])\n</code></pre> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>class FeaturesToModel(Transform):\n    \"\"\"FeaturesToModel: Train/Create a Model from a FeatureSet\n\n    Common Usage:\n        ```\n        to_model = FeaturesToModel(feature_uuid, model_uuid, model_type=ModelType)\n        to_model.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n        to_model.transform(target_column=\"class_number_of_rings\",\n                           input_feature_list=[feature_list])\n        ```\n    \"\"\"\n\n    def __init__(self, feature_uuid: str, model_uuid: str, model_type: ModelType = ModelType.UNKNOWN, model_class=None):\n        \"\"\"FeaturesToModel Initialization\n        Args:\n            feature_uuid (str): UUID of the FeatureSet to use as input\n            model_uuid (str): UUID of the Model to create as output\n            model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.\n            model_class (str): The class of the model (optional)\n        \"\"\"\n\n        # Make sure the model_uuid is a valid name\n        Artifact.ensure_valid_name(model_uuid, delimiter=\"-\")\n\n        # Call superclass init\n        super().__init__(feature_uuid, model_uuid)\n\n        # If the model_type is UNKNOWN the model_class must be specified\n        if model_type == ModelType.UNKNOWN:\n            if model_class is None:\n                msg = \"ModelType is UNKNOWN, must specify a model_class!\"\n                self.log.critical(msg)\n                raise ValueError(msg)\n            else:\n                self.log.info(\"ModelType is UNKNOWN, using model_class to determine the type...\")\n                model_type = self._determine_model_type(model_class)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.FEATURE_SET\n        self.output_type = TransformOutput.MODEL\n        self.model_type = model_type\n        self.model_class = model_class\n        self.estimator = None\n        self.model_script_dir = None\n        self.model_description = None\n        self.model_training_root = self.models_s3_path + \"/training\"\n        self.model_feature_list = None\n        self.target_column = None\n        self.class_labels = None\n\n    def _determine_model_type(self, model_class: str) -&gt; ModelType:\n        \"\"\"Determine the ModelType from the model_class\n        Args:\n            model_class (str): The class of the model\n        Returns:\n            ModelType: The determined ModelType\n        \"\"\"\n        model_class_lower = model_class.lower()\n\n        # Direct mapping for specific models\n        specific_model_mapping = {\n            \"logisticregression\": ModelType.CLASSIFIER,\n            \"linearregression\": ModelType.REGRESSOR,\n            \"ridge\": ModelType.REGRESSOR,\n            \"lasso\": ModelType.REGRESSOR,\n            \"elasticnet\": ModelType.REGRESSOR,\n            \"bayesianridge\": ModelType.REGRESSOR,\n            \"svc\": ModelType.CLASSIFIER,\n            \"svr\": ModelType.REGRESSOR,\n            \"gaussiannb\": ModelType.CLASSIFIER,\n            \"kmeans\": ModelType.CLUSTERER,\n            \"dbscan\": ModelType.CLUSTERER,\n            \"meanshift\": ModelType.CLUSTERER,\n        }\n\n        if model_class_lower in specific_model_mapping:\n            return specific_model_mapping[model_class_lower]\n\n        # General pattern matching\n        if \"regressor\" in model_class_lower:\n            return ModelType.REGRESSOR\n        elif \"classifier\" in model_class_lower:\n            return ModelType.CLASSIFIER\n        elif \"quantile\" in model_class_lower:\n            return ModelType.QUANTILE_REGRESSOR\n        elif \"cluster\" in model_class_lower:\n            return ModelType.CLUSTERER\n        elif \"transform\" in model_class_lower:\n            return ModelType.TRANSFORMER\n        else:\n            self.log.critical(f\"Unknown ModelType for model_class: {model_class}\")\n            return ModelType.UNKNOWN\n\n    def generate_model_script(self, target_column: str, feature_list: list[str], train_all_data: bool) -&gt; str:\n        \"\"\"Fill in the model template with specific target and feature_list\n        Args:\n            target_column (str): Column name of the target variable\n            feature_list (list[str]): A list of columns for the features\n            train_all_data (bool): Train on ALL (100%) of the data\n        Returns:\n           str: The name of the generated model script\n        \"\"\"\n\n        # FIXME: Revisit all of this since it's a bit wonky\n        # Did they specify a Scikit-Learn model class?\n        if self.model_class:\n            self.log.info(f\"Using Scikit-Learn model class: {self.model_class}\")\n            script_name = \"generated_scikit_model.py\"\n            dir_path = Path(__file__).parent.absolute()\n            self.model_script_dir = os.path.join(dir_path, \"light_scikit_learn\")\n            template_path = os.path.join(self.model_script_dir, \"scikit_learn.template\")\n            output_path = os.path.join(self.model_script_dir, script_name)\n            with open(template_path, \"r\") as fp:\n                scikit_template = fp.read()\n\n            # Template replacements\n            aws_script = scikit_template.replace(\"{{model_class}}\", self.model_class)\n            aws_script = aws_script.replace(\"{{target_column}}\", target_column)\n            feature_list_str = json.dumps(feature_list)\n            aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n            aws_script = aws_script.replace(\"{{model_type}}\", self.model_type.value)\n            metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n            aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n            aws_script = aws_script.replace(\"{{train_all_data}}\", str(train_all_data))\n\n        elif self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.CLASSIFIER:\n            script_name = \"generated_xgb_model.py\"\n            dir_path = Path(__file__).parent.absolute()\n            self.model_script_dir = os.path.join(dir_path, \"light_xgb_model\")\n            template_path = os.path.join(self.model_script_dir, \"xgb_model.template\")\n            output_path = os.path.join(self.model_script_dir, script_name)\n            with open(template_path, \"r\") as fp:\n                xgb_template = fp.read()\n\n            # Template replacements\n            aws_script = xgb_template.replace(\"{{target_column}}\", target_column)\n            feature_list_str = json.dumps(feature_list)\n            aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n            aws_script = aws_script.replace(\"{{model_type}}\", self.model_type.value)\n            metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n            aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n            aws_script = aws_script.replace(\"{{train_all_data}}\", str(train_all_data))\n\n        elif self.model_type == ModelType.QUANTILE_REGRESSOR:\n            script_name = \"generated_quantile_model.py\"\n            dir_path = Path(__file__).parent.absolute()\n            self.model_script_dir = os.path.join(dir_path, \"light_quant_regression\")\n            template_path = os.path.join(self.model_script_dir, \"quant_regression.template\")\n            output_path = os.path.join(self.model_script_dir, script_name)\n            with open(template_path, \"r\") as fp:\n                quant_template = fp.read()\n\n            # Template replacements\n            aws_script = quant_template.replace(\"{{target_column}}\", target_column)\n            feature_list_str = json.dumps(feature_list)\n            aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n            metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n            aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n\n        # Now write out the generated model script and return the name\n        with open(output_path, \"w\") as fp:\n            fp.write(aws_script)\n        return script_name\n\n    def transform_impl(\n        self, target_column: str, description: str = None, feature_list: list = None, train_all_data=False\n    ):\n        \"\"\"Generic Features to Model: Note you should create a new class and inherit from\n        this one to include specific logic for your Feature Set/Model\n        Args:\n            target_column (str): Column name of the target variable\n            description (str): Description of the model (optional)\n            feature_list (list[str]): A list of columns for the features (default None, will try to guess)\n            train_all_data (bool): Train on ALL (100%) of the data (default False)\n        \"\"\"\n        # Delete the existing model (if it exists)\n        self.log.important(\"Trying to delete existing model...\")\n        delete_model = ModelCore(self.output_uuid, force_refresh=True)\n        delete_model.delete()\n\n        # Set our model description\n        self.model_description = description if description is not None else f\"Model created from {self.input_uuid}\"\n\n        # Get our Feature Set and create an S3 CSV Training dataset\n        feature_set = FeatureSetCore(self.input_uuid)\n        s3_training_path = feature_set.create_s3_training_data()\n        self.log.info(f\"Created new training data {s3_training_path}...\")\n\n        # Report the target column\n        self.target_column = target_column\n        self.log.info(f\"Target column: {self.target_column}\")\n\n        # Did they specify a feature list?\n        if feature_list:\n            # AWS Feature Groups will also add these implicit columns, so remove them\n            aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"training\"]\n            feature_list = [c for c in feature_list if c not in aws_cols]\n\n        # If they didn't specify a feature list, try to guess it\n        else:\n            # Try to figure out features with this logic\n            # - Don't include id, event_time, __index_level_0__, or training columns\n            # - Don't include AWS generated columns (e.g. write_time, api_invocation_time, is_deleted)\n            # - Don't include the target columns\n            # - Don't include any columns that are of type string or timestamp\n            # - The rest of the columns are assumed to be features\n            self.log.warning(\"Guessing at the feature list, HIGHLY SUGGESTED to specify an explicit feature list!\")\n            all_columns = feature_set.column_names()\n            filter_list = [\n                \"id\",\n                \"__index_level_0__\",\n                \"write_time\",\n                \"api_invocation_time\",\n                \"is_deleted\",\n                \"event_time\",\n                \"training\",\n            ] + [self.target_column]\n            feature_list = [c for c in all_columns if c not in filter_list]\n\n        # AWS Feature Store has 3 user column types (String, Integral, Fractional)\n        # and two internal types (Timestamp and Boolean). A Feature List for\n        # modeling can only contain Integral and Fractional types.\n        remove_columns = []\n        column_details = feature_set.column_details()\n        for column_name in feature_list:\n            if column_details[column_name] not in [\"Integral\", \"Fractional\"]:\n                self.log.warning(\n                    f\"Removing {column_name} from feature list, improper type {column_details[column_name]}\"\n                )\n                remove_columns.append(column_name)\n\n        # Remove the columns that are not Integral or Fractional\n        self.model_feature_list = [c for c in feature_list if c not in remove_columns]\n        self.log.important(f\"Feature List for Modeling: {self.model_feature_list}\")\n\n        # Generate our model script\n        script_path = self.generate_model_script(self.target_column, self.model_feature_list, train_all_data)\n\n        # Metric Definitions for Regression\n        if self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.QUANTILE_REGRESSOR:\n            metric_definitions = [\n                {\"Name\": \"RMSE\", \"Regex\": \"RMSE: ([0-9.]+)\"},\n                {\"Name\": \"MAE\", \"Regex\": \"MAE: ([0-9.]+)\"},\n                {\"Name\": \"R2\", \"Regex\": \"R2: ([0-9.]+)\"},\n                {\"Name\": \"NumRows\", \"Regex\": \"NumRows: ([0-9]+)\"},\n            ]\n\n        # Metric Definitions for Classification\n        elif self.model_type == ModelType.CLASSIFIER:\n            # We need to get creative with the Classification Metrics\n\n            # Grab all the target column class values (class labels)\n            table = feature_set.data_source.get_table_name()\n            self.class_labels = feature_set.query(f\"select DISTINCT {self.target_column} FROM {table}\")[\n                self.target_column\n            ].to_list()\n\n            # Sanity check on the targets\n            if len(self.class_labels) &gt; 10:\n                msg = f\"Too many target classes ({len(self.class_labels)}) for classification, aborting!\"\n                self.log.critical(msg)\n                raise ValueError(msg)\n\n            # Dynamically create the metric definitions\n            metrics = [\"precision\", \"recall\", \"fscore\"]\n            metric_definitions = []\n            for t in self.class_labels:\n                for m in metrics:\n                    metric_definitions.append({\"Name\": f\"Metrics:{t}:{m}\", \"Regex\": f\"Metrics:{t}:{m} ([0-9.]+)\"})\n\n            # Add the confusion matrix metrics\n            for row in self.class_labels:\n                for col in self.class_labels:\n                    metric_definitions.append(\n                        {\"Name\": f\"ConfusionMatrix:{row}:{col}\", \"Regex\": f\"ConfusionMatrix:{row}:{col} ([0-9.]+)\"}\n                    )\n\n        # If the model type is UNKNOWN, our metric_definitions will be empty\n        else:\n            self.log.warning(f\"ModelType is {self.model_type}, skipping metric_definitions...\")\n            metric_definitions = []\n\n        # Create a Sagemaker Model with our script\n        self.estimator = SKLearn(\n            entry_point=script_path,\n            source_dir=self.model_script_dir,\n            role=self.sageworks_role_arn,\n            instance_type=\"ml.m5.large\",\n            sagemaker_session=self.sm_session,\n            framework_version=\"1.2-1\",\n            metric_definitions=metric_definitions,\n        )\n\n        # Training Job Name based on the Model UUID and today's date\n        training_date_time_utc = datetime.utcnow().strftime(\"%Y-%m-%d-%H-%M\")\n        training_job_name = f\"{self.output_uuid}-{training_date_time_utc}\"\n\n        # Train the estimator\n        self.estimator.fit({\"train\": s3_training_path}, job_name=training_job_name)\n\n        # Now delete the training data\n        self.log.info(f\"Deleting training data {s3_training_path}...\")\n        wr.s3.delete_objects(\n            [s3_training_path, s3_training_path.replace(\".csv\", \".csv.metadata\")],\n            boto3_session=self.boto_session,\n        )\n\n        # Create Model and officially Register\n        self.log.important(f\"Creating new model {self.output_uuid}...\")\n        self.create_and_register_model()\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Calling onboard() on the Model\"\"\"\n        self.log.info(\"Post-Transform: Calling onboard() on the Model...\")\n\n        # Store the model feature_list and target_column in the sageworks_meta\n        output_model = ModelCore(self.output_uuid, model_type=self.model_type, force_refresh=True)\n        output_model.upsert_sageworks_meta({\"sageworks_model_features\": self.model_feature_list})\n        output_model.upsert_sageworks_meta({\"sageworks_model_target\": self.target_column})\n\n        # Store the class labels (if they exist)\n        if self.class_labels:\n            output_model.set_class_labels(self.class_labels)\n\n        # Call the Model onboard method\n        output_model.onboard_with_args(self.model_type, self.target_column, self.model_feature_list)\n\n    def create_and_register_model(self):\n        \"\"\"Create and Register the Model\"\"\"\n\n        # Get the metadata/tags to push into AWS\n        aws_tags = self.get_aws_tags()\n\n        # Create model group (if it doesn't already exist)\n        self.sm_client.create_model_package_group(\n            ModelPackageGroupName=self.output_uuid,\n            ModelPackageGroupDescription=self.model_description,\n            Tags=aws_tags,\n        )\n\n        # Register our model\n        model = self.estimator.create_model(role=self.sageworks_role_arn)\n        model.register(\n            model_package_group_name=self.output_uuid,\n            framework_version=\"1.2.1\",\n            content_types=[\"text/csv\"],\n            response_types=[\"text/csv\"],\n            inference_instances=[\"ml.t2.medium\"],\n            transform_instances=[\"ml.m5.large\"],\n            approval_status=\"Approved\",\n            description=self.model_description,\n        )\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.__init__","title":"<code>__init__(feature_uuid, model_uuid, model_type=ModelType.UNKNOWN, model_class=None)</code>","text":"<p>FeaturesToModel Initialization Args:     feature_uuid (str): UUID of the FeatureSet to use as input     model_uuid (str): UUID of the Model to create as output     model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.     model_class (str): The class of the model (optional)</p> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>def __init__(self, feature_uuid: str, model_uuid: str, model_type: ModelType = ModelType.UNKNOWN, model_class=None):\n    \"\"\"FeaturesToModel Initialization\n    Args:\n        feature_uuid (str): UUID of the FeatureSet to use as input\n        model_uuid (str): UUID of the Model to create as output\n        model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.\n        model_class (str): The class of the model (optional)\n    \"\"\"\n\n    # Make sure the model_uuid is a valid name\n    Artifact.ensure_valid_name(model_uuid, delimiter=\"-\")\n\n    # Call superclass init\n    super().__init__(feature_uuid, model_uuid)\n\n    # If the model_type is UNKNOWN the model_class must be specified\n    if model_type == ModelType.UNKNOWN:\n        if model_class is None:\n            msg = \"ModelType is UNKNOWN, must specify a model_class!\"\n            self.log.critical(msg)\n            raise ValueError(msg)\n        else:\n            self.log.info(\"ModelType is UNKNOWN, using model_class to determine the type...\")\n            model_type = self._determine_model_type(model_class)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.FEATURE_SET\n    self.output_type = TransformOutput.MODEL\n    self.model_type = model_type\n    self.model_class = model_class\n    self.estimator = None\n    self.model_script_dir = None\n    self.model_description = None\n    self.model_training_root = self.models_s3_path + \"/training\"\n    self.model_feature_list = None\n    self.target_column = None\n    self.class_labels = None\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.create_and_register_model","title":"<code>create_and_register_model()</code>","text":"<p>Create and Register the Model</p> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>def create_and_register_model(self):\n    \"\"\"Create and Register the Model\"\"\"\n\n    # Get the metadata/tags to push into AWS\n    aws_tags = self.get_aws_tags()\n\n    # Create model group (if it doesn't already exist)\n    self.sm_client.create_model_package_group(\n        ModelPackageGroupName=self.output_uuid,\n        ModelPackageGroupDescription=self.model_description,\n        Tags=aws_tags,\n    )\n\n    # Register our model\n    model = self.estimator.create_model(role=self.sageworks_role_arn)\n    model.register(\n        model_package_group_name=self.output_uuid,\n        framework_version=\"1.2.1\",\n        content_types=[\"text/csv\"],\n        response_types=[\"text/csv\"],\n        inference_instances=[\"ml.t2.medium\"],\n        transform_instances=[\"ml.m5.large\"],\n        approval_status=\"Approved\",\n        description=self.model_description,\n    )\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.generate_model_script","title":"<code>generate_model_script(target_column, feature_list, train_all_data)</code>","text":"<p>Fill in the model template with specific target and feature_list Args:     target_column (str): Column name of the target variable     feature_list (list[str]): A list of columns for the features     train_all_data (bool): Train on ALL (100%) of the data Returns:    str: The name of the generated model script</p> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>def generate_model_script(self, target_column: str, feature_list: list[str], train_all_data: bool) -&gt; str:\n    \"\"\"Fill in the model template with specific target and feature_list\n    Args:\n        target_column (str): Column name of the target variable\n        feature_list (list[str]): A list of columns for the features\n        train_all_data (bool): Train on ALL (100%) of the data\n    Returns:\n       str: The name of the generated model script\n    \"\"\"\n\n    # FIXME: Revisit all of this since it's a bit wonky\n    # Did they specify a Scikit-Learn model class?\n    if self.model_class:\n        self.log.info(f\"Using Scikit-Learn model class: {self.model_class}\")\n        script_name = \"generated_scikit_model.py\"\n        dir_path = Path(__file__).parent.absolute()\n        self.model_script_dir = os.path.join(dir_path, \"light_scikit_learn\")\n        template_path = os.path.join(self.model_script_dir, \"scikit_learn.template\")\n        output_path = os.path.join(self.model_script_dir, script_name)\n        with open(template_path, \"r\") as fp:\n            scikit_template = fp.read()\n\n        # Template replacements\n        aws_script = scikit_template.replace(\"{{model_class}}\", self.model_class)\n        aws_script = aws_script.replace(\"{{target_column}}\", target_column)\n        feature_list_str = json.dumps(feature_list)\n        aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n        aws_script = aws_script.replace(\"{{model_type}}\", self.model_type.value)\n        metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n        aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n        aws_script = aws_script.replace(\"{{train_all_data}}\", str(train_all_data))\n\n    elif self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.CLASSIFIER:\n        script_name = \"generated_xgb_model.py\"\n        dir_path = Path(__file__).parent.absolute()\n        self.model_script_dir = os.path.join(dir_path, \"light_xgb_model\")\n        template_path = os.path.join(self.model_script_dir, \"xgb_model.template\")\n        output_path = os.path.join(self.model_script_dir, script_name)\n        with open(template_path, \"r\") as fp:\n            xgb_template = fp.read()\n\n        # Template replacements\n        aws_script = xgb_template.replace(\"{{target_column}}\", target_column)\n        feature_list_str = json.dumps(feature_list)\n        aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n        aws_script = aws_script.replace(\"{{model_type}}\", self.model_type.value)\n        metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n        aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n        aws_script = aws_script.replace(\"{{train_all_data}}\", str(train_all_data))\n\n    elif self.model_type == ModelType.QUANTILE_REGRESSOR:\n        script_name = \"generated_quantile_model.py\"\n        dir_path = Path(__file__).parent.absolute()\n        self.model_script_dir = os.path.join(dir_path, \"light_quant_regression\")\n        template_path = os.path.join(self.model_script_dir, \"quant_regression.template\")\n        output_path = os.path.join(self.model_script_dir, script_name)\n        with open(template_path, \"r\") as fp:\n            quant_template = fp.read()\n\n        # Template replacements\n        aws_script = quant_template.replace(\"{{target_column}}\", target_column)\n        feature_list_str = json.dumps(feature_list)\n        aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n        metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n        aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n\n    # Now write out the generated model script and return the name\n    with open(output_path, \"w\") as fp:\n        fp.write(aws_script)\n    return script_name\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Calling onboard() on the Model</p> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Calling onboard() on the Model\"\"\"\n    self.log.info(\"Post-Transform: Calling onboard() on the Model...\")\n\n    # Store the model feature_list and target_column in the sageworks_meta\n    output_model = ModelCore(self.output_uuid, model_type=self.model_type, force_refresh=True)\n    output_model.upsert_sageworks_meta({\"sageworks_model_features\": self.model_feature_list})\n    output_model.upsert_sageworks_meta({\"sageworks_model_target\": self.target_column})\n\n    # Store the class labels (if they exist)\n    if self.class_labels:\n        output_model.set_class_labels(self.class_labels)\n\n    # Call the Model onboard method\n    output_model.onboard_with_args(self.model_type, self.target_column, self.model_feature_list)\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.transform_impl","title":"<code>transform_impl(target_column, description=None, feature_list=None, train_all_data=False)</code>","text":"<p>Generic Features to Model: Note you should create a new class and inherit from this one to include specific logic for your Feature Set/Model Args:     target_column (str): Column name of the target variable     description (str): Description of the model (optional)     feature_list (list[str]): A list of columns for the features (default None, will try to guess)     train_all_data (bool): Train on ALL (100%) of the data (default False)</p> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>def transform_impl(\n    self, target_column: str, description: str = None, feature_list: list = None, train_all_data=False\n):\n    \"\"\"Generic Features to Model: Note you should create a new class and inherit from\n    this one to include specific logic for your Feature Set/Model\n    Args:\n        target_column (str): Column name of the target variable\n        description (str): Description of the model (optional)\n        feature_list (list[str]): A list of columns for the features (default None, will try to guess)\n        train_all_data (bool): Train on ALL (100%) of the data (default False)\n    \"\"\"\n    # Delete the existing model (if it exists)\n    self.log.important(\"Trying to delete existing model...\")\n    delete_model = ModelCore(self.output_uuid, force_refresh=True)\n    delete_model.delete()\n\n    # Set our model description\n    self.model_description = description if description is not None else f\"Model created from {self.input_uuid}\"\n\n    # Get our Feature Set and create an S3 CSV Training dataset\n    feature_set = FeatureSetCore(self.input_uuid)\n    s3_training_path = feature_set.create_s3_training_data()\n    self.log.info(f\"Created new training data {s3_training_path}...\")\n\n    # Report the target column\n    self.target_column = target_column\n    self.log.info(f\"Target column: {self.target_column}\")\n\n    # Did they specify a feature list?\n    if feature_list:\n        # AWS Feature Groups will also add these implicit columns, so remove them\n        aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"training\"]\n        feature_list = [c for c in feature_list if c not in aws_cols]\n\n    # If they didn't specify a feature list, try to guess it\n    else:\n        # Try to figure out features with this logic\n        # - Don't include id, event_time, __index_level_0__, or training columns\n        # - Don't include AWS generated columns (e.g. write_time, api_invocation_time, is_deleted)\n        # - Don't include the target columns\n        # - Don't include any columns that are of type string or timestamp\n        # - The rest of the columns are assumed to be features\n        self.log.warning(\"Guessing at the feature list, HIGHLY SUGGESTED to specify an explicit feature list!\")\n        all_columns = feature_set.column_names()\n        filter_list = [\n            \"id\",\n            \"__index_level_0__\",\n            \"write_time\",\n            \"api_invocation_time\",\n            \"is_deleted\",\n            \"event_time\",\n            \"training\",\n        ] + [self.target_column]\n        feature_list = [c for c in all_columns if c not in filter_list]\n\n    # AWS Feature Store has 3 user column types (String, Integral, Fractional)\n    # and two internal types (Timestamp and Boolean). A Feature List for\n    # modeling can only contain Integral and Fractional types.\n    remove_columns = []\n    column_details = feature_set.column_details()\n    for column_name in feature_list:\n        if column_details[column_name] not in [\"Integral\", \"Fractional\"]:\n            self.log.warning(\n                f\"Removing {column_name} from feature list, improper type {column_details[column_name]}\"\n            )\n            remove_columns.append(column_name)\n\n    # Remove the columns that are not Integral or Fractional\n    self.model_feature_list = [c for c in feature_list if c not in remove_columns]\n    self.log.important(f\"Feature List for Modeling: {self.model_feature_list}\")\n\n    # Generate our model script\n    script_path = self.generate_model_script(self.target_column, self.model_feature_list, train_all_data)\n\n    # Metric Definitions for Regression\n    if self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.QUANTILE_REGRESSOR:\n        metric_definitions = [\n            {\"Name\": \"RMSE\", \"Regex\": \"RMSE: ([0-9.]+)\"},\n            {\"Name\": \"MAE\", \"Regex\": \"MAE: ([0-9.]+)\"},\n            {\"Name\": \"R2\", \"Regex\": \"R2: ([0-9.]+)\"},\n            {\"Name\": \"NumRows\", \"Regex\": \"NumRows: ([0-9]+)\"},\n        ]\n\n    # Metric Definitions for Classification\n    elif self.model_type == ModelType.CLASSIFIER:\n        # We need to get creative with the Classification Metrics\n\n        # Grab all the target column class values (class labels)\n        table = feature_set.data_source.get_table_name()\n        self.class_labels = feature_set.query(f\"select DISTINCT {self.target_column} FROM {table}\")[\n            self.target_column\n        ].to_list()\n\n        # Sanity check on the targets\n        if len(self.class_labels) &gt; 10:\n            msg = f\"Too many target classes ({len(self.class_labels)}) for classification, aborting!\"\n            self.log.critical(msg)\n            raise ValueError(msg)\n\n        # Dynamically create the metric definitions\n        metrics = [\"precision\", \"recall\", \"fscore\"]\n        metric_definitions = []\n        for t in self.class_labels:\n            for m in metrics:\n                metric_definitions.append({\"Name\": f\"Metrics:{t}:{m}\", \"Regex\": f\"Metrics:{t}:{m} ([0-9.]+)\"})\n\n        # Add the confusion matrix metrics\n        for row in self.class_labels:\n            for col in self.class_labels:\n                metric_definitions.append(\n                    {\"Name\": f\"ConfusionMatrix:{row}:{col}\", \"Regex\": f\"ConfusionMatrix:{row}:{col} ([0-9.]+)\"}\n                )\n\n    # If the model type is UNKNOWN, our metric_definitions will be empty\n    else:\n        self.log.warning(f\"ModelType is {self.model_type}, skipping metric_definitions...\")\n        metric_definitions = []\n\n    # Create a Sagemaker Model with our script\n    self.estimator = SKLearn(\n        entry_point=script_path,\n        source_dir=self.model_script_dir,\n        role=self.sageworks_role_arn,\n        instance_type=\"ml.m5.large\",\n        sagemaker_session=self.sm_session,\n        framework_version=\"1.2-1\",\n        metric_definitions=metric_definitions,\n    )\n\n    # Training Job Name based on the Model UUID and today's date\n    training_date_time_utc = datetime.utcnow().strftime(\"%Y-%m-%d-%H-%M\")\n    training_job_name = f\"{self.output_uuid}-{training_date_time_utc}\"\n\n    # Train the estimator\n    self.estimator.fit({\"train\": s3_training_path}, job_name=training_job_name)\n\n    # Now delete the training data\n    self.log.info(f\"Deleting training data {s3_training_path}...\")\n    wr.s3.delete_objects(\n        [s3_training_path, s3_training_path.replace(\".csv\", \".csv.metadata\")],\n        boto3_session=self.boto_session,\n    )\n\n    # Create Model and officially Register\n    self.log.important(f\"Creating new model {self.output_uuid}...\")\n    self.create_and_register_model()\n</code></pre>"},{"location":"core_classes/transforms/model_to_endpoint/","title":"Model to Endpoint","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p> <p>ModelToEndpoint: Deploy an Endpoint for a Model</p>"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint","title":"<code>ModelToEndpoint</code>","text":"<p>               Bases: <code>Transform</code></p> <p>ModelToEndpoint: Deploy an Endpoint for a Model</p> Common Usage <pre><code>to_endpoint = ModelToEndpoint(model_uuid, endpoint_uuid)\nto_endpoint.set_output_tags([\"aqsol\", \"public\", \"whatever\"])\nto_endpoint.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py</code> <pre><code>class ModelToEndpoint(Transform):\n    \"\"\"ModelToEndpoint: Deploy an Endpoint for a Model\n\n    Common Usage:\n        ```\n        to_endpoint = ModelToEndpoint(model_uuid, endpoint_uuid)\n        to_endpoint.set_output_tags([\"aqsol\", \"public\", \"whatever\"])\n        to_endpoint.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, model_uuid: str, endpoint_uuid: str, serverless: bool = True):\n        \"\"\"ModelToEndpoint Initialization\n        Args:\n            model_uuid(str): The UUID of the input Model\n            endpoint_uuid(str): The UUID of the output Endpoint\n            serverless(bool): Deploy the Endpoint in serverless mode (default: True)\n        \"\"\"\n\n        # Make sure the endpoint_uuid is a valid name\n        Artifact.ensure_valid_name(endpoint_uuid, delimiter=\"-\")\n\n        # Call superclass init\n        super().__init__(model_uuid, endpoint_uuid)\n\n        # Set up all my instance attributes\n        self.instance_type = \"serverless\" if serverless else \"ml.t2.medium\"\n        self.input_type = TransformInput.MODEL\n        self.output_type = TransformOutput.ENDPOINT\n\n    def transform_impl(self):\n        \"\"\"Deploy an Endpoint for a Model\"\"\"\n\n        # Delete endpoint (if it already exists)\n        existing_endpoint = EndpointCore(self.output_uuid, force_refresh=True)\n        if existing_endpoint.exists():\n            existing_endpoint.delete()\n\n        # Get the Model Package ARN for our input model\n        input_model = ModelCore(self.input_uuid)\n        model_package_arn = input_model.model_package_arn()\n\n        # Will this be a Serverless Endpoint?\n        if self.instance_type == \"serverless\":\n            self._serverless_deploy(model_package_arn)\n        else:\n            self._realtime_deploy(model_package_arn)\n\n        # Add this endpoint to the set of registered endpoints for the model\n        input_model.register_endpoint(self.output_uuid)\n\n        # This ensures that the endpoint is ready for use\n        time.sleep(5)  # We wait for AWS Lag\n        end = EndpointCore(self.output_uuid, force_refresh=True)\n        self.log.important(f\"Endpoint {end.uuid} is ready for use\")\n\n    def _realtime_deploy(self, model_package_arn: str):\n        \"\"\"Internal Method: Deploy the Realtime Endpoint\n\n        Args:\n            model_package_arn(str): The Model Package ARN used to deploy the Endpoint\n        \"\"\"\n        # Create a Model Package\n        model_package = ModelPackage(role=self.sageworks_role_arn, model_package_arn=model_package_arn)\n\n        # Get the metadata/tags to push into AWS\n        aws_tags = self.get_aws_tags()\n\n        # Deploy a Realtime Endpoint\n        model_package.deploy(\n            initial_instance_count=1,\n            instance_type=self.instance_type,\n            endpoint_name=self.output_uuid,\n            serializer=CSVSerializer(),\n            deserializer=CSVDeserializer(),\n            tags=aws_tags,\n        )\n\n    def _serverless_deploy(self, model_package_arn, mem_size=2048, max_concurrency=5, wait=True):\n        \"\"\"Internal Method: Deploy a Serverless Endpoint\n\n        Args:\n            mem_size(int): Memory size in MB (default: 2048)\n            max_concurrency(int): Max concurrency (default: 5)\n            wait(bool): Wait for the Endpoint to be ready (default: True)\n        \"\"\"\n        model_name = self.input_uuid\n        endpoint_name = self.output_uuid\n        aws_tags = self.get_aws_tags()\n\n        # Create Low Level Model Resource (Endpoint Config below references this Model Resource)\n        # Note: Since model is internal to the endpoint we'll add a timestamp (just like SageMaker does)\n        datetime_str = datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S-%f\")[:-3]\n        model_name = f\"{model_name}-{datetime_str}\"\n        self.log.info(f\"Creating Low Level Model: {model_name}...\")\n        self.sm_client.create_model(\n            ModelName=model_name,\n            PrimaryContainer={\n                \"ModelPackageName\": model_package_arn,\n            },\n            ExecutionRoleArn=self.sageworks_role_arn,\n            Tags=aws_tags,\n        )\n\n        # Create Endpoint Config\n        self.log.info(f\"Creating Endpoint Config {endpoint_name}...\")\n        try:\n            self.sm_client.create_endpoint_config(\n                EndpointConfigName=endpoint_name,\n                ProductionVariants=[\n                    {\n                        \"ServerlessConfig\": {\"MemorySizeInMB\": mem_size, \"MaxConcurrency\": max_concurrency},\n                        \"ModelName\": model_name,\n                        \"VariantName\": \"AllTraffic\",\n                    }\n                ],\n            )\n        except ClientError as e:\n            # Already Exists: Check if ValidationException and existing endpoint configuration\n            if (\n                e.response[\"Error\"][\"Code\"] == \"ValidationException\"\n                and \"already existing endpoint configuration\" in e.response[\"Error\"][\"Message\"]\n            ):\n                self.log.warning(\"Endpoint configuration already exists: Deleting and retrying...\")\n                self.sm_client.delete_endpoint_config(EndpointConfigName=endpoint_name)\n                self.sm_client.create_endpoint_config(\n                    EndpointConfigName=endpoint_name,\n                    ProductionVariants=[\n                        {\n                            \"ServerlessConfig\": {\"MemorySizeInMB\": mem_size, \"MaxConcurrency\": max_concurrency},\n                            \"ModelName\": model_name,\n                            \"VariantName\": \"AllTraffic\",\n                        }\n                    ],\n                )\n\n        # Create Endpoint\n        self.log.info(f\"Creating Serverless Endpoint {endpoint_name}...\")\n        self.sm_client.create_endpoint(\n            EndpointName=endpoint_name, EndpointConfigName=endpoint_name, Tags=self.get_aws_tags()\n        )\n\n        # Wait for Endpoint to be ready\n        if not wait:\n            self.log.important(f\"Endpoint {endpoint_name} is being created...\")\n        else:\n            self.log.important(f\"Waiting for Endpoint {endpoint_name} to be ready...\")\n            describe_endpoint_response = self.sm_client.describe_endpoint(EndpointName=endpoint_name)\n            while describe_endpoint_response[\"EndpointStatus\"] == \"Creating\":\n                time.sleep(30)\n                describe_endpoint_response = self.sm_client.describe_endpoint(EndpointName=endpoint_name)\n                self.log.info(f\"Endpoint Status: {describe_endpoint_response['EndpointStatus']}\")\n            status = describe_endpoint_response[\"EndpointStatus\"]\n            if status != \"InService\":\n                msg = f\"Endpoint {endpoint_name} failed to be created. Status: {status}\"\n                details = describe_endpoint_response[\"FailureReason\"]\n                self.log.critical(msg)\n                self.log.critical(details)\n                raise Exception(msg)\n            self.log.important(f\"Endpoint {endpoint_name} is now {status}...\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Calling onboard() for the Endpoint\"\"\"\n        self.log.info(\"Post-Transform: Calling onboard() for the Endpoint...\")\n\n        # Onboard the Endpoint\n        output_endpoint = EndpointCore(self.output_uuid, force_refresh=True)\n        output_endpoint.onboard()\n</code></pre>"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.__init__","title":"<code>__init__(model_uuid, endpoint_uuid, serverless=True)</code>","text":"<p>ModelToEndpoint Initialization Args:     model_uuid(str): The UUID of the input Model     endpoint_uuid(str): The UUID of the output Endpoint     serverless(bool): Deploy the Endpoint in serverless mode (default: True)</p> Source code in <code>src/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py</code> <pre><code>def __init__(self, model_uuid: str, endpoint_uuid: str, serverless: bool = True):\n    \"\"\"ModelToEndpoint Initialization\n    Args:\n        model_uuid(str): The UUID of the input Model\n        endpoint_uuid(str): The UUID of the output Endpoint\n        serverless(bool): Deploy the Endpoint in serverless mode (default: True)\n    \"\"\"\n\n    # Make sure the endpoint_uuid is a valid name\n    Artifact.ensure_valid_name(endpoint_uuid, delimiter=\"-\")\n\n    # Call superclass init\n    super().__init__(model_uuid, endpoint_uuid)\n\n    # Set up all my instance attributes\n    self.instance_type = \"serverless\" if serverless else \"ml.t2.medium\"\n    self.input_type = TransformInput.MODEL\n    self.output_type = TransformOutput.ENDPOINT\n</code></pre>"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Calling onboard() for the Endpoint</p> Source code in <code>src/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Calling onboard() for the Endpoint\"\"\"\n    self.log.info(\"Post-Transform: Calling onboard() for the Endpoint...\")\n\n    # Onboard the Endpoint\n    output_endpoint = EndpointCore(self.output_uuid, force_refresh=True)\n    output_endpoint.onboard()\n</code></pre>"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.transform_impl","title":"<code>transform_impl()</code>","text":"<p>Deploy an Endpoint for a Model</p> Source code in <code>src/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py</code> <pre><code>def transform_impl(self):\n    \"\"\"Deploy an Endpoint for a Model\"\"\"\n\n    # Delete endpoint (if it already exists)\n    existing_endpoint = EndpointCore(self.output_uuid, force_refresh=True)\n    if existing_endpoint.exists():\n        existing_endpoint.delete()\n\n    # Get the Model Package ARN for our input model\n    input_model = ModelCore(self.input_uuid)\n    model_package_arn = input_model.model_package_arn()\n\n    # Will this be a Serverless Endpoint?\n    if self.instance_type == \"serverless\":\n        self._serverless_deploy(model_package_arn)\n    else:\n        self._realtime_deploy(model_package_arn)\n\n    # Add this endpoint to the set of registered endpoints for the model\n    input_model.register_endpoint(self.output_uuid)\n\n    # This ensures that the endpoint is ready for use\n    time.sleep(5)  # We wait for AWS Lag\n    end = EndpointCore(self.output_uuid, force_refresh=True)\n    self.log.important(f\"Endpoint {end.uuid} is ready for use\")\n</code></pre>"},{"location":"core_classes/transforms/overview/","title":"Transforms","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p> <p>SageWorks currently has a large set of Transforms that go from one Artifact type to another (e.g. DataSource to FeatureSet). The Transforms will often have light and heavy versions depending on the scale of data that needs to be transformed.</p>"},{"location":"core_classes/transforms/overview/#transform-details","title":"Transform Details","text":"<ul> <li>DataLoaders Light: Loads various light/smaller data into AWS Data Catalog and Athena</li> <li>DataLoaders Heavy: Loads heavy/larger data (via Glue) into AWS Data Catalog and Athena</li> <li>DataToFeatures: Transforms a DataSource into a FeatureSet (AWS Feature Store/Group)</li> <li>FeaturesToModel: Trains and deploys an AWS Model Package/Group from a FeatureSet</li> <li>ModelToEndpoint: Manages the provisioning and deployment of a Model Endpoint</li> <li>PandasTransforms: Pandas DataFrame transforms and helper methods.</li> </ul>"},{"location":"core_classes/transforms/pandas_transforms/","title":"Pandas Transforms","text":"<p>API Classes</p> <p>The API Classes will often provide helpful methods that give you a DataFrame (data_source.query() for instance), so always check out the API Classes first.</p> <p>These Transforms will give you the ultimate in customization and flexibility when creating AWS Machine Learning Pipelines. Grab a Pandas DataFrame from a DataSource or FeatureSet process in whatever way for your use case and simply create another Sageworks DataSource or FeatureSet from the resulting DataFrame.</p> <p>Lots of Options:</p> <p>Not for Large Data</p> <p>Pandas Transforms can't handle large datasets (&gt; 4 GigaBytes). For doing transforma on large data see our Heavy Transforms</p> <ul> <li>S3 --&gt; DF --&gt; DataSource</li> <li>DataSource --&gt; DF --&gt; DataSource</li> <li>DataSoruce --&gt; DF --&gt; FeatureSet</li> <li>Get Creative!</li> </ul> <p>Welcome to the SageWorks Pandas Transform Classes</p> <p>These classes provide low-level APIs for using Pandas DataFrames</p> <ul> <li>DataToPandas: Pull a dataframe from a SageWorks DataSource</li> <li>FeaturesToPandas: Pull a dataframe from a SageWorks FeatureSet</li> <li>PandasToData: Create a SageWorks DataSource using a Pandas DataFrame as the source</li> <li>PandasToFeatures: Create a SageWorks FeatureSet using a Pandas DataFrame as the source</li> <li>PandasToFeaturesChunked: Create a SageWorks FeatureSet using a Chunked/Streaming Pandas DataFrame as the source</li> </ul>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas","title":"<code>DataToPandas</code>","text":"<p>               Bases: <code>Transform</code></p> <p>DataToPandas: Class to transform a Data Source into a Pandas DataFrame</p> Common Usage <pre><code>data_to_df = DataToPandas(data_source_uuid)\ndata_to_df.transform(query=&lt;optional SQL query to filter/process data&gt;)\ndata_to_df.transform(max_rows=&lt;optional max rows to sample&gt;)\nmy_df = data_to_df.get_output()\n\nNote: query is the best way to use this class, so use it :)\n</code></pre> Source code in <code>src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py</code> <pre><code>class DataToPandas(Transform):\n    \"\"\"DataToPandas: Class to transform a Data Source into a Pandas DataFrame\n\n    Common Usage:\n        ```\n        data_to_df = DataToPandas(data_source_uuid)\n        data_to_df.transform(query=&lt;optional SQL query to filter/process data&gt;)\n        data_to_df.transform(max_rows=&lt;optional max rows to sample&gt;)\n        my_df = data_to_df.get_output()\n\n        Note: query is the best way to use this class, so use it :)\n        ```\n    \"\"\"\n\n    def __init__(self, input_uuid: str):\n        \"\"\"DataToPandas Initialization\"\"\"\n\n        # Call superclass init\n        super().__init__(input_uuid, \"DataFrame\")\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.DATA_SOURCE\n        self.output_type = TransformOutput.PANDAS_DF\n        self.output_df = None\n\n    def transform_impl(self, query: str = None, max_rows=100000):\n        \"\"\"Convert the DataSource into a Pandas DataFrame\n        Args:\n            query(str): The query to run against the DataSource (default: None)\n            max_rows(int): The maximum number of rows to return (default: 100000)\n        \"\"\"\n\n        # Grab the Input (Data Source)\n        input_data = DataSourceFactory(self.input_uuid)\n        if not input_data.exists():\n            self.log.critical(f\"Data Check on {self.input_uuid} failed!\")\n            return\n\n        # If a query is provided, that overrides the queries below\n        if query:\n            self.log.info(f\"Querying {self.input_uuid} with {query}...\")\n            self.output_df = input_data.query(query)\n            return\n\n        # If the data source has more rows than max_rows, do a sample query\n        num_rows = input_data.num_rows()\n        if num_rows &gt; max_rows:\n            percentage = round(max_rows * 100.0 / num_rows)\n            self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n            query = f\"SELECT * FROM {self.input_uuid} TABLESAMPLE BERNOULLI({percentage})\"\n        else:\n            query = f\"SELECT * FROM {self.input_uuid}\"\n\n        # Mark the transform as complete and set the output DataFrame\n        self.output_df = input_data.query(query)\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n        self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n        self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n\n    def get_output(self) -&gt; pd.DataFrame:\n        \"\"\"Get the DataFrame Output from this Transform\"\"\"\n        return self.output_df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.__init__","title":"<code>__init__(input_uuid)</code>","text":"<p>DataToPandas Initialization</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py</code> <pre><code>def __init__(self, input_uuid: str):\n    \"\"\"DataToPandas Initialization\"\"\"\n\n    # Call superclass init\n    super().__init__(input_uuid, \"DataFrame\")\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.DATA_SOURCE\n    self.output_type = TransformOutput.PANDAS_DF\n    self.output_df = None\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.get_output","title":"<code>get_output()</code>","text":"<p>Get the DataFrame Output from this Transform</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py</code> <pre><code>def get_output(self) -&gt; pd.DataFrame:\n    \"\"\"Get the DataFrame Output from this Transform\"\"\"\n    return self.output_df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Any checks on the Pandas DataFrame that need to be done</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n    self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n    self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.transform_impl","title":"<code>transform_impl(query=None, max_rows=100000)</code>","text":"<p>Convert the DataSource into a Pandas DataFrame Args:     query(str): The query to run against the DataSource (default: None)     max_rows(int): The maximum number of rows to return (default: 100000)</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py</code> <pre><code>def transform_impl(self, query: str = None, max_rows=100000):\n    \"\"\"Convert the DataSource into a Pandas DataFrame\n    Args:\n        query(str): The query to run against the DataSource (default: None)\n        max_rows(int): The maximum number of rows to return (default: 100000)\n    \"\"\"\n\n    # Grab the Input (Data Source)\n    input_data = DataSourceFactory(self.input_uuid)\n    if not input_data.exists():\n        self.log.critical(f\"Data Check on {self.input_uuid} failed!\")\n        return\n\n    # If a query is provided, that overrides the queries below\n    if query:\n        self.log.info(f\"Querying {self.input_uuid} with {query}...\")\n        self.output_df = input_data.query(query)\n        return\n\n    # If the data source has more rows than max_rows, do a sample query\n    num_rows = input_data.num_rows()\n    if num_rows &gt; max_rows:\n        percentage = round(max_rows * 100.0 / num_rows)\n        self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n        query = f\"SELECT * FROM {self.input_uuid} TABLESAMPLE BERNOULLI({percentage})\"\n    else:\n        query = f\"SELECT * FROM {self.input_uuid}\"\n\n    # Mark the transform as complete and set the output DataFrame\n    self.output_df = input_data.query(query)\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas","title":"<code>FeaturesToPandas</code>","text":"<p>               Bases: <code>Transform</code></p> <p>FeaturesToPandas: Class to transform a FeatureSet into a Pandas DataFrame</p> Common Usage <pre><code>feature_to_df = FeaturesToPandas(feature_set_uuid)\nfeature_to_df.transform(max_rows=&lt;optional max rows to sample&gt;)\nmy_df = feature_to_df.get_output()\n</code></pre> Source code in <code>src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py</code> <pre><code>class FeaturesToPandas(Transform):\n    \"\"\"FeaturesToPandas: Class to transform a FeatureSet into a Pandas DataFrame\n\n    Common Usage:\n        ```\n        feature_to_df = FeaturesToPandas(feature_set_uuid)\n        feature_to_df.transform(max_rows=&lt;optional max rows to sample&gt;)\n        my_df = feature_to_df.get_output()\n        ```\n    \"\"\"\n\n    def __init__(self, feature_set_name: str):\n        \"\"\"FeaturesToPandas Initialization\"\"\"\n\n        # Call superclass init\n        super().__init__(input_uuid=feature_set_name, output_uuid=\"DataFrame\")\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.FEATURE_SET\n        self.output_type = TransformOutput.PANDAS_DF\n        self.output_df = None\n        self.transform_run = False\n\n    def transform_impl(self, max_rows=100000):\n        \"\"\"Convert the FeatureSet into a Pandas DataFrame\"\"\"\n\n        # Grab the Input (Feature Set)\n        input_data = FeatureSetCore(self.input_uuid)\n        if not input_data.exists():\n            self.log.critical(f\"Feature Set Check on {self.input_uuid} failed!\")\n            return\n\n        # Grab the table for this Feature Set\n        table = input_data.athena_table\n\n        # Get the list of columns (and subtract metadata columns that might get added)\n        columns = input_data.column_names()\n        filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n        columns = \", \".join([x for x in columns if x not in filter_columns])\n\n        # Get the number of rows in the Feature Set\n        num_rows = input_data.num_rows()\n\n        # If the data source has more rows than max_rows, do a sample query\n        if num_rows &gt; max_rows:\n            percentage = round(max_rows * 100.0 / num_rows)\n            self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n            query = f'SELECT {columns} FROM \"{table}\" TABLESAMPLE BERNOULLI({percentage})'\n        else:\n            query = f'SELECT {columns} FROM \"{table}\"'\n\n        # Mark the transform as complete and set the output DataFrame\n        self.transform_run = True\n        self.output_df = input_data.query(query)\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n        self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n        self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n\n    def get_output(self) -&gt; pd.DataFrame:\n        \"\"\"Get the DataFrame Output from this Transform\"\"\"\n        if not self.transform_run:\n            self.transform()\n        return self.output_df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.__init__","title":"<code>__init__(feature_set_name)</code>","text":"<p>FeaturesToPandas Initialization</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py</code> <pre><code>def __init__(self, feature_set_name: str):\n    \"\"\"FeaturesToPandas Initialization\"\"\"\n\n    # Call superclass init\n    super().__init__(input_uuid=feature_set_name, output_uuid=\"DataFrame\")\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.FEATURE_SET\n    self.output_type = TransformOutput.PANDAS_DF\n    self.output_df = None\n    self.transform_run = False\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.get_output","title":"<code>get_output()</code>","text":"<p>Get the DataFrame Output from this Transform</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py</code> <pre><code>def get_output(self) -&gt; pd.DataFrame:\n    \"\"\"Get the DataFrame Output from this Transform\"\"\"\n    if not self.transform_run:\n        self.transform()\n    return self.output_df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Any checks on the Pandas DataFrame that need to be done</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n    self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n    self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.transform_impl","title":"<code>transform_impl(max_rows=100000)</code>","text":"<p>Convert the FeatureSet into a Pandas DataFrame</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py</code> <pre><code>def transform_impl(self, max_rows=100000):\n    \"\"\"Convert the FeatureSet into a Pandas DataFrame\"\"\"\n\n    # Grab the Input (Feature Set)\n    input_data = FeatureSetCore(self.input_uuid)\n    if not input_data.exists():\n        self.log.critical(f\"Feature Set Check on {self.input_uuid} failed!\")\n        return\n\n    # Grab the table for this Feature Set\n    table = input_data.athena_table\n\n    # Get the list of columns (and subtract metadata columns that might get added)\n    columns = input_data.column_names()\n    filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n    columns = \", \".join([x for x in columns if x not in filter_columns])\n\n    # Get the number of rows in the Feature Set\n    num_rows = input_data.num_rows()\n\n    # If the data source has more rows than max_rows, do a sample query\n    if num_rows &gt; max_rows:\n        percentage = round(max_rows * 100.0 / num_rows)\n        self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n        query = f'SELECT {columns} FROM \"{table}\" TABLESAMPLE BERNOULLI({percentage})'\n    else:\n        query = f'SELECT {columns} FROM \"{table}\"'\n\n    # Mark the transform as complete and set the output DataFrame\n    self.transform_run = True\n    self.output_df = input_data.query(query)\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData","title":"<code>PandasToData</code>","text":"<p>               Bases: <code>Transform</code></p> <p>PandasToData: Class to publish a Pandas DataFrame as a DataSource</p> Common Usage <pre><code>df_to_data = PandasToData(output_uuid)\ndf_to_data.set_output_tags([\"test\", \"small\"])\ndf_to_data.set_input(test_df)\ndf_to_data.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>class PandasToData(Transform):\n    \"\"\"PandasToData: Class to publish a Pandas DataFrame as a DataSource\n\n    Common Usage:\n        ```\n        df_to_data = PandasToData(output_uuid)\n        df_to_data.set_output_tags([\"test\", \"small\"])\n        df_to_data.set_input(test_df)\n        df_to_data.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, output_uuid: str, output_format: str = \"parquet\"):\n        \"\"\"PandasToData Initialization\n        Args:\n            output_uuid (str): The UUID of the DataSource to create\n            output_format (str): The file format to store the S3 object data in (default: \"parquet\")\n        \"\"\"\n\n        # Make sure the output_uuid is a valid name/id\n        Artifact.ensure_valid_name(output_uuid)\n\n        # Call superclass init\n        super().__init__(\"DataFrame\", output_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.PANDAS_DF\n        self.output_type = TransformOutput.DATA_SOURCE\n        self.output_df = None\n\n        # Give a message that Parquet is best in most cases\n        if output_format != \"parquet\":\n            self.log.warning(\"Parquet format works the best in most cases please consider using it\")\n        self.output_format = output_format\n\n    def set_input(self, input_df: pd.DataFrame):\n        \"\"\"Set the DataFrame Input for this Transform\"\"\"\n        self.output_df = input_df.copy()\n\n    def convert_object_to_string(self, df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Try to automatically convert object columns to string columns\"\"\"\n        for c in df.columns[df.dtypes == \"object\"]:  # Look at the object columns\n            try:\n                df[c] = df[c].astype(\"string\")\n                df[c] = df[c].str.replace(\"'\", '\"')  # This is for nested JSON\n            except (ParserError, ValueError, TypeError):\n                self.log.info(f\"Column {c} could not be converted to string...\")\n        return df\n\n    def convert_object_to_datetime(self, df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Try to automatically convert object columns to datetime or string columns\"\"\"\n        for c in df.columns[df.dtypes == \"object\"]:  # Look at the object columns\n            try:\n                df[c] = pd.to_datetime(df[c])\n            except (ParserError, ValueError, TypeError):\n                self.log.debug(f\"Column {c} could not be converted to datetime...\")\n        return df\n\n    @staticmethod\n    def convert_datetime_columns(df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Convert datetime columns to ISO-8601 string\"\"\"\n        datetime_type = [\"datetime\", \"datetime64\", \"datetime64[ns]\", \"datetimetz\"]\n        for c in df.select_dtypes(include=datetime_type).columns:\n            df[c] = df[c].map(datetime_to_iso8601)\n            df[c] = df[c].astype(pd.StringDtype())\n        return df\n\n    def transform_impl(self, overwrite: bool = True, **kwargs):\n        \"\"\"Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and\n        store the information about the data to the AWS Data Catalog sageworks database\n\n        Args:\n            overwrite (bool): Overwrite the existing data in the SageWorks S3 Bucket\n        \"\"\"\n        self.log.info(f\"DataFrame to SageWorks DataSource: {self.output_uuid}...\")\n\n        # Set up our metadata storage\n        sageworks_meta = {\"sageworks_tags\": self.output_tags}\n        sageworks_meta.update(self.output_meta)\n\n        # Create the Output Parquet file S3 Storage Path\n        s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n        # Convert columns names to lowercase, Athena will not work with uppercase column names\n        if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n            for c in self.output_df.columns:\n                if c != c.lower():\n                    self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n            self.output_df.columns = self.output_df.columns.str.lower()\n\n        # Convert Object Columns to String\n        self.output_df = self.convert_object_to_string(self.output_df)\n\n        # Note: Both of these conversions may not be necessary, so we're leaving them commented out\n        \"\"\"\n        # Convert Object Columns to Datetime\n        self.output_df = self.convert_object_to_datetime(self.output_df)\n\n        # Now convert datetime columns to ISO-8601 string\n        # self.output_df = self.convert_datetime_columns(self.output_df)\n        \"\"\"\n\n        # Write out the DataFrame to AWS Data Catalog in either Parquet or JSONL format\n        description = f\"SageWorks data source: {self.output_uuid}\"\n        glue_table_settings = {\"description\": description, \"parameters\": sageworks_meta}\n        if self.output_format == \"parquet\":\n            wr.s3.to_parquet(\n                self.output_df,\n                path=s3_storage_path,\n                dataset=True,\n                mode=\"overwrite\",\n                database=self.data_catalog_db,\n                table=self.output_uuid,\n                filename_prefix=f\"{self.output_uuid}_\",\n                boto3_session=self.boto_session,\n                partition_cols=None,\n                glue_table_settings=glue_table_settings,\n                sanitize_columns=False,\n            )  # FIXME: Have some logic around partition columns\n\n        # Note: In general Parquet works will for most uses cases. We recommend using Parquet\n        #       You can use JSON_EXTRACT on Parquet string field, and it works great.\n        elif self.output_format == \"jsonl\":\n            self.log.warning(\"We recommend using Parquet format for most use cases\")\n            self.log.warning(\"If you have a use case that requires JSONL please contact SageWorks support\")\n            self.log.warning(\"We'd like to understand what functionality JSONL is providing that isn't already\")\n            self.log.warning(\"provided with Parquet and JSON_EXTRACT() for your Athena Queries\")\n            wr.s3.to_json(\n                self.output_df,\n                path=s3_storage_path,\n                orient=\"records\",\n                lines=True,\n                date_format=\"iso\",\n                dataset=True,\n                mode=\"overwrite\",\n                database=self.data_catalog_db,\n                table=self.output_uuid,\n                filename_prefix=f\"{self.output_uuid}_\",\n                boto3_session=self.boto_session,\n                partition_cols=None,\n                glue_table_settings=glue_table_settings,\n            )\n        else:\n            raise ValueError(f\"Unsupported file format: {self.output_format}\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Calling onboard() fnr the DataSource\"\"\"\n        self.log.info(\"Post-Transform: Calling onboard() for the DataSource...\")\n\n        # Onboard the DataSource\n        output_data_source = DataSourceFactory(self.output_uuid, force_refresh=True)\n        output_data_source.onboard()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.__init__","title":"<code>__init__(output_uuid, output_format='parquet')</code>","text":"<p>PandasToData Initialization Args:     output_uuid (str): The UUID of the DataSource to create     output_format (str): The file format to store the S3 object data in (default: \"parquet\")</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def __init__(self, output_uuid: str, output_format: str = \"parquet\"):\n    \"\"\"PandasToData Initialization\n    Args:\n        output_uuid (str): The UUID of the DataSource to create\n        output_format (str): The file format to store the S3 object data in (default: \"parquet\")\n    \"\"\"\n\n    # Make sure the output_uuid is a valid name/id\n    Artifact.ensure_valid_name(output_uuid)\n\n    # Call superclass init\n    super().__init__(\"DataFrame\", output_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.PANDAS_DF\n    self.output_type = TransformOutput.DATA_SOURCE\n    self.output_df = None\n\n    # Give a message that Parquet is best in most cases\n    if output_format != \"parquet\":\n        self.log.warning(\"Parquet format works the best in most cases please consider using it\")\n    self.output_format = output_format\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_datetime_columns","title":"<code>convert_datetime_columns(df)</code>  <code>staticmethod</code>","text":"<p>Convert datetime columns to ISO-8601 string</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>@staticmethod\ndef convert_datetime_columns(df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Convert datetime columns to ISO-8601 string\"\"\"\n    datetime_type = [\"datetime\", \"datetime64\", \"datetime64[ns]\", \"datetimetz\"]\n    for c in df.select_dtypes(include=datetime_type).columns:\n        df[c] = df[c].map(datetime_to_iso8601)\n        df[c] = df[c].astype(pd.StringDtype())\n    return df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_object_to_datetime","title":"<code>convert_object_to_datetime(df)</code>","text":"<p>Try to automatically convert object columns to datetime or string columns</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def convert_object_to_datetime(self, df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Try to automatically convert object columns to datetime or string columns\"\"\"\n    for c in df.columns[df.dtypes == \"object\"]:  # Look at the object columns\n        try:\n            df[c] = pd.to_datetime(df[c])\n        except (ParserError, ValueError, TypeError):\n            self.log.debug(f\"Column {c} could not be converted to datetime...\")\n    return df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_object_to_string","title":"<code>convert_object_to_string(df)</code>","text":"<p>Try to automatically convert object columns to string columns</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def convert_object_to_string(self, df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Try to automatically convert object columns to string columns\"\"\"\n    for c in df.columns[df.dtypes == \"object\"]:  # Look at the object columns\n        try:\n            df[c] = df[c].astype(\"string\")\n            df[c] = df[c].str.replace(\"'\", '\"')  # This is for nested JSON\n        except (ParserError, ValueError, TypeError):\n            self.log.info(f\"Column {c} could not be converted to string...\")\n    return df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Calling onboard() fnr the DataSource</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Calling onboard() fnr the DataSource\"\"\"\n    self.log.info(\"Post-Transform: Calling onboard() for the DataSource...\")\n\n    # Onboard the DataSource\n    output_data_source = DataSourceFactory(self.output_uuid, force_refresh=True)\n    output_data_source.onboard()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.set_input","title":"<code>set_input(input_df)</code>","text":"<p>Set the DataFrame Input for this Transform</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def set_input(self, input_df: pd.DataFrame):\n    \"\"\"Set the DataFrame Input for this Transform\"\"\"\n    self.output_df = input_df.copy()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.transform_impl","title":"<code>transform_impl(overwrite=True, **kwargs)</code>","text":"<p>Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and store the information about the data to the AWS Data Catalog sageworks database</p> <p>Parameters:</p> Name Type Description Default <code>overwrite</code> <code>bool</code> <p>Overwrite the existing data in the SageWorks S3 Bucket</p> <code>True</code> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def transform_impl(self, overwrite: bool = True, **kwargs):\n    \"\"\"Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and\n    store the information about the data to the AWS Data Catalog sageworks database\n\n    Args:\n        overwrite (bool): Overwrite the existing data in the SageWorks S3 Bucket\n    \"\"\"\n    self.log.info(f\"DataFrame to SageWorks DataSource: {self.output_uuid}...\")\n\n    # Set up our metadata storage\n    sageworks_meta = {\"sageworks_tags\": self.output_tags}\n    sageworks_meta.update(self.output_meta)\n\n    # Create the Output Parquet file S3 Storage Path\n    s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n    # Convert columns names to lowercase, Athena will not work with uppercase column names\n    if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n        for c in self.output_df.columns:\n            if c != c.lower():\n                self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n        self.output_df.columns = self.output_df.columns.str.lower()\n\n    # Convert Object Columns to String\n    self.output_df = self.convert_object_to_string(self.output_df)\n\n    # Note: Both of these conversions may not be necessary, so we're leaving them commented out\n    \"\"\"\n    # Convert Object Columns to Datetime\n    self.output_df = self.convert_object_to_datetime(self.output_df)\n\n    # Now convert datetime columns to ISO-8601 string\n    # self.output_df = self.convert_datetime_columns(self.output_df)\n    \"\"\"\n\n    # Write out the DataFrame to AWS Data Catalog in either Parquet or JSONL format\n    description = f\"SageWorks data source: {self.output_uuid}\"\n    glue_table_settings = {\"description\": description, \"parameters\": sageworks_meta}\n    if self.output_format == \"parquet\":\n        wr.s3.to_parquet(\n            self.output_df,\n            path=s3_storage_path,\n            dataset=True,\n            mode=\"overwrite\",\n            database=self.data_catalog_db,\n            table=self.output_uuid,\n            filename_prefix=f\"{self.output_uuid}_\",\n            boto3_session=self.boto_session,\n            partition_cols=None,\n            glue_table_settings=glue_table_settings,\n            sanitize_columns=False,\n        )  # FIXME: Have some logic around partition columns\n\n    # Note: In general Parquet works will for most uses cases. We recommend using Parquet\n    #       You can use JSON_EXTRACT on Parquet string field, and it works great.\n    elif self.output_format == \"jsonl\":\n        self.log.warning(\"We recommend using Parquet format for most use cases\")\n        self.log.warning(\"If you have a use case that requires JSONL please contact SageWorks support\")\n        self.log.warning(\"We'd like to understand what functionality JSONL is providing that isn't already\")\n        self.log.warning(\"provided with Parquet and JSON_EXTRACT() for your Athena Queries\")\n        wr.s3.to_json(\n            self.output_df,\n            path=s3_storage_path,\n            orient=\"records\",\n            lines=True,\n            date_format=\"iso\",\n            dataset=True,\n            mode=\"overwrite\",\n            database=self.data_catalog_db,\n            table=self.output_uuid,\n            filename_prefix=f\"{self.output_uuid}_\",\n            boto3_session=self.boto_session,\n            partition_cols=None,\n            glue_table_settings=glue_table_settings,\n        )\n    else:\n        raise ValueError(f\"Unsupported file format: {self.output_format}\")\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures","title":"<code>PandasToFeatures</code>","text":"<p>               Bases: <code>Transform</code></p> <p>PandasToFeatures: Class to publish a Pandas DataFrame into a FeatureSet</p> Common Usage <pre><code>to_features = PandasToFeatures(output_uuid)\nto_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\nto_features.set_input(df, id_column=\"id\"/None, event_time_column=\"date\"/None)\nto_features.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>class PandasToFeatures(Transform):\n    \"\"\"PandasToFeatures: Class to publish a Pandas DataFrame into a FeatureSet\n\n    Common Usage:\n        ```\n        to_features = PandasToFeatures(output_uuid)\n        to_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n        to_features.set_input(df, id_column=\"id\"/None, event_time_column=\"date\"/None)\n        to_features.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, output_uuid: str, auto_one_hot=False):\n        \"\"\"PandasToFeatures Initialization\n        Args:\n            output_uuid (str): The UUID of the FeatureSet to create\n            auto_one_hot (bool): Should we automatically one-hot encode categorical columns?\n        \"\"\"\n\n        # Make sure the output_uuid is a valid name\n        Artifact.ensure_valid_name(output_uuid)\n\n        # Call superclass init\n        super().__init__(\"DataFrame\", output_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.PANDAS_DF\n        self.output_type = TransformOutput.FEATURE_SET\n        self.target_column = None\n        self.id_column = None\n        self.event_time_column = None\n        self.auto_one_hot = auto_one_hot\n        self.categorical_dtypes = {}\n        self.output_df = None\n        self.table_format = TableFormatEnum.ICEBERG\n\n        # Delete the existing FeatureSet if it exists\n        self.delete_existing()\n\n        # These will be set in the transform method\n        self.output_feature_group = None\n        self.output_feature_set = None\n        self.expected_rows = 0\n\n    def set_input(self, input_df: pd.DataFrame, target_column=None, id_column=None, event_time_column=None):\n        \"\"\"Set the Input DataFrame for this Transform\n        Args:\n            input_df (pd.DataFrame): The input DataFrame\n            target_column (str): The name of the target column (default: None)\n            id_column (str): The name of the id column (default: None)\n            event_time_column (str): The name of the event_time column (default: None)\n        \"\"\"\n        self.target_column = target_column\n        self.id_column = id_column\n        self.event_time_column = event_time_column\n        self.output_df = input_df.copy()\n\n        # Now Prepare the DataFrame for its journey into an AWS FeatureGroup\n        self.prep_dataframe()\n\n    def delete_existing(self):\n        # Delete the existing FeatureSet if it exists\n        try:\n            delete_fs = FeatureSetCore(self.output_uuid)\n            if delete_fs.exists():\n                self.log.info(f\"Deleting the {self.output_uuid} FeatureSet...\")\n                delete_fs.delete()\n                time.sleep(1)\n        except ClientError as exc:\n            self.log.info(f\"FeatureSet {self.output_uuid} doesn't exist...\")\n            self.log.info(exc)\n\n    def _ensure_id_column(self):\n        \"\"\"Internal: AWS Feature Store requires an Id field for all data store\"\"\"\n        if self.id_column is None or self.id_column not in self.output_df.columns:\n            if \"id\" not in self.output_df.columns:\n                self.log.info(\"Generating an id column before FeatureSet Creation...\")\n                self.output_df[\"id\"] = self.output_df.index\n            self.id_column = \"id\"\n\n    def _ensure_event_time(self):\n        \"\"\"Internal: AWS Feature Store requires an event_time field for all data stored\"\"\"\n        if self.event_time_column is None or self.event_time_column not in self.output_df.columns:\n            self.log.info(\"Generating an event_time column before FeatureSet Creation...\")\n            self.event_time_column = \"event_time\"\n            self.output_df[self.event_time_column] = pd.Timestamp(\"now\", tz=\"UTC\")\n\n        # The event_time_column is defined, so we need to make sure it's in ISO-8601 string format\n        # Note: AWS Feature Store only a particular ISO-8601 format not ALL ISO-8601 formats\n        time_column = self.output_df[self.event_time_column]\n\n        # Check if the event_time_column is of type object or string convert it to DateTime\n        if time_column.dtypes == \"object\" or time_column.dtypes.name == \"string\":\n            self.log.info(f\"Converting {self.event_time_column} to DateTime...\")\n            time_column = pd.to_datetime(time_column)\n\n        # Let's make sure it the right type for Feature Store\n        if pd.api.types.is_datetime64_any_dtype(time_column):\n            self.log.info(f\"Converting {self.event_time_column} to ISOFormat Date String before FeatureSet Creation...\")\n\n            # Convert the datetime DType to ISO-8601 string\n            # TableFormat=ICEBERG does not support alternate formats for event_time field, it only supports String type.\n            time_column = time_column.map(datetime_to_iso8601)\n            self.output_df[self.event_time_column] = time_column.astype(\"string\")\n\n    def _convert_objs_to_string(self):\n        \"\"\"Internal: AWS Feature Store doesn't know how to store object dtypes, so convert to String\"\"\"\n        for col in self.output_df:\n            if pd.api.types.is_object_dtype(self.output_df[col].dtype):\n                self.output_df[col] = self.output_df[col].astype(pd.StringDtype())\n\n    def process_column_name(self, column: str, shorten: bool = False) -&gt; str:\n        \"\"\"Call various methods to make sure the column is ready for Feature Store\n        Args:\n            column (str): The column name to process\n            shorten (bool): Should we shorten the column name? (default: False)\n        \"\"\"\n        self.log.debug(f\"Processing column {column}...\")\n\n        # Make sure the column name is valid\n        column = self.sanitize_column_name(column)\n\n        # Make sure the column name isn't too long\n        if shorten:\n            column = self.shorten_column_name(column)\n\n        return column\n\n    def shorten_column_name(self, name, max_length=20):\n        if len(name) &lt;= max_length:\n            return name\n\n        # Start building the new name from the end\n        parts = name.split(\"_\")[::-1]\n        new_name = \"\"\n        for part in parts:\n            if len(new_name) + len(part) + 1 &lt;= max_length:  # +1 for the underscore\n                new_name = f\"{part}_{new_name}\" if new_name else part\n            else:\n                break\n\n        # If new_name is empty, just use the last part of the original name\n        if not new_name:\n            new_name = parts[0]\n\n        self.log.info(f\"Shortening {name} to {new_name}\")\n        return new_name\n\n    def sanitize_column_name(self, name):\n        # Remove all invalid characters\n        sanitized = re.sub(\"[^a-zA-Z0-9-_]\", \"_\", name)\n        sanitized = re.sub(\"_+\", \"_\", sanitized)\n        sanitized = sanitized.strip(\"_\")\n\n        # Log the change if the name was altered\n        if sanitized != name:\n            self.log.info(f\"Sanitizing {name} to {sanitized}\")\n\n        return sanitized\n\n    def one_hot_encoding(self, df, categorical_columns: list) -&gt; pd.DataFrame:\n        \"\"\"One Hot Encoding for Categorical Columns with additional column name management\"\"\"\n\n        # Now convert Categorical Types to One Hot Encoding\n        current_columns = list(df.columns)\n        df = pd.get_dummies(df, columns=categorical_columns)\n\n        # Compute the new columns generated by get_dummies\n        new_columns = list(set(df.columns) - set(current_columns))\n\n        # Convert new columns to int32\n        df[new_columns] = df[new_columns].astype(\"int32\")\n\n        # For the new columns we're going to shorten the names\n        renamed_columns = {col: self.process_column_name(col) for col in new_columns}\n\n        # Rename the columns in the DataFrame\n        df.rename(columns=renamed_columns, inplace=True)\n\n        return df\n\n    # Helper Methods\n    def auto_convert_columns_to_categorical(self):\n        \"\"\"Convert object and string types to Categorical\"\"\"\n        categorical_columns = []\n        for feature, dtype in self.output_df.dtypes.items():\n            if dtype in [\"object\", \"string\", \"category\"] and feature not in [\n                self.event_time_column,\n                self.id_column,\n                self.target_column,\n            ]:\n                unique_values = self.output_df[feature].nunique()\n                if 1 &lt; unique_values &lt; 6:\n                    self.log.important(f\"Converting column {feature} to categorical (unique {unique_values})\")\n                    self.output_df[feature] = self.output_df[feature].astype(\"category\")\n                    categorical_columns.append(feature)\n\n        # Now run one hot encoding on categorical columns\n        self.output_df = self.one_hot_encoding(self.output_df, categorical_columns)\n\n    def manual_categorical_converter(self):\n        \"\"\"Convert object and string types to Categorical\n\n        Note:\n            This method is used for streaming/chunking. You can set the\n            categorical_dtypes attribute to a dictionary of column names and\n            their respective categorical types.\n        \"\"\"\n        for column, cat_d_type in self.categorical_dtypes.items():\n            self.output_df[column] = self.output_df[column].astype(cat_d_type)\n\n        # Now convert Categorical Types to One Hot Encoding\n        categorical_columns = list(self.categorical_dtypes.keys())\n        self.output_df = self.one_hot_encoding(self.output_df, categorical_columns)\n\n    @staticmethod\n    def convert_column_types(df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Convert the types of the DataFrame to the correct types for the Feature Store\"\"\"\n        for column in list(df.select_dtypes(include=\"bool\").columns):\n            df[column] = df[column].astype(\"int32\")\n        for column in list(df.select_dtypes(include=\"category\").columns):\n            df[column] = df[column].astype(\"str\")\n\n        # Special case for datetime types\n        for column in df.select_dtypes(include=[\"datetime\"]).columns:\n            df[column] = df[column].map(datetime_to_iso8601).astype(\"string\")\n\n        \"\"\"FIXME Not sure we need these conversions\n        for column in list(df.select_dtypes(include=\"object\").columns):\n            df[column] = df[column].astype(\"string\")\n        for column in list(df.select_dtypes(include=[pd.Int64Dtype]).columns):\n            df[column] = df[column].astype(\"int64\")\n        for column in list(df.select_dtypes(include=[pd.Float64Dtype]).columns):\n            df[column] = df[column].astype(\"float64\")\n        \"\"\"\n        return df\n\n    def prep_dataframe(self):\n        \"\"\"Prep the DataFrame for Feature Store Creation\"\"\"\n        self.log.info(\"Prep the output_df (cat_convert, convert types, and lowercase columns)...\")\n\n        # Make sure we have the required id and event_time columns\n        self._ensure_id_column()\n        self._ensure_event_time()\n\n        # Convert object and string types to Categorical\n        if self.auto_one_hot:\n            self.auto_convert_columns_to_categorical()\n        else:\n            self.manual_categorical_converter()\n\n        # We need to convert some of our column types to the correct types\n        # Feature Store only supports these data types:\n        # - Integral\n        # - Fractional\n        # - String (timestamp/datetime types need to be converted to string)\n        self.output_df = self.convert_column_types(self.output_df)\n\n        # Convert columns names to lowercase, Athena will not work with uppercase column names\n        if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n            for c in self.output_df.columns:\n                if c != c.lower():\n                    self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n            self.output_df.columns = self.output_df.columns.str.lower()\n\n    def create_feature_group(self):\n        \"\"\"Create a Feature Group, load our Feature Definitions, and wait for it to be ready\"\"\"\n\n        # Create a Feature Group and load our Feature Definitions\n        my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)\n        my_feature_group.load_feature_definitions(data_frame=self.output_df)\n\n        # Create the Output S3 Storage Path for this Feature Set\n        s3_storage_path = f\"{self.feature_sets_s3_path}/{self.output_uuid}\"\n\n        # Get the metadata/tags to push into AWS\n        aws_tags = self.get_aws_tags()\n\n        # Create the Feature Group\n        my_feature_group.create(\n            s3_uri=s3_storage_path,\n            record_identifier_name=self.id_column,\n            event_time_feature_name=self.event_time_column,\n            role_arn=self.sageworks_role_arn,\n            enable_online_store=True,\n            table_format=self.table_format,\n            tags=aws_tags,\n        )\n\n        # Ensure/wait for the feature group to be created\n        self.ensure_feature_group_created(my_feature_group)\n        return my_feature_group\n\n    def pre_transform(self, **kwargs):\n        \"\"\"Pre-Transform: Create the Feature Group\"\"\"\n        self.output_feature_group = self.create_feature_group()\n\n    def transform_impl(self):\n        \"\"\"Transform Implementation: Ingest the data into the Feature Group\"\"\"\n\n        # Now we actually push the data into the Feature Group (called ingestion)\n        self.log.important(\"Ingesting rows into Feature Group...\")\n        ingest_manager = self.output_feature_group.ingest(self.output_df, max_processes=8, wait=False)\n        try:\n            ingest_manager.wait()\n        except IngestionError as exc:\n            self.log.warning(f\"Some rows had an ingesting error: {exc}\")\n\n        # Report on any rows that failed to ingest\n        if ingest_manager.failed_rows:\n            self.log.warning(f\"Number of Failed Rows: {len(ingest_manager.failed_rows)}\")\n\n            # FIXME: This may or may not give us the correct rows\n            # If any index is greater then the number of rows, then the index needs\n            # to be converted to a relative index in our current output_df\n            df_rows = len(self.output_df)\n            relative_indexes = [idx - df_rows if idx &gt;= df_rows else idx for idx in ingest_manager.failed_rows]\n            failed_data = self.output_df.iloc[relative_indexes]\n            for idx, row in failed_data.iterrows():\n                self.log.warning(f\"Failed Row {idx}: {row.to_dict()}\")\n\n        # Keep track of the number of rows we expect to be ingested\n        self.expected_rows += len(self.output_df) - len(ingest_manager.failed_rows)\n        self.log.info(f\"Added rows: {len(self.output_df)}\")\n        self.log.info(f\"Failed rows: {len(ingest_manager.failed_rows)}\")\n        self.log.info(f\"Total rows to be ingested: {self.expected_rows}\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Populating Offline Storage and onboard()\"\"\"\n        self.log.info(\"Post-Transform: Populating Offline Storage and onboard()...\")\n\n        # Feature Group Ingestion takes a while, so we need to wait for it to finish\n        self.output_feature_set = FeatureSetCore(self.output_uuid, force_refresh=True)\n        self.log.important(\"Waiting for AWS Feature Group Offline storage to be ready...\")\n        self.log.important(\"This will often take 10-20 minutes...go have coffee or lunch :)\")\n        self.output_feature_set.set_status(\"initializing\")\n        self.wait_for_rows(self.expected_rows)\n\n        # Call the FeatureSet onboard method to compute a bunch of EDA stuff\n        self.output_feature_set.onboard()\n\n    def ensure_feature_group_created(self, feature_group):\n        status = feature_group.describe().get(\"FeatureGroupStatus\")\n        while status == \"Creating\":\n            self.log.debug(\"FeatureSet being Created...\")\n            time.sleep(5)\n            status = feature_group.describe().get(\"FeatureGroupStatus\")\n        self.log.info(f\"FeatureSet {feature_group.name} successfully created\")\n\n    def wait_for_rows(self, expected_rows: int):\n        \"\"\"Wait for AWS Feature Group to fully populate the Offline Storage\"\"\"\n        rows = self.output_feature_set.num_rows()\n\n        # Wait for the rows to be populated\n        self.log.info(f\"Waiting for AWS Feature Group {self.output_uuid} Offline Storage...\")\n        not_all_rows_retry = 5\n        while rows &lt; expected_rows and not_all_rows_retry &gt; 0:\n            sleep_time = 5 if rows else 60\n            not_all_rows_retry -= 1 if rows else 0\n            time.sleep(sleep_time)\n            rows = self.output_feature_set.num_rows()\n            self.log.info(f\"Offline Storage {self.output_uuid}: {rows} rows out of {expected_rows}\")\n        if rows == expected_rows:\n            self.log.important(f\"Success: Reached Expected Rows ({rows} rows)...\")\n        else:\n            self.log.warning(\n                f\"Did not reach expected rows ({rows}/{expected_rows}) but we're not sweating the small stuff...\"\n            )\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.__init__","title":"<code>__init__(output_uuid, auto_one_hot=False)</code>","text":"<p>PandasToFeatures Initialization Args:     output_uuid (str): The UUID of the FeatureSet to create     auto_one_hot (bool): Should we automatically one-hot encode categorical columns?</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def __init__(self, output_uuid: str, auto_one_hot=False):\n    \"\"\"PandasToFeatures Initialization\n    Args:\n        output_uuid (str): The UUID of the FeatureSet to create\n        auto_one_hot (bool): Should we automatically one-hot encode categorical columns?\n    \"\"\"\n\n    # Make sure the output_uuid is a valid name\n    Artifact.ensure_valid_name(output_uuid)\n\n    # Call superclass init\n    super().__init__(\"DataFrame\", output_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.PANDAS_DF\n    self.output_type = TransformOutput.FEATURE_SET\n    self.target_column = None\n    self.id_column = None\n    self.event_time_column = None\n    self.auto_one_hot = auto_one_hot\n    self.categorical_dtypes = {}\n    self.output_df = None\n    self.table_format = TableFormatEnum.ICEBERG\n\n    # Delete the existing FeatureSet if it exists\n    self.delete_existing()\n\n    # These will be set in the transform method\n    self.output_feature_group = None\n    self.output_feature_set = None\n    self.expected_rows = 0\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.auto_convert_columns_to_categorical","title":"<code>auto_convert_columns_to_categorical()</code>","text":"<p>Convert object and string types to Categorical</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def auto_convert_columns_to_categorical(self):\n    \"\"\"Convert object and string types to Categorical\"\"\"\n    categorical_columns = []\n    for feature, dtype in self.output_df.dtypes.items():\n        if dtype in [\"object\", \"string\", \"category\"] and feature not in [\n            self.event_time_column,\n            self.id_column,\n            self.target_column,\n        ]:\n            unique_values = self.output_df[feature].nunique()\n            if 1 &lt; unique_values &lt; 6:\n                self.log.important(f\"Converting column {feature} to categorical (unique {unique_values})\")\n                self.output_df[feature] = self.output_df[feature].astype(\"category\")\n                categorical_columns.append(feature)\n\n    # Now run one hot encoding on categorical columns\n    self.output_df = self.one_hot_encoding(self.output_df, categorical_columns)\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.convert_column_types","title":"<code>convert_column_types(df)</code>  <code>staticmethod</code>","text":"<p>Convert the types of the DataFrame to the correct types for the Feature Store</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>@staticmethod\ndef convert_column_types(df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Convert the types of the DataFrame to the correct types for the Feature Store\"\"\"\n    for column in list(df.select_dtypes(include=\"bool\").columns):\n        df[column] = df[column].astype(\"int32\")\n    for column in list(df.select_dtypes(include=\"category\").columns):\n        df[column] = df[column].astype(\"str\")\n\n    # Special case for datetime types\n    for column in df.select_dtypes(include=[\"datetime\"]).columns:\n        df[column] = df[column].map(datetime_to_iso8601).astype(\"string\")\n\n    \"\"\"FIXME Not sure we need these conversions\n    for column in list(df.select_dtypes(include=\"object\").columns):\n        df[column] = df[column].astype(\"string\")\n    for column in list(df.select_dtypes(include=[pd.Int64Dtype]).columns):\n        df[column] = df[column].astype(\"int64\")\n    for column in list(df.select_dtypes(include=[pd.Float64Dtype]).columns):\n        df[column] = df[column].astype(\"float64\")\n    \"\"\"\n    return df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.create_feature_group","title":"<code>create_feature_group()</code>","text":"<p>Create a Feature Group, load our Feature Definitions, and wait for it to be ready</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def create_feature_group(self):\n    \"\"\"Create a Feature Group, load our Feature Definitions, and wait for it to be ready\"\"\"\n\n    # Create a Feature Group and load our Feature Definitions\n    my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)\n    my_feature_group.load_feature_definitions(data_frame=self.output_df)\n\n    # Create the Output S3 Storage Path for this Feature Set\n    s3_storage_path = f\"{self.feature_sets_s3_path}/{self.output_uuid}\"\n\n    # Get the metadata/tags to push into AWS\n    aws_tags = self.get_aws_tags()\n\n    # Create the Feature Group\n    my_feature_group.create(\n        s3_uri=s3_storage_path,\n        record_identifier_name=self.id_column,\n        event_time_feature_name=self.event_time_column,\n        role_arn=self.sageworks_role_arn,\n        enable_online_store=True,\n        table_format=self.table_format,\n        tags=aws_tags,\n    )\n\n    # Ensure/wait for the feature group to be created\n    self.ensure_feature_group_created(my_feature_group)\n    return my_feature_group\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.manual_categorical_converter","title":"<code>manual_categorical_converter()</code>","text":"<p>Convert object and string types to Categorical</p> Note <p>This method is used for streaming/chunking. You can set the categorical_dtypes attribute to a dictionary of column names and their respective categorical types.</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def manual_categorical_converter(self):\n    \"\"\"Convert object and string types to Categorical\n\n    Note:\n        This method is used for streaming/chunking. You can set the\n        categorical_dtypes attribute to a dictionary of column names and\n        their respective categorical types.\n    \"\"\"\n    for column, cat_d_type in self.categorical_dtypes.items():\n        self.output_df[column] = self.output_df[column].astype(cat_d_type)\n\n    # Now convert Categorical Types to One Hot Encoding\n    categorical_columns = list(self.categorical_dtypes.keys())\n    self.output_df = self.one_hot_encoding(self.output_df, categorical_columns)\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.one_hot_encoding","title":"<code>one_hot_encoding(df, categorical_columns)</code>","text":"<p>One Hot Encoding for Categorical Columns with additional column name management</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def one_hot_encoding(self, df, categorical_columns: list) -&gt; pd.DataFrame:\n    \"\"\"One Hot Encoding for Categorical Columns with additional column name management\"\"\"\n\n    # Now convert Categorical Types to One Hot Encoding\n    current_columns = list(df.columns)\n    df = pd.get_dummies(df, columns=categorical_columns)\n\n    # Compute the new columns generated by get_dummies\n    new_columns = list(set(df.columns) - set(current_columns))\n\n    # Convert new columns to int32\n    df[new_columns] = df[new_columns].astype(\"int32\")\n\n    # For the new columns we're going to shorten the names\n    renamed_columns = {col: self.process_column_name(col) for col in new_columns}\n\n    # Rename the columns in the DataFrame\n    df.rename(columns=renamed_columns, inplace=True)\n\n    return df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Populating Offline Storage and onboard()</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Populating Offline Storage and onboard()\"\"\"\n    self.log.info(\"Post-Transform: Populating Offline Storage and onboard()...\")\n\n    # Feature Group Ingestion takes a while, so we need to wait for it to finish\n    self.output_feature_set = FeatureSetCore(self.output_uuid, force_refresh=True)\n    self.log.important(\"Waiting for AWS Feature Group Offline storage to be ready...\")\n    self.log.important(\"This will often take 10-20 minutes...go have coffee or lunch :)\")\n    self.output_feature_set.set_status(\"initializing\")\n    self.wait_for_rows(self.expected_rows)\n\n    # Call the FeatureSet onboard method to compute a bunch of EDA stuff\n    self.output_feature_set.onboard()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.pre_transform","title":"<code>pre_transform(**kwargs)</code>","text":"<p>Pre-Transform: Create the Feature Group</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def pre_transform(self, **kwargs):\n    \"\"\"Pre-Transform: Create the Feature Group\"\"\"\n    self.output_feature_group = self.create_feature_group()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.prep_dataframe","title":"<code>prep_dataframe()</code>","text":"<p>Prep the DataFrame for Feature Store Creation</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def prep_dataframe(self):\n    \"\"\"Prep the DataFrame for Feature Store Creation\"\"\"\n    self.log.info(\"Prep the output_df (cat_convert, convert types, and lowercase columns)...\")\n\n    # Make sure we have the required id and event_time columns\n    self._ensure_id_column()\n    self._ensure_event_time()\n\n    # Convert object and string types to Categorical\n    if self.auto_one_hot:\n        self.auto_convert_columns_to_categorical()\n    else:\n        self.manual_categorical_converter()\n\n    # We need to convert some of our column types to the correct types\n    # Feature Store only supports these data types:\n    # - Integral\n    # - Fractional\n    # - String (timestamp/datetime types need to be converted to string)\n    self.output_df = self.convert_column_types(self.output_df)\n\n    # Convert columns names to lowercase, Athena will not work with uppercase column names\n    if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n        for c in self.output_df.columns:\n            if c != c.lower():\n                self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n        self.output_df.columns = self.output_df.columns.str.lower()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.process_column_name","title":"<code>process_column_name(column, shorten=False)</code>","text":"<p>Call various methods to make sure the column is ready for Feature Store Args:     column (str): The column name to process     shorten (bool): Should we shorten the column name? (default: False)</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def process_column_name(self, column: str, shorten: bool = False) -&gt; str:\n    \"\"\"Call various methods to make sure the column is ready for Feature Store\n    Args:\n        column (str): The column name to process\n        shorten (bool): Should we shorten the column name? (default: False)\n    \"\"\"\n    self.log.debug(f\"Processing column {column}...\")\n\n    # Make sure the column name is valid\n    column = self.sanitize_column_name(column)\n\n    # Make sure the column name isn't too long\n    if shorten:\n        column = self.shorten_column_name(column)\n\n    return column\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.set_input","title":"<code>set_input(input_df, target_column=None, id_column=None, event_time_column=None)</code>","text":"<p>Set the Input DataFrame for this Transform Args:     input_df (pd.DataFrame): The input DataFrame     target_column (str): The name of the target column (default: None)     id_column (str): The name of the id column (default: None)     event_time_column (str): The name of the event_time column (default: None)</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def set_input(self, input_df: pd.DataFrame, target_column=None, id_column=None, event_time_column=None):\n    \"\"\"Set the Input DataFrame for this Transform\n    Args:\n        input_df (pd.DataFrame): The input DataFrame\n        target_column (str): The name of the target column (default: None)\n        id_column (str): The name of the id column (default: None)\n        event_time_column (str): The name of the event_time column (default: None)\n    \"\"\"\n    self.target_column = target_column\n    self.id_column = id_column\n    self.event_time_column = event_time_column\n    self.output_df = input_df.copy()\n\n    # Now Prepare the DataFrame for its journey into an AWS FeatureGroup\n    self.prep_dataframe()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.transform_impl","title":"<code>transform_impl()</code>","text":"<p>Transform Implementation: Ingest the data into the Feature Group</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def transform_impl(self):\n    \"\"\"Transform Implementation: Ingest the data into the Feature Group\"\"\"\n\n    # Now we actually push the data into the Feature Group (called ingestion)\n    self.log.important(\"Ingesting rows into Feature Group...\")\n    ingest_manager = self.output_feature_group.ingest(self.output_df, max_processes=8, wait=False)\n    try:\n        ingest_manager.wait()\n    except IngestionError as exc:\n        self.log.warning(f\"Some rows had an ingesting error: {exc}\")\n\n    # Report on any rows that failed to ingest\n    if ingest_manager.failed_rows:\n        self.log.warning(f\"Number of Failed Rows: {len(ingest_manager.failed_rows)}\")\n\n        # FIXME: This may or may not give us the correct rows\n        # If any index is greater then the number of rows, then the index needs\n        # to be converted to a relative index in our current output_df\n        df_rows = len(self.output_df)\n        relative_indexes = [idx - df_rows if idx &gt;= df_rows else idx for idx in ingest_manager.failed_rows]\n        failed_data = self.output_df.iloc[relative_indexes]\n        for idx, row in failed_data.iterrows():\n            self.log.warning(f\"Failed Row {idx}: {row.to_dict()}\")\n\n    # Keep track of the number of rows we expect to be ingested\n    self.expected_rows += len(self.output_df) - len(ingest_manager.failed_rows)\n    self.log.info(f\"Added rows: {len(self.output_df)}\")\n    self.log.info(f\"Failed rows: {len(ingest_manager.failed_rows)}\")\n    self.log.info(f\"Total rows to be ingested: {self.expected_rows}\")\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.wait_for_rows","title":"<code>wait_for_rows(expected_rows)</code>","text":"<p>Wait for AWS Feature Group to fully populate the Offline Storage</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def wait_for_rows(self, expected_rows: int):\n    \"\"\"Wait for AWS Feature Group to fully populate the Offline Storage\"\"\"\n    rows = self.output_feature_set.num_rows()\n\n    # Wait for the rows to be populated\n    self.log.info(f\"Waiting for AWS Feature Group {self.output_uuid} Offline Storage...\")\n    not_all_rows_retry = 5\n    while rows &lt; expected_rows and not_all_rows_retry &gt; 0:\n        sleep_time = 5 if rows else 60\n        not_all_rows_retry -= 1 if rows else 0\n        time.sleep(sleep_time)\n        rows = self.output_feature_set.num_rows()\n        self.log.info(f\"Offline Storage {self.output_uuid}: {rows} rows out of {expected_rows}\")\n    if rows == expected_rows:\n        self.log.important(f\"Success: Reached Expected Rows ({rows} rows)...\")\n    else:\n        self.log.warning(\n            f\"Did not reach expected rows ({rows}/{expected_rows}) but we're not sweating the small stuff...\"\n        )\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked","title":"<code>PandasToFeaturesChunked</code>","text":"<p>               Bases: <code>Transform</code></p> <p>PandasToFeaturesChunked:  Class to manage a bunch of chunked Pandas DataFrames into a FeatureSet</p> Common Usage <pre><code>to_features = PandasToFeaturesChunked(output_uuid, id_column=\"id\"/None, event_time_column=\"date\"/None)\nto_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\ncat_column_info = {\"sex\": [\"M\", \"F\", \"I\"]}\nto_features.set_categorical_info(cat_column_info)\nto_features.add_chunk(df)\nto_features.add_chunk(df)\n...\nto_features.finalize()\n</code></pre> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>class PandasToFeaturesChunked(Transform):\n    \"\"\"PandasToFeaturesChunked:  Class to manage a bunch of chunked Pandas DataFrames into a FeatureSet\n\n    Common Usage:\n        ```\n        to_features = PandasToFeaturesChunked(output_uuid, id_column=\"id\"/None, event_time_column=\"date\"/None)\n        to_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n        cat_column_info = {\"sex\": [\"M\", \"F\", \"I\"]}\n        to_features.set_categorical_info(cat_column_info)\n        to_features.add_chunk(df)\n        to_features.add_chunk(df)\n        ...\n        to_features.finalize()\n        ```\n    \"\"\"\n\n    def __init__(self, output_uuid: str, id_column=None, event_time_column=None):\n        \"\"\"PandasToFeaturesChunked Initialization\"\"\"\n\n        # Make sure the output_uuid is a valid name\n        Artifact.ensure_valid_name(output_uuid)\n\n        # Call superclass init\n        super().__init__(\"DataFrame\", output_uuid)\n\n        # Set up all my instance attributes\n        self.id_column = id_column\n        self.event_time_column = event_time_column\n        self.first_chunk = None\n        self.pandas_to_features = PandasToFeatures(output_uuid, auto_one_hot=False)\n\n    def set_categorical_info(self, cat_column_info: dict[list[str]]):\n        \"\"\"Set the Categorical Columns\n        Args:\n            cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values\n        \"\"\"\n\n        # Create the CategoricalDtypes\n        cat_d_types = {}\n        for col, vals in cat_column_info.items():\n            cat_d_types[col] = CategoricalDtype(categories=vals)\n\n        # Now set the CategoricalDtypes on our underlying PandasToFeatures\n        self.pandas_to_features.categorical_dtypes = cat_d_types\n\n    def add_chunk(self, chunk_df: pd.DataFrame):\n        \"\"\"Add a Chunk of Data to the FeatureSet\"\"\"\n\n        # Is this the first chunk? If so we need to run the pre_transform\n        if self.first_chunk is None:\n            self.log.info(f\"Adding first chunk {chunk_df.shape}...\")\n            self.first_chunk = chunk_df\n            self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n            self.pandas_to_features.pre_transform()\n            self.pandas_to_features.transform_impl()\n        else:\n            self.log.info(f\"Adding chunk {chunk_df.shape}...\")\n            self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n            self.pandas_to_features.transform_impl()\n\n    def pre_transform(self, **kwargs):\n        \"\"\"Pre-Transform: Create the Feature Group with Chunked Data\"\"\"\n\n        # Loading data into a Feature Group takes a while, so set status to loading\n        FeatureSetCore(self.output_uuid).set_status(\"loading\")\n\n    def transform_impl(self):\n        \"\"\"Required implementation of the Transform interface\"\"\"\n        self.log.warning(\"PandasToFeaturesChunked.transform_impl() called.  This is a no-op.\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Any Post Transform Steps\"\"\"\n        self.pandas_to_features.post_transform()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.__init__","title":"<code>__init__(output_uuid, id_column=None, event_time_column=None)</code>","text":"<p>PandasToFeaturesChunked Initialization</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def __init__(self, output_uuid: str, id_column=None, event_time_column=None):\n    \"\"\"PandasToFeaturesChunked Initialization\"\"\"\n\n    # Make sure the output_uuid is a valid name\n    Artifact.ensure_valid_name(output_uuid)\n\n    # Call superclass init\n    super().__init__(\"DataFrame\", output_uuid)\n\n    # Set up all my instance attributes\n    self.id_column = id_column\n    self.event_time_column = event_time_column\n    self.first_chunk = None\n    self.pandas_to_features = PandasToFeatures(output_uuid, auto_one_hot=False)\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.add_chunk","title":"<code>add_chunk(chunk_df)</code>","text":"<p>Add a Chunk of Data to the FeatureSet</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def add_chunk(self, chunk_df: pd.DataFrame):\n    \"\"\"Add a Chunk of Data to the FeatureSet\"\"\"\n\n    # Is this the first chunk? If so we need to run the pre_transform\n    if self.first_chunk is None:\n        self.log.info(f\"Adding first chunk {chunk_df.shape}...\")\n        self.first_chunk = chunk_df\n        self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n        self.pandas_to_features.pre_transform()\n        self.pandas_to_features.transform_impl()\n    else:\n        self.log.info(f\"Adding chunk {chunk_df.shape}...\")\n        self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n        self.pandas_to_features.transform_impl()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Any Post Transform Steps</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Any Post Transform Steps\"\"\"\n    self.pandas_to_features.post_transform()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.pre_transform","title":"<code>pre_transform(**kwargs)</code>","text":"<p>Pre-Transform: Create the Feature Group with Chunked Data</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def pre_transform(self, **kwargs):\n    \"\"\"Pre-Transform: Create the Feature Group with Chunked Data\"\"\"\n\n    # Loading data into a Feature Group takes a while, so set status to loading\n    FeatureSetCore(self.output_uuid).set_status(\"loading\")\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.set_categorical_info","title":"<code>set_categorical_info(cat_column_info)</code>","text":"<p>Set the Categorical Columns Args:     cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def set_categorical_info(self, cat_column_info: dict[list[str]]):\n    \"\"\"Set the Categorical Columns\n    Args:\n        cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values\n    \"\"\"\n\n    # Create the CategoricalDtypes\n    cat_d_types = {}\n    for col, vals in cat_column_info.items():\n        cat_d_types[col] = CategoricalDtype(categories=vals)\n\n    # Now set the CategoricalDtypes on our underlying PandasToFeatures\n    self.pandas_to_features.categorical_dtypes = cat_d_types\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.transform_impl","title":"<code>transform_impl()</code>","text":"<p>Required implementation of the Transform interface</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def transform_impl(self):\n    \"\"\"Required implementation of the Transform interface\"\"\"\n    self.log.warning(\"PandasToFeaturesChunked.transform_impl() called.  This is a no-op.\")\n</code></pre>"},{"location":"core_classes/transforms/transform/","title":"Transform","text":"<p>API Classes</p> <p>The API Classes will use Transforms internally. So model.to_endpoint() uses the ModelToEndpoint() transform. If you need more control over the Transform you can use the Core Classes directly.</p> <p>The SageWorks Transform class is a base/abstract class that defines API implemented by all the child classes (DataLoaders, DataSourceToFeatureSet, ModelToEndpoint, etc).</p> <p>Transform: Base Class for all transforms within SageWorks Inherited Classes must implement the abstract transform_impl() method</p>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform","title":"<code>Transform</code>","text":"<p>               Bases: <code>ABC</code></p> <p>Transform: Base Class for all transforms within SageWorks. Inherited Classes must implement the abstract transform_impl() method</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>class Transform(ABC):\n    \"\"\"Transform: Base Class for all transforms within SageWorks. Inherited Classes\n    must implement the abstract transform_impl() method\"\"\"\n\n    def __init__(self, input_uuid: str, output_uuid: str):\n        \"\"\"Transform Initialization\"\"\"\n\n        self.log = logging.getLogger(\"sageworks\")\n        self.input_type = None\n        self.output_type = None\n        self.output_tags = \"\"\n        self.input_uuid = str(input_uuid)  # Occasionally we get a pathlib.Path object\n        self.output_uuid = str(output_uuid)  # Occasionally we get a pathlib.Path object\n        self.output_meta = {\"sageworks_input\": self.input_uuid}\n        self.data_catalog_db = \"sageworks\"\n\n        # Grab our SageWorks Bucket\n        cm = ConfigManager()\n        if not cm.config_okay():\n            self.log.error(\"SageWorks Configuration Incomplete...\")\n            self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n            raise FatalConfigError()\n        self.sageworks_bucket = cm.get_config(\"SAGEWORKS_BUCKET\")\n        self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n        self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n        self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n        self.endpoints_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n        # Grab a SageWorks Role ARN, Boto3, SageMaker Session, and SageMaker Client\n        self.aws_account_clamp = AWSAccountClamp()\n        self.sageworks_role_arn = self.aws_account_clamp.sageworks_execution_role_arn()\n        self.boto_session = self.aws_account_clamp.boto_session()\n        self.sm_session = self.aws_account_clamp.sagemaker_session(self.boto_session)\n        self.sm_client = self.aws_account_clamp.sagemaker_client(self.boto_session)\n\n        # Delimiter for storing lists in AWS Tags\n        self.tag_delimiter = \"::\"\n\n    @abstractmethod\n    def transform_impl(self, **kwargs):\n        \"\"\"Abstract Method: Implement the Transformation from Input to Output\"\"\"\n        pass\n\n    def pre_transform(self, **kwargs):\n        \"\"\"Perform any Pre-Transform operations\"\"\"\n        self.log.debug(\"Pre-Transform...\")\n\n    @abstractmethod\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform ensures that the output Artifact is ready for use\"\"\"\n        pass\n\n    def set_output_tags(self, tags: list | str):\n        \"\"\"Set the tags that will be associated with the output object\n        Args:\n            tags (list | str): The list of tags or a '::' separated string of tags\"\"\"\n        if isinstance(tags, list):\n            self.output_tags = self.tag_delimiter.join(tags)\n        else:\n            self.output_tags = tags\n\n    def add_output_meta(self, meta: dict):\n        \"\"\"Add additional metadata that will be associated with the output artifact\n        Args:\n            meta (dict): A dictionary of metadata\"\"\"\n        self.output_meta = self.output_meta | meta\n\n    @staticmethod\n    def convert_to_aws_tags(metadata: dict):\n        \"\"\"Convert a dictionary to the AWS tag format (list of dicts)\n        [ {Key: key_name, Value: value}, {..}, ...]\"\"\"\n        return [{\"Key\": key, \"Value\": value} for key, value in metadata.items()]\n\n    def get_aws_tags(self):\n        \"\"\"Get the metadata/tags and convert them into AWS Tag Format\"\"\"\n        # Set up our metadata storage\n        sageworks_meta = {\"sageworks_tags\": self.output_tags}\n        for key, value in self.output_meta.items():\n            sageworks_meta[key] = value\n        aws_tags = self.convert_to_aws_tags(sageworks_meta)\n        return aws_tags\n\n    @final\n    def transform(self, **kwargs):\n        \"\"\"Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations\"\"\"\n        self.pre_transform(**kwargs)\n        self.transform_impl(**kwargs)\n        self.post_transform(**kwargs)\n\n    def input_type(self) -&gt; TransformInput:\n        \"\"\"What Input Type does this Transform Consume\"\"\"\n        return self.input_type\n\n    def output_type(self) -&gt; TransformOutput:\n        \"\"\"What Output Type does this Transform Produce\"\"\"\n        return self.output_type\n\n    def set_input_uuid(self, input_uuid: str):\n        \"\"\"Set the Input UUID (Name) for this Transform\"\"\"\n        self.input_uuid = input_uuid\n\n    def set_output_uuid(self, output_uuid: str):\n        \"\"\"Set the Output UUID (Name) for this Transform\"\"\"\n        self.output_uuid = output_uuid\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.__init__","title":"<code>__init__(input_uuid, output_uuid)</code>","text":"<p>Transform Initialization</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def __init__(self, input_uuid: str, output_uuid: str):\n    \"\"\"Transform Initialization\"\"\"\n\n    self.log = logging.getLogger(\"sageworks\")\n    self.input_type = None\n    self.output_type = None\n    self.output_tags = \"\"\n    self.input_uuid = str(input_uuid)  # Occasionally we get a pathlib.Path object\n    self.output_uuid = str(output_uuid)  # Occasionally we get a pathlib.Path object\n    self.output_meta = {\"sageworks_input\": self.input_uuid}\n    self.data_catalog_db = \"sageworks\"\n\n    # Grab our SageWorks Bucket\n    cm = ConfigManager()\n    if not cm.config_okay():\n        self.log.error(\"SageWorks Configuration Incomplete...\")\n        self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n        raise FatalConfigError()\n    self.sageworks_bucket = cm.get_config(\"SAGEWORKS_BUCKET\")\n    self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n    self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n    self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n    self.endpoints_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n    # Grab a SageWorks Role ARN, Boto3, SageMaker Session, and SageMaker Client\n    self.aws_account_clamp = AWSAccountClamp()\n    self.sageworks_role_arn = self.aws_account_clamp.sageworks_execution_role_arn()\n    self.boto_session = self.aws_account_clamp.boto_session()\n    self.sm_session = self.aws_account_clamp.sagemaker_session(self.boto_session)\n    self.sm_client = self.aws_account_clamp.sagemaker_client(self.boto_session)\n\n    # Delimiter for storing lists in AWS Tags\n    self.tag_delimiter = \"::\"\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.add_output_meta","title":"<code>add_output_meta(meta)</code>","text":"<p>Add additional metadata that will be associated with the output artifact Args:     meta (dict): A dictionary of metadata</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def add_output_meta(self, meta: dict):\n    \"\"\"Add additional metadata that will be associated with the output artifact\n    Args:\n        meta (dict): A dictionary of metadata\"\"\"\n    self.output_meta = self.output_meta | meta\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.convert_to_aws_tags","title":"<code>convert_to_aws_tags(metadata)</code>  <code>staticmethod</code>","text":"<p>Convert a dictionary to the AWS tag format (list of dicts) [ {Key: key_name, Value: value}, {..}, ...]</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>@staticmethod\ndef convert_to_aws_tags(metadata: dict):\n    \"\"\"Convert a dictionary to the AWS tag format (list of dicts)\n    [ {Key: key_name, Value: value}, {..}, ...]\"\"\"\n    return [{\"Key\": key, \"Value\": value} for key, value in metadata.items()]\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.get_aws_tags","title":"<code>get_aws_tags()</code>","text":"<p>Get the metadata/tags and convert them into AWS Tag Format</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def get_aws_tags(self):\n    \"\"\"Get the metadata/tags and convert them into AWS Tag Format\"\"\"\n    # Set up our metadata storage\n    sageworks_meta = {\"sageworks_tags\": self.output_tags}\n    for key, value in self.output_meta.items():\n        sageworks_meta[key] = value\n    aws_tags = self.convert_to_aws_tags(sageworks_meta)\n    return aws_tags\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.input_type","title":"<code>input_type()</code>","text":"<p>What Input Type does this Transform Consume</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def input_type(self) -&gt; TransformInput:\n    \"\"\"What Input Type does this Transform Consume\"\"\"\n    return self.input_type\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.output_type","title":"<code>output_type()</code>","text":"<p>What Output Type does this Transform Produce</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def output_type(self) -&gt; TransformOutput:\n    \"\"\"What Output Type does this Transform Produce\"\"\"\n    return self.output_type\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.post_transform","title":"<code>post_transform(**kwargs)</code>  <code>abstractmethod</code>","text":"<p>Post-Transform ensures that the output Artifact is ready for use</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>@abstractmethod\ndef post_transform(self, **kwargs):\n    \"\"\"Post-Transform ensures that the output Artifact is ready for use\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.pre_transform","title":"<code>pre_transform(**kwargs)</code>","text":"<p>Perform any Pre-Transform operations</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def pre_transform(self, **kwargs):\n    \"\"\"Perform any Pre-Transform operations\"\"\"\n    self.log.debug(\"Pre-Transform...\")\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_input_uuid","title":"<code>set_input_uuid(input_uuid)</code>","text":"<p>Set the Input UUID (Name) for this Transform</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def set_input_uuid(self, input_uuid: str):\n    \"\"\"Set the Input UUID (Name) for this Transform\"\"\"\n    self.input_uuid = input_uuid\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_output_tags","title":"<code>set_output_tags(tags)</code>","text":"<p>Set the tags that will be associated with the output object Args:     tags (list | str): The list of tags or a '::' separated string of tags</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def set_output_tags(self, tags: list | str):\n    \"\"\"Set the tags that will be associated with the output object\n    Args:\n        tags (list | str): The list of tags or a '::' separated string of tags\"\"\"\n    if isinstance(tags, list):\n        self.output_tags = self.tag_delimiter.join(tags)\n    else:\n        self.output_tags = tags\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_output_uuid","title":"<code>set_output_uuid(output_uuid)</code>","text":"<p>Set the Output UUID (Name) for this Transform</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def set_output_uuid(self, output_uuid: str):\n    \"\"\"Set the Output UUID (Name) for this Transform\"\"\"\n    self.output_uuid = output_uuid\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.transform","title":"<code>transform(**kwargs)</code>","text":"<p>Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>@final\ndef transform(self, **kwargs):\n    \"\"\"Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations\"\"\"\n    self.pre_transform(**kwargs)\n    self.transform_impl(**kwargs)\n    self.post_transform(**kwargs)\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.transform_impl","title":"<code>transform_impl(**kwargs)</code>  <code>abstractmethod</code>","text":"<p>Abstract Method: Implement the Transformation from Input to Output</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>@abstractmethod\ndef transform_impl(self, **kwargs):\n    \"\"\"Abstract Method: Implement the Transformation from Input to Output\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.TransformInput","title":"<code>TransformInput</code>","text":"<p>               Bases: <code>Enum</code></p> <p>Enumerated Types for SageWorks Transform Inputs</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>class TransformInput(Enum):\n    \"\"\"Enumerated Types for SageWorks Transform Inputs\"\"\"\n\n    LOCAL_FILE = auto()\n    PANDAS_DF = auto()\n    SPARK_DF = auto()\n    S3_OBJECT = auto()\n    DATA_SOURCE = auto()\n    FEATURE_SET = auto()\n    MODEL = auto()\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.TransformOutput","title":"<code>TransformOutput</code>","text":"<p>               Bases: <code>Enum</code></p> <p>Enumerated Types for SageWorks Transform Outputs</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>class TransformOutput(Enum):\n    \"\"\"Enumerated Types for SageWorks Transform Outputs\"\"\"\n\n    PANDAS_DF = auto()\n    SPARK_DF = auto()\n    S3_OBJECT = auto()\n    DATA_SOURCE = auto()\n    FEATURE_SET = auto()\n    MODEL = auto()\n    ENDPOINT = auto()\n</code></pre>"},{"location":"enterprise/","title":"SageWorks Enterprise","text":"<p>The SageWorks API and User Interfaces cover a broad set of AWS Machine Learning services and provide easy to use abstractions and visualizations of your AWS ML data. We offer a wide range of options to best fit your companies needs.</p> Accelerate ML Pipeline development with an Enterprise License! Free Enterprise: Lite Enterprise: Standard Enterprise: Pro Python API \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 SageWorks REPL \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Dashboard \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 AWS Onboarding \u2796 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Dashboard Plugins \u2796 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Custom Pages \u2796 \u2796 \ud83d\udfe2 \ud83d\udfe2 Themes \u2796 \u2796 \ud83d\udfe2 \ud83d\udfe2 ML Pipelines \u2796 \u2796 \u2796 \ud83d\udfe2 Project Branding \u2796 \u2796 \u2796 \ud83d\udfe2 Prioritized Feature Requests \u2796 \u2796 \u2796 \ud83d\udfe2 Pricing \u2796 $1500* $3000* $4000* <p>*USD per month, includes AWS setup, support, and training: Everything needed to accelerate your AWS ML Development team. Interested in Data Science/Engineering consulting? We have top notch Consultants with a depth and breadth of AWS ML/DS/Engineering expertise.</p>"},{"location":"enterprise/#try-sageworks","title":"Try SageWorks","text":"<p>We encourage new users to try out the free version, first. We offer support in our Discord channel and our Documentation has instructions for how to get started with SageWorks. So try it out and when you're ready to accelerate your AWS ML Adventure with an Enterprise licence contact us at SageWorks Sales</p>"},{"location":"enterprise/#data-engineeringscience-consulting","title":"Data Engineering/Science Consulting","text":"<p>Alongside our SageWorks Enterprise offerings, we provide comprehensive consulting services and domain expertise through our Partnerships. We specialize in AWS Machine Learning Systems and our extended team of Data Scientists and Engineers, have Masters and Ph.D. degrees in Computer Science, Chemistry, and Pharmacology. We also have a parntership with Nomic Networks to support our Network Security Clients.</p> <p>Using AWS and SageWorks, our experts are equipped to deliver tailored solutions that are focused on your project needs and deliverables. For more information please touch base and we'll set up a free initial consultation SageWorks Consulting</p>"},{"location":"enterprise/#contact-us","title":"Contact Us","text":"<p>Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or  SageWorks Sales</p>"},{"location":"enterprise/private_saas/","title":"Benefits of a Private SaaS Architecture","text":""},{"location":"enterprise/private_saas/#self-hosted-vs-private-saas-vs-public-saas","title":"Self Hosted vs Private SaaS vs Public SaaS?","text":"<p>At the top level your team/project is making a decision about how they are going to build, expand, support, and maintain a machine learning pipeline.</p> <p>Conceptual ML Pipeline</p> <pre><code>Data \u2b95 Features \u2b95 Models \u2b95 Deployment (end-user application)\n</code></pre> <p>Concrete/Real World Example</p> <pre><code>S3 \u2b95 Glue Job \u2b95 Data Catalog \u2b95 FeatureGroups \u2b95 Models \u2b95 Endpoints \u2b95 App\n</code></pre> <p>When building out a framework to support ML Pipelines there are three main options:</p> <ul> <li>Self Hosted</li> <li>Private SaaS</li> <li>Public SaaS</li> </ul> <p>The other choice, that we're not going to cover here, is whether you use AWS, Azure, GCP, or something else. SageWorks is architected and powered by a broad and rich set of AWS ML Pipeline services. We believe that AWS provides the best set of functionality and APIs for flexible, real world ML architectures.</p> <p></p>"},{"location":"enterprise/private_saas/#resources","title":"Resources","text":"<p>See our full presentation on the SageWorks Private SaaS Architecture</p>"},{"location":"enterprise/project_branding/","title":"Project Branding","text":"<p>The SageWorks Dashboard can be customized extensively. Using SageWorks Project Branding allows you to change page headers, titles, and logos to match your project. All user interfaces will reflect your project name and company logos. </p>"},{"location":"enterprise/project_branding/#contact-us","title":"Contact Us","text":"<p>Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or  SageWorks Sales.</p>"},{"location":"enterprise/themes/","title":"SageWorks Themes","text":"<p>The SageWorks Dashboard can be customized extensively. Using SageWorks Themes allows you to customize the User Interfaces to suit your preferences, including completely customized color palettes and fonts. We offer a set of default 'dark' and 'light' themes, but we'll also customize the theme to match your company's color palette and logos.</p>"},{"location":"enterprise/themes/#contact-us","title":"Contact Us","text":"<p>Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or  SageWorks Sales.</p>"},{"location":"getting_started/","title":"Getting Started","text":"<p>For the initial setup of SageWorks we'll be using the SageWorks REPL. When you start <code>sageworks</code> it will recognize that it needs to complete the initial configuration and will guide you through that process.</p> <p>Need Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p>"},{"location":"getting_started/#initial-setupconfig","title":"Initial Setup/Config","text":"<p>Notes: Use the SageWorks REPL to setup your AWS connection for both API Usage (Data Scientists/Engineers) and AWS Initial Setup (AWS Folks). Also if you don't already have an AWS Profile or SSO Setup you'll need to do that first Developer SSO Setup </p> <p><pre><code>&gt; pip install sageworks\n&gt; sageworks &lt;-- This starts the REPL\n\nWelcome to SageWorks!\nLooks like this is your first time using SageWorks...\nLet's get you set up...\nAWS_PROFILE: my_aws_profile\nSAGEWORKS_BUCKET: my-company-sageworks\n[optional] REDIS_HOST(localhost): my-redis.cache.amazon (or leave blank)\n[optional] REDIS_PORT(6379):\n[optional] REDIS_PASSWORD():\n[optional] SAGEWORKS_API_KEY(open_source): my_api_key (or leave blank)\n</code></pre> That's It: You're now all set. This configuration only needs to be ONCE :)</p>"},{"location":"getting_started/#data-scientistsengineers","title":"Data Scientists/Engineers","text":"<ul> <li>SageWorks REPL: SageWorks REPL</li> <li>Using SageWorks for ML Pipelines: SageWorks API Classes</li> <li>SCP SageWorks Github: Github Repo</li> </ul>"},{"location":"getting_started/#aws-administrators","title":"AWS Administrators","text":"<p>For companies that are setting up SageWorks on an internal AWS Account: Company AWS Setup</p>"},{"location":"getting_started/#additional-resources","title":"Additional Resources","text":"<ul> <li>SageWorks Core Classes: Core Classes</li> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"glue/","title":"AWS Glue Jobs","text":"<p>AWS Glue Simplified</p> <p>AWS Glue Jobs are a great way to automate ETL and data processing. SageWorks takes all the hassle out of creating and debugging Glue Jobs. Follow this guide and empower your Glue Jobs with SageWorks!</p> <p>SageWorks make creating, testing, and debugging of AWS Glue Jobs easy. The exact same SageWorks API Classes are used in your Glue Jobs. Also since SageWorks manages the roles for both API and Glue Jobs you'll be able to test new Glue Jobs locally and minimizes surprises when deploying your Glue Job.</p>"},{"location":"glue/#glue-job-setup","title":"Glue Job Setup","text":"<p>Setting up a AWS Glue Job that uses SageWorks is straight forward. SageWorks can be 'installed' on AWS Glue via the <code>--additional-python-modules</code> parameter and then you can use the Sageworks API just like normal. </p> <p></p> <p>Here are the settings and a screen shot to guide you. There are several ways to set up and run Glue Jobs, with either the SageWorks-ExecutionRole or using the SageWorksAPIPolicy. Please feel free to contact SageWorks support if you need any help with setting up Glue Jobs.</p> <ul> <li>IAM Role: SageWorks-ExecutionRole</li> <li>Type: Spark</li> <li>Glue Version: Glue 4.0</li> <li>Worker Type: G.1X</li> <li>Number of Workers: 2</li> <li>Job Parameters</li> <li>--additional-python-modules: sageworks&gt;=0.4.6</li> <li>--sageworks-bucket: &lt;your sageworks bucket&gt;</li> </ul> <p>Glue IAM Role Details</p> <p>If your Glue Jobs already use an existing IAM Role then you can add the <code>SageWorksAPIPolicy</code> to that Role to enable the Glue Job to perform SageWorks API Tasks.</p>"},{"location":"glue/#sageworks-glue-example","title":"SageWorks Glue Example","text":"<p>Anyone familiar with a typical Glue Job should be pleasantly surpised by how simple the example below is. Also SageWorks allows you to test Glue Jobs locally using the same code that you use for script and Notebooks (see Glue Testing)</p> <p>Glue Job Arguments</p> <p>AWS Glue Jobs take arguments in the form of Job Parameters (see screenshot above). There's a SageWorks utility function <code>glue_args_to_dict</code> that turns these Job Parameters into a nice dictionary for ease of use.</p> examples/glue_hello_world.py<pre><code>import sys\n\n# SageWorks Imports\nfrom sageworks.api.data_source import DataSource\nfrom sageworks.utils.config_manager import ConfigManager\nfrom sageworks.utils.glue_utils import glue_args_to_dict\n\n# Convert Glue Job Args to a Dictionary\nglue_args = glue_args_to_dict(sys.argv)\n\n# Set the SAGEWORKS_BUCKET for the ConfigManager\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", glue_args[\"--sageworks-bucket\"])\n\n# Create a new Data Source from an S3 Path\nsource_path = \"s3://sageworks-public-data/common/abalone.csv\"\nmy_data = DataSource(source_path, name=\"abalone_glue_test\")\n</code></pre>"},{"location":"glue/#glue-example-2","title":"Glue Example 2","text":"<p>This example takes two 'Job Parameters'</p> <ul> <li>--sageworks-bucket : &lt;your sageworks bucket&gt;</li> <li>--input-s3-path : &lt;your S3 input path&gt;</li> </ul> <p>The example will convert all CSV files in an S3 bucket/prefix and load them up as DataSources in SageWorks.</p> examples/glue_load_s3_bucket.py<pre><code>import sys\n\n# SageWorks Imports\nfrom sageworks.api.data_source import DataSource\nfrom sageworks.utils.config_manager import ConfigManager\nfrom sageworks.utils.glue_utils import glue_args_to_dict, list_s3_files\n\n# Convert Glue Job Args to a Dictionary\nglue_args = glue_args_to_dict(sys.argv)\n\n# Set the SAGEWORKS_BUCKET for the ConfigManager\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", glue_args[\"--sageworks-bucket\"])\n\n# List all the CSV files in the given S3 Path\ninput_s3_path = glue_args[\"--input-s3-path\"]\nfor input_file in list_s3_files(input_s3_path):\n\n    # Note: If we don't specify a name, one will be 'auto-generated'\n    my_data = DataSource(input_file, name=None)\n</code></pre>"},{"location":"glue/#glue-job-local-testing","title":"Glue Job Local Testing","text":"<p>Glue Power without the Pain. SageWorks manages the AWS Execution Role, so local API and Glue Jobs will have the same permissions/access. Also using the same Code as your notebooks or scripts makes creating and testing Glue Jobs a breeze.</p> <pre><code>export SAGEWORKS_CONFIG=&lt;your config&gt;  # Only if not already set up\npython my_glue_job.py --sageworks-bucket &lt;your bucket&gt;\n</code></pre>"},{"location":"glue/#additional-resources","title":"Additional Resources","text":"<ul> <li>SageWorks Glue Jobs: SageWorks Glue</li> <li>Setting up SageWorks on your AWS Account: AWS Setup</li> <li>Using SageWorks for ML Pipelines: SageWorks API Classes</li> </ul> <ul> <li>SageWorks Core Classes: Core Classes</li> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"lambda_layer/","title":"AWS Lambda Layer","text":"<p>SageWorks Lambda Layers</p> <p>AWS Lambda Jobs are a great way to spin up data processing jobs. Follow this guide and empower AWS Lambda with SageWorks!</p> <p>SageWorks makes creating, testing, and debugging of AWS Lambda Functions easy. The exact same SageWorks API Classes are used in your AWS Lambda Functions. Also since SageWorks manages the access policies you'll be able to test new Lambda Jobs locally and minimizes surprises when deploying.</p> <p>Work In Progress</p> <p>The SageWorks Lambda Layers are a great way to use SageWorks but they are still in 'beta' mode so please let us know if you have any issues.</p>"},{"location":"lambda_layer/#lambda-job-setup","title":"Lambda Job Setup","text":"<p>Setting up a AWS Lambda Job that uses SageWorks is straight forward. SageWorks can be 'installed' using a Lambda Layer and then you can use the Sageworks API just like normal.</p> <p>Here are the ARNs for the current SageWorks Lambda Layers, please note they are specified with region and Python version in the name, so if your lambda is us-east-1, python 3.12, pick this ARN with those values in it.</p> <p>us-east-1</p> <ul> <li>arn:aws:lambda:us-east-1:507740646243:layer:sageworks_lambda_layer-us-east-1-python310:1</li> <li>arn:aws:lambda:us-east-1:507740646243:layer:sageworks_lambda_layer-us-east-1-python312:1</li> </ul> <p>us-west-2</p> <ul> <li>arn:aws:lambda:us-west-2:507740646243:layer:sageworks_lambda_layer-us-west-2-python310:1</li> <li>arn:aws:lambda:us-west-2:507740646243:layer:sageworks_lambda_layer-us-west-2-python312:1</li> </ul> <p>Note: If you're using lambdas on a different region or with a different Python version, just let us know and we'll publish some additional layers.</p> <p></p> <p>At the bottom of the Lambda page there's an 'Add Layer' button. You can click that button and specify the layer using the ARN above. Also in the 'General Configuration' set these parameters:</p> <ul> <li>Timeout: 5 Minutes</li> <li>Memory: 4096</li> </ul> <p>Set the SAGEWORKS_BUCKET ENV SageWorks will need to know what bucket to work out of, so go into the Configuration...Environment Variables... and add one for the SageWorks bucket that your are using for AWS Account (dev, prod, etc). </p> <p>Lambda Role Details</p> <p>If your Lambda Function already use an existing IAM Role then you can add the SageWorks policies to that Role to enable the Lambda Job to perform SageWorks API Tasks. See SageWorks Access Controls</p>"},{"location":"lambda_layer/#sageworks-lambda-example","title":"SageWorks Lambda Example","text":"<p>Here's a simple example of using SageWorks in your Lambda Function.</p> examples/lambda_hello_world.py<pre><code>import json\nfrom sageworks.utils.lambda_utils import load_lambda_layer\n\n# Load the SageWorks Lambda Layer\nload_lambda_layer()\n\n# Now we can use the normal SageWorks imports\nfrom sageworks.api import Meta, Model \n\ndef lambda_handler(event, context):\n\n    # Create our Meta Class and get a list of our Models\n    meta = Meta()\n    models = meta.models()\n\n    print(f\"Number of Models: {len(models)}\")\n    print(models)\n\n    # Get more details data on the Endpoints\n    models_groups = meta.models_deep()\n    for name, model_versions in models_groups.items():\n        print(name)\n\n    # Onboard a model\n    model = Model(\"abalone-regression\")\n    model.onboard()\n\n    # Return success\n    return {\n        'statusCode': 200,\n        'body': { \"incoming_event\": event}\n    }\n</code></pre>"},{"location":"lambda_layer/#lambda-function-local-testing","title":"Lambda Function Local Testing","text":"<p>Lambda Power without the Pain. SageWorks manages the AWS Execution Role/Policies, so local API and Lambda Functions will have the same permissions/access. Also using the same Code as your notebooks or scripts makes creating and testing Lambda Functions a breeze.</p> <pre><code>python my_lambda_function.py --sageworks-bucket &lt;your bucket&gt;\n</code></pre>"},{"location":"lambda_layer/#additional-resources","title":"Additional Resources","text":"<ul> <li>SageWorks Access Management: SageWorks Access Management</li> <li>Setting up SageWorks on your AWS Account: AWS Setup</li> </ul> <ul> <li> <p>Using SageWorks for ML Pipelines: SageWorks API Classes</p> </li> <li> <p>Consulting Available: SuperCowPowers LLC</p> </li> </ul>"},{"location":"misc/faq/","title":"SageWorks: FAQ","text":"<p>Artifact and Column Naming?</p> <p>You might have noticed that SageWorks has some unintuitive constraints when naming Artifacts and restrictions on column names. All of these restrictions come from AWS. SageWorks uses Glue, Athena, Feature Store, Models and Endpoints, each of these services have their own constraints, SageWorks simply 'reflects' those contraints.</p>"},{"location":"misc/faq/#naming-underscores-dashes-and-lower-case","title":"Naming: Underscores, Dashes, and Lower Case","text":"<p>Data Sources and Feature Sets must adhere to AWS restrictions on table names and columns names (here is a snippet from the AWS documentation)</p> <p>Database, table, and column names</p> <p>When you create schema in AWS Glue to query in Athena, consider the following:</p> <p>A database name cannot be longer than 255 characters. A table name cannot be longer than 255 characters. A column name cannot be longer than 255 characters.</p> <p>The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character.</p> <p>For more info see: Glue Best Practices</p>"},{"location":"misc/faq/#datasourcefeatureset-use-_-and-modelendpoint-use-","title":"DataSource/FeatureSet use '_'  and Model/Endpoint use '-'","text":"<p>You may notice that DataSource and FeatureSet uuid/name examples have underscores but the model and endpoints have dashes. Yes, it\u2019s super annoying to have one convention for DataSources and FeatureSets and another for Models and Endpoints but this is an AWS restriction and not something that SageWorks can control.</p> <p>DataSources and FeatureSet: Underscores. You cannot use a dash because both classes use Athena for Storage and Athena tables names cannot have a dash.</p> <p>Models and Endpoints: Dashes. You cannot use an underscores because AWS imposes a restriction on the naming.</p>"},{"location":"misc/faq/#additional-information-on-the-lower-case-issue","title":"Additional information on the lower case issue","text":"<p>We\u2019ve tried to create a glue table with Mixed Case column names and haven\u2019t had any luck. We\u2019ve bypassed wrangler and used the boto3 low level calls directly. In all cases when it shows up in the Glue Table the columns have always been converted to lower case. We've also tried uses the Athena DDL directly, that also doesn't work. Here's the relevant AWS documentation and the two scripts that reproduce the issue.</p> <p>AWS Docs</p> <ul> <li>Athena Naming Restrictions</li> <li>Glue Best Practices</li> </ul> <p>Scripts to Reproduce</p> <ul> <li>scripts/athena_ddl_mixed_case.py</li> <li>scripts/glue_mixed_case.py</li> </ul>"},{"location":"misc/general_info/","title":"General info","text":""},{"location":"misc/general_info/#general-info","title":"General Info","text":""},{"location":"misc/general_info/#sageworks-the-scientists-workbench-powered-by-aws-for-scalability-flexibility-and-security","title":"SageWorks: The scientist's workbench powered by AWS\u00ae for scalability, flexibility, and security.","text":"<p>SageWorks is a medium granularity framework that manages and aggregates AWS\u00ae Services into classes and concepts. When you use SageWorks you think about DataSources, FeatureSets, Models, and Endpoints. Underneath the hood those classes handle all the details around updating and managing a complex set of AWS Services. All the power and none of the pain so that your team can Do Science Faster!</p>"},{"location":"misc/general_info/#sageworks-documentation","title":"SageWorks Documentation","text":"<p>See our Python API and AWS documentation here: SageWorks Documentation</p>"},{"location":"misc/general_info/#full-sageworks-overview","title":"Full SageWorks OverView","text":"<p>SageWorks Architected FrameWork</p>"},{"location":"misc/general_info/#why-sageworks","title":"Why SageWorks?","text":"<ul> <li>The AWS SageMaker\u00ae ecosystem is awesome but has a large number of services with significant complexity</li> <li>SageWorks provides rapid prototyping through easy to use classes and transforms</li> <li>SageWorks provides visibility and transparency into AWS SageMaker\u00ae Pipelines<ul> <li>What S3 data sources are getting pulled?</li> <li>What Features Store/Group is the Model Using?</li> <li>What's the Provenance of a Model in Model Registry?</li> <li>What SageMaker Endpoints are associated with this model?</li> </ul> </li> </ul>"},{"location":"misc/general_info/#single-pane-of-glass","title":"Single Pane of Glass","text":"<p>Visibility into the AWS Services that underpin the SageWorks Classes. We can see that SageWorks automatically tags and tracks the inputs of all artifacts providing 'data provenance' for all steps in the AWS modeling pipeline.</p> <p>Image TBD</p> <p> Clearly illustrated: SageWorks provides intuitive and transparent visibility into the full pipeline of your AWS Sagemaker Deployments.</p>"},{"location":"misc/general_info/#getting-started","title":"Getting Started","text":"<ul> <li>SageWorks Overview Slides that cover and illustrate the SageWorks Modeling Pipeline.</li> <li>SageWorks Docs/Wiki Our general documentation for getting started with SageWorks.</li> <li>SageWorks AWS Onboarding Deploy the SageWorks Stack to your AWS Account. </li> <li>Notebook: Start to Finish AWS ML Pipeline Building an AWS\u00ae ML Pipeline from start to finish.</li> <li>Video: Coding with SageWorks Informal coding + chatting while building a full ML pipeline.</li> <li>Join our Discord for questions and advice on using SageWorks within your organization.</li> </ul>"},{"location":"misc/general_info/#sageworks-zen","title":"SageWorks Zen","text":"<ul> <li>The AWS SageMaker\u00ae set of services is vast and complex.</li> <li>SageWorks Classes encapsulate, organize, and manage sets of AWS\u00ae Services.</li> <li>Heavy transforms typically use AWS Athena or Apache Spark (AWS Glue/EMR Serverless).</li> <li>Light transforms will typically use Pandas.</li> <li>Heavy and Light transforms both update AWS Artifacts (collections of AWS Services).</li> <li>Quick prototypes are typically built with the light path and then flipped to the heavy path as the system matures and usage grows.</li> </ul>"},{"location":"misc/general_info/#classes-and-concepts","title":"Classes and Concepts","text":"<p>The SageWorks Classes are organized to work in concert with AWS Services. For more details on the current classes and class hierarchies see SageWorks Classes and Concepts.</p>"},{"location":"misc/general_info/#contributions","title":"Contributions","text":"<p>If you'd like to contribute to the SageWorks project, you're more than welcome. All contributions will fall under the existing project license. If you are interested in contributing or have questions please feel free to contact us at sageworks@supercowpowers.com.</p>"},{"location":"misc/general_info/#sageworks-alpha-testers-wanted","title":"SageWorks Alpha Testers Wanted","text":"<p>Our experienced team can provide development and consulting services to help you effectively use Amazon\u2019s Machine Learning services within your organization.</p> <p>The popularity of cloud based Machine Learning services is booming. The problem many companies face is how that capability gets effectively used and harnessed to drive real business decisions and provide concrete value for their organization.</p> <p>Using SageWorks will minimize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a SageWorks Alpha Tester, contact us at sageworks@supercowpowers.com.</p> <p>\u00ae Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates.</p> <p>Readme change</p>"},{"location":"misc/sageworks_classes_concepts/","title":"SageWorks Classes and Concepts","text":"<p>A flexible, rapid, and customizable AWS\u00ae ML Sandbox. Here's some of the classes and concepts we use in the SageWorks system:</p> <p></p> <ul> <li>Artifacts</li> <li>DataLoader</li> <li>DataSource</li> <li>FeatureSet</li> <li>Model</li> <li> <p>Endpoint</p> </li> <li> <p>Transforms</p> </li> <li>DataSource to DataSource<ul> <li>Heavy <ul> <li>AWS Glue Jobs</li> <li>AWS EMR Serverless</li> </ul> </li> <li>Light<ul> <li>Local/Laptop</li> <li>Lambdas</li> <li>StepFunctions</li> </ul> </li> </ul> </li> <li>DataSource to FeatureSet<ul> <li>Heavy/Light (see above breakout)</li> </ul> </li> <li>FeatureSet to FeatureSet<ul> <li>Heavy/Light (see above breakout)</li> </ul> </li> <li>FeatureSet to Model</li> <li>Model to Endpoint</li> </ul>"},{"location":"misc/scp_consulting/","title":"Scp consulting","text":""},{"location":"misc/scp_consulting/#consulting","title":"Consulting","text":""},{"location":"misc/scp_consulting/#sageworks-scp-consulting-awesome","title":"SageWorks + SCP Consulting = Awesome","text":"<p>Our experienced team can provide development and consulting services to help you effectively use Amazon\u2019s Machine Learning services within your organization.</p> <p>The popularity of cloud based Machine Learning services is booming. The problem many companies face is how that capability gets effectively used and harnessed to drive real business decisions and provide concrete value for their organization.</p> <p>Using SageWorks will minimizize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a SageWorks Alpha Tester, contact us at sageworks@supercowpowers.com.</p>"},{"location":"misc/scp_consulting/#typical-engagements","title":"Typical Engagements","text":"<p>SageWorks clients typically want a tailored web_interface that helps to drive business decisions and provides value for their organization.</p> <ul> <li>SageWorks components provide a set of classes and transforms the will dramatically reduce time and increase productivity when building AWS ML Systems.</li> <li>SageWorks enables rapid prototyping via it's light paths and provides AWS production workflows on large scale data through it's heavy paths.</li> <li> <p>Rapid Prototyping is typically done via these steps.</p> </li> <li> <p>Quick Construction of Web Interface (tailored)</p> </li> <li>Custom Components (tailored)</li> <li>Demo/Review the Application with the Client</li> <li>Get Feedback/Changes/Improvements</li> <li> <p>Goto Step 1</p> </li> <li> <p>When the client is happy/excited about the ProtoType we then bolt down the system, test the heavy paths, review AWS access, security and ensure 'least privileged' roles and policies.</p> </li> </ul> <p>Contact us for a free initial consultation on how we can accelerate the use of AWS ML at your company sageworks@supercowpowers.com.</p>"},{"location":"plugins/","title":"OverView","text":"<p>SageWorks Plugins</p> <p>The SageWorks toolkit provides a flexible plugin architecture to expand, enhance, or even replace the Dashboard. Make custom UI components, views, and entire pages with the plugin classes described here.</p> <p>The SageWorks Plugin system allows clients to customize how their AWS Machine Learning Pipeline is displayed, analyzed, and visualized. Our easy to use Python API enables developers to make new Dash/Plotly components, data views, and entirely new web pages focused on business use cases.</p>"},{"location":"plugins/#concept-docs","title":"Concept Docs","text":"<p>Many classes in SageWorks need additional high-level material that covers class design and illustrates class usage. Here's the Concept Docs for Plugins:</p> <ul> <li>Plugin Concepts: Read First!</li> <li>How to Write a Plugin </li> <li>Plugin Pages</li> <li>Plugins Advanced</li> </ul>"},{"location":"plugins/#make-a-plugin","title":"Make a plugin","text":"<p>Each plugin class inherits from the SageWorks PluginInterface class and needs to set two attributes and implement two methods. These requirements are set so that each Plugin will conform to the Sageworks infrastructure; if the required attributes and methods aren\u2019t included in the class definition, errors will be raised during tests and at runtime.</p> <pre><code>from sageworks.web_components.plugin_interface import PluginInterface, PluginPage\n\nclass MyPlugin(PluginInterface):\n    \"\"\"My Awesome Component\"\"\"\n\n    # Initialize the required attributes\"\"\"\n    plugin_page = PluginPage.MODEL\n    plugin_input_type = PluginInputType.MODEL\n\n    # Implement the two methods\n    def create_component(self, component_id: str) -&gt; ComponentTypes:\n        &lt; Function logic which creates a Dash Component &gt;\n        return dcc.Graph(id=component_id, figure=self.waiting_figure())\n\n    def update_content(self, data_object: SageworksObject) -&gt; ContentTypes:\n        &lt; Function logic which creates a figure (go.Figure) \n        return figure\n</code></pre>"},{"location":"plugins/#required-attributes","title":"Required Attributes","text":"<p>The class variable plugin_page determines what type of plugin the MyPlugin class is. This variable is inspected during plugin loading at runtime in order to load the plugin to the correct artifact page in the Sageworks dashboard. The PluginPage class can be DATA_SOURCE, FEATURE_SET, MODEL, or ENDPOINT.</p>"},{"location":"plugins/#s3-bucket-plugins-work-in-progress","title":"S3 Bucket Plugins (Work in Progress)","text":"<p>Note: This functionality is coming soon</p> <p>Offers the most flexibility and fast prototyping. Simple set your config/env for  blah to an S3 Path and SageWorks will load the plugins from S3 directly.</p> <p>Helpful Tip</p> <p>You can copy files from your local system up to S3 with this handy AWS CLI call</p> <pre><code> aws s3 cp . s3://my-sageworks/sageworks_plugins \\\n --recursive --exclude \"*\" --include \"*.py\"\n</code></pre>"},{"location":"plugins/#additional-resources","title":"Additional Resources","text":"<p>Need help with plugins? Want to develop a customized application tailored to your business needs?</p> <ul> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"plugins/plugin_api_changes/","title":"API Changes","text":""},{"location":"plugins/plugin_api_changes/#plugin-api-changes","title":"Plugin API Changes","text":"<p>There were quite a fiew API changes for Plugins between <code>0.4.43</code> and <code>0.5.0</code> versions of SageWorks.</p> <p>General: Classes that inherit from <code>component_interface</code> or <code>plugin_interface</code>  are now 'auto wrapped' with an exception container. This container not only catches errors/crashes so they don't crash the application but it also displays the error in the widget.</p> <p>Specific Changes:</p> <ul> <li>The <code>generate_component_figure</code> method is now <code>update_contents</code></li> <li>The <code>message_figure</code> method is now <code>display_text</code></li> <li><code>PluginType</code> was changed to <code>PluginPage</code> (use CUSTOM to NOT autoload)</li> <li><code>PluginInputType.MODEL_DETAILS</code>  changed to <code>PluginInputType.MODEL</code>  (since your now getting a model object)</li> <li><code>FigureTypes</code> is now <code>ContentTypes</code></li> </ul>"},{"location":"presentations/","title":"SageWorks Presentations","text":"<p>The SageWorks framework makes AWS\u00ae both easier to use and more powerful. SageWorks handles all the details around updating and managing a complex set of AWS Services. With a simple-to-use Python API and a beautiful set of web interfaces, SageWorks makes creating AWS ML pipelines a snap.</p> <p>Need Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p>"},{"location":"presentations/#sageworks-presentations_1","title":"SageWorks Presentations","text":"<ul> <li>SageWorks Overview</li> <li>Private SaaS Architecture</li> <li>Python API</li> <li>Plugins Overview</li> <li>Plugins: Getting Started</li> <li>Plugins: Pages</li> <li>Plugins: Advanced</li> <li>Exploratory Data Analysis</li> <li>Architected ML Framework</li> <li>AWS Access Management</li> <li>Sageworks Config</li> <li>SageWorks REPL</li> </ul>"},{"location":"presentations/#sageworks-python-api-docs","title":"SageWorks Python API Docs","text":"<p>The SageWorks API documentation SageWorks API covers our in-depth Python API and contains code examples. The code examples are provided in the Github repo <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p>"},{"location":"presentations/#questions","title":"Questions?","text":"<p>The SuperCowPowers team is happy to anser any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p> <p></p> <p>\u00ae Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates</p>"},{"location":"repl/","title":"SageWorks REPL","text":"<p>Visibility and Control</p> <p>The SageWorks REPL provides AWS ML Pipeline visibility just like the SageWorks Dashboard but also provides control over the creation, modification, and deletion of artifacts through the Python API.</p> <p>The SageWorks REPL is a customized iPython shell. It provides tailored functionality for easy interaction with SageWorks objects and since it's based on iPython developers will feel right at home using autocomplete, history, help, etc. Both easy and powerful, the SageWorks REPL puts control of AWS ML Pipelines at your fingertips.</p>"},{"location":"repl/#installation","title":"Installation","text":"<p><code>pip install sageworks</code></p>"},{"location":"repl/#usage","title":"Usage","text":"<p>Just type <code>sageworks</code> at the command line and the SageWorks shell will spin up and provide a command view of your AWS Machine Learning Pipelines.</p> <p>At startup the SageWorks shell, will connect to your AWS Account and create a summary of the Machine Learning artifacts currently residing on the account.</p> <p></p> <p>Available Commands:</p> <ul> <li>status</li> <li>config</li> <li>incoming_data</li> <li>glue_jobs</li> <li>data_sources</li> <li>feature_sets</li> <li>models</li> <li>endpoints</li> <li>aws_refresh</li> <li>and more...</li> </ul> <p>All of the API Classes are auto-loaded, so drilling down on an individual artifact is easy. The same Python API is provided so if you want additional info on a model, for instance, simply create a model object and use any of the documented API methods.</p> <pre><code>m = Model(\"abalone-regression\")\nm.details()\n&lt;shows info about the model&gt;\n</code></pre>"},{"location":"repl/#additional-resources","title":"Additional Resources","text":"<ul> <li>Setting up SageWorks on your AWS Account: AWS Setup</li> <li>Using SageWorks for ML Pipelines: SageWorks API Classes</li> </ul> <ul> <li>SageWorks Core Classes: Core Classes</li> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to SageWorks","text":"<p>The SageWorks framework makes AWS\u00ae both easier to use and more powerful. SageWorks handles all the details around updating and managing a complex set of AWS Services. With a simple-to-use Python API and a beautiful set of web interfaces, SageWorks makes creating AWS ML pipelines a snap. It also dramatically improves both the usability and visibility across the entire spectrum of services: Glue Jobs, Athena, Feature Store, Models, and Endpoints. SageWorks makes it easy to build production ready, AWS powered, machine learning pipelines.</p> SageWorks Dashboard: AWS Pipelines in a Whole New Light!"},{"location":"#full-aws-overview","title":"Full AWS OverView","text":"<ul> <li>Health Monitoring \ud83d\udfe2</li> <li>Dynamic Updates</li> <li>High Level Summary</li> </ul>"},{"location":"#drill-down-views","title":"Drill-Down Views","text":"<ul> <li>Glue Jobs</li> <li>DataSources</li> <li>FeatureSets</li> <li>Models</li> <li>Endpoints</li> </ul>"},{"location":"#private-saas-architecture","title":"Private SaaS Architecture","text":"<p>Secure your Data, Empower your ML Pipelines</p> <p>SageWorks is architected as a Private SaaS. This hybrid architecture is the ultimate solution for businesses that prioritize data control and security. SageWorks deploys as an AWS Stack within your own cloud environment, ensuring compliance with stringent corporate and regulatory standards. It offers the flexibility to tailor solutions to your specific business needs through our comprehensive plugin support, both components and full web interfaces. By using SageWorks, you maintain absolute control over your data while benefiting from the power, security, and scalability of AWS cloud services. SageWorks Private SaaS Architecture</p>"},{"location":"#dashboard-and-api","title":"Dashboard and API","text":"<p>The SageWorks package has two main components, a Web Interface that provides visibility into AWS ML PIpelines and a Python API that makes creation and usage of the AWS ML Services easier than using/learning the services directly.</p>"},{"location":"#web-interfaces","title":"Web Interfaces","text":"<p>The SageWorks Dashboard has a set of web interfaces that give visibility into the AWS Glue and SageMaker Services. There are currently 5 web interfaces available:</p> <ul> <li>Top Level Dashboard: Shows all AWS ML Artifacts (Glue and SageMaker)</li> <li>DataSources: DataSource Column Details, Distributions and Correlations</li> <li>FeatureSets: FeatureSet Details, Distributions and Correlations</li> <li>Model: Model details, performance metric, and inference plots</li> <li>Endpoints: Endpoint details, realtime charts of endpoint performance and latency</li> </ul>"},{"location":"#python-api","title":"Python API","text":"<p>SageWorks API Documentation: SageWorks API Classes </p> <p>The main functionality of the Python API is to encapsulate and manage a set of AWS services underneath a Python Object interface. The Python Classes are used to create and interact with Machine Learning Pipeline Artifacts.</p>"},{"location":"#getting-started","title":"Getting Started","text":"<p>SageWorks will need some initial setup when you first start using it. See our Getting Started guide on how to connect SageWorks to your AWS Account.</p>"},{"location":"#additional-resources","title":"Additional Resources","text":"<ul> <li>Getting Started: Getting Started </li> <li>SageWorks API Classes: API Classes</li> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"admin/base_docker_push/","title":"SageWorks Base Docker Build and Push","text":"<p>Notes and information on how to do the Docker Builds and Push to AWS ECR.</p>"},{"location":"admin/base_docker_push/#update-sageworks-version","title":"Update SageWorks Version","text":"<pre><code>vi Dockerfile\n\n# Install latest Sageworks\nRUN pip install --no-cache-dir 'sageworks[ml-tool,chem]'==0.7.0\n</code></pre>"},{"location":"admin/base_docker_push/#build-the-docker-image","title":"Build the Docker Image","text":"<p>Note: For a client specific config file you'll need to copy it locally so that it's within Dockers 'build context'. If you're building the 'vanilla' open source Docker image, then you can use the <code>open_source_config.json</code> that's in the directory already.</p> <pre><code>docker build --build-arg SAGEWORKS_CONFIG=open_source_config.json -t \\\nsageworks_base:v0_7_0_amd64 --platform linux/amd64 .\n</code></pre>"},{"location":"admin/base_docker_push/#test-the-image-locally","title":"Test the Image Locally","text":"<p>You have a <code>docker_local_base</code> alias in your <code>~/.zshrc</code> :)</p>"},{"location":"admin/base_docker_push/#login-to-ecr","title":"Login to ECR","text":"<pre><code>aws ecr-public get-login-password --region us-east-1 --profile \\\nscp_sandbox_admin | docker login --username AWS \\\n--password-stdin public.ecr.aws\n</code></pre>"},{"location":"admin/base_docker_push/#tagpush-the-image-to-aws-ecr","title":"Tag/Push the Image to AWS ECR","text":"<p><pre><code>docker tag sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64\n</code></pre></p>"},{"location":"admin/base_docker_push/#update-the-latest-tag","title":"Update the 'latest' tag","text":"<p><pre><code>docker tag public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:latest\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_base:latest\n</code></pre></p>"},{"location":"admin/base_docker_push/#update-the-stable-tag","title":"Update the 'stable' tag","text":"<p>This is obviously only when you want to mark a version as stable. Meaning that it seems to 'be good and stable (ish)' :)</p> <p><pre><code>docker tag public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:stable\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_base:stable\n</code></pre></p>"},{"location":"admin/base_docker_push/#test-the-ecr-image","title":"Test the ECR Image","text":"<p>You have a <code>docker_ecr_base</code> alias in your <code>~/.zshrc</code> :)</p>"},{"location":"admin/dashboard_docker_push/","title":"Dashboard Docker Build and Push","text":"<p>Notes and information on how to do the Dashboard Docker Builds and Push to AWS ECR.</p>"},{"location":"admin/dashboard_docker_push/#update-sageworks-version","title":"Update SageWorks Version","text":"<pre><code>cd applications/aws_dashboard\nvi Dockerfile\n\n# Install Sageworks (changes often)\nRUN pip install --no-cache-dir sageworks==0.4.13 &lt;-- change this\n</code></pre>"},{"location":"admin/dashboard_docker_push/#build-the-docker-image","title":"Build the Docker Image","text":"<p>Note: For a client specific config file you'll need to copy it locally so that it's within Dockers 'build context'. If you're building the 'vanilla' open source Docker image, then you can use the <code>open_source_config.json</code> that's in the directory already.</p> <pre><code>docker build --build-arg SAGEWORKS_CONFIG=open_source_config.json -t \\\nsageworks_dashboard:v0_4_13_amd64 --platform linux/amd64 .\n</code></pre> <p>Docker with Custom Plugins: If you're using custom plugins you may want to change the SAGEWORKS_PLUGINS directory to something like <code>/app/sageworks_plugins</code> and then have Dockerfile copy your plugins into that directory on the Docker image.</p>"},{"location":"admin/dashboard_docker_push/#test-the-image-locally","title":"Test the Image Locally","text":"<p>You have a <code>docker_local_dashboard</code> alias in your <code>~/.zshrc</code> :)</p>"},{"location":"admin/dashboard_docker_push/#login-to-ecr","title":"Login to ECR","text":"<pre><code>aws ecr-public get-login-password --region us-east-1 --profile \\\nscp_sandbox_admin | docker login --username AWS \\\n--password-stdin public.ecr.aws\n</code></pre>"},{"location":"admin/dashboard_docker_push/#tagpush-the-image-to-aws-ecr","title":"Tag/Push the Image to AWS ECR","text":"<p><pre><code>docker tag sageworks_dashboard:v0_4_13_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64\n</code></pre></p>"},{"location":"admin/dashboard_docker_push/#update-the-latest-tag","title":"Update the 'latest' tag","text":"<p><pre><code>docker tag public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:latest\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:latest\n</code></pre></p>"},{"location":"admin/dashboard_docker_push/#update-the-stable-tag","title":"Update the 'stable' tag","text":"<p>This is obviously only when you want to mark a version as stable. Meaning that it seems to 'be good and stable (ish)' :)</p> <p><pre><code>docker tag public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_5_4_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:stable\n</code></pre> <pre><code>docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:stable\n</code></pre></p>"},{"location":"admin/dashboard_docker_push/#test-the-ecr-image","title":"Test the ECR Image","text":"<p>You have a <code>docker_ecr_dashboard</code> alias in your <code>~/.zshrc</code> :)</p>"},{"location":"admin/pypi_release/","title":"PyPI Release Notes","text":"<p>Notes and information on how to do the PyPI release for the SageMaker project. For full details on packaging you can reference this page Packaging</p> <p>The following instructions should work, but things change :)</p>"},{"location":"admin/pypi_release/#package-requirements","title":"Package Requirements","text":"<ul> <li>pip install tox</li> <li>pip install --upgrade wheel build twine</li> </ul>"},{"location":"admin/pypi_release/#setup-pypirc","title":"Setup pypirc","text":"<p>The easiest thing to do is setup a \\~/.pypirc file with the following contents</p> <pre><code>[distutils]\nindex-servers =\n  pypi\n  testpypi\n\n[pypi]\nusername = __token__\npassword = pypi-AgEIcH...\n\n[testpypi]\nusername = __token__\npassword = pypi-AgENdG...\n</code></pre>"},{"location":"admin/pypi_release/#tox-background","title":"Tox Background","text":"<p>Tox will install the SageMaker Sandbox package into a blank virtualenv and then execute all the tests against the newly installed package. So if everything goes okay, you know the pypi package installed fine and the tests (which puls from the installed <code>sageworks</code> package) also ran okay.</p>"},{"location":"admin/pypi_release/#make-sure-all-tests-pass","title":"Make sure ALL tests pass","text":"<pre><code>$ cd sageworks\n$ tox \n</code></pre> <p>If ALL the test above pass...</p>"},{"location":"admin/pypi_release/#clean-any-previous-distribution-files","title":"Clean any previous distribution files","text":"<pre><code>make clean\n</code></pre>"},{"location":"admin/pypi_release/#tag-the-new-version","title":"Tag the New Version","text":"<pre><code>git tag v0.1.8 (or whatever)\ngit push --tags\n</code></pre>"},{"location":"admin/pypi_release/#create-the-test-pypi-release","title":"Create the TEST PyPI Release","text":"<pre><code>python -m build\ntwine upload dist/* -r testpypi\n</code></pre>"},{"location":"admin/pypi_release/#install-the-test-pypi-release","title":"Install the TEST PyPI Release","text":"<pre><code>pip install --index-url https://test.pypi.org/simple sageworks\n</code></pre>"},{"location":"admin/pypi_release/#create-the-real-pypi-release","title":"Create the REAL PyPI Release","text":"<pre><code>twine upload dist/* -r pypi\n</code></pre>"},{"location":"admin/pypi_release/#push-any-possible-changes-to-github","title":"Push any possible changes to Github","text":"<pre><code>git push\n</code></pre>"},{"location":"admin/sageworks_docker_for_lambdas/","title":"SageWorks Docker Image for Lambdas","text":"<p>Using the SageWorks Docker Image for AWS Lambda Jobs allows your Lambda Jobs to use and create AWS ML Pipeline Artifacts with SageWorks.</p> <p>AWS, for some reason, does not allow Public ECRs to be used for Lambda Docker images. So you'll have to copy the Docker image into your private ECR. </p>"},{"location":"admin/sageworks_docker_for_lambdas/#creating-a-private-ecr","title":"Creating a Private ECR","text":"<p>You only need to do this if you don't already have a private ECR.</p>"},{"location":"admin/sageworks_docker_for_lambdas/#aws-console-to-create-private-ecr","title":"AWS Console to create Private ECR","text":"<ol> <li>Open the Amazon ECR console.</li> <li>Choose \"Create repository\".</li> <li>For \"Repository name\", enter <code>sageworks_base</code>.</li> <li>Ensure \"Private\" is selected.</li> <li>Choose \"Create repository\".</li> </ol>"},{"location":"admin/sageworks_docker_for_lambdas/#command-line-to-create-private-ecr","title":"Command Line to create Private ECR","text":"<p>Create the ECR repository using the AWS CLI:</p> <pre><code>aws ecr create-repository --repository-name sageworks_base --region &lt;region&gt;\n</code></pre>"},{"location":"admin/sageworks_docker_for_lambdas/#pulling-docker-image-into-private-ecr","title":"Pulling Docker Image into Private ECR","text":"<p>Note: You'll only need to do this when you want to update the SageWorks Docker image</p> <p>Pull the SageWorks Public ECR Image</p> <pre><code>docker pull public.ecr.aws/m6i5k1r2/sageworks_base:latest\n</code></pre> <p>Tag the image for your private ECR</p> <pre><code>docker tag public.ecr.aws/m6i5k1r2/sageworks_base:latest \\\n&lt;your-account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:latest\n</code></pre> <p>Push the image to your private ECR</p> <pre><code>aws ecr get-login-password --region &lt;region&gt; --profile &lt;profile&gt; | \\\ndocker login --username AWS --password-stdin &lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com\n\ndocker push &lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:&lt;tag&gt;\n</code></pre>"},{"location":"admin/sageworks_docker_for_lambdas/#using-the-docker-image-for-your-lambdas","title":"Using the Docker Image for your Lambdas","text":"<p>Okay, now that you have the SageWorks Docker image in your private ECR, here's how you use that image for your Lambda jobs.</p>"},{"location":"admin/sageworks_docker_for_lambdas/#aws-console","title":"AWS Console","text":"<ol> <li>Open the AWS Lambda console.</li> <li>Create a new function.</li> <li>Select \"Container image\".</li> <li>Use the ECR image URI: <code>&lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:&lt;tag&gt;</code>.</li> </ol>"},{"location":"admin/sageworks_docker_for_lambdas/#command-line","title":"Command Line","text":"<p>Create the Lambda function using the AWS CLI:</p> <pre><code>aws lambda create-function \\\n --region &lt;region&gt; \\\n --function-name myLambdaFunction \\\n --package-type Image \\\n --code ImageUri=&lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:&lt;tag&gt; \\\n --role arn:aws:iam::&lt;account-id&gt;:role/&lt;execution-role&gt;\n</code></pre>"},{"location":"admin/sageworks_docker_for_lambdas/#python-cdk","title":"Python CDK","text":"<p>Define the Lambda function in your CDK app:</p> <pre><code>from aws_cdk import (\n   aws_lambda as _lambda,\n   core\n)\n\nclass MyLambdaStack(core.Stack):\n   def __init__(self, scope: core.Construct, id: str, **kwargs) -&gt; None:\n       super().__init__(scope, id, **kwargs)\n\n       _lambda.Function(self, \"MyLambdaFunction\",\n                        code=_lambda.Code.from_ecr_image(\"&lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:&lt;tag&gt;\"),\n                        handler=_lambda.Handler.FROM_IMAGE,\n                        runtime=_lambda.Runtime.FROM_IMAGE,\n                        role=iam.Role.from_role_arn(self, \"LambdaRole\", \"arn:aws:iam::&lt;account-id&gt;:role/&lt;execution-role&gt;\"))\n\napp = core.App()\nMyLambdaStack(app, \"MyLambdaStack\")\napp.synth()\n</code></pre>"},{"location":"admin/sageworks_docker_for_lambdas/#cloudformation","title":"Cloudformation","text":"<p>Define the Lambda function in your CloudFormation template.</p> <pre><code>Resources:\n MyLambdaFunction:\n   Type: AWS::Lambda::Function\n   Properties:\n     Code:\n       ImageUri: &lt;account-id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/&lt;private-repo&gt;:&lt;tag&gt;\n     Role: arn:aws:iam::&lt;account-id&gt;:role/&lt;execution-role&gt;\n     PackageType: Image\n</code></pre>"},{"location":"api_classes/data_source/","title":"DataSource","text":"<p>DataSource Examples</p> <p>Examples of using the DataSource class are in the Examples section at the bottom of this page. S3 data, local files, and Pandas dataframes, DataSource can read data from many different sources.</p> <p>DataSource: Manages AWS Data Catalog creation and management. DataSources are set up so that can easily be queried with AWS Athena. All DataSources are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) DataSources can be viewed and explored within the SageWorks Dashboard UI.</p>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource","title":"<code>DataSource</code>","text":"<p>               Bases: <code>AthenaSource</code></p> <p>DataSource: SageWorks DataSource API Class</p> Common Usage <pre><code>my_data = DataSource(name_of_source)\nmy_data.details()\nmy_features = my_data.to_features()\n</code></pre> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>class DataSource(AthenaSource):\n    \"\"\"DataSource: SageWorks DataSource API Class\n\n    Common Usage:\n        ```\n        my_data = DataSource(name_of_source)\n        my_data.details()\n        my_features = my_data.to_features()\n        ```\n    \"\"\"\n\n    def __init__(self, source, name: str = None, tags: list = None):\n        \"\"\"\n        Initializes a new DataSource object.\n\n        Args:\n            source (str): The source of the data. This can be an S3 bucket, file path,\n                          DataFrame object, or an existing DataSource object.\n            name (str): The name of the data source (must be lowercase). If not specified, a name will be generated\n            tags (list[str]): A list of tags associated with the data source. If not specified tags will be generated.\n        \"\"\"\n\n        # Make sure we have a name for when we use a DataFrame source\n        if name == \"dataframe\":\n            msg = \"Set the 'name' argument in the constructor: DataSource(df, name='my_data')\"\n            self.log.critical(msg)\n            raise ValueError(msg)\n\n        # Ensure the ds_name is valid\n        if name:\n            Artifact.ensure_valid_name(name)\n\n        # If the model_name wasn't given generate it\n        else:\n            name = extract_data_source_basename(source)\n            name = Artifact.generate_valid_name(name)\n\n        # Set the tags and load the source\n        tags = [name] if tags is None else tags\n        self._load_source(source, name, tags)\n\n        # Call superclass init\n        super().__init__(name)\n\n    def details(self, **kwargs) -&gt; dict:\n        \"\"\"DataSource Details\n\n        Returns:\n            dict: A dictionary of details about the DataSource\n        \"\"\"\n        return super().details(**kwargs)\n\n    def query(self, query: str) -&gt; pd.DataFrame:\n        \"\"\"Query the AthenaSource\n\n        Args:\n            query (str): The query to run against the DataSource\n\n        Returns:\n            pd.DataFrame: The results of the query\n        \"\"\"\n        return super().query(query)\n\n    def pull_dataframe(self, include_aws_columns=False) -&gt; pd.DataFrame:\n        \"\"\"Return a DataFrame of ALL the data from this DataSource\n\n        Args:\n            include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n        Returns:\n            pd.DataFrame: A DataFrame of ALL the data from this DataSource\n\n        Note:\n            Obviously this is not recommended for large datasets :)\n        \"\"\"\n\n        # Get the table associated with the data\n        self.log.info(f\"Pulling all data from {self.uuid}...\")\n        table = super().get_table_name()\n        query = f\"SELECT * FROM {table}\"\n        df = self.query(query)\n\n        # Drop any columns generated from AWS\n        if not include_aws_columns:\n            aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n            df = df.drop(columns=aws_cols, errors=\"ignore\")\n        return df\n\n    def to_features(\n        self,\n        name: str = None,\n        tags: list = None,\n        target_column: str = None,\n        id_column: str = None,\n        event_time_column: str = None,\n        auto_one_hot: bool = False,\n    ) -&gt; FeatureSet:\n        \"\"\"\n        Convert the DataSource to a FeatureSet\n\n        Args:\n            name (str): Set the name for feature set (must be lowercase). If not specified, a name will be generated\n            tags (list): Set the tags for the feature set. If not specified tags will be generated.\n            target_column (str): Set the target column for the feature set. (Optional)\n            id_column (str): Set the id column for the feature set. If not specified will be generated.\n            event_time_column (str): Set the event time for the feature set. If not specified will be generated.\n            auto_one_hot (bool): Automatically one-hot encode categorical fields (default: False)\n\n        Returns:\n            FeatureSet: The FeatureSet created from the DataSource\n        \"\"\"\n\n        # Ensure the feature_set_name is valid\n        if name:\n            Artifact.ensure_valid_name(name)\n\n        # If the feature_set_name wasn't given generate it\n        else:\n            name = self.uuid.replace(\"_data\", \"\") + \"_features\"\n            name = Artifact.generate_valid_name(name)\n\n        # Set the Tags\n        tags = [name] if tags is None else tags\n\n        # Transform the DataSource to a FeatureSet\n        data_to_features = DataToFeaturesLight(self.uuid, name)\n        data_to_features.set_output_tags(tags)\n        data_to_features.transform(\n            target_column=target_column,\n            id_column=id_column,\n            event_time_column=event_time_column,\n            auto_one_hot=auto_one_hot,\n        )\n\n        # Return the FeatureSet (which will now be up-to-date)\n        return FeatureSet(name)\n\n    def _load_source(self, source: str, name: str, tags: list):\n        \"\"\"Load the source of the data\"\"\"\n        self.log.info(f\"Loading source: {source}...\")\n\n        # Pandas DataFrame Source\n        if isinstance(source, pd.DataFrame):\n            my_loader = PandasToData(name)\n            my_loader.set_input(source)\n            my_loader.set_output_tags(tags)\n            my_loader.transform()\n\n        # S3 Source\n        source = source if isinstance(source, str) else str(source)\n        if source.startswith(\"s3://\"):\n            my_loader = S3ToDataSourceLight(source, name)\n            my_loader.set_output_tags(tags)\n            my_loader.transform()\n\n        # File Source\n        elif os.path.isfile(source):\n            my_loader = CSVToDataSource(source, name)\n            my_loader.set_output_tags(tags)\n            my_loader.transform()\n</code></pre>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.__init__","title":"<code>__init__(source, name=None, tags=None)</code>","text":"<p>Initializes a new DataSource object.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>The source of the data. This can be an S3 bucket, file path,           DataFrame object, or an existing DataSource object.</p> required <code>name</code> <code>str</code> <p>The name of the data source (must be lowercase). If not specified, a name will be generated</p> <code>None</code> <code>tags</code> <code>list[str]</code> <p>A list of tags associated with the data source. If not specified tags will be generated.</p> <code>None</code> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>def __init__(self, source, name: str = None, tags: list = None):\n    \"\"\"\n    Initializes a new DataSource object.\n\n    Args:\n        source (str): The source of the data. This can be an S3 bucket, file path,\n                      DataFrame object, or an existing DataSource object.\n        name (str): The name of the data source (must be lowercase). If not specified, a name will be generated\n        tags (list[str]): A list of tags associated with the data source. If not specified tags will be generated.\n    \"\"\"\n\n    # Make sure we have a name for when we use a DataFrame source\n    if name == \"dataframe\":\n        msg = \"Set the 'name' argument in the constructor: DataSource(df, name='my_data')\"\n        self.log.critical(msg)\n        raise ValueError(msg)\n\n    # Ensure the ds_name is valid\n    if name:\n        Artifact.ensure_valid_name(name)\n\n    # If the model_name wasn't given generate it\n    else:\n        name = extract_data_source_basename(source)\n        name = Artifact.generate_valid_name(name)\n\n    # Set the tags and load the source\n    tags = [name] if tags is None else tags\n    self._load_source(source, name, tags)\n\n    # Call superclass init\n    super().__init__(name)\n</code></pre>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.details","title":"<code>details(**kwargs)</code>","text":"<p>DataSource Details</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about the DataSource</p> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>def details(self, **kwargs) -&gt; dict:\n    \"\"\"DataSource Details\n\n    Returns:\n        dict: A dictionary of details about the DataSource\n    \"\"\"\n    return super().details(**kwargs)\n</code></pre>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.pull_dataframe","title":"<code>pull_dataframe(include_aws_columns=False)</code>","text":"<p>Return a DataFrame of ALL the data from this DataSource</p> <p>Parameters:</p> Name Type Description Default <code>include_aws_columns</code> <code>bool</code> <p>Include the AWS columns in the DataFrame (default: False)</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A DataFrame of ALL the data from this DataSource</p> Note <p>Obviously this is not recommended for large datasets :)</p> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>def pull_dataframe(self, include_aws_columns=False) -&gt; pd.DataFrame:\n    \"\"\"Return a DataFrame of ALL the data from this DataSource\n\n    Args:\n        include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n    Returns:\n        pd.DataFrame: A DataFrame of ALL the data from this DataSource\n\n    Note:\n        Obviously this is not recommended for large datasets :)\n    \"\"\"\n\n    # Get the table associated with the data\n    self.log.info(f\"Pulling all data from {self.uuid}...\")\n    table = super().get_table_name()\n    query = f\"SELECT * FROM {table}\"\n    df = self.query(query)\n\n    # Drop any columns generated from AWS\n    if not include_aws_columns:\n        aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n        df = df.drop(columns=aws_cols, errors=\"ignore\")\n    return df\n</code></pre>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.query","title":"<code>query(query)</code>","text":"<p>Query the AthenaSource</p> <p>Parameters:</p> Name Type Description Default <code>query</code> <code>str</code> <p>The query to run against the DataSource</p> required <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The results of the query</p> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>def query(self, query: str) -&gt; pd.DataFrame:\n    \"\"\"Query the AthenaSource\n\n    Args:\n        query (str): The query to run against the DataSource\n\n    Returns:\n        pd.DataFrame: The results of the query\n    \"\"\"\n    return super().query(query)\n</code></pre>"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.to_features","title":"<code>to_features(name=None, tags=None, target_column=None, id_column=None, event_time_column=None, auto_one_hot=False)</code>","text":"<p>Convert the DataSource to a FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>Set the name for feature set (must be lowercase). If not specified, a name will be generated</p> <code>None</code> <code>tags</code> <code>list</code> <p>Set the tags for the feature set. If not specified tags will be generated.</p> <code>None</code> <code>target_column</code> <code>str</code> <p>Set the target column for the feature set. (Optional)</p> <code>None</code> <code>id_column</code> <code>str</code> <p>Set the id column for the feature set. If not specified will be generated.</p> <code>None</code> <code>event_time_column</code> <code>str</code> <p>Set the event time for the feature set. If not specified will be generated.</p> <code>None</code> <code>auto_one_hot</code> <code>bool</code> <p>Automatically one-hot encode categorical fields (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>FeatureSet</code> <code>FeatureSet</code> <p>The FeatureSet created from the DataSource</p> Source code in <code>src/sageworks/api/data_source.py</code> <pre><code>def to_features(\n    self,\n    name: str = None,\n    tags: list = None,\n    target_column: str = None,\n    id_column: str = None,\n    event_time_column: str = None,\n    auto_one_hot: bool = False,\n) -&gt; FeatureSet:\n    \"\"\"\n    Convert the DataSource to a FeatureSet\n\n    Args:\n        name (str): Set the name for feature set (must be lowercase). If not specified, a name will be generated\n        tags (list): Set the tags for the feature set. If not specified tags will be generated.\n        target_column (str): Set the target column for the feature set. (Optional)\n        id_column (str): Set the id column for the feature set. If not specified will be generated.\n        event_time_column (str): Set the event time for the feature set. If not specified will be generated.\n        auto_one_hot (bool): Automatically one-hot encode categorical fields (default: False)\n\n    Returns:\n        FeatureSet: The FeatureSet created from the DataSource\n    \"\"\"\n\n    # Ensure the feature_set_name is valid\n    if name:\n        Artifact.ensure_valid_name(name)\n\n    # If the feature_set_name wasn't given generate it\n    else:\n        name = self.uuid.replace(\"_data\", \"\") + \"_features\"\n        name = Artifact.generate_valid_name(name)\n\n    # Set the Tags\n    tags = [name] if tags is None else tags\n\n    # Transform the DataSource to a FeatureSet\n    data_to_features = DataToFeaturesLight(self.uuid, name)\n    data_to_features.set_output_tags(tags)\n    data_to_features.transform(\n        target_column=target_column,\n        id_column=id_column,\n        event_time_column=event_time_column,\n        auto_one_hot=auto_one_hot,\n    )\n\n    # Return the FeatureSet (which will now be up-to-date)\n    return FeatureSet(name)\n</code></pre>"},{"location":"api_classes/data_source/#examples","title":"Examples","text":"<p>All of the SageWorks Examples are in the Sageworks Repository under the <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p> <p>Create a DataSource from an S3 Path or File Path</p> datasource_from_s3.py<pre><code>from sageworks.api.data_source import DataSource\n\n# Create a DataSource from an S3 Path (or a local file)\nsource_path = \"s3://sageworks-public-data/common/abalone.csv\"\n# source_path = \"/full/path/to/local/file.csv\"\n\nmy_data = DataSource(source_path)\nprint(my_data.details())\n</code></pre> <p>Create a DataSource from a Pandas Dataframe</p> datasource_from_df.py<pre><code>from sageworks.utils.test_data_generator import TestDataGenerator\nfrom sageworks.api.data_source import DataSource\n\n# Create a DataSource from a Pandas DataFrame\ngen_data = TestDataGenerator()\ndf = gen_data.person_data()\n\ntest_data = DataSource(df, name=\"test_data\")\nprint(test_data.details())\n</code></pre> <p>Query a DataSource</p> <p>All SageWorks DataSources use AWS Athena, so any query that you can make with Athena is accessible through the DataSource API.</p> datasource_query.py<pre><code>from sageworks.api.data_source import DataSource\n\n# Grab a DataSource\nmy_data = DataSource(\"abalone_data\")\n\n# Make some queries using the Athena backend\ndf = my_data.query(\"select * from abalone_data where height &gt; .3\")\nprint(df.head())\n\ndf = my_data.query(\"select * from abalone_data where class_number_of_rings &lt; 3\")\nprint(df.head())\n</code></pre> <p>Output</p> <pre><code>  sex  length  diameter  height  whole_weight  shucked_weight  viscera_weight  shell_weight  class_number_of_rings\n0   M   0.705     0.565   0.515         2.210          1.1075          0.4865        0.5120                     10\n1   F   0.455     0.355   1.130         0.594          0.3320          0.1160        0.1335                      8\n\n  sex  length  diameter  height  whole_weight  shucked_weight  viscera_weight  shell_weight  class_number_of_rings\n0   I   0.075     0.055   0.010         0.002          0.0010          0.0005        0.0015                      1\n1   I   0.150     0.100   0.025         0.015          0.0045          0.0040        0.0050                      2\n</code></pre> <p>Create a FeatureSet from a DataSource</p> datasource_to_featureset.py<pre><code>from sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\ntest_data = DataSource('test_data')\nmy_features = test_data.to_features()\nprint(my_features.details())\n</code></pre>"},{"location":"api_classes/data_source/#sageworks-ui","title":"SageWorks UI","text":"<p>Whenever a DataSource is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.</p> SageWorks Dashboard: DataSources <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"api_classes/endpoint/","title":"Endpoint","text":"<p>Endpoint Examples</p> <p>Examples of using the Endpoint class are listed at the bottom of this page Examples.</p> <p>Endpoint: Manages AWS Endpoint creation and deployment. Endpoints are automatically set up and provisioned for deployment into AWS. Endpoints can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional model details and performance metrics</p>"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint","title":"<code>Endpoint</code>","text":"<p>               Bases: <code>EndpointCore</code></p> <p>Endpoint: SageWorks Endpoint API Class</p> Common Usage <pre><code>my_endpoint = Endpoint(name)\nmy_endpoint.details()\nmy_endpoint.inference(eval_df)\n</code></pre> Source code in <code>src/sageworks/api/endpoint.py</code> <pre><code>class Endpoint(EndpointCore):\n    \"\"\"Endpoint: SageWorks Endpoint API Class\n\n    Common Usage:\n        ```\n        my_endpoint = Endpoint(name)\n        my_endpoint.details()\n        my_endpoint.inference(eval_df)\n        ```\n    \"\"\"\n\n    def details(self, **kwargs) -&gt; dict:\n        \"\"\"Endpoint Details\n\n        Returns:\n            dict: A dictionary of details about the Endpoint\n        \"\"\"\n        return super().details(**kwargs)\n\n    def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -&gt; pd.DataFrame:\n        \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n        Args:\n            eval_df (pd.DataFrame): The DataFrame to run predictions on\n            capture_uuid (str, optional): The UUID of the capture to use (default: None)\n            id_column (str, optional): The name of the column to use as the ID (default: None)\n\n        Returns:\n            pd.DataFrame: The DataFrame with predictions\n        \"\"\"\n        return super().inference(eval_df, capture_uuid, id_column)\n\n    def auto_inference(self, capture: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Run inference on the Endpoint using the FeatureSet evaluation data\n\n        Args:\n            capture (bool): Capture the inference results\n\n        Returns:\n            pd.DataFrame: The DataFrame with predictions\n        \"\"\"\n        return super().auto_inference(capture)\n</code></pre>"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.auto_inference","title":"<code>auto_inference(capture=False)</code>","text":"<p>Run inference on the Endpoint using the FeatureSet evaluation data</p> <p>Parameters:</p> Name Type Description Default <code>capture</code> <code>bool</code> <p>Capture the inference results</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The DataFrame with predictions</p> Source code in <code>src/sageworks/api/endpoint.py</code> <pre><code>def auto_inference(self, capture: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Run inference on the Endpoint using the FeatureSet evaluation data\n\n    Args:\n        capture (bool): Capture the inference results\n\n    Returns:\n        pd.DataFrame: The DataFrame with predictions\n    \"\"\"\n    return super().auto_inference(capture)\n</code></pre>"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.details","title":"<code>details(**kwargs)</code>","text":"<p>Endpoint Details</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about the Endpoint</p> Source code in <code>src/sageworks/api/endpoint.py</code> <pre><code>def details(self, **kwargs) -&gt; dict:\n    \"\"\"Endpoint Details\n\n    Returns:\n        dict: A dictionary of details about the Endpoint\n    \"\"\"\n    return super().details(**kwargs)\n</code></pre>"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.inference","title":"<code>inference(eval_df, capture_uuid=None, id_column=None)</code>","text":"<p>Run inference on the Endpoint using the provided DataFrame</p> <p>Parameters:</p> Name Type Description Default <code>eval_df</code> <code>DataFrame</code> <p>The DataFrame to run predictions on</p> required <code>capture_uuid</code> <code>str</code> <p>The UUID of the capture to use (default: None)</p> <code>None</code> <code>id_column</code> <code>str</code> <p>The name of the column to use as the ID (default: None)</p> <code>None</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The DataFrame with predictions</p> Source code in <code>src/sageworks/api/endpoint.py</code> <pre><code>def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -&gt; pd.DataFrame:\n    \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n    Args:\n        eval_df (pd.DataFrame): The DataFrame to run predictions on\n        capture_uuid (str, optional): The UUID of the capture to use (default: None)\n        id_column (str, optional): The name of the column to use as the ID (default: None)\n\n    Returns:\n        pd.DataFrame: The DataFrame with predictions\n    \"\"\"\n    return super().inference(eval_df, capture_uuid, id_column)\n</code></pre>"},{"location":"api_classes/endpoint/#examples","title":"Examples","text":"<p>Run Inference on an Endpoint</p> endpoint_inference.py<pre><code>from sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import Model\nfrom sageworks.api.endpoint import Endpoint\n\n# Grab an existing Endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# SageWorks has full ML Pipeline provenance, so we can backtrack the inputs,\n# get a DataFrame of data (not used for training) and run inference\nmodel = Model(endpoint.get_input())\nfs = FeatureSet(model.get_input())\nathena_table = fs.get_training_view_table()\ndf = fs.query(f\"SELECT * FROM {athena_table} where training = 0\")\n\n# Run inference/predictions on the Endpoint\nresults_df = endpoint.inference(df)\n\n# Run inference/predictions and capture the results\nresults_df = endpoint.inference(df, capture=True)\n\n# Run inference/predictions using the FeatureSet evaluation data\nresults_df = endpoint.auto_inference(capture=True)\n</code></pre> <p>Output</p> <p><pre><code>Processing...\n     class_number_of_rings  prediction\n0                       13   11.477922\n1                       12   12.316887\n2                        8    7.612847\n3                        8    9.663341\n4                        9    9.075263\n..                     ...         ...\n839                      8    8.069856\n840                     15   14.915502\n841                     11   10.977605\n842                     10   10.173433\n843                      7    7.297976\n</code></pre> Endpoint Details</p> <p>The details() method</p> <p>The <code>detail()</code> method on the Endpoint class provides a lot of useful information. All of the SageWorks classes have a <code>details()</code> method try it out!</p> endpoint_details.py<pre><code>from sageworks.api.endpoint import Endpoint\nfrom pprint import pprint\n\n# Get Endpoint and print out it's details\nendpoint = Endpoint(\"abalone-regression-end\")\npprint(endpoint.details())\n</code></pre> <p>Output</p> <pre><code>{\n 'input': 'abalone-regression',\n 'instance': 'Serverless (2GB/5)',\n 'model_metrics':   metric_name  value\n            0        RMSE  2.190\n            1         MAE  1.544\n            2          R2  0.504,\n 'model_name': 'abalone-regression',\n 'model_type': 'regressor',\n 'modified': datetime.datetime(2023, 12, 29, 17, 48, 35, 115000, tzinfo=datetime.timezone.utc),\n     class_number_of_rings  prediction\n0                        9    8.648378\n1                       11    9.717787\n2                       11   10.933070\n3                       10    9.899738\n4                        9   10.014504\n..                     ...         ...\n495                     10   10.261657\n496                      9   10.788254\n497                     13    7.779886\n498                     12   14.718514\n499                     13   10.637320\n 'sageworks_tags': ['abalone', 'regression'],\n 'status': 'InService',\n 'uuid': 'abalone-regression-end',\n 'variant': 'AllTraffic'}\n</code></pre> <p>Endpoint Metrics</p> endpoint_metrics.py<pre><code>from sageworks.api.endpoint import Endpoint\n\n# Grab an existing Endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# SageWorks tracks both Model performance and Endpoint Metrics\nmodel_metrics = endpoint.details()[\"model_metrics\"]\nendpoint_metrics = endpoint.endpoint_metrics()\nprint(model_metrics)\nprint(endpoint_metrics)\n</code></pre> <p>Output</p> <pre><code>  metric_name  value\n0        RMSE  2.190\n1         MAE  1.544\n2          R2  0.504\n\n    Invocations  ModelLatency  OverheadLatency  ModelSetupTime  Invocation5XXErrors\n29          0.0          0.00             0.00            0.00                  0.0\n30          1.0          1.11            23.73           23.34                  0.0\n31          0.0          0.00             0.00            0.00                  0.0\n48          0.0          0.00             0.00            0.00                  0.0\n49          5.0          0.45             9.64           23.57                  0.0\n50          2.0          0.57             0.08            0.00                  0.0\n51          0.0          0.00             0.00            0.00                  0.0\n60          4.0          0.33             5.80           22.65                  0.0\n61          1.0          1.11            23.35           23.10                  0.0\n62          0.0          0.00             0.00            0.00                  0.0\n...\n</code></pre>"},{"location":"api_classes/endpoint/#sageworks-ui","title":"SageWorks UI","text":"<p>Running these few lines of code creates and deploys an AWS Endpoint. The Endpoint artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI. SageWorks will monitor the endpoint, plot invocations, latencies, and tracks error metrics.</p> SageWorks Dashboard: Endpoints <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"api_classes/feature_set/","title":"FeatureSet","text":"<p>FeatureSet Examples</p> <p>Examples of using the FeatureSet Class are in the Examples section at the bottom of this page. AWS Feature Store and Feature Groups are quite complicated to set up manually but the SageWorks FeatureSet makes it a breeze!</p> <p>FeatureSet: Manages AWS Feature Store/Group creation and management. FeatureSets are set up so they can easily be queried with AWS Athena. All FeatureSets are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) FeatureSets can be viewed and explored within the SageWorks Dashboard UI.</p>"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet","title":"<code>FeatureSet</code>","text":"<p>               Bases: <code>FeatureSetCore</code></p> <p>FeatureSet: SageWorks FeatureSet API Class</p> Common Usage <pre><code>my_features = FeatureSet(name)\nmy_features.details()\nmy_features.to_model(\n    ModelType.REGRESSOR,\n    name=\"abalone-regression\",\n    target_column=\"class_number_of_rings\"\n)\n</code></pre> Source code in <code>src/sageworks/api/feature_set.py</code> <pre><code>class FeatureSet(FeatureSetCore):\n    \"\"\"FeatureSet: SageWorks FeatureSet API Class\n\n    Common Usage:\n        ```\n        my_features = FeatureSet(name)\n        my_features.details()\n        my_features.to_model(\n            ModelType.REGRESSOR,\n            name=\"abalone-regression\",\n            target_column=\"class_number_of_rings\"\n        )\n        ```\n    \"\"\"\n\n    def details(self, **kwargs) -&gt; dict:\n        \"\"\"FeatureSet Details\n\n        Returns:\n            dict: A dictionary of details about the FeatureSet\n        \"\"\"\n        return super().details(**kwargs)\n\n    def query(self, query: str, **kwargs) -&gt; pd.DataFrame:\n        \"\"\"Query the AthenaSource\n\n        Args:\n            query (str): The query to run against the FeatureSet\n\n        Returns:\n            pd.DataFrame: The results of the query\n        \"\"\"\n        return super().query(query, **kwargs)\n\n    def pull_dataframe(self, include_aws_columns=False) -&gt; pd.DataFrame:\n        \"\"\"Return a DataFrame of ALL the data from this FeatureSet\n\n        Args:\n            include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n        Returns:\n            pd.DataFrame: A DataFrame of ALL the data from this FeatureSet\n\n        Note:\n            Obviously this is not recommended for large datasets :)\n        \"\"\"\n\n        # Get the table associated with the data\n        self.log.info(f\"Pulling all data from {self.uuid}...\")\n        query = f\"SELECT * FROM {self.athena_table}\"\n        df = self.query(query)\n\n        # Drop any columns generated from AWS\n        if not include_aws_columns:\n            aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n            df = df.drop(columns=aws_cols, errors=\"ignore\")\n        return df\n\n    def to_model(\n        self,\n        model_type: ModelType = ModelType.UNKNOWN,\n        model_class: str = None,\n        name: str = None,\n        tags: list = None,\n        description: str = None,\n        feature_list: list = None,\n        target_column: str = None,\n        **kwargs,\n    ) -&gt; Model:\n        \"\"\"Create a Model from the FeatureSet\n\n        Args:\n\n            model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\n            model_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\n            name (str): Set the name for the model. If not specified, a name will be generated\n            tags (list): Set the tags for the model.  If not specified tags will be generated.\n            description (str): Set the description for the model. If not specified a description is generated.\n            feature_list (list): Set the feature list for the model. If not specified a feature list is generated.\n            target_column (str): The target column for the model (use None for unsupervised model)\n\n        Returns:\n            Model: The Model created from the FeatureSet\n        \"\"\"\n\n        # Ensure the model_name is valid\n        if name:\n            Artifact.ensure_valid_name(name, delimiter=\"-\")\n\n        # If the model_name wasn't given generate it\n        else:\n            name = self.uuid.replace(\"_features\", \"\") + \"-model\"\n            name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n        # Create the Model Tags\n        tags = [name] if tags is None else tags\n\n        # Transform the FeatureSet into a Model\n        features_to_model = FeaturesToModel(self.uuid, name, model_type=model_type, model_class=model_class)\n        features_to_model.set_output_tags(tags)\n        features_to_model.transform(\n            target_column=target_column, description=description, feature_list=feature_list, **kwargs\n        )\n\n        # Return the Model\n        return Model(name)\n</code></pre>"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.details","title":"<code>details(**kwargs)</code>","text":"<p>FeatureSet Details</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about the FeatureSet</p> Source code in <code>src/sageworks/api/feature_set.py</code> <pre><code>def details(self, **kwargs) -&gt; dict:\n    \"\"\"FeatureSet Details\n\n    Returns:\n        dict: A dictionary of details about the FeatureSet\n    \"\"\"\n    return super().details(**kwargs)\n</code></pre>"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.pull_dataframe","title":"<code>pull_dataframe(include_aws_columns=False)</code>","text":"<p>Return a DataFrame of ALL the data from this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>include_aws_columns</code> <code>bool</code> <p>Include the AWS columns in the DataFrame (default: False)</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A DataFrame of ALL the data from this FeatureSet</p> Note <p>Obviously this is not recommended for large datasets :)</p> Source code in <code>src/sageworks/api/feature_set.py</code> <pre><code>def pull_dataframe(self, include_aws_columns=False) -&gt; pd.DataFrame:\n    \"\"\"Return a DataFrame of ALL the data from this FeatureSet\n\n    Args:\n        include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n    Returns:\n        pd.DataFrame: A DataFrame of ALL the data from this FeatureSet\n\n    Note:\n        Obviously this is not recommended for large datasets :)\n    \"\"\"\n\n    # Get the table associated with the data\n    self.log.info(f\"Pulling all data from {self.uuid}...\")\n    query = f\"SELECT * FROM {self.athena_table}\"\n    df = self.query(query)\n\n    # Drop any columns generated from AWS\n    if not include_aws_columns:\n        aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n        df = df.drop(columns=aws_cols, errors=\"ignore\")\n    return df\n</code></pre>"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.query","title":"<code>query(query, **kwargs)</code>","text":"<p>Query the AthenaSource</p> <p>Parameters:</p> Name Type Description Default <code>query</code> <code>str</code> <p>The query to run against the FeatureSet</p> required <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The results of the query</p> Source code in <code>src/sageworks/api/feature_set.py</code> <pre><code>def query(self, query: str, **kwargs) -&gt; pd.DataFrame:\n    \"\"\"Query the AthenaSource\n\n    Args:\n        query (str): The query to run against the FeatureSet\n\n    Returns:\n        pd.DataFrame: The results of the query\n    \"\"\"\n    return super().query(query, **kwargs)\n</code></pre>"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.to_model","title":"<code>to_model(model_type=ModelType.UNKNOWN, model_class=None, name=None, tags=None, description=None, feature_list=None, target_column=None, **kwargs)</code>","text":"<p>Create a Model from the FeatureSet</p> <p>Args:</p> <pre><code>model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\nmodel_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\nname (str): Set the name for the model. If not specified, a name will be generated\ntags (list): Set the tags for the model.  If not specified tags will be generated.\ndescription (str): Set the description for the model. If not specified a description is generated.\nfeature_list (list): Set the feature list for the model. If not specified a feature list is generated.\ntarget_column (str): The target column for the model (use None for unsupervised model)\n</code></pre> <p>Returns:</p> Name Type Description <code>Model</code> <code>Model</code> <p>The Model created from the FeatureSet</p> Source code in <code>src/sageworks/api/feature_set.py</code> <pre><code>def to_model(\n    self,\n    model_type: ModelType = ModelType.UNKNOWN,\n    model_class: str = None,\n    name: str = None,\n    tags: list = None,\n    description: str = None,\n    feature_list: list = None,\n    target_column: str = None,\n    **kwargs,\n) -&gt; Model:\n    \"\"\"Create a Model from the FeatureSet\n\n    Args:\n\n        model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\n        model_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\n        name (str): Set the name for the model. If not specified, a name will be generated\n        tags (list): Set the tags for the model.  If not specified tags will be generated.\n        description (str): Set the description for the model. If not specified a description is generated.\n        feature_list (list): Set the feature list for the model. If not specified a feature list is generated.\n        target_column (str): The target column for the model (use None for unsupervised model)\n\n    Returns:\n        Model: The Model created from the FeatureSet\n    \"\"\"\n\n    # Ensure the model_name is valid\n    if name:\n        Artifact.ensure_valid_name(name, delimiter=\"-\")\n\n    # If the model_name wasn't given generate it\n    else:\n        name = self.uuid.replace(\"_features\", \"\") + \"-model\"\n        name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n    # Create the Model Tags\n    tags = [name] if tags is None else tags\n\n    # Transform the FeatureSet into a Model\n    features_to_model = FeaturesToModel(self.uuid, name, model_type=model_type, model_class=model_class)\n    features_to_model.set_output_tags(tags)\n    features_to_model.transform(\n        target_column=target_column, description=description, feature_list=feature_list, **kwargs\n    )\n\n    # Return the Model\n    return Model(name)\n</code></pre>"},{"location":"api_classes/feature_set/#examples","title":"Examples","text":"<p>All of the SageWorks Examples are in the Sageworks Repository under the <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p> <p>Create a FeatureSet from a Datasource</p> datasource_to_featureset.py<pre><code>from sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\ntest_data = DataSource('test_data')\nmy_features = test_data.to_features()\nprint(my_features.details())\n</code></pre> <p>FeatureSet EDA Statistics</p> <p>featureset_eda.py<pre><code>from sageworks.api.feature_set import FeatureSet\nimport pandas as pd\n\n# Grab a FeatureSet and pull some of the EDA Stats\nmy_features = FeatureSet('test_features')\n\n# Grab some of the EDA Stats\ncorr_data = my_features.correlations()\ncorr_df = pd.DataFrame(corr_data)\nprint(corr_df)\n\n# Get some outliers\noutliers = my_features.outliers()\npprint(outliers.head())\n\n# Full set of EDA Stats\neda_stats = my_features.column_stats()\npprint(eda_stats)\n</code></pre> Output</p> <pre><code>                 age  food_pizza  food_steak  food_sushi  food_tacos    height        id  iq_score\nage              NaN   -0.188645   -0.256356    0.263048    0.054211  0.439678 -0.054948 -0.295513\nfood_pizza -0.188645         NaN   -0.288175   -0.229591   -0.196818 -0.494380  0.137282  0.395378\nfood_steak -0.256356   -0.288175         NaN   -0.374920   -0.321403 -0.002542 -0.005199  0.076477\nfood_sushi  0.263048   -0.229591   -0.374920         NaN   -0.256064  0.536396  0.038279 -0.435033\nfood_tacos  0.054211   -0.196818   -0.321403   -0.256064         NaN -0.091493 -0.051398  0.033364\nheight      0.439678   -0.494380   -0.002542    0.536396   -0.091493       NaN -0.117372 -0.655210\nid         -0.054948    0.137282   -0.005199    0.038279   -0.051398 -0.117372       NaN  0.106020\niq_score   -0.295513    0.395378    0.076477   -0.435033    0.033364 -0.655210  0.106020       NaN\n\n        name     height      weight         salary  age    iq_score  likes_dogs  food_pizza  food_steak  food_sushi  food_tacos outlier_group\n0  Person 96  57.582840  148.461349   80000.000000   43  150.000000           1           0           0           0           0    height_low\n1  Person 68  73.918663  189.527313  219994.000000   80  100.000000           0           0           0           1           0  iq_score_low\n2  Person 49  70.381790  261.237000  175633.703125   49  107.933998           0           0           0           1           0  iq_score_low\n3  Person 90  73.488739  193.840698  227760.000000   72  110.821541           1           0           0           0           0   salary_high\n\n&lt;lots of EDA data and statistics&gt;\n</code></pre> <p>Query a FeatureSet</p> <p>All SageWorks FeatureSet have an 'offline' store that uses AWS Athena, so any query that you can make with Athena is accessible through the FeatureSet API.</p> featureset_query.py<pre><code>from sageworks.api.feature_set import FeatureSet\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"abalone_features\")\n\n# Make some queries using the Athena backend\ndf = my_features.query(\"select * from abalone_features where height &gt; .3\")\nprint(df.head())\n\ndf = my_features.query(\"select * from abalone_features where class_number_of_rings &lt; 3\")\nprint(df.head())\n</code></pre> <p>Output</p> <pre><code>  sex  length  diameter  height  whole_weight  shucked_weight  viscera_weight  shell_weight  class_number_of_rings\n0   M   0.705     0.565   0.515         2.210          1.1075          0.4865        0.5120                     10\n1   F   0.455     0.355   1.130         0.594          0.3320          0.1160        0.1335                      8\n\n  sex  length  diameter  height  whole_weight  shucked_weight  viscera_weight  shell_weight  class_number_of_rings\n0   I   0.075     0.055   0.010         0.002          0.0010          0.0005        0.0015                      1\n1   I   0.150     0.100   0.025         0.015          0.0045          0.0040         0.0050                      2\n</code></pre> <p>Create a Model from a FeatureSet</p> featureset_to_model.py<pre><code>from sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import ModelType\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet('test_features')\n\n# Create a Model from the FeatureSet\n# Note: ModelTypes can be CLASSIFIER, REGRESSOR, \n#       UNSUPERVISED, or TRANSFORMER\nmy_model = my_features.to_model(model_type=ModelType.REGRESSOR, \n                                target_column=\"iq_score\")\npprint(my_model.details())\n</code></pre> <p>Output</p> <pre><code>{'approval_status': 'Approved',\n 'content_types': ['text/csv'],\n ...\n 'inference_types': ['ml.t2.medium'],\n 'input': 'test_features',\n 'model_metrics':   metric_name  value\n                0        RMSE  7.924\n                1         MAE  6.554,\n                2          R2  0.604,\n 'regression_predictions':       iq_score  prediction\n                            0   136.519012  139.964460\n                            1   133.616974  130.819950\n                            2   122.495415  124.967834\n                            3   133.279510  121.010284\n                            4   127.881073  113.825005\n    ...\n 'response_types': ['text/csv'],\n 'sageworks_tags': ['test-model'],\n 'shapley_values': None,\n 'size': 0.0,\n 'status': 'Completed',\n 'transform_types': ['ml.m5.large'],\n 'uuid': 'test-model',\n 'version': 1}\n</code></pre>"},{"location":"api_classes/feature_set/#sageworks-ui","title":"SageWorks UI","text":"<p>Whenever a FeatureSet is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.</p> SageWorks Dashboard: FeatureSets <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"api_classes/meta/","title":"Meta","text":"<p>Meta Examples</p> <p>Examples of using the Meta class are listed at the bottom of this page Examples.</p> <p>Meta: A class that provides high level information and summaries of SageWorks/AWS Artifacts. The Meta class provides 'meta' information, what account are we in, what is the current configuration, etc. It also provides metadata for AWS Artifacts, such as Data Sources, Feature Sets, Models, and Endpoints.</p> <p>Refresh</p> <p>Setting <code>refresh</code> to <code>True</code> will lead to substantial performance issues, so don't do it :).</p>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta","title":"<code>Meta</code>","text":"<p>Meta: A class that provides Metadata for a broad set of AWS Artifacts</p> <p>Common Usage: <pre><code>meta = Meta()\nmeta.account()\nmeta.config()\nmeta.data_sources()\n</code></pre></p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>class Meta:\n    \"\"\"Meta: A class that provides Metadata for a broad set of AWS Artifacts\n\n    Common Usage:\n    ```\n    meta = Meta()\n    meta.account()\n    meta.config()\n    meta.data_sources()\n    ```\n    \"\"\"\n\n    def __init__(self):\n        \"\"\"Meta Initialization\"\"\"\n        self.log = logging.getLogger(\"sageworks\")\n\n        # Account and Service Brokers\n        self.aws_account_clamp = AWSAccountClamp()\n        self.aws_broker = AWSServiceBroker()\n        self.cm = ConfigManager()\n\n        # Pipeline Manager\n        self.pipeline_manager = PipelineManager()\n\n    def account(self) -&gt; dict:\n        \"\"\"Print out the AWS Account Info\n\n        Returns:\n            dict: The AWS Account Info\n        \"\"\"\n        return self.aws_account_clamp.get_aws_account_info()\n\n    def config(self) -&gt; dict:\n        \"\"\"Return the current SageWorks Configuration\n\n        Returns:\n            dict: The current SageWorks Configuration\n        \"\"\"\n        return self.cm.get_all_config()\n\n    def incoming_data(self) -&gt; pd.DataFrame:\n        \"\"\"Get summary data about data in the incoming-data S3 Bucket\n\n        Returns:\n            pd.DataFrame: A summary of the data in the incoming-data S3 Bucket\n        \"\"\"\n        data = self.incoming_data_deep()\n        data_summary = []\n        for name, info in data.items():\n            # Get the name and the size of the S3 Storage Object(s)\n            name = \"/\".join(name.split(\"/\")[-2:]).replace(\"incoming-data/\", \"\")\n            info[\"Name\"] = name\n            size = info.get(\"ContentLength\") / 1_000_000\n            summary = {\n                \"Name\": name,\n                \"Size(MB)\": f\"{size:.2f}\",\n                \"Modified\": datetime_string(info.get(\"LastModified\", \"-\")),\n                \"ContentType\": str(info.get(\"ContentType\", \"-\")),\n                \"ServerSideEncryption\": info.get(\"ServerSideEncryption\", \"-\"),\n                \"Tags\": str(info.get(\"tags\", \"-\")),\n                \"_aws_url\": aws_url(info, \"S3\", self.aws_account_clamp),  # Hidden Column\n            }\n            data_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(data_summary)\n\n    def incoming_data_deep(self, refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for the Incoming Data in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Incoming Data in AWS\n        \"\"\"\n        return self.aws_broker.get_metadata(ServiceCategory.INCOMING_DATA_S3, force_refresh=refresh)\n\n    def glue_jobs(self) -&gt; pd.DataFrame:\n        \"\"\"Get summary data about AWS Glue Jobs\"\"\"\n        glue_meta = self.glue_jobs_deep()\n        glue_summary = []\n\n        # Get the information about each Glue Job\n        for name, info in glue_meta.items():\n            summary = {\n                \"Name\": info[\"Name\"],\n                \"GlueVersion\": info[\"GlueVersion\"],\n                \"Workers\": info.get(\"NumberOfWorkers\", \"-\"),\n                \"WorkerType\": info.get(\"WorkerType\", \"-\"),\n                \"Modified\": datetime_string(info.get(\"LastModifiedOn\")),\n                \"LastRun\": datetime_string(info[\"sageworks_meta\"][\"last_run\"]),\n                \"Status\": info[\"sageworks_meta\"][\"status\"],\n                \"_aws_url\": aws_url(info, \"GlueJob\", self.aws_account_clamp),  # Hidden Column\n            }\n            glue_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(glue_summary)\n\n    def glue_jobs_deep(self, refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for the Glue Jobs in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Glue Jobs in AWS\n        \"\"\"\n        return self.aws_broker.get_metadata(ServiceCategory.GLUE_JOBS, force_refresh=refresh)\n\n    def data_sources(self) -&gt; pd.DataFrame:\n        \"\"\"Get a summary of the Data Sources in AWS\n\n        Returns:\n            pd.DataFrame: A summary of the Data Sources in AWS\n        \"\"\"\n        data = self.data_sources_deep()\n        data_summary = []\n\n        # Pull in various bits of metadata for each data source\n        for name, info in data.items():\n            summary = {\n                \"Name\": name,\n                \"Modified\": datetime_string(info.get(\"UpdateTime\")),\n                \"Num Columns\": num_columns_ds(info),\n                \"Tags\": info.get(\"Parameters\", {}).get(\"sageworks_tags\", \"-\"),\n                \"Input\": str(\n                    info.get(\"Parameters\", {}).get(\"sageworks_input\", \"-\"),\n                ),\n                \"_aws_url\": aws_url(info, \"DataSource\", self.aws_account_clamp),  # Hidden Column\n            }\n            data_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(data_summary)\n\n    def data_source_details(\n        self, data_source_name: str, database: str = \"sageworks\", refresh: bool = False\n    ) -&gt; Union[dict, None]:\n        \"\"\"Get detailed information about a specific data source in AWS\n\n        Args:\n            data_source_name (str): The name of the data source\n            database (str, optional): Glue database. Defaults to 'sageworks'.\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: Detailed information about the data source (or None if not found)\n        \"\"\"\n        data = self.data_sources_deep(database=database, refresh=refresh)\n        return data.get(data_source_name)\n\n    def data_sources_deep(self, database: str = \"sageworks\", refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for the Data Sources in AWS\n\n        Args:\n            database (str, optional): Glue database. Defaults to 'sageworks'.\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Data Sources in AWS\n        \"\"\"\n        data = self.aws_broker.get_metadata(ServiceCategory.DATA_CATALOG, force_refresh=refresh)\n\n        # Data Sources are in two databases, 'sageworks' and 'sagemaker_featurestore'\n        data = data[database]\n\n        # Return the data\n        return data\n\n    def feature_sets(self, refresh: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Get a summary of the Feature Sets in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            pd.DataFrame: A summary of the Feature Sets in AWS\n        \"\"\"\n        data = self.feature_sets_deep(refresh)\n        data_summary = []\n\n        # Pull in various bits of metadata for each feature set\n        for name, group_info in data.items():\n            sageworks_meta = group_info.get(\"sageworks_meta\", {})\n            summary = {\n                \"Feature Group\": group_info[\"FeatureGroupName\"],\n                \"Created\": datetime_string(group_info.get(\"CreationTime\")),\n                \"Num Columns\": num_columns_fs(group_info),\n                \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n                \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n                \"Online\": str(group_info.get(\"OnlineStoreConfig\", {}).get(\"EnableOnlineStore\", \"False\")),\n                \"_aws_url\": aws_url(group_info, \"FeatureSet\", self.aws_account_clamp),  # Hidden Column\n            }\n            data_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(data_summary)\n\n    def feature_set_details(self, feature_set_name: str) -&gt; dict:\n        \"\"\"Get detailed information about a specific feature set in AWS\n\n        Args:\n            feature_set_name (str): The name of the feature set\n\n        Returns:\n            dict: Detailed information about the feature set\n        \"\"\"\n        data = self.feature_sets_deep()\n        return data.get(feature_set_name, {})\n\n    def feature_sets_deep(self, refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for the Feature Sets in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Feature Sets in AWS\n        \"\"\"\n        return self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=refresh)\n\n    def models(self, refresh: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Get a summary of the Models in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            pd.DataFrame: A summary of the Models in AWS\n        \"\"\"\n        model_data = self.models_deep(refresh)\n        model_summary = []\n        for model_group_name, model_list in model_data.items():\n\n            # Get Summary information for the 'latest' model in the model_list\n            latest_model = model_list[0]\n            sageworks_meta = latest_model.get(\"sageworks_meta\", {})\n\n            # If the sageworks_health_tags have nothing in them, then the model is healthy\n            health_tags = sageworks_meta.get(\"sageworks_health_tags\", \"-\")\n            health_tags = health_tags if health_tags else \"healthy\"\n            summary = {\n                \"Model Group\": latest_model[\"ModelPackageGroupName\"],\n                \"Health\": health_tags,\n                \"Owner\": sageworks_meta.get(\"sageworks_owner\", \"-\"),\n                \"Model Type\": sageworks_meta.get(\"sageworks_model_type\"),\n                \"Created\": datetime_string(latest_model.get(\"CreationTime\")),\n                \"Ver\": latest_model[\"ModelPackageVersion\"],\n                \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n                \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n                \"Status\": latest_model[\"ModelPackageStatus\"],\n                \"Description\": latest_model.get(\"ModelPackageDescription\", \"-\"),\n            }\n            model_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(model_summary)\n\n    def model_details(self, model_group_name: str) -&gt; dict:\n        \"\"\"Get detailed information about a specific model group in AWS\n\n        Args:\n            model_group_name (str): The name of the model group\n\n        Returns:\n            dict: Detailed information about the model group\n        \"\"\"\n        data = self.models_deep()\n        return data.get(model_group_name, {})\n\n    def models_deep(self, refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for Models in AWS\n\n         Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Models in AWS\n        \"\"\"\n        return self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=refresh)\n\n    def endpoints(self, refresh: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Get a summary of the Endpoints in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            pd.DataFrame: A summary of the Endpoints in AWS\n        \"\"\"\n        data = self.endpoints_deep(refresh)\n        data_summary = []\n\n        # Get Summary information for each endpoint\n        for endpoint, endpoint_info in data.items():\n            # Get the SageWorks metadata for this Endpoint\n            sageworks_meta = endpoint_info.get(\"sageworks_meta\", {})\n\n            # If the sageworks_health_tags have nothing in them, then the endpoint is healthy\n            health_tags = sageworks_meta.get(\"sageworks_health_tags\", \"-\")\n            health_tags = health_tags if health_tags else \"healthy\"\n            summary = {\n                \"Name\": endpoint_info[\"EndpointName\"],\n                \"Health\": health_tags,\n                \"Instance\": endpoint_info.get(\"InstanceType\", \"-\"),\n                \"Created\": datetime_string(endpoint_info.get(\"CreationTime\")),\n                \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n                \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n                \"Status\": endpoint_info[\"EndpointStatus\"],\n                \"Variant\": endpoint_info.get(\"ProductionVariants\", [{}])[0].get(\"VariantName\", \"-\"),\n                \"Capture\": str(endpoint_info.get(\"DataCaptureConfig\", {}).get(\"EnableCapture\", \"False\")),\n                \"Samp(%)\": str(endpoint_info.get(\"DataCaptureConfig\", {}).get(\"CurrentSamplingPercentage\", \"-\")),\n            }\n            data_summary.append(summary)\n\n        # Return the summary\n        return pd.DataFrame(data_summary)\n\n    def endpoints_deep(self, refresh: bool = False) -&gt; dict:\n        \"\"\"Get a deeper set of data for Endpoints in AWS\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            dict: A summary of the Endpoints in AWS\n        \"\"\"\n        return self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=refresh)\n\n    def pipelines(self, refresh: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Get a summary of the SageWorks Pipelines\n\n        Args:\n            refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n        Returns:\n            pd.DataFrame: A summary of the SageWorks Pipelines\n        \"\"\"\n        data = self.pipeline_manager.list_pipelines()\n\n        # Return the pipelines summary as a DataFrame\n        return pd.DataFrame(data)\n\n    def _remove_sageworks_meta(self, data: dict) -&gt; dict:\n        \"\"\"Internal: Recursively remove any keys with 'sageworks_' in them\"\"\"\n\n        # Recursively exclude any keys with 'sageworks_' in them\n        summary_data = {}\n        for key, value in data.items():\n            if isinstance(value, dict):\n                summary_data[key] = self._remove_sageworks_meta(value)\n            elif not key.startswith(\"sageworks_\"):\n                summary_data[key] = value\n        return summary_data\n\n    def refresh_all_aws_meta(self) -&gt; None:\n        \"\"\"Force a refresh of all the metadata\"\"\"\n        self.aws_broker.get_all_metadata(force_refresh=True)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.__init__","title":"<code>__init__()</code>","text":"<p>Meta Initialization</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def __init__(self):\n    \"\"\"Meta Initialization\"\"\"\n    self.log = logging.getLogger(\"sageworks\")\n\n    # Account and Service Brokers\n    self.aws_account_clamp = AWSAccountClamp()\n    self.aws_broker = AWSServiceBroker()\n    self.cm = ConfigManager()\n\n    # Pipeline Manager\n    self.pipeline_manager = PipelineManager()\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.account","title":"<code>account()</code>","text":"<p>Print out the AWS Account Info</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>The AWS Account Info</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def account(self) -&gt; dict:\n    \"\"\"Print out the AWS Account Info\n\n    Returns:\n        dict: The AWS Account Info\n    \"\"\"\n    return self.aws_account_clamp.get_aws_account_info()\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.config","title":"<code>config()</code>","text":"<p>Return the current SageWorks Configuration</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>The current SageWorks Configuration</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def config(self) -&gt; dict:\n    \"\"\"Return the current SageWorks Configuration\n\n    Returns:\n        dict: The current SageWorks Configuration\n    \"\"\"\n    return self.cm.get_all_config()\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.data_source_details","title":"<code>data_source_details(data_source_name, database='sageworks', refresh=False)</code>","text":"<p>Get detailed information about a specific data source in AWS</p> <p>Parameters:</p> Name Type Description Default <code>data_source_name</code> <code>str</code> <p>The name of the data source</p> required <code>database</code> <code>str</code> <p>Glue database. Defaults to 'sageworks'.</p> <code>'sageworks'</code> <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>Union[dict, None]</code> <p>Detailed information about the data source (or None if not found)</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def data_source_details(\n    self, data_source_name: str, database: str = \"sageworks\", refresh: bool = False\n) -&gt; Union[dict, None]:\n    \"\"\"Get detailed information about a specific data source in AWS\n\n    Args:\n        data_source_name (str): The name of the data source\n        database (str, optional): Glue database. Defaults to 'sageworks'.\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: Detailed information about the data source (or None if not found)\n    \"\"\"\n    data = self.data_sources_deep(database=database, refresh=refresh)\n    return data.get(data_source_name)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.data_sources","title":"<code>data_sources()</code>","text":"<p>Get a summary of the Data Sources in AWS</p> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the Data Sources in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def data_sources(self) -&gt; pd.DataFrame:\n    \"\"\"Get a summary of the Data Sources in AWS\n\n    Returns:\n        pd.DataFrame: A summary of the Data Sources in AWS\n    \"\"\"\n    data = self.data_sources_deep()\n    data_summary = []\n\n    # Pull in various bits of metadata for each data source\n    for name, info in data.items():\n        summary = {\n            \"Name\": name,\n            \"Modified\": datetime_string(info.get(\"UpdateTime\")),\n            \"Num Columns\": num_columns_ds(info),\n            \"Tags\": info.get(\"Parameters\", {}).get(\"sageworks_tags\", \"-\"),\n            \"Input\": str(\n                info.get(\"Parameters\", {}).get(\"sageworks_input\", \"-\"),\n            ),\n            \"_aws_url\": aws_url(info, \"DataSource\", self.aws_account_clamp),  # Hidden Column\n        }\n        data_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(data_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.data_sources_deep","title":"<code>data_sources_deep(database='sageworks', refresh=False)</code>","text":"<p>Get a deeper set of data for the Data Sources in AWS</p> <p>Parameters:</p> Name Type Description Default <code>database</code> <code>str</code> <p>Glue database. Defaults to 'sageworks'.</p> <code>'sageworks'</code> <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Data Sources in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def data_sources_deep(self, database: str = \"sageworks\", refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for the Data Sources in AWS\n\n    Args:\n        database (str, optional): Glue database. Defaults to 'sageworks'.\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Data Sources in AWS\n    \"\"\"\n    data = self.aws_broker.get_metadata(ServiceCategory.DATA_CATALOG, force_refresh=refresh)\n\n    # Data Sources are in two databases, 'sageworks' and 'sagemaker_featurestore'\n    data = data[database]\n\n    # Return the data\n    return data\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.endpoints","title":"<code>endpoints(refresh=False)</code>","text":"<p>Get a summary of the Endpoints in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the Endpoints in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def endpoints(self, refresh: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Get a summary of the Endpoints in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        pd.DataFrame: A summary of the Endpoints in AWS\n    \"\"\"\n    data = self.endpoints_deep(refresh)\n    data_summary = []\n\n    # Get Summary information for each endpoint\n    for endpoint, endpoint_info in data.items():\n        # Get the SageWorks metadata for this Endpoint\n        sageworks_meta = endpoint_info.get(\"sageworks_meta\", {})\n\n        # If the sageworks_health_tags have nothing in them, then the endpoint is healthy\n        health_tags = sageworks_meta.get(\"sageworks_health_tags\", \"-\")\n        health_tags = health_tags if health_tags else \"healthy\"\n        summary = {\n            \"Name\": endpoint_info[\"EndpointName\"],\n            \"Health\": health_tags,\n            \"Instance\": endpoint_info.get(\"InstanceType\", \"-\"),\n            \"Created\": datetime_string(endpoint_info.get(\"CreationTime\")),\n            \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n            \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n            \"Status\": endpoint_info[\"EndpointStatus\"],\n            \"Variant\": endpoint_info.get(\"ProductionVariants\", [{}])[0].get(\"VariantName\", \"-\"),\n            \"Capture\": str(endpoint_info.get(\"DataCaptureConfig\", {}).get(\"EnableCapture\", \"False\")),\n            \"Samp(%)\": str(endpoint_info.get(\"DataCaptureConfig\", {}).get(\"CurrentSamplingPercentage\", \"-\")),\n        }\n        data_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(data_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.endpoints_deep","title":"<code>endpoints_deep(refresh=False)</code>","text":"<p>Get a deeper set of data for Endpoints in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Endpoints in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def endpoints_deep(self, refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for Endpoints in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Endpoints in AWS\n    \"\"\"\n    return self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=refresh)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.feature_set_details","title":"<code>feature_set_details(feature_set_name)</code>","text":"<p>Get detailed information about a specific feature set in AWS</p> <p>Parameters:</p> Name Type Description Default <code>feature_set_name</code> <code>str</code> <p>The name of the feature set</p> required <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>Detailed information about the feature set</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def feature_set_details(self, feature_set_name: str) -&gt; dict:\n    \"\"\"Get detailed information about a specific feature set in AWS\n\n    Args:\n        feature_set_name (str): The name of the feature set\n\n    Returns:\n        dict: Detailed information about the feature set\n    \"\"\"\n    data = self.feature_sets_deep()\n    return data.get(feature_set_name, {})\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.feature_sets","title":"<code>feature_sets(refresh=False)</code>","text":"<p>Get a summary of the Feature Sets in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the Feature Sets in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def feature_sets(self, refresh: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Get a summary of the Feature Sets in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        pd.DataFrame: A summary of the Feature Sets in AWS\n    \"\"\"\n    data = self.feature_sets_deep(refresh)\n    data_summary = []\n\n    # Pull in various bits of metadata for each feature set\n    for name, group_info in data.items():\n        sageworks_meta = group_info.get(\"sageworks_meta\", {})\n        summary = {\n            \"Feature Group\": group_info[\"FeatureGroupName\"],\n            \"Created\": datetime_string(group_info.get(\"CreationTime\")),\n            \"Num Columns\": num_columns_fs(group_info),\n            \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n            \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n            \"Online\": str(group_info.get(\"OnlineStoreConfig\", {}).get(\"EnableOnlineStore\", \"False\")),\n            \"_aws_url\": aws_url(group_info, \"FeatureSet\", self.aws_account_clamp),  # Hidden Column\n        }\n        data_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(data_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.feature_sets_deep","title":"<code>feature_sets_deep(refresh=False)</code>","text":"<p>Get a deeper set of data for the Feature Sets in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Feature Sets in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def feature_sets_deep(self, refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for the Feature Sets in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Feature Sets in AWS\n    \"\"\"\n    return self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=refresh)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.glue_jobs","title":"<code>glue_jobs()</code>","text":"<p>Get summary data about AWS Glue Jobs</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def glue_jobs(self) -&gt; pd.DataFrame:\n    \"\"\"Get summary data about AWS Glue Jobs\"\"\"\n    glue_meta = self.glue_jobs_deep()\n    glue_summary = []\n\n    # Get the information about each Glue Job\n    for name, info in glue_meta.items():\n        summary = {\n            \"Name\": info[\"Name\"],\n            \"GlueVersion\": info[\"GlueVersion\"],\n            \"Workers\": info.get(\"NumberOfWorkers\", \"-\"),\n            \"WorkerType\": info.get(\"WorkerType\", \"-\"),\n            \"Modified\": datetime_string(info.get(\"LastModifiedOn\")),\n            \"LastRun\": datetime_string(info[\"sageworks_meta\"][\"last_run\"]),\n            \"Status\": info[\"sageworks_meta\"][\"status\"],\n            \"_aws_url\": aws_url(info, \"GlueJob\", self.aws_account_clamp),  # Hidden Column\n        }\n        glue_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(glue_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.glue_jobs_deep","title":"<code>glue_jobs_deep(refresh=False)</code>","text":"<p>Get a deeper set of data for the Glue Jobs in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Glue Jobs in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def glue_jobs_deep(self, refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for the Glue Jobs in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Glue Jobs in AWS\n    \"\"\"\n    return self.aws_broker.get_metadata(ServiceCategory.GLUE_JOBS, force_refresh=refresh)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.incoming_data","title":"<code>incoming_data()</code>","text":"<p>Get summary data about data in the incoming-data S3 Bucket</p> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the data in the incoming-data S3 Bucket</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def incoming_data(self) -&gt; pd.DataFrame:\n    \"\"\"Get summary data about data in the incoming-data S3 Bucket\n\n    Returns:\n        pd.DataFrame: A summary of the data in the incoming-data S3 Bucket\n    \"\"\"\n    data = self.incoming_data_deep()\n    data_summary = []\n    for name, info in data.items():\n        # Get the name and the size of the S3 Storage Object(s)\n        name = \"/\".join(name.split(\"/\")[-2:]).replace(\"incoming-data/\", \"\")\n        info[\"Name\"] = name\n        size = info.get(\"ContentLength\") / 1_000_000\n        summary = {\n            \"Name\": name,\n            \"Size(MB)\": f\"{size:.2f}\",\n            \"Modified\": datetime_string(info.get(\"LastModified\", \"-\")),\n            \"ContentType\": str(info.get(\"ContentType\", \"-\")),\n            \"ServerSideEncryption\": info.get(\"ServerSideEncryption\", \"-\"),\n            \"Tags\": str(info.get(\"tags\", \"-\")),\n            \"_aws_url\": aws_url(info, \"S3\", self.aws_account_clamp),  # Hidden Column\n        }\n        data_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(data_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.incoming_data_deep","title":"<code>incoming_data_deep(refresh=False)</code>","text":"<p>Get a deeper set of data for the Incoming Data in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Incoming Data in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def incoming_data_deep(self, refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for the Incoming Data in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Incoming Data in AWS\n    \"\"\"\n    return self.aws_broker.get_metadata(ServiceCategory.INCOMING_DATA_S3, force_refresh=refresh)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.model_details","title":"<code>model_details(model_group_name)</code>","text":"<p>Get detailed information about a specific model group in AWS</p> <p>Parameters:</p> Name Type Description Default <code>model_group_name</code> <code>str</code> <p>The name of the model group</p> required <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>Detailed information about the model group</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def model_details(self, model_group_name: str) -&gt; dict:\n    \"\"\"Get detailed information about a specific model group in AWS\n\n    Args:\n        model_group_name (str): The name of the model group\n\n    Returns:\n        dict: Detailed information about the model group\n    \"\"\"\n    data = self.models_deep()\n    return data.get(model_group_name, {})\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.models","title":"<code>models(refresh=False)</code>","text":"<p>Get a summary of the Models in AWS</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the Models in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def models(self, refresh: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Get a summary of the Models in AWS\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        pd.DataFrame: A summary of the Models in AWS\n    \"\"\"\n    model_data = self.models_deep(refresh)\n    model_summary = []\n    for model_group_name, model_list in model_data.items():\n\n        # Get Summary information for the 'latest' model in the model_list\n        latest_model = model_list[0]\n        sageworks_meta = latest_model.get(\"sageworks_meta\", {})\n\n        # If the sageworks_health_tags have nothing in them, then the model is healthy\n        health_tags = sageworks_meta.get(\"sageworks_health_tags\", \"-\")\n        health_tags = health_tags if health_tags else \"healthy\"\n        summary = {\n            \"Model Group\": latest_model[\"ModelPackageGroupName\"],\n            \"Health\": health_tags,\n            \"Owner\": sageworks_meta.get(\"sageworks_owner\", \"-\"),\n            \"Model Type\": sageworks_meta.get(\"sageworks_model_type\"),\n            \"Created\": datetime_string(latest_model.get(\"CreationTime\")),\n            \"Ver\": latest_model[\"ModelPackageVersion\"],\n            \"Tags\": sageworks_meta.get(\"sageworks_tags\", \"-\"),\n            \"Input\": sageworks_meta.get(\"sageworks_input\", \"-\"),\n            \"Status\": latest_model[\"ModelPackageStatus\"],\n            \"Description\": latest_model.get(\"ModelPackageDescription\", \"-\"),\n        }\n        model_summary.append(summary)\n\n    # Return the summary\n    return pd.DataFrame(model_summary)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.models_deep","title":"<code>models_deep(refresh=False)</code>","text":"<p>Get a deeper set of data for Models in AWS</p> <p>Args:     refresh (bool, optional): Force a refresh of the metadata. Defaults to False.</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A summary of the Models in AWS</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def models_deep(self, refresh: bool = False) -&gt; dict:\n    \"\"\"Get a deeper set of data for Models in AWS\n\n     Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        dict: A summary of the Models in AWS\n    \"\"\"\n    return self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=refresh)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.pipelines","title":"<code>pipelines(refresh=False)</code>","text":"<p>Get a summary of the SageWorks Pipelines</p> <p>Parameters:</p> Name Type Description Default <code>refresh</code> <code>bool</code> <p>Force a refresh of the metadata. Defaults to False.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A summary of the SageWorks Pipelines</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def pipelines(self, refresh: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Get a summary of the SageWorks Pipelines\n\n    Args:\n        refresh (bool, optional): Force a refresh of the metadata. Defaults to False.\n\n    Returns:\n        pd.DataFrame: A summary of the SageWorks Pipelines\n    \"\"\"\n    data = self.pipeline_manager.list_pipelines()\n\n    # Return the pipelines summary as a DataFrame\n    return pd.DataFrame(data)\n</code></pre>"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.refresh_all_aws_meta","title":"<code>refresh_all_aws_meta()</code>","text":"<p>Force a refresh of all the metadata</p> Source code in <code>src/sageworks/api/meta.py</code> <pre><code>def refresh_all_aws_meta(self) -&gt; None:\n    \"\"\"Force a refresh of all the metadata\"\"\"\n    self.aws_broker.get_all_metadata(force_refresh=True)\n</code></pre>"},{"location":"api_classes/meta/#examples","title":"Examples","text":"<p>These example show how to use the <code>Meta()</code> class to pull lists of artifacts from AWS. DataSources, FeatureSets, Models, Endpoints and more. If you're building a web interface plugin, the Meta class is a great place to start.</p> <p>SageWorks REPL</p> <p>If you'd like to see exactly what data/details you get back from the <code>Meta()</code> class, you can spin up the SageWorks REPL, use the class and test out all the methods. Try it out! SageWorks REPL</p> Using SageWorks REPL<pre><code>[\u25cf\u25cf\u25cf]SageWorks:scp_sandbox&gt; meta = Meta()\n[\u25cf\u25cf\u25cf]SageWorks:scp_sandbox&gt; model_info = meta.models()\n[\u25cf\u25cf\u25cf]SageWorks:scp_sandbox&gt; model_info\n               Model Group   Health Owner  ...             Input     Status                Description\n0      wine-classification  healthy     -  ...     wine_features  Completed  Wine Classification Model\n1  abalone-regression-full  healthy     -  ...  abalone_features  Completed   Abalone Regression Model\n2       abalone-regression  healthy     -  ...  abalone_features  Completed   Abalone Regression Model\n\n[3 rows x 10 columns]\n</code></pre> <p>List the Models in AWS</p> meta_list_models.py<pre><code>from sageworks.api.meta import Meta\n\n# Create our Meta Class and get a list of our Models\nmeta = Meta()\nmodels = meta.models()\n\nprint(f\"Number of Models: {len(models)}\")\nprint(models)\n\n# Get more details data on the Endpoints\nmodels_groups = meta.models_deep()\nfor name, model_versions in models_groups.items():\n    print(name)\n</code></pre> <p>Output</p> <pre><code>Number of Models: 3\n               Model Group   Health Owner  ...             Input     Status                Description\n0      wine-classification  healthy     -  ...     wine_features  Completed  Wine Classification Model\n1  abalone-regression-full  healthy     -  ...  abalone_features  Completed   Abalone Regression Model\n2       abalone-regression  healthy     -  ...  abalone_features  Completed   Abalone Regression Model\n\n[3 rows x 10 columns]\nwine-classification\nabalone-regression-full\nabalone-regression\n</code></pre> <p>Getting Model Performance Metrics</p> meta_models.py<pre><code>from sageworks.api.meta import Meta\n\n# Create our Meta Class to get metadata about our Models\nmeta = Meta()\nmodel_info = meta.models_deep()\n\n# Print out the summary of our Models\nfor name, info in model_info.items():\n    print(f\"{name}\")\n    latest = info[0]  # We get a list of models, so we only want the latest\n    print(f\"\\tARN: {latest['ModelPackageGroupArn']}\")\n    print(f\"\\tDescription: {latest['ModelPackageDescription']}\")\n    print(f\"\\tTags: {latest['sageworks_meta']['sageworks_tags']}\")\n    performance_metrics = latest[\"sageworks_meta\"][\"sageworks_inference_metrics\"]\n    print(f\"\\tPerformance Metrics:\")\n    print(f\"\\t\\t{performance_metrics}\")\n</code></pre> <p>Output</p> <pre><code>wine-classification\n    ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/wine-classification\n    Description: Wine Classification Model\n    Tags: wine::classification\n    Performance Metrics:\n        [{'wine_class': 'TypeA', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 12}, {'wine_class': 'TypeB', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 14}, {'wine_class': 'TypeC', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 9}]\n\nabalone-regression\n    ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/abalone-regression\n    Description: Abalone Regression Model\n    Tags: abalone::regression\n    Performance Metrics:\n        [{'MAE': 1.64, 'RMSE': 2.246, 'R2': 0.502, 'MAPE': 16.393, 'MedAE': 1.209, 'NumRows': 834}]\n</code></pre> <p>List the Endpoints in AWS</p> meta_list_endpoints.py<pre><code>from sageworks.api.meta import Meta\n\n# Create our Meta Class and get a list of our Endpoints\nmeta = Meta()\nendpoints = meta.endpoints()\nprint(f\"Number of Endpoints: {len(endpoints)}\")\nprint(endpoints)\n\n# Get more details data on the Endpoints\nendpoints_deep = meta.endpoints_deep()\nfor name, info in endpoints_deep.items():\n    print(name)\n    print(info.keys())\n</code></pre> <p>Output</p> <pre><code>Number of Endpoints: 2\n                      Name   Health            Instance           Created  ...     Status     Variant Capture Samp(%)\n0  wine-classification-end  healthy  Serverless (2GB/5)  2024-03-23 23:09  ...  InService  AllTraffic   False       -\n1   abalone-regression-end  healthy  Serverless (2GB/5)  2024-03-23 21:11  ...  InService  AllTraffic   False       -\n\n[2 rows x 10 columns]\nwine-classification-end\ndict_keys(['EndpointName', 'EndpointArn', 'EndpointConfigName', 'ProductionVariants', 'EndpointStatus', 'CreationTime', 'LastModifiedTime', 'ResponseMetadata', 'InstanceType', 'sageworks_meta'])\nabalone-regression-end\ndict_keys(['EndpointName', 'EndpointArn', 'EndpointConfigName', 'ProductionVariants', 'EndpointStatus', 'CreationTime', 'LastModifiedTime', 'ResponseMetadata', 'InstanceType', 'sageworks_meta'])\n</code></pre> <p>Not Finding some particular AWS Data?</p> <p>The SageWorks Meta API Class also has <code>_details()</code> methods, so make sure to check those out.</p>"},{"location":"api_classes/model/","title":"Model","text":"<p>Model Examples</p> <p>Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!</p> <p>Model: Manages AWS Model Package/Group creation and management.</p> <p>Models are automatically set up and provisioned for deployment into AWS. Models can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional model details and performance metrics</p>"},{"location":"api_classes/model/#sageworks.api.model.Model","title":"<code>Model</code>","text":"<p>               Bases: <code>ModelCore</code></p> <p>Model: SageWorks Model API Class.</p> Common Usage <pre><code>my_features = Model(name)\nmy_features.details()\nmy_features.to_endpoint()\n</code></pre> Source code in <code>src/sageworks/api/model.py</code> <pre><code>class Model(ModelCore):\n    \"\"\"Model: SageWorks Model API Class.\n\n    Common Usage:\n        ```\n        my_features = Model(name)\n        my_features.details()\n        my_features.to_endpoint()\n        ```\n    \"\"\"\n\n    def details(self, **kwargs) -&gt; dict:\n        \"\"\"Retrieve the Model Details.\n\n        Returns:\n            dict: A dictionary of details about the Model\n        \"\"\"\n        return super().details(**kwargs)\n\n    def to_endpoint(self, name: str = None, tags: list = None, serverless: bool = True) -&gt; Endpoint:\n        \"\"\"Create an Endpoint from the Model.\n\n        Args:\n            name (str): Set the name for the endpoint. If not specified, an automatic name will be generated\n            tags (list): Set the tags for the endpoint. If not specified automatic tags will be generated.\n            serverless (bool): Set the endpoint to be serverless (default: True)\n\n        Returns:\n            Endpoint: The Endpoint created from the Model\n        \"\"\"\n\n        # Ensure the endpoint_name is valid\n        if name:\n            Artifact.ensure_valid_name(name, delimiter=\"-\")\n\n        # If the endpoint_name wasn't given generate it\n        else:\n            name = self.uuid.replace(\"_features\", \"\") + \"-end\"\n            name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n        # Create the Endpoint Tags\n        tags = [name] if tags is None else tags\n\n        # Create an Endpoint from the Model\n        model_to_endpoint = ModelToEndpoint(self.uuid, name, serverless=serverless)\n        model_to_endpoint.set_output_tags(tags)\n        model_to_endpoint.transform()\n\n        # Return the Endpoint\n        return Endpoint(name)\n</code></pre>"},{"location":"api_classes/model/#sageworks.api.model.Model.details","title":"<code>details(**kwargs)</code>","text":"<p>Retrieve the Model Details.</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about the Model</p> Source code in <code>src/sageworks/api/model.py</code> <pre><code>def details(self, **kwargs) -&gt; dict:\n    \"\"\"Retrieve the Model Details.\n\n    Returns:\n        dict: A dictionary of details about the Model\n    \"\"\"\n    return super().details(**kwargs)\n</code></pre>"},{"location":"api_classes/model/#sageworks.api.model.Model.to_endpoint","title":"<code>to_endpoint(name=None, tags=None, serverless=True)</code>","text":"<p>Create an Endpoint from the Model.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>Set the name for the endpoint. If not specified, an automatic name will be generated</p> <code>None</code> <code>tags</code> <code>list</code> <p>Set the tags for the endpoint. If not specified automatic tags will be generated.</p> <code>None</code> <code>serverless</code> <code>bool</code> <p>Set the endpoint to be serverless (default: True)</p> <code>True</code> <p>Returns:</p> Name Type Description <code>Endpoint</code> <code>Endpoint</code> <p>The Endpoint created from the Model</p> Source code in <code>src/sageworks/api/model.py</code> <pre><code>def to_endpoint(self, name: str = None, tags: list = None, serverless: bool = True) -&gt; Endpoint:\n    \"\"\"Create an Endpoint from the Model.\n\n    Args:\n        name (str): Set the name for the endpoint. If not specified, an automatic name will be generated\n        tags (list): Set the tags for the endpoint. If not specified automatic tags will be generated.\n        serverless (bool): Set the endpoint to be serverless (default: True)\n\n    Returns:\n        Endpoint: The Endpoint created from the Model\n    \"\"\"\n\n    # Ensure the endpoint_name is valid\n    if name:\n        Artifact.ensure_valid_name(name, delimiter=\"-\")\n\n    # If the endpoint_name wasn't given generate it\n    else:\n        name = self.uuid.replace(\"_features\", \"\") + \"-end\"\n        name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n    # Create the Endpoint Tags\n    tags = [name] if tags is None else tags\n\n    # Create an Endpoint from the Model\n    model_to_endpoint = ModelToEndpoint(self.uuid, name, serverless=serverless)\n    model_to_endpoint.set_output_tags(tags)\n    model_to_endpoint.transform()\n\n    # Return the Endpoint\n    return Endpoint(name)\n</code></pre>"},{"location":"api_classes/model/#examples","title":"Examples","text":"<p>All of the SageWorks Examples are in the Sageworks Repository under the <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p> <p>Create a Model from a FeatureSet</p> featureset_to_model.py<pre><code>from sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import ModelType\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"test_features\")\n\n# Create a Model from the FeatureSet\n# Note: ModelTypes can be CLASSIFIER, REGRESSOR (XGBoost is default)\nmy_model = my_features.to_model(model_type=ModelType.REGRESSOR, \n                                target_column=\"iq_score\")\npprint(my_model.details())\n</code></pre> <p>Output</p> <pre><code>{'approval_status': 'Approved',\n 'content_types': ['text/csv'],\n ...\n 'inference_types': ['ml.t2.medium'],\n 'input': 'test_features',\n 'model_metrics':   metric_name  value\n                0        RMSE  7.924\n                1         MAE  6.554,\n                2          R2  0.604,\n 'regression_predictions':       iq_score  prediction\n                            0   136.519012  139.964460\n                            1   133.616974  130.819950\n                            2   122.495415  124.967834\n                            3   133.279510  121.010284\n                            4   127.881073  113.825005\n    ...\n 'response_types': ['text/csv'],\n 'sageworks_tags': ['test-model'],\n 'shapley_values': None,\n 'size': 0.0,\n 'status': 'Completed',\n 'transform_types': ['ml.m5.large'],\n 'uuid': 'test-model',\n 'version': 1}\n</code></pre> <p>Use a specific Scikit-Learn Model</p> <p>featureset_to_knn.py<pre><code>from sageworks.api.feature_set import FeatureSet\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"abalone_features\")\n\n# Transform FeatureSet into KNN Regression Model\n# Note: model_class can be any sckit-learn model \n#  \"KNeighborsRegressor\", \"BayesianRidge\",\n#  \"GaussianNB\", \"AdaBoostClassifier\", etc\nmy_model = my_features.to_model(\n    model_class=\"KNeighborsRegressor\",\n    target_column=\"class_number_of_rings\",\n    name=\"abalone-knn-reg\",\n    description=\"Abalone KNN Regression\",\n    tags=[\"abalone\", \"knn\"],\n    train_all_data=True,\n)\npprint(my_model.details())\n</code></pre> Another Scikit-Learn Example</p> featureset_to_rfc.py<pre><code>from sageworks.api.feature_set import FeatureSet\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"wine_features\")\n\n# Using a Scikit-Learn Model\n# Note: model_class can be any sckit-learn model (\"KNeighborsRegressor\", \"BayesianRidge\",\n#       \"GaussianNB\", \"AdaBoostClassifier\", \"Ridge, \"Lasso\", \"SVC\", \"SVR\", etc...)\nmy_model = my_features.to_model(\n    model_class=\"RandomForestClassifier\",\n    target_column=\"wine_class\",\n    name=\"wine-rfc-class\",\n    description=\"Wine RandomForest Classification\",\n    tags=[\"wine\", \"rfc\"]\n)\npprint(my_model.details())\n</code></pre> <p>Create an Endpoint from a Model</p> <p>Endpoint Costs</p> <p>Serverless endpoints are a great option, they have no AWS charges when not running. A realtime endpoint has less latency (no cold start) but AWS charges an hourly fee which can add up quickly!</p> model_to_endpoint.py<pre><code>from sageworks.api.model import Model\n\n# Grab the abalone regression Model\nmodel = Model(\"abalone-regression\")\n\n# By default, an Endpoint is serverless, you can\n# make a realtime endpoint with serverless=False\nmodel.to_endpoint(name=\"abalone-regression-end\",\n                  tags=[\"abalone\", \"regression\"],\n                  serverless=True)\n</code></pre> <p>Model Health Check and Metrics</p> model_metrics.py<pre><code>from sageworks.api.model import Model\n\n# Grab the abalone-regression Model\nmodel = Model(\"abalone-regression\")\n\n# Perform a health check on the model\n# Note: The health_check() method returns 'issues' if there are any\n#       problems, so if there are no issues, the model is healthy\nhealth_issues = model.health_check()\nif not health_issues:\n    print(\"Model is Healthy\")\nelse:\n    print(\"Model has issues\")\n    print(health_issues)\n\n# Get the model metrics and regression predictions\nprint(model.model_metrics())\nprint(model.regression_predictions())\n</code></pre> <p>Output</p> <pre><code>Model is Healthy\n  metric_name  value\n0        RMSE  2.190\n1         MAE  1.544\n2          R2  0.504\n\n     class_number_of_rings  prediction\n0                        9    8.648378\n1                       11    9.717787\n2                       11   10.933070\n3                       10    9.899738\n4                        9   10.014504\n..                     ...         ...\n495                     10   10.261657\n496                      9   10.788254\n497                     13    7.779886\n498                     12   14.718514\n499                     13   10.637320\n</code></pre>"},{"location":"api_classes/model/#sageworks-ui","title":"SageWorks UI","text":"<p>Running these few lines of code creates an AWS Model Package Group and an AWS Model Package. These model artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI.</p> SageWorks Dashboard: Models <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"api_classes/monitor/","title":"Monitor","text":"<p>Monitor Examples</p> <p>Examples of using the Monitor class are listed at the bottom of this page Examples.</p> <p>Monitor: Manages AWS Endpoint Monitor creation and deployment. Endpoints Monitors are set up and provisioned for deployment into AWS. Monitors can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional monitor details and performance metrics</p>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor","title":"<code>Monitor</code>","text":"<p>               Bases: <code>MonitorCore</code></p> <p>Monitor: SageWorks Monitor API Class</p> Common Usage <pre><code>mon = Endpoint(name).get_monitor()  # Pull from endpoint OR\nmon = Monitor(name)                 # Create using Endpoint Name\nmon.summary()\nmon.details()\n\n# One time setup methods\nmon.add_data_capture()\nmon.create_baseline()\nmon.create_monitoring_schedule()\n\n# Pull information from the monitor\nbaseline_df = mon.get_baseline()\nconstraints_df = mon.get_constraints()\nstats_df = mon.get_statistics()\ninput_df, output_df = mon.get_latest_data_capture()\n</code></pre> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>class Monitor(MonitorCore):\n    \"\"\"Monitor: SageWorks Monitor API Class\n\n    Common Usage:\n       ```\n       mon = Endpoint(name).get_monitor()  # Pull from endpoint OR\n       mon = Monitor(name)                 # Create using Endpoint Name\n       mon.summary()\n       mon.details()\n\n       # One time setup methods\n       mon.add_data_capture()\n       mon.create_baseline()\n       mon.create_monitoring_schedule()\n\n       # Pull information from the monitor\n       baseline_df = mon.get_baseline()\n       constraints_df = mon.get_constraints()\n       stats_df = mon.get_statistics()\n       input_df, output_df = mon.get_latest_data_capture()\n       ```\n    \"\"\"\n\n    def summary(self) -&gt; dict:\n        \"\"\"Monitor Summary\n\n        Returns:\n            dict: A dictionary of summary information about the Monitor\n        \"\"\"\n        return super().summary()\n\n    def details(self) -&gt; dict:\n        \"\"\"Monitor Details\n\n        Returns:\n            dict: A dictionary of details about the Monitor\n        \"\"\"\n        return super().details()\n\n    def add_data_capture(self, capture_percentage=100):\n        \"\"\"\n        Add data capture configuration for this Monitor/endpoint.\n\n        Args:\n            capture_percentage (int): Percentage of data to capture. Defaults to 100.\n        \"\"\"\n        super().add_data_capture(capture_percentage)\n\n    def create_baseline(self, recreate: bool = False):\n        \"\"\"Code to create a baseline for monitoring\n\n        Args:\n            recreate (bool): If True, recreate the baseline even if it already exists\n\n        Notes:\n            This will create/write three files to the baseline_dir:\n            - baseline.csv\n            - constraints.json\n            - statistics.json\n        \"\"\"\n        super().create_baseline(recreate)\n\n    def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n        \"\"\"\n        Sets up the monitoring schedule for the model endpoint.\n\n        Args:\n            schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n            recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n        \"\"\"\n        super().create_monitoring_schedule(schedule, recreate)\n\n    def get_latest_data_capture(self) -&gt; (pd.DataFrame, pd.DataFrame):\n        \"\"\"\n        Get the latest data capture input and output from S3.\n\n        Returns:\n            DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n        \"\"\"\n        return super().get_latest_data_capture()\n\n    def get_baseline(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n        Returns:\n            pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n        \"\"\"\n        return super().get_baseline()\n\n    def get_constraints(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the constraints from the baseline\n\n        Returns:\n           pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n        \"\"\"\n        return super().get_constraints()\n\n    def get_statistics(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the statistics from the baseline\n\n        Returns:\n            pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n        \"\"\"\n        return super().get_statistics()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.add_data_capture","title":"<code>add_data_capture(capture_percentage=100)</code>","text":"<p>Add data capture configuration for this Monitor/endpoint.</p> <p>Parameters:</p> Name Type Description Default <code>capture_percentage</code> <code>int</code> <p>Percentage of data to capture. Defaults to 100.</p> <code>100</code> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def add_data_capture(self, capture_percentage=100):\n    \"\"\"\n    Add data capture configuration for this Monitor/endpoint.\n\n    Args:\n        capture_percentage (int): Percentage of data to capture. Defaults to 100.\n    \"\"\"\n    super().add_data_capture(capture_percentage)\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.create_baseline","title":"<code>create_baseline(recreate=False)</code>","text":"<p>Code to create a baseline for monitoring</p> <p>Parameters:</p> Name Type Description Default <code>recreate</code> <code>bool</code> <p>If True, recreate the baseline even if it already exists</p> <code>False</code> Notes <p>This will create/write three files to the baseline_dir: - baseline.csv - constraints.json - statistics.json</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def create_baseline(self, recreate: bool = False):\n    \"\"\"Code to create a baseline for monitoring\n\n    Args:\n        recreate (bool): If True, recreate the baseline even if it already exists\n\n    Notes:\n        This will create/write three files to the baseline_dir:\n        - baseline.csv\n        - constraints.json\n        - statistics.json\n    \"\"\"\n    super().create_baseline(recreate)\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.create_monitoring_schedule","title":"<code>create_monitoring_schedule(schedule='hourly', recreate=False)</code>","text":"<p>Sets up the monitoring schedule for the model endpoint.</p> <p>Parameters:</p> Name Type Description Default <code>schedule</code> <code>str</code> <p>The schedule for the monitoring job (hourly or daily, defaults to hourly).</p> <code>'hourly'</code> <code>recreate</code> <code>bool</code> <p>If True, recreate the monitoring schedule even if it already exists.</p> <code>False</code> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n    \"\"\"\n    Sets up the monitoring schedule for the model endpoint.\n\n    Args:\n        schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n        recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n    \"\"\"\n    super().create_monitoring_schedule(schedule, recreate)\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.details","title":"<code>details()</code>","text":"<p>Monitor Details</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about the Monitor</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def details(self) -&gt; dict:\n    \"\"\"Monitor Details\n\n    Returns:\n        dict: A dictionary of details about the Monitor\n    \"\"\"\n    return super().details()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_baseline","title":"<code>get_baseline()</code>","text":"<p>Code to get the baseline CSV from the S3 baseline directory</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def get_baseline(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n    Returns:\n        pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n    \"\"\"\n    return super().get_baseline()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_constraints","title":"<code>get_constraints()</code>","text":"<p>Code to get the constraints from the baseline</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def get_constraints(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the constraints from the baseline\n\n    Returns:\n       pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n    \"\"\"\n    return super().get_constraints()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_latest_data_capture","title":"<code>get_latest_data_capture()</code>","text":"<p>Get the latest data capture input and output from S3.</p> <p>Returns:</p> Name Type Description <code>DataFrame</code> <code>input), DataFrame(output</code> <p>Flattened and processed DataFrames for input and output data.</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def get_latest_data_capture(self) -&gt; (pd.DataFrame, pd.DataFrame):\n    \"\"\"\n    Get the latest data capture input and output from S3.\n\n    Returns:\n        DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n    \"\"\"\n    return super().get_latest_data_capture()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_statistics","title":"<code>get_statistics()</code>","text":"<p>Code to get the statistics from the baseline</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def get_statistics(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the statistics from the baseline\n\n    Returns:\n        pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n    \"\"\"\n    return super().get_statistics()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.summary","title":"<code>summary()</code>","text":"<p>Monitor Summary</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of summary information about the Monitor</p> Source code in <code>src/sageworks/api/monitor.py</code> <pre><code>def summary(self) -&gt; dict:\n    \"\"\"Monitor Summary\n\n    Returns:\n        dict: A dictionary of summary information about the Monitor\n    \"\"\"\n    return super().summary()\n</code></pre>"},{"location":"api_classes/monitor/#examples","title":"Examples","text":"<p>Initial Setup of the Endpoint Monitor</p> monitor_setup.py<pre><code>from sageworks.api.monitor import Monitor\n\n# Create an Endpoint Monitor Class and perform initial Setup\nendpoint_name = \"abalone-regression-end-rt\"\nmon = Monitor(endpoint_name)\n\n# Add data capture to the endpoint\nmon.add_data_capture(capture_percentage=100)\n\n# Create a baseline for monitoring\nmon.create_baseline()\n\n# Set up the monitoring schedule\nmon.create_monitoring_schedule(schedule=\"hourly\")\n</code></pre> <p>Pulling Information from an Existing Monitor</p> monitor_usage.py<pre><code>from sageworks.api.monitor import Monitor\nfrom sageworks.api.endpoint import Endpoint\n\n# Construct a Monitor Class in one of Two Ways\nmon = Endpoint(\"abalone-regression-end-rt\").get_monitor()\nmon = Monitor(\"abalone-regression-end-rt\")\n\n# Check the summary and details of the monitoring class\nmon.summary()\nmon.details()\n\n# Check the baseline outputs (baseline, constraints, statistics)\nbase_df = mon.get_baseline()\nbase_df.head()\n\nconstraints_df = mon.get_constraints()\nconstraints_df.head()\n\nstatistics_df = mon.get_statistics()\nstatistics_df.head()\n\n# Get the latest data capture (inputs and outputs)\ninput_df, output_df = mon.get_latest_data_capture()\ninput_df.head()\noutput_df.head()\n</code></pre>"},{"location":"api_classes/monitor/#sageworks-ui","title":"SageWorks UI","text":"<p>Running these few lines of code creates and deploys an AWS Endpoint Monitor. The Monitor status and outputs can be viewed in the Sagemaker Console interfaces or in the SageWorks Dashboard UI. SageWorks will use the monitor to track various metrics including Data Quality, Model Bias, etc...</p> SageWorks Dashboard: Endpoints <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"api_classes/overview/","title":"Overview","text":"<p>Just Getting Started?</p> <p>You're in the right place, the SageWorks API Classes are the best way to get started with SageWorks!</p>"},{"location":"api_classes/overview/#welcome-to-the-sageworks-api-classes","title":"Welcome to the SageWorks API Classes","text":"<p>These classes provide high-level APIs for the SageWorks package, they enable your team to build full AWS Machine Learning Pipelines. They handle all the details around updating and managing a complex set of AWS Services. Each class provides an essential component of the overall ML Pipline. Simply combine the classes to build production ready, AWS powered, machine learning pipelines. </p> <ul> <li>DataSource: Manages AWS Data Catalog and Athena</li> <li>FeatureSet: Manages AWS Feature Store and Feature Groups</li> <li>Model: Manages the training and deployment of AWS Model Groups and Packages</li> <li>Endpoint: Manages the deployment and invocations/inference on AWS Endpoints</li> <li>Monitor: Manages the setup and deployment of AWS Endpoint Monitors</li> </ul> <p></p>"},{"location":"api_classes/overview/#example-ml-pipline","title":"Example ML Pipline","text":"full_ml_pipeline.py<pre><code>from sageworks.api.data_source import DataSource\nfrom sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import Model, ModelType\nfrom sageworks.api.endpoint import Endpoint\n\n# Create the abalone_data DataSource\nds = DataSource(\"s3://sageworks-public-data/common/abalone.csv\")\n\n# Now create a FeatureSet\nds.to_features(\"abalone_features\")\n\n# Create the abalone_regression Model\nfs = FeatureSet(\"abalone_features\")\nfs.to_model(\n    ModelType.REGRESSOR,\n    name=\"abalone-regression\",\n    target_column=\"class_number_of_rings\",\n    tags=[\"abalone\", \"regression\"],\n    description=\"Abalone Regression Model\",\n)\n\n# Create the abalone_regression Endpoint\nmodel = Model(\"abalone-regression\")\nmodel.to_endpoint(name=\"abalone-regression-end\", tags=[\"abalone\", \"regression\"])\n\n# Now we'll run inference on the endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# Get a DataFrame of data (not used to train) and run predictions\nathena_table = fs.get_training_view_table()\ndf = fs.query(f\"SELECT * FROM {athena_table} where training = 0\")\nresults = endpoint.predict(df)\nprint(results[[\"class_number_of_rings\", \"prediction\"]])\n</code></pre> <p>Output</p> <pre><code>Processing...\n     class_number_of_rings  prediction\n0                       12   10.477794\n1                       11    11.11835\n2                       14   13.605763\n3                       12   11.744759\n4                       17    15.55189\n..                     ...         ...\n826                      7    7.981503\n827                     11   11.246113\n828                      9    9.592911\n829                      6    6.129388\n830                      8    7.628252\n</code></pre> <p>Full AWS ML Pipeline Achievement Unlocked!</p> <p>Bing! You just built and deployed a full AWS Machine Learning Pipeline. You can now use the SageWorks Dashboard web interface to inspect your AWS artifacts. A comprehensive set of Exploratory Data Analysis techniques and Model Performance Metrics are available for your entire team to review, inspect and interact with.</p> <p></p> <p>Examples</p> <p>All of the SageWorks Examples are in the Sageworks Repository under the <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p>"},{"location":"api_classes/pipelines/","title":"Pipelines","text":"<p>Pipeline Examples</p> <p>Examples of using the Pipeline classes are listed at the bottom of this page Examples.</p> <p>Pipelines store sequences of SageWorks transforms. So if you have a nightly ML workflow you can capture that as a Pipeline. Here's an example pipeline:</p> nightly_sol_pipeline_v1.json<pre><code>{\n    \"data_source\": {\n         \"name\": \"nightly_data\",\n         \"tags\": [\"solubility\", \"foo\"],\n         \"s3_input\": \"s3://blah/blah.csv\"\n    },\n    \"feature_set\": {\n          \"name\": \"nightly_features\",\n          \"tags\": [\"blah\", \"blah\"],\n          \"input\": \"nightly_data\"\n          \"schema\": \"mol_descriptors_v1\"\n    },\n    \"model\": {\n          \"name\": \u201cnightly_model\u201d,\n          \"tags\": [\"blah\", \"blah\"],\n          \"features\": [\"col1\", \"col2\"],\n          \"target\": \u201csol\u201d,\n          \"input\": \u201cnightly_features\u201d\n    \"endpoint\": {\n          ...\n}    \n</code></pre> <p>PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.</p> <p>Pipeline: Manages the details around a SageWorks Pipeline, including Execution</p>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager","title":"<code>PipelineManager</code>","text":"<p>PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.</p> Common Usage <pre><code>my_manager = PipelineManager()\nmy_manager.list_pipelines()\nabalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\nmy_manager.save_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n</code></pre> Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>class PipelineManager:\n    \"\"\"PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.\n\n    Common Usage:\n        ```\n        my_manager = PipelineManager()\n        my_manager.list_pipelines()\n        abalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\n        my_manager.save_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n        ```\n    \"\"\"\n\n    def __init__(self):\n        \"\"\"Pipeline Init Method\"\"\"\n        self.log = logging.getLogger(\"sageworks\")\n\n        # Grab our SageWorks Bucket from Config\n        self.cm = ConfigManager()\n        self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n        if self.sageworks_bucket is None:\n            self.log = logging.getLogger(\"sageworks\")\n            self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n            sys.exit(1)\n\n        # Set the S3 Path for Pipelines\n        self.bucket = self.sageworks_bucket\n        self.prefix = \"pipelines/\"\n        self.pipelines_s3_path = f\"s3://{self.sageworks_bucket}/pipelines/\"\n\n        # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n        self.boto_session = AWSAccountClamp().boto_session()\n\n        # Read all the Pipelines from this S3 path\n        self.s3_client = self.boto_session.client(\"s3\")\n\n    def list_pipelines(self) -&gt; list:\n        \"\"\"List all the Pipelines in the S3 Bucket\n\n        Returns:\n            list: A list of Pipeline names and details\n        \"\"\"\n        # List objects using the S3 client\n        response = self.s3_client.list_objects_v2(Bucket=self.bucket, Prefix=self.prefix)\n\n        # Check if there are objects\n        if \"Contents\" in response:\n            # Process the list of dictionaries (we only need the filename, the LastModified, and the Size)\n            pipelines = [\n                {\n                    \"name\": pipeline[\"Key\"].split(\"/\")[-1].replace(\".json\", \"\"),\n                    \"last_modified\": pipeline[\"LastModified\"],\n                    \"size\": pipeline[\"Size\"],\n                }\n                for pipeline in response[\"Contents\"]\n            ]\n            return pipelines\n        else:\n            self.log.warning(f\"No pipelines found at {self.pipelines_s3_path}...\")\n            return []\n\n    # Create a new Pipeline from an Endpoint\n    def create_from_endpoint(self, endpoint_name: str) -&gt; dict:\n        \"\"\"Create a Pipeline from an Endpoint\n\n        Args:\n            endpoint_name (str): The name of the Endpoint\n\n        Returns:\n            dict: A dictionary of the Pipeline\n        \"\"\"\n        self.log.important(f\"Creating Pipeline from Endpoint: {endpoint_name}...\")\n        pipeline = {}\n        endpoint = Endpoint(endpoint_name)\n        model = Model(endpoint.get_input())\n        feature_set = FeatureSet(model.get_input())\n        data_source = DataSource(feature_set.get_input())\n        s3_source = data_source.get_input()\n        for name in [\"data_source\", \"feature_set\", \"model\", \"endpoint\"]:\n            artifact = locals()[name]\n            pipeline[name] = {\"name\": artifact.uuid, \"tags\": artifact.get_tags(), \"input\": artifact.get_input()}\n            if name == \"model\":\n                pipeline[name][\"model_type\"] = artifact.model_type.value\n                pipeline[name][\"target_column\"] = artifact.target()\n                pipeline[name][\"feature_list\"] = artifact.features()\n\n        # Return the Pipeline\n        return pipeline\n\n    # Publish a Pipeline to SageWorks\n    def publish_pipeline(self, name: str, pipeline: dict):\n        \"\"\"Save a Pipeline to S3\n\n        Args:\n            name (str): The name of the Pipeline\n            pipeline (dict): The Pipeline to save\n        \"\"\"\n        key = f\"{self.prefix}{name}.json\"\n        self.log.important(f\"Saving {name} to S3: {self.bucket}/{key}...\")\n\n        # Save the pipeline as an S3 JSON object\n        self.s3_client.put_object(Body=json.dumps(pipeline, indent=4), Bucket=self.bucket, Key=key)\n\n    def delete_pipeline(self, name: str):\n        \"\"\"Delete a Pipeline from S3\n\n        Args:\n            name (str): The name of the Pipeline to delete\n        \"\"\"\n        key = f\"{self.prefix}{name}.json\"\n        self.log.important(f\"Deleting {name} from S3: {self.bucket}/{key}...\")\n\n        # Delete the pipeline object from S3\n        self.s3_client.delete_object(Bucket=self.bucket, Key=key)\n\n    # Save a Pipeline to a local file\n    def save_pipeline_to_file(self, pipeline: dict, filepath: str):\n        \"\"\"Save a Pipeline to a local file\n\n        Args:\n            pipeline (dict): The Pipeline to save\n            filepath (str): The path to save the Pipeline\n        \"\"\"\n\n        # Sanity check the filepath\n        if not filepath.endswith(\".json\"):\n            filepath += \".json\"\n\n        # Save the pipeline as a local JSON file\n        with open(filepath, \"w\") as fp:\n            json.dump(pipeline, fp, indent=4)\n\n    def load_pipeline_from_file(self, filepath: str) -&gt; dict:\n        \"\"\"Load a Pipeline from a local file\n\n        Args:\n            filepath (str): The path of the Pipeline to load\n\n        Returns:\n            dict: The Pipeline loaded from the file\n        \"\"\"\n\n        # Load a pipeline as a local JSON file\n        with open(filepath, \"r\") as fp:\n            pipeline = json.load(fp)\n            return pipeline\n\n    def publish_pipeline_from_file(self, filepath: str):\n        \"\"\"Publish a Pipeline to SageWorks from a local file\n\n        Args:\n            filepath (str): The path of the Pipeline to publish\n        \"\"\"\n\n        # Load a pipeline as a local JSON file\n        pipeline = self.load_pipeline_from_file(filepath)\n\n        # Get the pipeline name\n        pipeline_name = filepath.split(\"/\")[-1].replace(\".json\", \"\")\n\n        # Publish the Pipeline\n        self.publish_pipeline(pipeline_name, pipeline)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.__init__","title":"<code>__init__()</code>","text":"<p>Pipeline Init Method</p> Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def __init__(self):\n    \"\"\"Pipeline Init Method\"\"\"\n    self.log = logging.getLogger(\"sageworks\")\n\n    # Grab our SageWorks Bucket from Config\n    self.cm = ConfigManager()\n    self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n    if self.sageworks_bucket is None:\n        self.log = logging.getLogger(\"sageworks\")\n        self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n        sys.exit(1)\n\n    # Set the S3 Path for Pipelines\n    self.bucket = self.sageworks_bucket\n    self.prefix = \"pipelines/\"\n    self.pipelines_s3_path = f\"s3://{self.sageworks_bucket}/pipelines/\"\n\n    # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n    self.boto_session = AWSAccountClamp().boto_session()\n\n    # Read all the Pipelines from this S3 path\n    self.s3_client = self.boto_session.client(\"s3\")\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.create_from_endpoint","title":"<code>create_from_endpoint(endpoint_name)</code>","text":"<p>Create a Pipeline from an Endpoint</p> <p>Parameters:</p> Name Type Description Default <code>endpoint_name</code> <code>str</code> <p>The name of the Endpoint</p> required <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of the Pipeline</p> Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def create_from_endpoint(self, endpoint_name: str) -&gt; dict:\n    \"\"\"Create a Pipeline from an Endpoint\n\n    Args:\n        endpoint_name (str): The name of the Endpoint\n\n    Returns:\n        dict: A dictionary of the Pipeline\n    \"\"\"\n    self.log.important(f\"Creating Pipeline from Endpoint: {endpoint_name}...\")\n    pipeline = {}\n    endpoint = Endpoint(endpoint_name)\n    model = Model(endpoint.get_input())\n    feature_set = FeatureSet(model.get_input())\n    data_source = DataSource(feature_set.get_input())\n    s3_source = data_source.get_input()\n    for name in [\"data_source\", \"feature_set\", \"model\", \"endpoint\"]:\n        artifact = locals()[name]\n        pipeline[name] = {\"name\": artifact.uuid, \"tags\": artifact.get_tags(), \"input\": artifact.get_input()}\n        if name == \"model\":\n            pipeline[name][\"model_type\"] = artifact.model_type.value\n            pipeline[name][\"target_column\"] = artifact.target()\n            pipeline[name][\"feature_list\"] = artifact.features()\n\n    # Return the Pipeline\n    return pipeline\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.delete_pipeline","title":"<code>delete_pipeline(name)</code>","text":"<p>Delete a Pipeline from S3</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>The name of the Pipeline to delete</p> required Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def delete_pipeline(self, name: str):\n    \"\"\"Delete a Pipeline from S3\n\n    Args:\n        name (str): The name of the Pipeline to delete\n    \"\"\"\n    key = f\"{self.prefix}{name}.json\"\n    self.log.important(f\"Deleting {name} from S3: {self.bucket}/{key}...\")\n\n    # Delete the pipeline object from S3\n    self.s3_client.delete_object(Bucket=self.bucket, Key=key)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.list_pipelines","title":"<code>list_pipelines()</code>","text":"<p>List all the Pipelines in the S3 Bucket</p> <p>Returns:</p> Name Type Description <code>list</code> <code>list</code> <p>A list of Pipeline names and details</p> Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def list_pipelines(self) -&gt; list:\n    \"\"\"List all the Pipelines in the S3 Bucket\n\n    Returns:\n        list: A list of Pipeline names and details\n    \"\"\"\n    # List objects using the S3 client\n    response = self.s3_client.list_objects_v2(Bucket=self.bucket, Prefix=self.prefix)\n\n    # Check if there are objects\n    if \"Contents\" in response:\n        # Process the list of dictionaries (we only need the filename, the LastModified, and the Size)\n        pipelines = [\n            {\n                \"name\": pipeline[\"Key\"].split(\"/\")[-1].replace(\".json\", \"\"),\n                \"last_modified\": pipeline[\"LastModified\"],\n                \"size\": pipeline[\"Size\"],\n            }\n            for pipeline in response[\"Contents\"]\n        ]\n        return pipelines\n    else:\n        self.log.warning(f\"No pipelines found at {self.pipelines_s3_path}...\")\n        return []\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.load_pipeline_from_file","title":"<code>load_pipeline_from_file(filepath)</code>","text":"<p>Load a Pipeline from a local file</p> <p>Parameters:</p> Name Type Description Default <code>filepath</code> <code>str</code> <p>The path of the Pipeline to load</p> required <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>The Pipeline loaded from the file</p> Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def load_pipeline_from_file(self, filepath: str) -&gt; dict:\n    \"\"\"Load a Pipeline from a local file\n\n    Args:\n        filepath (str): The path of the Pipeline to load\n\n    Returns:\n        dict: The Pipeline loaded from the file\n    \"\"\"\n\n    # Load a pipeline as a local JSON file\n    with open(filepath, \"r\") as fp:\n        pipeline = json.load(fp)\n        return pipeline\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.publish_pipeline","title":"<code>publish_pipeline(name, pipeline)</code>","text":"<p>Save a Pipeline to S3</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>The name of the Pipeline</p> required <code>pipeline</code> <code>dict</code> <p>The Pipeline to save</p> required Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def publish_pipeline(self, name: str, pipeline: dict):\n    \"\"\"Save a Pipeline to S3\n\n    Args:\n        name (str): The name of the Pipeline\n        pipeline (dict): The Pipeline to save\n    \"\"\"\n    key = f\"{self.prefix}{name}.json\"\n    self.log.important(f\"Saving {name} to S3: {self.bucket}/{key}...\")\n\n    # Save the pipeline as an S3 JSON object\n    self.s3_client.put_object(Body=json.dumps(pipeline, indent=4), Bucket=self.bucket, Key=key)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.publish_pipeline_from_file","title":"<code>publish_pipeline_from_file(filepath)</code>","text":"<p>Publish a Pipeline to SageWorks from a local file</p> <p>Parameters:</p> Name Type Description Default <code>filepath</code> <code>str</code> <p>The path of the Pipeline to publish</p> required Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def publish_pipeline_from_file(self, filepath: str):\n    \"\"\"Publish a Pipeline to SageWorks from a local file\n\n    Args:\n        filepath (str): The path of the Pipeline to publish\n    \"\"\"\n\n    # Load a pipeline as a local JSON file\n    pipeline = self.load_pipeline_from_file(filepath)\n\n    # Get the pipeline name\n    pipeline_name = filepath.split(\"/\")[-1].replace(\".json\", \"\")\n\n    # Publish the Pipeline\n    self.publish_pipeline(pipeline_name, pipeline)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.save_pipeline_to_file","title":"<code>save_pipeline_to_file(pipeline, filepath)</code>","text":"<p>Save a Pipeline to a local file</p> <p>Parameters:</p> Name Type Description Default <code>pipeline</code> <code>dict</code> <p>The Pipeline to save</p> required <code>filepath</code> <code>str</code> <p>The path to save the Pipeline</p> required Source code in <code>src/sageworks/api/pipeline_manager.py</code> <pre><code>def save_pipeline_to_file(self, pipeline: dict, filepath: str):\n    \"\"\"Save a Pipeline to a local file\n\n    Args:\n        pipeline (dict): The Pipeline to save\n        filepath (str): The path to save the Pipeline\n    \"\"\"\n\n    # Sanity check the filepath\n    if not filepath.endswith(\".json\"):\n        filepath += \".json\"\n\n    # Save the pipeline as a local JSON file\n    with open(filepath, \"w\") as fp:\n        json.dump(pipeline, fp, indent=4)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline","title":"<code>Pipeline</code>","text":"<p>Pipeline: SageWorks Pipeline API Class</p> Common Usage <pre><code>my_pipeline = Pipeline(\"name\")\nmy_pipeline.details()\nmy_pipeline.execute()  # Execute entire pipeline\nmy_pipeline.execute_partial([\"data_source\", \"feature_set\"])\nmy_pipeline.execute_partial([\"model\", \"endpoint\"])\n</code></pre> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>class Pipeline:\n    \"\"\"Pipeline: SageWorks Pipeline API Class\n\n    Common Usage:\n        ```\n        my_pipeline = Pipeline(\"name\")\n        my_pipeline.details()\n        my_pipeline.execute()  # Execute entire pipeline\n        my_pipeline.execute_partial([\"data_source\", \"feature_set\"])\n        my_pipeline.execute_partial([\"model\", \"endpoint\"])\n        ```\n    \"\"\"\n\n    def __init__(self, name: str):\n        \"\"\"Pipeline Init Method\"\"\"\n        self.log = logging.getLogger(\"sageworks\")\n        self.name = name\n\n        # Grab our SageWorks Bucket from Config\n        self.cm = ConfigManager()\n        self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n        if self.sageworks_bucket is None:\n            self.log = logging.getLogger(\"sageworks\")\n            self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n            sys.exit(1)\n\n        # Set the S3 Path for this Pipeline\n        self.bucket = self.sageworks_bucket\n        self.key = f\"pipelines/{self.name}.json\"\n        self.s3_path = f\"s3://{self.bucket}/{self.key}\"\n\n        # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n        self.boto_session = AWSAccountClamp().boto_session()\n        self.s3_client = self.boto_session.client(\"s3\")\n\n        # If this S3 Path exists, load the Pipeline\n        if wr.s3.does_object_exist(self.s3_path):\n            self.pipeline = self._get_pipeline()\n        else:\n            self.log.warning(f\"Pipeline {self.name} not found at {self.s3_path}\")\n            self.pipeline = None\n\n        # Data Storage Cache\n        self.data_storage = SageWorksCache(prefix=\"data_storage\")\n\n    def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n        \"\"\"Set the input for the Pipeline\n\n        Args:\n            input (Union[str, pd.DataFrame]): The input for the Pipeline\n            artifact (str): The artifact to set the input for (default: \"data_source\")\n        \"\"\"\n        self.pipeline[artifact][\"input\"] = input\n\n    def set_holdout_ids(self, id_column: str, holdout_ids: list[str]):\n        \"\"\"Set the input for the Pipeline\n\n        Args:\n           id_list (list): The list of hold out ids\n        \"\"\"\n        self.pipeline[\"feature_set\"][\"id_column\"] = id_column\n        self.pipeline[\"feature_set\"][\"holdout_ids\"] = holdout_ids\n\n    def execute(self):\n        \"\"\"Execute the entire Pipeline\n\n        Raises:\n            RunTimeException: If the pipeline execution fails in any way\n        \"\"\"\n        pipeline_executor = PipelineExecutor(self)\n        pipeline_executor.execute()\n\n    def execute_partial(self, subset: list):\n        \"\"\"Execute a partial Pipeline\n\n        Args:\n            subset (list): A subset of the pipeline to execute\n\n        Raises:\n            RunTimeException: If the pipeline execution fails in any way\n        \"\"\"\n        pipeline_executor = PipelineExecutor(self)\n        pipeline_executor.execute_partial(subset)\n\n    def report_settable_fields(self, pipeline: dict = {}, path: str = \"\") -&gt; None:\n        \"\"\"\n        Recursively finds and prints keys with settable fields in a JSON-like dictionary.\n\n        Args:\n        pipeline (dict): pipeline (or sub pipeline) to process.\n        path (str): Current path to the key, used for nested dictionaries.\n        \"\"\"\n        # Grab the entire pipeline if not provided (first call)\n        if not pipeline:\n            self.log.important(f\"Checking Pipeline: {self.name}...\")\n            pipeline = self.pipeline\n        for key, value in pipeline.items():\n            if isinstance(value, dict):\n                # Recurse into sub-dictionary\n                self.report_settable_fields(value, path + key + \" -&gt; \")\n            elif isinstance(value, str) and value.startswith(\"&lt;&lt;\") and value.endswith(\"&gt;&gt;\"):\n                # Check if required or optional\n                required = \"[Required]\" if \"required\" in value else \"[Optional]\"\n                self.log.important(f\"{required} Path: {path + key}\")\n\n    def delete(self):\n        \"\"\"Pipeline Deletion\"\"\"\n        self.log.info(f\"Deleting Pipeline: {self.name}...\")\n        self.data_storage.delete(f\"pipeline:{self.name}:details\")\n        wr.s3.delete_objects(self.s3_path)\n\n    def _get_pipeline(self) -&gt; dict:\n        \"\"\"Internal: Get the pipeline as a JSON object from the specified S3 bucket and key.\"\"\"\n        response = self.s3_client.get_object(Bucket=self.bucket, Key=self.key)\n        json_object = json.loads(response[\"Body\"].read())\n        return json_object\n\n    def __repr__(self) -&gt; str:\n        \"\"\"String representation of this pipeline\n\n        Returns:\n            str: String representation of this pipeline\n        \"\"\"\n        # Class name and details\n        class_name = self.__class__.__name__\n        pipeline_details = json.dumps(self.pipeline, indent=4)\n        return f\"{class_name}({pipeline_details})\"\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.__init__","title":"<code>__init__(name)</code>","text":"<p>Pipeline Init Method</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def __init__(self, name: str):\n    \"\"\"Pipeline Init Method\"\"\"\n    self.log = logging.getLogger(\"sageworks\")\n    self.name = name\n\n    # Grab our SageWorks Bucket from Config\n    self.cm = ConfigManager()\n    self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n    if self.sageworks_bucket is None:\n        self.log = logging.getLogger(\"sageworks\")\n        self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n        sys.exit(1)\n\n    # Set the S3 Path for this Pipeline\n    self.bucket = self.sageworks_bucket\n    self.key = f\"pipelines/{self.name}.json\"\n    self.s3_path = f\"s3://{self.bucket}/{self.key}\"\n\n    # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n    self.boto_session = AWSAccountClamp().boto_session()\n    self.s3_client = self.boto_session.client(\"s3\")\n\n    # If this S3 Path exists, load the Pipeline\n    if wr.s3.does_object_exist(self.s3_path):\n        self.pipeline = self._get_pipeline()\n    else:\n        self.log.warning(f\"Pipeline {self.name} not found at {self.s3_path}\")\n        self.pipeline = None\n\n    # Data Storage Cache\n    self.data_storage = SageWorksCache(prefix=\"data_storage\")\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.__repr__","title":"<code>__repr__()</code>","text":"<p>String representation of this pipeline</p> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>String representation of this pipeline</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def __repr__(self) -&gt; str:\n    \"\"\"String representation of this pipeline\n\n    Returns:\n        str: String representation of this pipeline\n    \"\"\"\n    # Class name and details\n    class_name = self.__class__.__name__\n    pipeline_details = json.dumps(self.pipeline, indent=4)\n    return f\"{class_name}({pipeline_details})\"\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.delete","title":"<code>delete()</code>","text":"<p>Pipeline Deletion</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def delete(self):\n    \"\"\"Pipeline Deletion\"\"\"\n    self.log.info(f\"Deleting Pipeline: {self.name}...\")\n    self.data_storage.delete(f\"pipeline:{self.name}:details\")\n    wr.s3.delete_objects(self.s3_path)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.execute","title":"<code>execute()</code>","text":"<p>Execute the entire Pipeline</p> <p>Raises:</p> Type Description <code>RunTimeException</code> <p>If the pipeline execution fails in any way</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def execute(self):\n    \"\"\"Execute the entire Pipeline\n\n    Raises:\n        RunTimeException: If the pipeline execution fails in any way\n    \"\"\"\n    pipeline_executor = PipelineExecutor(self)\n    pipeline_executor.execute()\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.execute_partial","title":"<code>execute_partial(subset)</code>","text":"<p>Execute a partial Pipeline</p> <p>Parameters:</p> Name Type Description Default <code>subset</code> <code>list</code> <p>A subset of the pipeline to execute</p> required <p>Raises:</p> Type Description <code>RunTimeException</code> <p>If the pipeline execution fails in any way</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def execute_partial(self, subset: list):\n    \"\"\"Execute a partial Pipeline\n\n    Args:\n        subset (list): A subset of the pipeline to execute\n\n    Raises:\n        RunTimeException: If the pipeline execution fails in any way\n    \"\"\"\n    pipeline_executor = PipelineExecutor(self)\n    pipeline_executor.execute_partial(subset)\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.report_settable_fields","title":"<code>report_settable_fields(pipeline={}, path='')</code>","text":"<p>Recursively finds and prints keys with settable fields in a JSON-like dictionary.</p> <p>Args: pipeline (dict): pipeline (or sub pipeline) to process. path (str): Current path to the key, used for nested dictionaries.</p> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def report_settable_fields(self, pipeline: dict = {}, path: str = \"\") -&gt; None:\n    \"\"\"\n    Recursively finds and prints keys with settable fields in a JSON-like dictionary.\n\n    Args:\n    pipeline (dict): pipeline (or sub pipeline) to process.\n    path (str): Current path to the key, used for nested dictionaries.\n    \"\"\"\n    # Grab the entire pipeline if not provided (first call)\n    if not pipeline:\n        self.log.important(f\"Checking Pipeline: {self.name}...\")\n        pipeline = self.pipeline\n    for key, value in pipeline.items():\n        if isinstance(value, dict):\n            # Recurse into sub-dictionary\n            self.report_settable_fields(value, path + key + \" -&gt; \")\n        elif isinstance(value, str) and value.startswith(\"&lt;&lt;\") and value.endswith(\"&gt;&gt;\"):\n            # Check if required or optional\n            required = \"[Required]\" if \"required\" in value else \"[Optional]\"\n            self.log.important(f\"{required} Path: {path + key}\")\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.set_holdout_ids","title":"<code>set_holdout_ids(id_column, holdout_ids)</code>","text":"<p>Set the input for the Pipeline</p> <p>Parameters:</p> Name Type Description Default <code>id_list</code> <code>list</code> <p>The list of hold out ids</p> required Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def set_holdout_ids(self, id_column: str, holdout_ids: list[str]):\n    \"\"\"Set the input for the Pipeline\n\n    Args:\n       id_list (list): The list of hold out ids\n    \"\"\"\n    self.pipeline[\"feature_set\"][\"id_column\"] = id_column\n    self.pipeline[\"feature_set\"][\"holdout_ids\"] = holdout_ids\n</code></pre>"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.set_input","title":"<code>set_input(input, artifact='data_source')</code>","text":"<p>Set the input for the Pipeline</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Union[str, DataFrame]</code> <p>The input for the Pipeline</p> required <code>artifact</code> <code>str</code> <p>The artifact to set the input for (default: \"data_source\")</p> <code>'data_source'</code> Source code in <code>src/sageworks/api/pipeline.py</code> <pre><code>def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n    \"\"\"Set the input for the Pipeline\n\n    Args:\n        input (Union[str, pd.DataFrame]): The input for the Pipeline\n        artifact (str): The artifact to set the input for (default: \"data_source\")\n    \"\"\"\n    self.pipeline[artifact][\"input\"] = input\n</code></pre>"},{"location":"api_classes/pipelines/#examples","title":"Examples","text":"<p>Make a Pipeline</p> <p>Pipelines are just JSON files (see <code>sageworks/examples/pipelines/</code>). You can copy one and make changes to fit your objects/use case, or if you have a set of SageWorks artifacts created you can 'backtrack' from the Endpoint and have it create the Pipeline for you.</p> pipeline_manager.py<pre><code>from sageworks.api.pipeline_manager import PipelineManager\n\n # Create a PipelineManager\nmy_manager = PipelineManager()\n\n# List the Pipelines\npprint(my_manager.list_pipelines())\n\n# Create a Pipeline from an Endpoint\nabalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\n\n# Publish the Pipeline\nmy_manager.publish_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n</code></pre> <p>Output</p> <p><pre><code>Listing Pipelines...\n[{'last_modified': datetime.datetime(2024, 4, 16, 21, 10, 6, tzinfo=tzutc()),\n  'name': 'abalone_pipeline_v1',\n  'size': 445}]\n</code></pre> Pipeline Details</p> pipeline_details.py<pre><code>from sageworks.api.pipeline import Pipeline\n\n# Retrieve an existing Pipeline\nmy_pipeline = Pipeline(\"abalone_pipeline_v1\")\npprint(my_pipeline.details())\n</code></pre> <p>Output</p> <pre><code>{\n    \"name\": \"abalone_pipeline_v1\",\n    \"s3_path\": \"s3://sandbox/pipelines/abalone_pipeline_v1.json\",\n    \"pipeline\": {\n        \"data_source\": {\n            \"name\": \"abalone_data\",\n            \"tags\": [\n                \"abalone_data\"\n            ],\n            \"input\": \"/Users/briford/work/sageworks/data/abalone.csv\"\n        },\n        \"feature_set\": {\n            \"name\": \"abalone_features\",\n            \"tags\": [\n                \"abalone_features\"\n            ],\n            \"input\": \"abalone_data\"\n        },\n        \"model\": {\n            \"name\": \"abalone-regression\",\n            \"tags\": [\n                \"abalone\",\n                \"regression\"\n            ],\n            \"input\": \"abalone_features\"\n        },\n        ...\n    }\n}\n</code></pre> <p>Pipeline Execution</p> <p>Pipeline Execution</p> <p>Executing the Pipeline is obviously the most important reason for creating one. If gives you a reproducible way to capture, inspect, and run the same ML pipeline on different data (nightly).</p> pipeline_execution.py<pre><code>from sageworks.api.pipeline import Pipeline\n\n# Retrieve an existing Pipeline\nmy_pipeline = Pipeline(\"abalone_pipeline_v1\")\n\n# Execute the Pipeline\nmy_pipeline.execute()  # Full execution\n\n# Partial executions\nmy_pipeline.execute_partial([\"data_source\", \"feature_set\"])\nmy_pipeline.execute_partial([\"model\", \"endpoint\"])\n</code></pre>"},{"location":"api_classes/pipelines/#pipelines-advanced","title":"Pipelines Advanced","text":"<p>As part of the flexible architecture sometimes DataSources or FeatureSets can be created with a Pandas DataFrame. To support a DataFrame as input to a pipeline we can call the <code>set_input()</code> method to the pipeline object. If you'd like to specify the <code>set_hold_out_ids()</code> you can also provide a list of ids.</p> <pre><code>    def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n        \"\"\"Set the input for the Pipeline\n\n        Args:\n            input (Union[str, pd.DataFrame]): The input for the Pipeline\n            artifact (str): The artifact to set the input for (default: \"data_source\")\n        \"\"\"\n        self.pipeline[artifact][\"input\"] = input\n\n    def set_hold_out_ids(self, id_list: list):\n        \"\"\"Set the input for the Pipeline\n\n        Args:\n           id_list (list): The list of hold out ids\n        \"\"\"\n        self.pipeline[\"feature_set\"][\"hold_out_ids\"] = id_list\n</code></pre> <p>Running a pipeline creates and deploys a set of SageWorks Artifacts, DataSource, FeatureSet, Model and Endpoint. These artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI.</p> <p>Not Finding a particular method?</p> <p>The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes</p>"},{"location":"aws_setup/aws_access_management/","title":"AWS Acesss Management","text":"<p>Need AWS Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p> <p>This page gives an overview of how SageWorks sets up roles and policies in a granular way that provides 'least priviledge' and also provides a unified framework for AWS access management.</p>"},{"location":"aws_setup/aws_access_management/#conceptual-slide-deck","title":"Conceptual Slide Deck","text":"<p>SageWorks AWS Acesss Management</p>"},{"location":"aws_setup/aws_access_management/#aws-resources","title":"AWS Resources","text":"<ul> <li>AWS Identity Center</li> <li>Users and Groups</li> <li>Permission Sets</li> <li>SSO Command Line/Python Configure</li> </ul>"},{"location":"aws_setup/aws_tips_and_tricks/","title":"AWS Tips and Tricks","text":"<p>Need AWS Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p> <p>This page tries to give helpful guidance when setting up AWS Accounts, Users, and Groups. In general AWS can be a bit tricky to set up the first time. Feel free to use any material in this guide but we're more than happy to help clients get their AWS Setup ready to go for FREE. Below are some guides for setting up a new AWS account for SageWorks and also setting up SSO Users and Groups within AWS.</p>"},{"location":"aws_setup/aws_tips_and_tricks/#new-aws-account-with-aws-organizations-easy","title":"New AWS Account (with AWS Organizations: easy)","text":"<ul> <li>If you already have an AWS Account you can activate the AWS Identity Center/Organization functionality.</li> <li>Now go to AWS Organizations page and hit 'Add an AWS Account' button</li> <li>Add a new User with permissions that allows AWS Stack creation</li> </ul> <p>Email Trick</p> <p>AWS will often not allow the same email to be used for different accounts. If you need a 'new' email just add a plus sign '+' at the end of your existing email (e.g. bob.smith+aws@gmail.com). This email will 'auto forward' to bob.smith@gmail.com.</p>"},{"location":"aws_setup/aws_tips_and_tricks/#new-aws-account-without-aws-organizations-a-bit-harder","title":"New AWS Account (without AWS Organizations: a bit harder)","text":"<ul> <li>Goto: https://aws.amazon.com/free and hit the big button 'Create a Free Account'</li> <li>Enter email and the account name you'd like (anything is fine)</li> <li>You'll get a validation email and go through the rest of the Account setup procedure</li> <li>Add a new User with permissions that allows AWS Stack creation</li> </ul>"},{"location":"aws_setup/aws_tips_and_tricks/#sso-users-and-groups","title":"SSO Users and Groups","text":"<p>AWS SSO (Single Sign-On) is a cloud-based service that allows users to manage access to multiple AWS accounts and business applications using a single set of credentials. It simplifies the authentication process for users and provides centralized management of permissions and access control across various AWS resources. With AWS SSO, users can log in once and access all the applications and accounts they need, streamlining the user experience and increasing productivity. AWS SSO also enables IT administrators to manage access more efficiently by providing a single point of control for managing user access, permissions, and policies, reducing the risk of unauthorized access or security breaches.</p>"},{"location":"aws_setup/aws_tips_and_tricks/#setting-up-sso-users","title":"Setting up SSO Users","text":"<ul> <li>Log in to your AWS account and go to the AWS Identity Center console.</li> <li>Click on the \"Users\" tab and then click on the \"Add user\" button.</li> </ul> <p>The 'Add User' setup is fairly straight forward but here are some screen shots:</p> <p>On the first panel you can fill in the users information.</p> <p></p>"},{"location":"aws_setup/aws_tips_and_tricks/#groups","title":"Groups","text":"<p>On the second panel we suggest that you have at LEAST two groups:</p> <ul> <li>Admin group</li> <li>DataScientists group</li> </ul>"},{"location":"aws_setup/aws_tips_and_tricks/#setting-up-groups","title":"Setting up Groups","text":"<p>This allows you to put most of the users into the DataScientists group that has AWS policies based on their job role. AWS uses 'permission sets' and you assign AWS Policies. This approach makes it easy to give a group of users a set of relevant policies for their tasks. </p> <p>Our standard setup is to have two permission sets with the following policies:</p> <ul> <li>IAM Identity Center --&gt; Permission sets --&gt; DataScientist </li> <li> <p>Add Policy: arn:aws:iam::aws:policy/job-function/DataScientist</p> </li> <li> <p>IAM Identity Center --&gt; Permission sets --&gt; AdministratorAccess </p> </li> <li>Add Policy: arn:aws:iam::aws:policy/job-function/AdministratorAccess</li> </ul> <p>See: Permission Sets for more details and instructions.</p> <p>Another benefit of creating groups is that you can include that group in 'Trust Policy (assume_role)' for the SageWorks-ExecutionRole (this gets deployed as part of the SageWorks AWS Stack). This means that the management of what SageWorks can do/see/read/write is completely done through the SageWorks-ExecutionRole.</p>"},{"location":"aws_setup/aws_tips_and_tricks/#back-to-adding-user","title":"Back to Adding User","text":"<p>Okay now that we have our groups set up we can go back to our original goal of adding a user. So here's the second panel with the groups and now we can hit 'Next'</p> <p></p> <p>On the third panel just review the details and hit the 'Add User' button at the bottom. The user will get an email giving them instructions on how to log on to their AWS account.</p> <p></p>"},{"location":"aws_setup/aws_tips_and_tricks/#aws-console","title":"AWS Console","text":"<p>Now when the user logs onto the AWS Console they should see something like this: </p>"},{"location":"aws_setup/aws_tips_and_tricks/#sso-setup-for-command-linepython-usage","title":"SSO Setup for Command Line/Python Usage","text":"<p>Please see our SSO Setup</p>"},{"location":"aws_setup/aws_tips_and_tricks/#aws-resources","title":"AWS Resources","text":"<ul> <li>AWS Identity Center</li> <li>Users and Groups</li> <li>Permission Sets</li> <li>SSO Command Line/Python Configure</li> </ul>"},{"location":"aws_setup/core_stack/","title":"Initial AWS Setup","text":"<p>Welcome to the SageWorks AWS Setup Guide. SageWorks is deployed as an AWS Stack following the well architected system practices of AWS. </p> <p>AWS Setup can be a bit complex</p> <p>Setting up SageWorks with AWS can be a bit complex, but this only needs to be done ONCE for your entire company. The install uses standard CDK --&gt; AWS Stacks and SageWorks tries to make it straight forward. If you have any troubles at all feel free to contact us a sageworks@supercowpowers.com or on Discord and we're happy to help you with AWS for FREE.</p>"},{"location":"aws_setup/core_stack/#two-main-options-when-using-sageworks","title":"Two main options when using SageWorks","text":"<ol> <li>Spin up a new AWS Account for the SageWorks Stacks (Make a New Account)</li> <li>Deploy SageWorks Stacks into your existing AWS Account</li> </ol> <p>Either of these options are fully supported, but we highly suggest a NEW account as it gives the following benefits:</p> <ul> <li>AWS Data Isolation: Data Scientists will feel empowered to play in the sandbox without impacting production services.</li> <li>AWS Cost Accounting: Monitor and Track all those new ML Pipelines that your team creates with SageWorks :)</li> </ul>"},{"location":"aws_setup/core_stack/#setting-up-users-and-groups","title":"Setting up Users and Groups","text":"<p>If your AWS Account already has users and groups set up you can skip this but here's our recommendations on setting up SSO Users and Groups</p>"},{"location":"aws_setup/core_stack/#onboarding-sageworks-to-your-aws-account","title":"Onboarding SageWorks to your AWS Account","text":"<p>Pulling down the SageWorks Repo   <pre><code>git clone https://github.com/SuperCowPowers/sageworks.git\n</code></pre></p>"},{"location":"aws_setup/core_stack/#sageworks-uses-aws-python-cdk-for-deployments","title":"SageWorks uses AWS Python CDK for Deployments","text":"<p>If you don't have AWS CDK already installed you can do these steps:</p> <p>Mac</p> <p><pre><code>brew install node \nnpm install -g aws-cdk\n</code></pre> Linux</p> <p><pre><code>sudo apt install nodejs\nsudo npm install -g aws-cdk\n</code></pre> For more information on Linux installs see Digital Ocean NodeJS</p>"},{"location":"aws_setup/core_stack/#create-an-s3-bucket-for-sageworks","title":"Create an S3 Bucket for SageWorks","text":"<p>SageWorks pushes and pulls data from AWS, it will use this S3 Bucket for storage and processing. You should create a NEW S3 Bucket, we suggest a name like <code>&lt;company_name&gt;-sageworks</code></p>"},{"location":"aws_setup/core_stack/#deploying-the-sageworks-core-stack","title":"Deploying the SageWorks Core Stack","text":"<p>Do the initial setup/config here: Getting Started. After you've done that come back to this section. For Stack Deployment additional things need to be added to your config file. The config file will be located in your home directory <code>~/.sageworks/sageworks_config.json</code>. Edit this file and add addition stuff for the deployment. Specifically there are two additional fields to be added (optional for both)</p> <p><pre><code>\"SAGEWORKS_SSO_GROUP\": DataScientist (or whatever)\n\"SAGEWORKS_ADDITIONAL_BUCKETS\": \"bucket1, bucket2\n</code></pre> These are optional but are set/used by most SageWorks users.</p> <p>AWS Stuff</p> <p>Activate your AWS Account that's used for SageWorks deployment. For this one time install you should use an Admin Account (or an account that had permissions to create/update AWS Stacks)</p> <pre><code>cd sageworks/aws_setup/sageworks_core\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n</code></pre>"},{"location":"aws_setup/core_stack/#aws-account-setup-check","title":"AWS Account Setup Check","text":"<p>After setting up SageWorks config/AWS Account you can run this test/checking script. If the results ends with <code>INFO AWS Account Clamp: AOK!</code> you're in good shape. If not feel free to contact us on Discord and we'll get it straightened out for you :)</p> <pre><code>pip install sageworks (if not already installed)\ncd sageworks/aws_setup\npython aws_account_check.py\n&lt;lot of print outs for various checks&gt;\n2023-04-12 11:17:09 (aws_account_check.py:48) INFO AWS Account Clamp: AOK!\n</code></pre> <p>Success</p> <p>Congratulations: SageWorks is now deployed to your AWS Account. Deploying the AWS Stack only needs to be done once. Now that this is complete your developers can simply <code>pip install sageworks</code> and start using the API.</p> <p>If you ran into any issues with this procedure please contact us via Discord or email sageworks@supercowpowers.com and the SCP team will provide free setup and support for new SageWorks users.</p>"},{"location":"aws_setup/dashboard_stack/","title":"Deploy the SageWorks Dashboard Stack","text":"<p>Deploying the Dashboard Stack is reasonably straight forward, it's the same approach as the Core Stack that you've already deployed.</p> <p>Please review the Stack Details section to understand all the AWS components that are included and utilized in the SageWorks Dashboard Stack.</p>"},{"location":"aws_setup/dashboard_stack/#deploying-the-dashboard-stack","title":"Deploying the Dashboard Stack","text":"<p>AWS Stuff</p> <p>Activate your AWS Account that's used for SageWorks deployment. For this one time install you should use an Admin Account (or an account that had permissions to create/update AWS Stacks)</p> <pre><code>cd sageworks/aws_setup/sageworks_dashboard_full\nexport SAGEWORKS_CONFIG=/full/path/to/config.json\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n</code></pre>"},{"location":"aws_setup/dashboard_stack/#stack-details","title":"Stack Details","text":"<p>AWS Questions?</p> <p>There's quite a bit to unpack when deploying an AWS powered Web Service. We're happy to help walk you through the details and options. Contact us anytime for a free consultation.</p> <ul> <li>ECS Fargate</li> <li>Load Balancer</li> <li>2 Availability Zones</li> <li>VPCs / Nat Gateways</li> <li>ElasticCache Cluster (shared Redis Caching)</li> </ul>"},{"location":"aws_setup/dashboard_stack/#aws-stack-benefits","title":"AWS Stack Benefits","text":"<ol> <li>Scalability: Includes an Application Load Balancer and uses ECS with Fargate, and ElasticCache for more robust scaling options.</li> <li>Higher Security: Utilizes security groups for both the ECS tasks, load balancer, plus VPC private subnets for Redis and the utilization of NAT Gateways.</li> </ol> <p>AWS Costs</p> <p>Deploying the SageWorks Dashboard does incur some monthly AWS costs. If you're on a tight budget you can deploy the 'lite' version of the Dashboard Stack.</p> <pre><code>cd sageworks/aws_setup/sageworks_dashboard_lite\nexport SAGEWORKS_CONFIG=/full/path/to/config.json\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n</code></pre>"},{"location":"aws_setup/domain_cert_setup/","title":"AWS Domain and Certificate Instructions","text":"<p>Need AWS Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p> <p>This page tries to give helpful guidance when setting up a new domain and SSL Certificate in your AWS Account.</p>"},{"location":"aws_setup/domain_cert_setup/#new-domain","title":"New Domain","text":"<p>You'll want the SageWorks Dashboard to have a domain for your companies internal use. Customers will typically use a domain like <code>&lt;company_name&gt;-ml-dashboard.com</code> but you are free to choose any domain you'd like.</p> <p>Domains are tied to AWS Accounts</p> <p>When you create a new domain in AWS Route 53, that domain is tied to that AWS Account. You can do a cross account setup for domains but it's a bit more tricky. We recommend that each account where SageWorks gets deployed owns the domain for that Dashboard.</p>"},{"location":"aws_setup/domain_cert_setup/#multiple-aws-accounts","title":"Multiple AWS Accounts","text":"<p>Many customers will have a dev/stage/prod set of AWS accounts, if that the case then the best practice is to make a domain specific to each account. So for instance:</p> <ul> <li>The AWS Dev Account gets: <code>&lt;company_name&gt;-ml-dashboard-dev.com</code> </li> <li>The AWS Prod Account gets:  <code>&lt;company_name&gt;-ml-dashboard-prod.com</code>.</li> </ul> <p>This means that when you go to that Dashboard it's super obvious which environment your on.</p>"},{"location":"aws_setup/domain_cert_setup/#register-the-domain","title":"Register the Domain","text":"<ul> <li> <p>Open Route 53 Console Route 53 Console</p> </li> <li> <p>Register your New Domain</p> <ul> <li>Click on Registered domains in the left navigation pane.</li> <li>Click on Register Domain.</li> <li>Enter your desired domain name and check for availability.</li> <li>Follow the prompts to complete the domain registration process.</li> <li>After registration, your domain will be listed under Registered domains.</li> </ul> </li> </ul>"},{"location":"aws_setup/domain_cert_setup/#request-a-ssl-certificate-from-acm","title":"Request a SSL Certificate from ACM","text":"<ol> <li> <p>Open ACM Console: AWS Certificate Manager (ACM) Console</p> </li> <li> <p>Request a Certificate:</p> <ul> <li>Click on Request a certificate.</li> <li>Select Request a public certificate and click Next.</li> </ul> </li> <li> <p>Add Domain Names:</p> <ul> <li>Enter the domain name you registered (e.g., <code>yourdomain.com</code>).</li> <li>Add any additional subdomains if needed (e.g., <code>www.yourdomain.com</code>).</li> </ul> </li> <li> <p>Validation Method:</p> <ul> <li>Choose DNS validation (recommended).</li> <li>ACM will provide CNAME records that you need to add to your Route 53 hosted zone.</li> </ul> </li> <li> <p>Add Tags (Optional):</p> <ul> <li>Add any tags if you want to organize your resources.</li> </ul> </li> <li> <p>Review and Request:</p> <ul> <li>Review your request and click Confirm and request.</li> </ul> </li> </ol>"},{"location":"aws_setup/domain_cert_setup/#adding-cname-records-to-route-53","title":"Adding CNAME Records to Route 53","text":"<p>To complete the domain validation process for your SSL/TLS certificate, you need to add the CNAME records provided by AWS Certificate Manager (ACM) to your Route 53 hosted zone. This step ensures that you own the domain and allows ACM to issue the certificate.</p>"},{"location":"aws_setup/domain_cert_setup/#finding-cname-record-names-and-values","title":"Finding CNAME Record Names and Values","text":"<p>You can find the CNAME record names and values in the AWS Certificate Manager (ACM) console:</p> <ol> <li> <p>Open ACM Console: AWS Certificate Manager (ACM) Console</p> </li> <li> <p>Select Your Certificate:</p> <ul> <li>Click on the certificate that is in the Pending Validation state.</li> </ul> </li> <li> <p>View Domains Section:</p> <ul> <li>Under the Domains section, you will see the CNAME record names and values that you need to add to your Route 53 hosted zone.</li> </ul> </li> </ol>"},{"location":"aws_setup/domain_cert_setup/#adding-cname-records-to-domain","title":"Adding CName Records to Domain","text":"<ol> <li> <p>Open Route 53 Console: Route 53 Console</p> </li> <li> <p>Select Your Hosted Zone:</p> <ul> <li>Find and select the hosted zone for your domain (e.g., <code>yourdomain.com</code>).</li> <li>Click on Create record.</li> </ul> </li> <li> <p>Add the First CNAME Record:</p> <ul> <li>For the Record name, enter the name provided by ACM (e.g., <code>_3e8623442477e9eeec.your-domain.com</code>).</li> <li>For the Record type, select <code>CNAME</code>.</li> <li>For the Value, enter the value provided by ACM (e.g., <code>_0908c89646d92.sdgjtdhdhz.acm-validations.aws.</code>) (include the trailing dot).</li> <li>Leave the default settings for TTL.</li> <li>Click on Create records.</li> </ul> </li> <li> <p>Add the Second CNAME Record:</p> <ul> <li>Repeat the process for the second CNAME record.</li> <li>For the Record name, enter the second name provided by ACM (e.g., <code>_75cd9364c643caa.www.your-domain.com</code>).</li> <li>For the Record type, select <code>CNAME</code>.</li> <li>For the Value, enter the second value provided by ACM (e.g., <code>_f72f8cff4fb20f4.sdgjhdhz.acm-validations.aws.</code>)  (include the trailing dot).</li> <li>Leave the default settings for TTL.</li> <li>Click on Create records.</li> </ul> </li> </ol> <p>DNS Propagation and Cert Validation</p> <p>After adding the CNAME records, these DNS records will propagate through the DNS system and ACM will automatically detect the validation records and validate the domain. This process can take a few minutes or up to an hour.</p>"},{"location":"aws_setup/domain_cert_setup/#certificate-states","title":"Certificate States","text":"<p>After requesting a certificate, it will go through the following states:</p> <ul> <li> <p>Pending Validation: The initial state after you request a certificate and before you complete the validation process. ACM is waiting for you to prove domain ownership by adding the CNAME records.</p> </li> <li> <p>Issued: This state indicates that the certificate has been successfully validated and issued. You can now use this certificate with your AWS resources.</p> </li> <li> <p>Validation Timed Out: If you do not complete the validation process within a specified period (usually 72 hours), the certificate request times out and enters this state.</p> </li> <li> <p>Revoked: This state indicates that the certificate has been revoked and is no longer valid.</p> </li> <li> <p>Failed: If the validation process fails for any reason, the certificate enters this state.</p> </li> <li> <p>Inactive: This state indicates that the certificate is not currently in use.</p> </li> </ul> <p>The certificate status should obviously be in the Issued state, if not please contact SageWorks Support Team.</p>"},{"location":"aws_setup/domain_cert_setup/#retrieving-the-certificate-arn","title":"Retrieving the Certificate ARN","text":"<ol> <li> <p>Open ACM Console:</p> <ul> <li>Go back to the AWS Certificate Manager (ACM) Console.</li> </ul> </li> <li> <p>Check the Status:</p> <ul> <li>Once the CNAME records are added, ACM will automatically validate the domain.</li> <li>Refresh the ACM console to see the updated status.</li> <li>The status will change to \"Issued\" once validation is complete.</li> </ul> </li> <li> <p>Copy the Certificate ARN:</p> <ul> <li>Click on your issued certificate.</li> <li>Copy the Amazon Resource Name (ARN) from the certificate details.</li> </ul> </li> </ol> <p>You now have the ARN for your certificate, which you can use in your AWS resources such as API Gateway, CloudFront, etc.</p>"},{"location":"aws_setup/domain_cert_setup/#aws-resources","title":"AWS Resources","text":"<ul> <li>AWS Adding or Changing DNS Records</li> <li>AWS Certificate Manager (ACM) Documentation</li> <li>Requesting a Public Certificate</li> <li>Validating Domain Ownership</li> <li>AWS Route 53 Documentation</li> <li>AWS API Gateway Documentation</li> </ul>"},{"location":"aws_setup/full_pipeline/","title":"Testing Full ML Pipeline","text":"<p>Now that the core Sageworks AWS Stack has been deployed. Let's test out SageWorks by building a full entire AWS ML Pipeline from start to finish. The script <code>build_ml_pipeline.py</code> uses the SageWorks API to quickly and easily build an AWS Modeling Pipeline.</p> <p>Taste the Awesome</p> <p>The SageWorks \"hello world\" builds a full AWS ML Pipeline. From S3 to deployed model and endpoint. If you have any troubles at all feel free to contact us at sageworks email or on Discord and we're happy to help you for FREE.</p> <ul> <li>DataLoader(abalone.csv) --&gt; DataSource</li> <li>DataToFeatureSet Transform --&gt; FeatureSet</li> <li>FeatureSetToModel Transform --&gt; Model</li> <li>ModelToEndpoint Transform --&gt; Endpoint</li> </ul> <p>This script will take a LONG TiME to run, most of the time is waiting on AWS to finalize FeatureGroups, train Models or deploy Endpoints.</p> <p><pre><code>\u276f python build_ml_pipeline.py\n&lt;lot of building ML pipeline outputs&gt;\n</code></pre> After the script completes you will see that it's built out an AWS ML Pipeline and testing artifacts.</p>"},{"location":"aws_setup/full_pipeline/#run-the-sageworks-dashboard-local","title":"Run the SageWorks Dashboard (Local)","text":"<p>Dashboard AWS Stack</p> <p>Deploying the Dashboard Stack is straight-forward and provides a robust AWS Web Server with Load Balancer, Elastic Container Service, VPC Networks, etc. (see AWS Dashboard Stack)</p> <p>For testing it's nice to run the Dashboard locally, but for longterm use the SageWorks Dashboard should be deployed as an AWS Stack. The deployed Stack allows everyone in the company to use, view, and interact with the AWS Machine Learning Artifacts created with SageWorks.</p> <p><pre><code>cd sageworks/application/aws_dashboard\n./dashboard\n</code></pre> This will open a browser to http://localhost:8000</p> <p> SageWorks Dashboard: AWS Pipelines in a Whole New Light! <p>Success</p> <p>Congratulations: SageWorks is now deployed to your AWS Account. Deploying the AWS Stack only needs to be done once. Now that this is complete your developers can simply <code>pip install sageworks</code> and start using the API.</p> <p>If you ran into any issues with this procedure please contact us via Discord or email sageworks@supercowpowers.com and the SCP team will provide free setup and support for new SageWorks users.</p>"},{"location":"aws_setup/sso_setup/","title":"AWS SSO Setup","text":"<p>Need AWS Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p>"},{"location":"aws_setup/sso_setup/#get-some-information","title":"Get some information","text":"<ul> <li>Goto your AWS Identity Center in the AWS Console</li> <li>On the right side there will be two important pieces of information<ul> <li>Start URL</li> <li>Region </li> </ul> </li> </ul> <p>If you're connecting to the SCP AWS Account you can use these values</p> <pre><code>Start URL: https://supercowpowers.awsapps.com/start \nRegion: us-west-2\n</code></pre>"},{"location":"aws_setup/sso_setup/#install-aws-cli","title":"Install AWS CLI","text":"<p>Mac <code>brew install awscli</code></p> <p>Linux <code>pip install awscli</code></p> <p>Windows</p> <p>Download the MSI installer (top right corner on this page) https://aws.amazon.com/cli/ and follow the installation instructions.</p>"},{"location":"aws_setup/sso_setup/#running-the-sso-configuration","title":"Running the SSO Configuration","text":"<p>Note: You only need to do this once!</p> <pre><code>aws configure sso --profile &lt;whatever you want&gt; (e.g. aws_sso)\nSSO session name (Recommended): sso-session\nSSO start URL []: &lt;the Start URL from info above&gt;\nSSO region []: &lt;the Region from info above&gt;\nSSO registration scopes [sso:account:access]: &lt;just hit return&gt;\n</code></pre> <p>You will get a browser open/redirect at this point and get a list of available accounts.. something like below, just pick the correct account</p> <pre><code>There are 2 AWS accounts available to you.\n&gt; SCP_Sandbox, briford+sandbox@supercowpowers.com (XXXX40646YYY)\n  SCP_Main, briford@supercowpowers.com (XXX576391YYY)\n</code></pre> <p>Now pick the role that you're going to use</p> <pre><code>There are 2 roles available to you.\n&gt; DataScientist\n  AdministratorAccess\n\nCLI default client Region [None]: &lt;same region as above&gt;\nCLI default output format [None]: json\n</code></pre>"},{"location":"aws_setup/sso_setup/#setting-up-some-aliases-for-bashzsh","title":"Setting up some aliases for bash/zsh","text":"<p>Edit your favorite ~/.bashrc ~/.zshrc and add these nice aliases/helper</p> <pre><code># AWS Aliases\nalias bob_sso='export AWS_PROFILE=bob_sso'\n\n# Default AWS Profile\nexport AWS_PROFILE=bob_sso\n</code></pre>"},{"location":"aws_setup/sso_setup/#testing-your-new-aws-profile","title":"Testing your new AWS Profile","text":"<p>Make sure your profile is active/set</p> <p><pre><code>env | grep AWS\nAWS_PROFILE=&lt;bob_sso or whatever&gt;\n</code></pre> Now you can list the S3 buckets in the AWS Account</p> <p><pre><code>aws ls s3\n</code></pre> If you get some message like this...</p> <pre><code>The SSO session associated with this profile has\nexpired or is otherwise invalid. To refresh this SSO\nsession run aws sso login with the corresponding\nprofile.\n</code></pre> <p>This is fine/good, a browser will open up and you can refresh your SSO Token.</p> <p>After that you should get a listing of the S3 buckets without needed to refresh your token.</p> <pre><code>aws s3 ls\n\u276f aws s3 ls\n2023-03-20 20:06:53 aws-athena-query-results-XXXYYY-us-west-2\n2023-03-30 13:22:28 sagemaker-studio-XXXYYY-dbgyvq8ruka\n2023-03-24 22:05:55 sagemaker-us-west-2-XXXYYY\n2023-04-30 13:43:29 scp-sageworks-artifacts\n</code></pre>"},{"location":"aws_setup/sso_setup/#back-to-initial-setup","title":"Back to Initial Setup","text":"<p>If you're doing the initial setup of SageWorks you should now go back and finish that process: Getting Started</p>"},{"location":"aws_setup/sso_setup/#aws-resources","title":"AWS Resources","text":"<ul> <li>AWS Identity Center</li> <li>Users and Groups</li> <li>Permission Sets</li> <li>SSO Command Line/Python Configure</li> </ul>"},{"location":"blogs_research/","title":"SageWorks Blogs","text":"<p>Just Getting Started?</p> <p>The SageWorks Blogs is a great way to see what's possible with SageWorks. Also if you're ready to just in the API Classes will give you details on the SageWorks ML Pipeline Classes.</p>"},{"location":"blogs_research/#blogs","title":"Blogs","text":"<ul> <li>AqSol Residual Analysis: If this Blog we'll look at the popular AqSol compound solubility dataset and walk backward through the ML pipeline by starting with model residuals and backtracking to the features and input data.</li> </ul> <p>Examples</p> <p>All of the SageWorks Examples are in the Sageworks Repository under the <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p>"},{"location":"blogs_research/eda/","title":"Exploratory Data Analysis","text":"<p>SageWorks EDS</p> <p>The SageWorks toolkit a set of plots that show EDA results, it also has a flexible plugin architecture to expand, enhance, or even replace the current set of web components Dashboard.</p> <p>The SageWorks framework has a broad range of Exploratory Data Analysis (EDA) functionality. Each time a DataSource or FeatureSet is created that data is run through a full set of EDA techniques:</p> <ul> <li>TBD</li> <li>TBD2</li> </ul> SageWorks Exploratory Data Analysis"},{"location":"blogs_research/eda/#additional-resources","title":"Additional Resources","text":"<ul> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"blogs_research/htg/","title":"EDA: High Target Gradients","text":"<p>SageWorks EDS</p> <p>The SageWorks toolkit a set of plots that show EDA results, it also has a flexible plugin architecture to expand, enhance, or even replace the current set of web components Dashboard.</p> <p>The SageWorks framework has a broad range of Exploratory Data Analysis (EDA) functionality. Each time a DataSource or FeatureSet is created that data is run through a full set of EDA techniques:</p> <ul> <li>TBD</li> <li>TBD2</li> </ul> <p>One of the latest EDA techniques we've added is the addition of a concept called High Target Gradients </p> <ol> <li>Definition: For a given data point (x_i) with target value (y_i), and its neighbor (x_j) with target value (y_j), the target gradient (G_{ij}) can be defined as:</li> </ol> <p>[G_{ij} = \\frac{|y_i - y_j|}{d(x_i, x_j)}]</p> <p>where (d(x_i, x_j)) is the distance between (x_i) and (x_j) in the feature space. This equation gives you the rate of change of the target value with respect to the change in features, similar to a slope in a two-dimensional space.</p> <ol> <li>Max Gradient for Each Point: For each data point (x_i), you can compute the maximum target gradient with respect to all its neighbors:</li> </ol> <p>[G_{i}^{max} = \\max_{j \\neq i} G_{ij}]</p> <p>This gives you a scalar value for each point in your training data that represents the maximum rate of change of the target value in its local neighborhood.</p> <ol> <li> <p>Usage: You can use (G_{i}^{max}) to identify and filter areas in the feature space that have high target gradients, which may indicate potential issues with data quality or feature representation.</p> </li> <li> <p>Visualization: Plotting the distribution of (G_{i}^{max}) values or visualizing them in the context of the feature space can help you identify regions or specific points that warrant further investigation.</p> </li> </ol>"},{"location":"blogs_research/htg/#additional-resources","title":"Additional Resources","text":"<ul> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"blogs_research/residual_analysis/","title":"Residual Analysis","text":""},{"location":"blogs_research/residual_analysis/#residual-analysis","title":"Residual Analysis","text":"<p>Overview and Definition Residual analysis involves examining the differences between observed and predicted values, known as residuals, to assess the performance of a predictive model. It is a critical step in model evaluation as it helps identify patterns of errors, diagnose potential problems, and improve model performance. By understanding where and why a model's predictions deviate from actual values, we can make informed adjustments to the model or the data to enhance accuracy and robustness.</p> <p>Sparse Data Regions The observation is in a part of feature space with little or no nearby training observations, leading to poor generalization in these regions and resulting in high prediction errors.</p> <p>Noisy/Inconsistent Data and Preprocessing Issues The observation is in a part of feature space where the training data is noisy, incorrect, or has high variance in the target variable. Additionally, missing values or incorrect data transformations can introduce errors, leading to unreliable predictions and high residuals.</p> <p>Feature Resolution The current feature set may not fully resolve the compounds, leading to \u2018collisions\u2019 where different compounds are assigned identical features. Such unresolved features can result in different compounds exhibiting the same features, causing high residuals due to unaccounted structural or chemical nuances.</p> <p>Activity Cliffs Structurally similar compounds exhibit significantly different activities, making accurate prediction challenging due to steep changes in activity with minor structural modifications.</p> <p>Feature Engineering Issues Irrelevant or redundant features and poor feature scaling can negatively impact the model's performance and accuracy, resulting in higher residuals.</p> <p>Model Overfitting or Underfitting Overfitting occurs when the model is too complex and captures noise, while underfitting happens when the model is too simple and misses underlying patterns, both leading to inaccurate predictions.</p>"},{"location":"concepts/model_monitoring/","title":"Model Monitoring","text":"<p>Amazon SageMaker Model Monitor currently provides the following types of monitoring:</p> <ul> <li>Monitor Data Quality: Detect drifts in data quality such as deviations from baseline data types.</li> <li>Monitor Model Quality: Monitor drift in model quality metrics, such as accuracy.</li> <li>Monitor Bias Drift for Models in Production: Monitor bias in your model\u2019s predictions.</li> <li>Monitor Feature Attribution Drift for Models in Production: Monitor drift in feature attribution.</li> </ul>"},{"location":"core_classes/overview/","title":"Core Classes","text":"<p>SageWorks Core Classes</p> <p>These classes interact with many of the AWS service details and are therefore more complex. They provide additional control and refinement over the AWS ML Pipline. For most use cases the API Classes should be used</p> <p>Welcome to the SageWorks Core Classes</p> <p>The Core Classes provide low-level APIs for the SageWorks package, these classes directly interface with the AWS Sagemaker Pipeline interfaces and have a large number of methods with reasonable complexity.</p> <p>The API Classes have method pass-through so just call the method on the API Class and voil\u00e0 it works the same.</p> <p></p>"},{"location":"core_classes/overview/#artifacts","title":"Artifacts","text":"<ul> <li>AthenaSource: Manages AWS Data Catalog and Athena</li> <li>FeatureSetCore: Manages AWS Feature Store and Feature Groups</li> <li>ModelCore: Manages the training and deployment of AWS Model Groups and Packages</li> <li>EndpointCore: Manages the deployment and invocations/inference on AWS Endpoints</li> </ul>"},{"location":"core_classes/overview/#transforms","title":"Transforms","text":"<p>Transforms are a set of classes that transform one type of <code>Artifact</code> to another type. For instance <code>DataToFeatureSet</code> takes a <code>DataSource</code> artifact and creates a <code>FeatureSet</code> artifact.</p> <ul> <li>DataLoaders Light: Loads various light/smaller data into AWS Data Catalog and Athena</li> <li>DataLoaders Heavy: Loads heavy/larger data (via Glue) into AWS Data Catalog and Athena</li> <li>DataToFeatures: Transforms a DataSource into a FeatureSet (AWS Feature Store/Group)</li> <li>FeaturesToModel: Trains and deploys an AWS Model Package/Group from a FeatureSet</li> <li>ModelToEndpoint: Manages the provisioning and deployment of a Model Endpoint</li> <li>PandasTransforms:Pandas DataFrame transforms and helper methods.</li> </ul>"},{"location":"core_classes/artifacts/artifact/","title":"Artifact","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the any class that inherits from the Artifact Class and voil\u00e0 it works the same.</p> <p>The SageWorks Artifact class is a base/abstract class that defines API implemented by all the child classes (DataSource, FeatureSet, Model, Endpoint).</p> <p>Artifact: Abstract Base Class for all Artifact classes in SageWorks. Artifacts simply reflect and aggregate one or more AWS Services</p>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact","title":"<code>Artifact</code>","text":"<p>               Bases: <code>ABC</code></p> <p>Artifact: Abstract Base Class for all Artifact classes in SageWorks</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>class Artifact(ABC):\n    \"\"\"Artifact: Abstract Base Class for all Artifact classes in SageWorks\"\"\"\n\n    log = logging.getLogger(\"sageworks\")\n\n    def __init__(self, uuid: str):\n        \"\"\"Initialize the Artifact Base Class\n\n        Args:\n            uuid (str): The UUID of this artifact\n        \"\"\"\n        self.uuid = uuid\n\n        # Set up our Boto3 and SageMaker Session and SageMaker Client\n        self.aws_account_clamp = AWSAccountClamp()\n        self.boto_session = self.aws_account_clamp.boto_session()\n        self.sm_session = self.aws_account_clamp.sagemaker_session(self.boto_session)\n        self.sm_client = self.aws_account_clamp.sagemaker_client(self.boto_session)\n        self.aws_region = self.aws_account_clamp.region\n\n        # The Meta() class pulls and collects metadata from a bunch of AWS Services\n        self.aws_broker = AWSServiceBroker()\n        from sageworks.api.meta import Meta\n\n        self.meta_broker = Meta()\n\n        # Config Manager Checks\n        self.cm = ConfigManager()\n        if not self.cm.config_okay():\n            self.log.error(\"SageWorks Configuration Incomplete...\")\n            self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n            raise FatalConfigError()\n\n        # Grab our SageWorks Bucket from Config\n        self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n        if self.sageworks_bucket is None:\n            self.log = logging.getLogger(\"sageworks\")\n            self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n            sys.exit(1)\n\n        # Setup Bucket Paths\n        self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n        self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n        self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n        self.endpoints_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n        # Data Cache for Artifacts\n        self.data_storage = SageWorksCache(prefix=\"data_storage\")\n        self.temp_storage = SageWorksCache(prefix=\"temp_storage\", expire=300)  # 5 minutes\n        self.ephemeral_storage = SageWorksCache(prefix=\"ephemeral_storage\", expire=1)  # 1 second\n\n        # Delimiter for storing lists in AWS Tags\n        self.tag_delimiter = \"::\"\n\n    def __post_init__(self):\n        \"\"\"Artifact Post Initialization\"\"\"\n\n        # Do I exist? (very metaphysical)\n        if not self.exists():\n            self.log.debug(f\"Artifact {self.uuid} does not exist\")\n            return\n\n        # Conduct a Health Check on this Artifact\n        health_issues = self.health_check()\n        if health_issues:\n            if \"needs_onboard\" in health_issues:\n                self.log.important(f\"Artifact {self.uuid} needs to be onboarded\")\n            elif health_issues == [\"no_activity\"]:\n                self.log.debug(f\"Artifact {self.uuid} has no activity\")\n            else:\n                self.log.warning(f\"Health Check Failed {self.uuid}: {health_issues}\")\n            for issue in health_issues:\n                self.add_health_tag(issue)\n        else:\n            self.log.info(f\"Health Check Passed {self.uuid}\")\n\n    @classmethod\n    def ensure_valid_name(cls, name: str, delimiter: str = \"_\"):\n        \"\"\"Check if the ID adheres to the naming conventions for this Artifact.\n\n        Args:\n            name (str): The name/id to check.\n            delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n\n        Raises:\n            ValueError: If the name/id is not valid.\n        \"\"\"\n        valid_name = cls.generate_valid_name(name, delimiter=delimiter)\n        if name != valid_name:\n            error_msg = f\"{name} doesn't conform and should be converted to: {valid_name}\"\n            cls.log.error(error_msg)\n            raise ValueError(error_msg)\n\n    @staticmethod\n    def generate_valid_name(name: str, delimiter: str = \"_\") -&gt; str:\n        \"\"\"Only allow letters and the specified delimiter, also lowercase the string\n\n        Args:\n            name (str): The name/id string to check\n            delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n\n        Returns:\n            str: A generated valid name/id\n        \"\"\"\n        valid_name = \"\".join(c for c in name if c.isalnum() or c in [\"_\", \"-\"]).lower()\n        valid_name = valid_name.replace(\"_\", delimiter)\n        valid_name = valid_name.replace(\"-\", delimiter)\n        return valid_name\n\n    @abstractmethod\n    def exists(self) -&gt; bool:\n        \"\"\"Does the Artifact exist? Can we connect to it?\"\"\"\n        pass\n\n    def sageworks_meta(self) -&gt; dict:\n        \"\"\"Get the SageWorks specific metadata for this Artifact\n        Note: This functionality will work for FeatureSets, Models, and Endpoints\n              but not for DataSources/Graphs. DataSource/Graph classes need to override this method.\n        \"\"\"\n        # First, check our cache\n        meta_data_key = f\"{self.uuid}_sageworks_meta\"\n        meta_data = self.ephemeral_storage.get(meta_data_key)\n        if meta_data is not None:\n            return meta_data\n\n        # Otherwise, fetch the metadata from AWS, store it in the cache, and return it\n        meta_data = list_tags_with_throttle(self.arn(), self.sm_session)\n        self.ephemeral_storage.set(meta_data_key, meta_data)\n        return meta_data\n\n    def expected_meta(self) -&gt; list[str]:\n        \"\"\"Metadata we expect to see for this Artifact when it's ready\n        Returns:\n            list[str]: List of expected metadata keys\n        \"\"\"\n\n        # If an artifact has additional expected metadata override this method\n        return [\"sageworks_status\"]\n\n    @abstractmethod\n    def refresh_meta(self):\n        \"\"\"Refresh the Artifact's metadata\"\"\"\n        pass\n\n    def ready(self) -&gt; bool:\n        \"\"\"Is the Artifact ready? Is initial setup complete and expected metadata populated?\"\"\"\n\n        # If anything goes wrong, assume the artifact is not ready\n        try:\n            # Check for the expected metadata\n            expected_meta = self.expected_meta()\n            existing_meta = self.sageworks_meta()\n            ready = set(existing_meta.keys()).issuperset(expected_meta)\n            if ready:\n                return True\n            else:\n                self.log.info(\"Artifact is not ready!\")\n                return False\n        except Exception as e:\n            self.log.error(f\"Artifact malformed: {e}\")\n            return False\n\n    @abstractmethod\n    def onboard(self) -&gt; bool:\n        \"\"\"Onboard this Artifact into SageWorks\n        Returns:\n            bool: True if the Artifact was successfully onboarded, False otherwise\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def details(self) -&gt; dict:\n        \"\"\"Additional Details about this Artifact\"\"\"\n        pass\n\n    @abstractmethod\n    def size(self) -&gt; float:\n        \"\"\"Return the size of this artifact in MegaBytes\"\"\"\n        pass\n\n    @abstractmethod\n    def created(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was created\"\"\"\n        pass\n\n    @abstractmethod\n    def modified(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was last modified\"\"\"\n        pass\n\n    @abstractmethod\n    def arn(self):\n        \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n        pass\n\n    @abstractmethod\n    def aws_url(self):\n        \"\"\"AWS console/web interface for this artifact\"\"\"\n        pass\n\n    @abstractmethod\n    def aws_meta(self) -&gt; dict:\n        \"\"\"Get the full AWS metadata for this artifact\"\"\"\n        pass\n\n    @abstractmethod\n    def delete(self):\n        \"\"\"Delete this artifact including all related AWS objects\"\"\"\n        pass\n\n    def upsert_sageworks_meta(self, new_meta: dict):\n        \"\"\"Add SageWorks specific metadata to this Artifact\n        Args:\n            new_meta (dict): Dictionary of NEW metadata to add\n        Note:\n            This functionality will work for FeatureSets, Models, and Endpoints\n            but not for DataSources. The DataSource class overrides this method.\n        \"\"\"\n        # Sanity check\n        aws_arn = self.arn()\n        if aws_arn is None:\n            self.log.error(f\"ARN is None for {self.uuid}!\")\n            return\n\n        # Add the new metadata to the existing metadata\n        self.log.info(f\"Upserting SageWorks Metadata for Artifact: {aws_arn}...\")\n        aws_tags = dict_to_aws_tags(new_meta)\n        self.sm_client.add_tags(ResourceArn=aws_arn, Tags=aws_tags)\n\n    def remove_sageworks_meta(self, key_to_remove: str):\n        \"\"\"Remove SageWorks specific metadata from this Artifact\n        Args:\n            key_to_remove (str): The metadata key to remove\n        Note:\n            This functionality will work for FeatureSets, Models, and Endpoints\n            but not for DataSources. The DataSource class overrides this method.\n        \"\"\"\n        aws_arn = self.arn()\n        # Sanity check\n        if aws_arn is None:\n            self.log.error(f\"ARN is None for {self.uuid}!\")\n            return\n        self.log.info(f\"Removing SageWorks Metadata {key_to_remove} for Artifact: {aws_arn}...\")\n        sagemaker_delete_tag(aws_arn, self.sm_session, key_to_remove)\n\n    def get_tags(self, tag_type=\"user\") -&gt; list:\n        \"\"\"Get the tags for this artifact\n        Args:\n            tag_type (str): Type of tags to return (user or health)\n        Returns:\n            list[str]: List of tags for this artifact\n        \"\"\"\n        if tag_type == \"user\":\n            user_tags = self.sageworks_meta().get(\"sageworks_tags\")\n            return user_tags.split(self.tag_delimiter) if user_tags else []\n\n        # Grab our health tags\n        health_tags = self.sageworks_meta().get(\"sageworks_health_tags\")\n\n        # If we don't have health tags, create the storage and return an empty list\n        if health_tags is None:\n            self.log.important(f\"{self.uuid} creating sageworks_health_tags storage...\")\n            self.upsert_sageworks_meta({\"sageworks_health_tags\": \"\"})\n            return []\n\n        # Otherwise, return the health tags\n        return health_tags.split(self.tag_delimiter) if health_tags else []\n\n    def set_tags(self, tags):\n        self.upsert_sageworks_meta({\"sageworks_tags\": self.tag_delimiter.join(tags)})\n\n    def add_tag(self, tag, tag_type=\"user\"):\n        \"\"\"Add a tag for this artifact, ensuring no duplicates and maintaining order.\n        Args:\n            tag (str): Tag to add for this artifact\n            tag_type (str): Type of tag to add (user or health)\n        \"\"\"\n        current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n        if tag not in current_tags:\n            current_tags.append(tag)\n            combined_tags = self.tag_delimiter.join(current_tags)\n            if tag_type == \"user\":\n                self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n            else:\n                self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n\n    def remove_sageworks_tag(self, tag, tag_type=\"user\"):\n        \"\"\"Remove a tag from this artifact if it exists.\n        Args:\n            tag (str): Tag to remove from this artifact\n            tag_type (str): Type of tag to remove (user or health)\n        \"\"\"\n        current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n        if tag in current_tags:\n            current_tags.remove(tag)\n            combined_tags = self.tag_delimiter.join(current_tags)\n            if tag_type == \"user\":\n                self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n            elif tag_type == \"health\":\n                self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n\n    # Syntactic sugar for health tags\n    def get_health_tags(self):\n        return self.get_tags(tag_type=\"health\")\n\n    def set_health_tags(self, tags):\n        self.upsert_sageworks_meta({\"sageworks_health_tags\": self.tag_delimiter.join(tags)})\n\n    def add_health_tag(self, tag):\n        self.add_tag(tag, tag_type=\"health\")\n\n    def remove_health_tag(self, tag):\n        self.remove_sageworks_tag(tag, tag_type=\"health\")\n\n    # Owner of this artifact\n    def get_owner(self) -&gt; str:\n        \"\"\"Get the owner of this artifact\"\"\"\n        return self.sageworks_meta().get(\"sageworks_owner\", \"unknown\")\n\n    def set_owner(self, owner: str):\n        \"\"\"Set the owner of this artifact\n\n        Args:\n            owner (str): Owner to set for this artifact\n        \"\"\"\n        self.upsert_sageworks_meta({\"sageworks_owner\": owner})\n\n    def get_input(self) -&gt; str:\n        \"\"\"Get the input data for this artifact\"\"\"\n        return self.sageworks_meta().get(\"sageworks_input\", \"unknown\")\n\n    def set_input(self, input_data: str):\n        \"\"\"Set the input data for this artifact\n\n        Args:\n            input_data (str): Name of input data for this artifact\n        Note:\n            This breaks the official provenance of the artifact, so use with caution.\n        \"\"\"\n        self.log.important(f\"{self.uuid}: Setting input to {input_data}...\")\n        self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n        self.upsert_sageworks_meta({\"sageworks_input\": input_data})\n\n    def get_status(self) -&gt; str:\n        \"\"\"Get the status for this artifact\"\"\"\n        return self.sageworks_meta().get(\"sageworks_status\", \"unknown\")\n\n    def set_status(self, status: str):\n        \"\"\"Set the status for this artifact\n        Args:\n            status (str): Status to set for this artifact\n        \"\"\"\n        self.upsert_sageworks_meta({\"sageworks_status\": status})\n\n    def health_check(self) -&gt; list[str]:\n        \"\"\"Perform a health check on this artifact\n        Returns:\n            list[str]: List of health issues\n        \"\"\"\n        health_issues = []\n        if not self.ready():\n            return [\"needs_onboard\"]\n        if \"unknown\" in self.aws_url():\n            health_issues.append(\"aws_url_unknown\")\n        return health_issues\n\n    def summary(self) -&gt; dict:\n        \"\"\"This is generic summary information for all Artifacts. If you\n        want to get more detailed information, call the details() method\n        which is implemented by the specific Artifact class\"\"\"\n        basic = {\n            \"uuid\": self.uuid,\n            \"health_tags\": self.get_health_tags(),\n            \"aws_arn\": self.arn(),\n            \"size\": self.size(),\n            \"created\": self.created(),\n            \"modified\": self.modified(),\n            \"input\": self.get_input(),\n        }\n        # Combine the sageworks metadata with the basic metadata\n        return {**basic, **self.sageworks_meta()}\n\n    def __repr__(self) -&gt; str:\n        \"\"\"String representation of this artifact\n\n        Returns:\n            str: String representation of this artifact\n        \"\"\"\n        summary_dict = self.summary()\n        display_keys = [\n            \"aws_arn\",\n            \"health_tags\",\n            \"size\",\n            \"created\",\n            \"modified\",\n            \"input\",\n            \"sageworks_status\",\n            \"sageworks_tags\",\n        ]\n        summary_items = [f\"  {repr(key)}: {repr(value)}\" for key, value in summary_dict.items() if key in display_keys]\n        summary_str = f\"{self.__class__.__name__}: {self.uuid}\\n\" + \",\\n\".join(summary_items)\n        return summary_str\n\n    def delete_metadata(self, key_to_delete: str):\n        \"\"\"Delete specific metadata from this artifact\n        Args:\n            key_to_delete (str): Metadata key to delete\n        \"\"\"\n\n        aws_arn = self.arn()\n        self.log.important(f\"Deleting Metadata {key_to_delete} for Artifact: {aws_arn}...\")\n\n        # First, fetch all the existing tags\n        response = self.sm_session.list_tags(aws_arn)\n        existing_tags = response.get(\"Tags\", [])\n\n        # Convert existing AWS tags to a dictionary for easy manipulation\n        existing_tags_dict = {item[\"Key\"]: item[\"Value\"] for item in existing_tags}\n\n        # Identify tags to delete\n        tag_list_to_delete = []\n        for key in existing_tags_dict.keys():\n            if key == key_to_delete or key.startswith(f\"{key_to_delete}_chunk_\"):\n                tag_list_to_delete.append(key)\n\n        # Delete the identified tags\n        if tag_list_to_delete:\n            self.sm_client.delete_tags(ResourceArn=aws_arn, TagKeys=tag_list_to_delete)\n        else:\n            self.log.info(f\"No Metadata found: {key_to_delete}...\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__init__","title":"<code>__init__(uuid)</code>","text":"<p>Initialize the Artifact Base Class</p> <p>Parameters:</p> Name Type Description Default <code>uuid</code> <code>str</code> <p>The UUID of this artifact</p> required Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def __init__(self, uuid: str):\n    \"\"\"Initialize the Artifact Base Class\n\n    Args:\n        uuid (str): The UUID of this artifact\n    \"\"\"\n    self.uuid = uuid\n\n    # Set up our Boto3 and SageMaker Session and SageMaker Client\n    self.aws_account_clamp = AWSAccountClamp()\n    self.boto_session = self.aws_account_clamp.boto_session()\n    self.sm_session = self.aws_account_clamp.sagemaker_session(self.boto_session)\n    self.sm_client = self.aws_account_clamp.sagemaker_client(self.boto_session)\n    self.aws_region = self.aws_account_clamp.region\n\n    # The Meta() class pulls and collects metadata from a bunch of AWS Services\n    self.aws_broker = AWSServiceBroker()\n    from sageworks.api.meta import Meta\n\n    self.meta_broker = Meta()\n\n    # Config Manager Checks\n    self.cm = ConfigManager()\n    if not self.cm.config_okay():\n        self.log.error(\"SageWorks Configuration Incomplete...\")\n        self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n        raise FatalConfigError()\n\n    # Grab our SageWorks Bucket from Config\n    self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n    if self.sageworks_bucket is None:\n        self.log = logging.getLogger(\"sageworks\")\n        self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n        sys.exit(1)\n\n    # Setup Bucket Paths\n    self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n    self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n    self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n    self.endpoints_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n    # Data Cache for Artifacts\n    self.data_storage = SageWorksCache(prefix=\"data_storage\")\n    self.temp_storage = SageWorksCache(prefix=\"temp_storage\", expire=300)  # 5 minutes\n    self.ephemeral_storage = SageWorksCache(prefix=\"ephemeral_storage\", expire=1)  # 1 second\n\n    # Delimiter for storing lists in AWS Tags\n    self.tag_delimiter = \"::\"\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__post_init__","title":"<code>__post_init__()</code>","text":"<p>Artifact Post Initialization</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def __post_init__(self):\n    \"\"\"Artifact Post Initialization\"\"\"\n\n    # Do I exist? (very metaphysical)\n    if not self.exists():\n        self.log.debug(f\"Artifact {self.uuid} does not exist\")\n        return\n\n    # Conduct a Health Check on this Artifact\n    health_issues = self.health_check()\n    if health_issues:\n        if \"needs_onboard\" in health_issues:\n            self.log.important(f\"Artifact {self.uuid} needs to be onboarded\")\n        elif health_issues == [\"no_activity\"]:\n            self.log.debug(f\"Artifact {self.uuid} has no activity\")\n        else:\n            self.log.warning(f\"Health Check Failed {self.uuid}: {health_issues}\")\n        for issue in health_issues:\n            self.add_health_tag(issue)\n    else:\n        self.log.info(f\"Health Check Passed {self.uuid}\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__repr__","title":"<code>__repr__()</code>","text":"<p>String representation of this artifact</p> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>String representation of this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def __repr__(self) -&gt; str:\n    \"\"\"String representation of this artifact\n\n    Returns:\n        str: String representation of this artifact\n    \"\"\"\n    summary_dict = self.summary()\n    display_keys = [\n        \"aws_arn\",\n        \"health_tags\",\n        \"size\",\n        \"created\",\n        \"modified\",\n        \"input\",\n        \"sageworks_status\",\n        \"sageworks_tags\",\n    ]\n    summary_items = [f\"  {repr(key)}: {repr(value)}\" for key, value in summary_dict.items() if key in display_keys]\n    summary_str = f\"{self.__class__.__name__}: {self.uuid}\\n\" + \",\\n\".join(summary_items)\n    return summary_str\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.add_tag","title":"<code>add_tag(tag, tag_type='user')</code>","text":"<p>Add a tag for this artifact, ensuring no duplicates and maintaining order. Args:     tag (str): Tag to add for this artifact     tag_type (str): Type of tag to add (user or health)</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def add_tag(self, tag, tag_type=\"user\"):\n    \"\"\"Add a tag for this artifact, ensuring no duplicates and maintaining order.\n    Args:\n        tag (str): Tag to add for this artifact\n        tag_type (str): Type of tag to add (user or health)\n    \"\"\"\n    current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n    if tag not in current_tags:\n        current_tags.append(tag)\n        combined_tags = self.tag_delimiter.join(current_tags)\n        if tag_type == \"user\":\n            self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n        else:\n            self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.arn","title":"<code>arn()</code>  <code>abstractmethod</code>","text":"<p>AWS ARN (Amazon Resource Name) for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef arn(self):\n    \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.aws_meta","title":"<code>aws_meta()</code>  <code>abstractmethod</code>","text":"<p>Get the full AWS metadata for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef aws_meta(self) -&gt; dict:\n    \"\"\"Get the full AWS metadata for this artifact\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.aws_url","title":"<code>aws_url()</code>  <code>abstractmethod</code>","text":"<p>AWS console/web interface for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef aws_url(self):\n    \"\"\"AWS console/web interface for this artifact\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.created","title":"<code>created()</code>  <code>abstractmethod</code>","text":"<p>Return the datetime when this artifact was created</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef created(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was created\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.delete","title":"<code>delete()</code>  <code>abstractmethod</code>","text":"<p>Delete this artifact including all related AWS objects</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef delete(self):\n    \"\"\"Delete this artifact including all related AWS objects\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.delete_metadata","title":"<code>delete_metadata(key_to_delete)</code>","text":"<p>Delete specific metadata from this artifact Args:     key_to_delete (str): Metadata key to delete</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def delete_metadata(self, key_to_delete: str):\n    \"\"\"Delete specific metadata from this artifact\n    Args:\n        key_to_delete (str): Metadata key to delete\n    \"\"\"\n\n    aws_arn = self.arn()\n    self.log.important(f\"Deleting Metadata {key_to_delete} for Artifact: {aws_arn}...\")\n\n    # First, fetch all the existing tags\n    response = self.sm_session.list_tags(aws_arn)\n    existing_tags = response.get(\"Tags\", [])\n\n    # Convert existing AWS tags to a dictionary for easy manipulation\n    existing_tags_dict = {item[\"Key\"]: item[\"Value\"] for item in existing_tags}\n\n    # Identify tags to delete\n    tag_list_to_delete = []\n    for key in existing_tags_dict.keys():\n        if key == key_to_delete or key.startswith(f\"{key_to_delete}_chunk_\"):\n            tag_list_to_delete.append(key)\n\n    # Delete the identified tags\n    if tag_list_to_delete:\n        self.sm_client.delete_tags(ResourceArn=aws_arn, TagKeys=tag_list_to_delete)\n    else:\n        self.log.info(f\"No Metadata found: {key_to_delete}...\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.details","title":"<code>details()</code>  <code>abstractmethod</code>","text":"<p>Additional Details about this Artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef details(self) -&gt; dict:\n    \"\"\"Additional Details about this Artifact\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.ensure_valid_name","title":"<code>ensure_valid_name(name, delimiter='_')</code>  <code>classmethod</code>","text":"<p>Check if the ID adheres to the naming conventions for this Artifact.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>The name/id to check.</p> required <code>delimiter</code> <code>str</code> <p>The delimiter to use in the name/id string (default: \"_\")</p> <code>'_'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the name/id is not valid.</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@classmethod\ndef ensure_valid_name(cls, name: str, delimiter: str = \"_\"):\n    \"\"\"Check if the ID adheres to the naming conventions for this Artifact.\n\n    Args:\n        name (str): The name/id to check.\n        delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n\n    Raises:\n        ValueError: If the name/id is not valid.\n    \"\"\"\n    valid_name = cls.generate_valid_name(name, delimiter=delimiter)\n    if name != valid_name:\n        error_msg = f\"{name} doesn't conform and should be converted to: {valid_name}\"\n        cls.log.error(error_msg)\n        raise ValueError(error_msg)\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.exists","title":"<code>exists()</code>  <code>abstractmethod</code>","text":"<p>Does the Artifact exist? Can we connect to it?</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef exists(self) -&gt; bool:\n    \"\"\"Does the Artifact exist? Can we connect to it?\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.expected_meta","title":"<code>expected_meta()</code>","text":"<p>Metadata we expect to see for this Artifact when it's ready Returns:     list[str]: List of expected metadata keys</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def expected_meta(self) -&gt; list[str]:\n    \"\"\"Metadata we expect to see for this Artifact when it's ready\n    Returns:\n        list[str]: List of expected metadata keys\n    \"\"\"\n\n    # If an artifact has additional expected metadata override this method\n    return [\"sageworks_status\"]\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.generate_valid_name","title":"<code>generate_valid_name(name, delimiter='_')</code>  <code>staticmethod</code>","text":"<p>Only allow letters and the specified delimiter, also lowercase the string</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>The name/id string to check</p> required <code>delimiter</code> <code>str</code> <p>The delimiter to use in the name/id string (default: \"_\")</p> <code>'_'</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>A generated valid name/id</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@staticmethod\ndef generate_valid_name(name: str, delimiter: str = \"_\") -&gt; str:\n    \"\"\"Only allow letters and the specified delimiter, also lowercase the string\n\n    Args:\n        name (str): The name/id string to check\n        delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n\n    Returns:\n        str: A generated valid name/id\n    \"\"\"\n    valid_name = \"\".join(c for c in name if c.isalnum() or c in [\"_\", \"-\"]).lower()\n    valid_name = valid_name.replace(\"_\", delimiter)\n    valid_name = valid_name.replace(\"-\", delimiter)\n    return valid_name\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_input","title":"<code>get_input()</code>","text":"<p>Get the input data for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def get_input(self) -&gt; str:\n    \"\"\"Get the input data for this artifact\"\"\"\n    return self.sageworks_meta().get(\"sageworks_input\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_owner","title":"<code>get_owner()</code>","text":"<p>Get the owner of this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def get_owner(self) -&gt; str:\n    \"\"\"Get the owner of this artifact\"\"\"\n    return self.sageworks_meta().get(\"sageworks_owner\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_status","title":"<code>get_status()</code>","text":"<p>Get the status for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def get_status(self) -&gt; str:\n    \"\"\"Get the status for this artifact\"\"\"\n    return self.sageworks_meta().get(\"sageworks_status\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_tags","title":"<code>get_tags(tag_type='user')</code>","text":"<p>Get the tags for this artifact Args:     tag_type (str): Type of tags to return (user or health) Returns:     list[str]: List of tags for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def get_tags(self, tag_type=\"user\") -&gt; list:\n    \"\"\"Get the tags for this artifact\n    Args:\n        tag_type (str): Type of tags to return (user or health)\n    Returns:\n        list[str]: List of tags for this artifact\n    \"\"\"\n    if tag_type == \"user\":\n        user_tags = self.sageworks_meta().get(\"sageworks_tags\")\n        return user_tags.split(self.tag_delimiter) if user_tags else []\n\n    # Grab our health tags\n    health_tags = self.sageworks_meta().get(\"sageworks_health_tags\")\n\n    # If we don't have health tags, create the storage and return an empty list\n    if health_tags is None:\n        self.log.important(f\"{self.uuid} creating sageworks_health_tags storage...\")\n        self.upsert_sageworks_meta({\"sageworks_health_tags\": \"\"})\n        return []\n\n    # Otherwise, return the health tags\n    return health_tags.split(self.tag_delimiter) if health_tags else []\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.health_check","title":"<code>health_check()</code>","text":"<p>Perform a health check on this artifact Returns:     list[str]: List of health issues</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def health_check(self) -&gt; list[str]:\n    \"\"\"Perform a health check on this artifact\n    Returns:\n        list[str]: List of health issues\n    \"\"\"\n    health_issues = []\n    if not self.ready():\n        return [\"needs_onboard\"]\n    if \"unknown\" in self.aws_url():\n        health_issues.append(\"aws_url_unknown\")\n    return health_issues\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.modified","title":"<code>modified()</code>  <code>abstractmethod</code>","text":"<p>Return the datetime when this artifact was last modified</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef modified(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was last modified\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.onboard","title":"<code>onboard()</code>  <code>abstractmethod</code>","text":"<p>Onboard this Artifact into SageWorks Returns:     bool: True if the Artifact was successfully onboarded, False otherwise</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef onboard(self) -&gt; bool:\n    \"\"\"Onboard this Artifact into SageWorks\n    Returns:\n        bool: True if the Artifact was successfully onboarded, False otherwise\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.ready","title":"<code>ready()</code>","text":"<p>Is the Artifact ready? Is initial setup complete and expected metadata populated?</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def ready(self) -&gt; bool:\n    \"\"\"Is the Artifact ready? Is initial setup complete and expected metadata populated?\"\"\"\n\n    # If anything goes wrong, assume the artifact is not ready\n    try:\n        # Check for the expected metadata\n        expected_meta = self.expected_meta()\n        existing_meta = self.sageworks_meta()\n        ready = set(existing_meta.keys()).issuperset(expected_meta)\n        if ready:\n            return True\n        else:\n            self.log.info(\"Artifact is not ready!\")\n            return False\n    except Exception as e:\n        self.log.error(f\"Artifact malformed: {e}\")\n        return False\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.refresh_meta","title":"<code>refresh_meta()</code>  <code>abstractmethod</code>","text":"<p>Refresh the Artifact's metadata</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef refresh_meta(self):\n    \"\"\"Refresh the Artifact's metadata\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.remove_sageworks_meta","title":"<code>remove_sageworks_meta(key_to_remove)</code>","text":"<p>Remove SageWorks specific metadata from this Artifact Args:     key_to_remove (str): The metadata key to remove Note:     This functionality will work for FeatureSets, Models, and Endpoints     but not for DataSources. The DataSource class overrides this method.</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def remove_sageworks_meta(self, key_to_remove: str):\n    \"\"\"Remove SageWorks specific metadata from this Artifact\n    Args:\n        key_to_remove (str): The metadata key to remove\n    Note:\n        This functionality will work for FeatureSets, Models, and Endpoints\n        but not for DataSources. The DataSource class overrides this method.\n    \"\"\"\n    aws_arn = self.arn()\n    # Sanity check\n    if aws_arn is None:\n        self.log.error(f\"ARN is None for {self.uuid}!\")\n        return\n    self.log.info(f\"Removing SageWorks Metadata {key_to_remove} for Artifact: {aws_arn}...\")\n    sagemaker_delete_tag(aws_arn, self.sm_session, key_to_remove)\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.remove_sageworks_tag","title":"<code>remove_sageworks_tag(tag, tag_type='user')</code>","text":"<p>Remove a tag from this artifact if it exists. Args:     tag (str): Tag to remove from this artifact     tag_type (str): Type of tag to remove (user or health)</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def remove_sageworks_tag(self, tag, tag_type=\"user\"):\n    \"\"\"Remove a tag from this artifact if it exists.\n    Args:\n        tag (str): Tag to remove from this artifact\n        tag_type (str): Type of tag to remove (user or health)\n    \"\"\"\n    current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n    if tag in current_tags:\n        current_tags.remove(tag)\n        combined_tags = self.tag_delimiter.join(current_tags)\n        if tag_type == \"user\":\n            self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n        elif tag_type == \"health\":\n            self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.sageworks_meta","title":"<code>sageworks_meta()</code>","text":"<p>Get the SageWorks specific metadata for this Artifact Note: This functionality will work for FeatureSets, Models, and Endpoints       but not for DataSources/Graphs. DataSource/Graph classes need to override this method.</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def sageworks_meta(self) -&gt; dict:\n    \"\"\"Get the SageWorks specific metadata for this Artifact\n    Note: This functionality will work for FeatureSets, Models, and Endpoints\n          but not for DataSources/Graphs. DataSource/Graph classes need to override this method.\n    \"\"\"\n    # First, check our cache\n    meta_data_key = f\"{self.uuid}_sageworks_meta\"\n    meta_data = self.ephemeral_storage.get(meta_data_key)\n    if meta_data is not None:\n        return meta_data\n\n    # Otherwise, fetch the metadata from AWS, store it in the cache, and return it\n    meta_data = list_tags_with_throttle(self.arn(), self.sm_session)\n    self.ephemeral_storage.set(meta_data_key, meta_data)\n    return meta_data\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_input","title":"<code>set_input(input_data)</code>","text":"<p>Set the input data for this artifact</p> <p>Parameters:</p> Name Type Description Default <code>input_data</code> <code>str</code> <p>Name of input data for this artifact</p> required <p>Note:     This breaks the official provenance of the artifact, so use with caution.</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def set_input(self, input_data: str):\n    \"\"\"Set the input data for this artifact\n\n    Args:\n        input_data (str): Name of input data for this artifact\n    Note:\n        This breaks the official provenance of the artifact, so use with caution.\n    \"\"\"\n    self.log.important(f\"{self.uuid}: Setting input to {input_data}...\")\n    self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n    self.upsert_sageworks_meta({\"sageworks_input\": input_data})\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_owner","title":"<code>set_owner(owner)</code>","text":"<p>Set the owner of this artifact</p> <p>Parameters:</p> Name Type Description Default <code>owner</code> <code>str</code> <p>Owner to set for this artifact</p> required Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def set_owner(self, owner: str):\n    \"\"\"Set the owner of this artifact\n\n    Args:\n        owner (str): Owner to set for this artifact\n    \"\"\"\n    self.upsert_sageworks_meta({\"sageworks_owner\": owner})\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_status","title":"<code>set_status(status)</code>","text":"<p>Set the status for this artifact Args:     status (str): Status to set for this artifact</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def set_status(self, status: str):\n    \"\"\"Set the status for this artifact\n    Args:\n        status (str): Status to set for this artifact\n    \"\"\"\n    self.upsert_sageworks_meta({\"sageworks_status\": status})\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.size","title":"<code>size()</code>  <code>abstractmethod</code>","text":"<p>Return the size of this artifact in MegaBytes</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>@abstractmethod\ndef size(self) -&gt; float:\n    \"\"\"Return the size of this artifact in MegaBytes\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.summary","title":"<code>summary()</code>","text":"<p>This is generic summary information for all Artifacts. If you want to get more detailed information, call the details() method which is implemented by the specific Artifact class</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def summary(self) -&gt; dict:\n    \"\"\"This is generic summary information for all Artifacts. If you\n    want to get more detailed information, call the details() method\n    which is implemented by the specific Artifact class\"\"\"\n    basic = {\n        \"uuid\": self.uuid,\n        \"health_tags\": self.get_health_tags(),\n        \"aws_arn\": self.arn(),\n        \"size\": self.size(),\n        \"created\": self.created(),\n        \"modified\": self.modified(),\n        \"input\": self.get_input(),\n    }\n    # Combine the sageworks metadata with the basic metadata\n    return {**basic, **self.sageworks_meta()}\n</code></pre>"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.upsert_sageworks_meta","title":"<code>upsert_sageworks_meta(new_meta)</code>","text":"<p>Add SageWorks specific metadata to this Artifact Args:     new_meta (dict): Dictionary of NEW metadata to add Note:     This functionality will work for FeatureSets, Models, and Endpoints     but not for DataSources. The DataSource class overrides this method.</p> Source code in <code>src/sageworks/core/artifacts/artifact.py</code> <pre><code>def upsert_sageworks_meta(self, new_meta: dict):\n    \"\"\"Add SageWorks specific metadata to this Artifact\n    Args:\n        new_meta (dict): Dictionary of NEW metadata to add\n    Note:\n        This functionality will work for FeatureSets, Models, and Endpoints\n        but not for DataSources. The DataSource class overrides this method.\n    \"\"\"\n    # Sanity check\n    aws_arn = self.arn()\n    if aws_arn is None:\n        self.log.error(f\"ARN is None for {self.uuid}!\")\n        return\n\n    # Add the new metadata to the existing metadata\n    self.log.info(f\"Upserting SageWorks Metadata for Artifact: {aws_arn}...\")\n    aws_tags = dict_to_aws_tags(new_meta)\n    self.sm_client.add_tags(ResourceArn=aws_arn, Tags=aws_tags)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/","title":"AthenaSource","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the DataSource API Class and voil\u00e0 it works the same.</p> <p>AthenaSource: SageWorks Data Source accessible through Athena</p>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource","title":"<code>AthenaSource</code>","text":"<p>               Bases: <code>DataSourceAbstract</code></p> <p>AthenaSource: SageWorks Data Source accessible through Athena</p> Common Usage <pre><code>my_data = AthenaSource(data_uuid, database=\"sageworks\")\nmy_data.summary()\nmy_data.details()\ndf = my_data.query(f\"select * from {data_uuid} limit 5\")\n</code></pre> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>class AthenaSource(DataSourceAbstract):\n    \"\"\"AthenaSource: SageWorks Data Source accessible through Athena\n\n    Common Usage:\n        ```\n        my_data = AthenaSource(data_uuid, database=\"sageworks\")\n        my_data.summary()\n        my_data.details()\n        df = my_data.query(f\"select * from {data_uuid} limit 5\")\n        ```\n    \"\"\"\n\n    def __init__(self, data_uuid, database=\"sageworks\", force_refresh: bool = False):\n        \"\"\"AthenaSource Initialization\n\n        Args:\n            data_uuid (str): Name of Athena Table\n            database (str): Athena Database Name (default: sageworks)\n            force_refresh (bool): Force refresh of AWS Metadata (default: False)\n        \"\"\"\n        # Ensure the data_uuid is a valid name/id\n        self.ensure_valid_name(data_uuid)\n\n        # Call superclass init\n        super().__init__(data_uuid, database)\n\n        # Flag for metadata cache refresh logic\n        self.metadata_refresh_needed = False\n\n        # Setup our AWS Metadata Broker\n        self.catalog_table_meta = self.meta_broker.data_source_details(\n            data_uuid, self.get_database(), refresh=force_refresh\n        )\n        if self.catalog_table_meta is None:\n            self.log.important(f\"Unable to find {self.get_database()}:{self.get_table_name()} in Glue Catalogs...\")\n\n        # Call superclass post init\n        super().__post_init__()\n\n        # All done\n        self.log.debug(f\"AthenaSource Initialized: {self.get_database()}.{self.get_table_name()}\")\n\n    def refresh_meta(self):\n        \"\"\"Refresh our internal AWS Broker catalog metadata\"\"\"\n        _catalog_meta = self.aws_broker.get_metadata(ServiceCategory.DATA_CATALOG, force_refresh=True)\n        self.catalog_table_meta = _catalog_meta[self.get_database()].get(self.get_table_name())\n        self.metadata_refresh_needed = False\n\n    def exists(self) -&gt; bool:\n        \"\"\"Validation Checks for this Data Source\"\"\"\n\n        # We're we able to pull AWS Metadata for this table_name?\"\"\"\n        if self.catalog_table_meta is None:\n            self.log.debug(f\"AthenaSource {self.get_table_name()} not found in SageWorks Metadata...\")\n            return False\n        return True\n\n    def arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n        # Grab our SageWorks Role Manager, get our AWS account id, and region for ARN creation\n        account_id = self.aws_account_clamp.account_id\n        region = self.aws_account_clamp.region\n        arn = f\"arn:aws:glue:{region}:{account_id}:table/{self.get_database()}/{self.get_table_name()}\"\n        return arn\n\n    def sageworks_meta(self) -&gt; dict:\n        \"\"\"Get the SageWorks specific metadata for this Artifact\"\"\"\n\n        # Sanity Check if we have invalid AWS Metadata\n        self.log.info(f\"Retrieving SageWorks Metadata for Artifact: {self.uuid}...\")\n        if self.catalog_table_meta is None:\n            if not self.exists():\n                self.log.error(f\"DataSource {self.uuid} doesn't appear to exist...\")\n            else:\n                self.log.critical(f\"Unable to get AWS Metadata for {self.get_table_name()}\")\n                self.log.critical(\"Malformed Artifact! Delete this Artifact and recreate it!\")\n            return {}\n\n        # Check if we need to refresh our metadata\n        if self.metadata_refresh_needed:\n            self.refresh_meta()\n\n        # Get the SageWorks Metadata from the Catalog Table Metadata\n        return sageworks_meta_from_catalog_table_meta(self.catalog_table_meta)\n\n    def upsert_sageworks_meta(self, new_meta: dict):\n        \"\"\"Add SageWorks specific metadata to this Artifact\n\n        Args:\n            new_meta (dict): Dictionary of new metadata to add\n        \"\"\"\n\n        # Give a warning message for keys that don't start with sageworks_\n        for key in new_meta.keys():\n            if not key.startswith(\"sageworks_\"):\n                self.log.warning(\"Append 'sageworks_' to key names to avoid overwriting AWS meta data\")\n\n        # Now convert any non-string values to JSON strings\n        for key, value in new_meta.items():\n            if not isinstance(value, str):\n                new_meta[key] = json.dumps(value, cls=CustomEncoder)\n\n        # Store our updated metadata\n        try:\n            wr.catalog.upsert_table_parameters(\n                parameters=new_meta,\n                database=self.get_database(),\n                table=self.get_table_name(),\n                boto3_session=self.boto_session,\n            )\n            self.metadata_refresh_needed = True\n        except botocore.exceptions.ClientError as e:\n            error_code = e.response[\"Error\"][\"Code\"]\n            if error_code == \"InvalidInputException\":\n                self.log.error(f\"Unable to upsert metadata for {self.get_table_name()}\")\n                self.log.error(\"Probably because the metadata is too large\")\n                self.log.error(new_meta)\n            elif error_code == \"ConcurrentModificationException\":\n                self.log.warning(\"ConcurrentModificationException... trying again...\")\n                time.sleep(5)\n                wr.catalog.upsert_table_parameters(\n                    parameters=new_meta,\n                    database=self.get_database(),\n                    table=self.get_table_name(),\n                    boto3_session=self.boto_session,\n                )\n            else:\n                raise e\n\n    def size(self) -&gt; float:\n        \"\"\"Return the size of this data in MegaBytes\"\"\"\n        size_in_bytes = sum(wr.s3.size_objects(self.s3_storage_location(), boto3_session=self.boto_session).values())\n        size_in_mb = size_in_bytes / 1_000_000\n        return size_in_mb\n\n    def aws_meta(self) -&gt; dict:\n        \"\"\"Get the FULL AWS metadata for this artifact\"\"\"\n        return self.catalog_table_meta\n\n    def aws_url(self):\n        \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n        sageworks_details = self.sageworks_meta().get(\"sageworks_details\", {})\n        return sageworks_details.get(\"aws_url\", \"unknown\")\n\n    def created(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was created\"\"\"\n        return self.catalog_table_meta[\"CreateTime\"]\n\n    def modified(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was last modified\"\"\"\n        return self.catalog_table_meta[\"UpdateTime\"]\n\n    def num_rows(self) -&gt; int:\n        \"\"\"Return the number of rows for this Data Source\"\"\"\n        count_df = self.query(\n            f'select count(*) AS sageworks_count from \"{self.get_database()}\".\"{self.get_table_name()}\"'\n        )\n        return count_df[\"sageworks_count\"][0]\n\n    def num_columns(self) -&gt; int:\n        \"\"\"Return the number of columns for this Data Source\"\"\"\n        return len(self.column_names())\n\n    def column_names(self) -&gt; list[str]:\n        \"\"\"Return the column names for this Athena Table\"\"\"\n        return [item[\"Name\"] for item in self.catalog_table_meta[\"StorageDescriptor\"][\"Columns\"]]\n\n    def column_types(self) -&gt; list[str]:\n        \"\"\"Return the column types of the internal AthenaSource\"\"\"\n        return [item[\"Type\"] for item in self.catalog_table_meta[\"StorageDescriptor\"][\"Columns\"]]\n\n    def query(self, query: str) -&gt; pd.DataFrame:\n        \"\"\"Query the AthenaSource\n\n        Args:\n            query (str): The query to run against the AthenaSource\n\n        Returns:\n            pd.DataFrame: The results of the query\n        \"\"\"\n        df = wr.athena.read_sql_query(\n            sql=query,\n            database=self.get_database(),\n            ctas_approach=False,\n            boto3_session=self.boto_session,\n        )\n        scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n        if scanned_bytes &gt; 0:\n            self.log.info(f\"Athena Query successful (scanned bytes: {scanned_bytes})\")\n        return df\n\n    def execute_statement(self, query: str):\n        \"\"\"Execute a non-returning SQL statement in Athena.\"\"\"\n        try:\n            # Start the query execution\n            query_execution_id = wr.athena.start_query_execution(\n                sql=query,\n                database=self.get_database(),\n                boto3_session=self.boto_session,\n            )\n            self.log.debug(f\"QueryExecutionId: {query_execution_id}\")\n\n            # Wait for the query to complete\n            wr.athena.wait_query(query_execution_id=query_execution_id, boto3_session=self.boto_session)\n            self.log.debug(f\"Statement executed successfully: {query_execution_id}\")\n        except Exception as e:\n            self.log.error(f\"Failed to execute statement: {e}\")\n            raise\n\n    def s3_storage_location(self) -&gt; str:\n        \"\"\"Get the S3 Storage Location for this Data Source\"\"\"\n        return self.catalog_table_meta[\"StorageDescriptor\"][\"Location\"]\n\n    def athena_test_query(self):\n        \"\"\"Validate that Athena Queries are working\"\"\"\n        query = f\"select count(*) as sageworks_count from {self.get_table_name()}\"\n        df = wr.athena.read_sql_query(\n            sql=query,\n            database=self.get_database(),\n            ctas_approach=False,\n            boto3_session=self.boto_session,\n        )\n        scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n        self.log.info(f\"Athena TEST Query successful (scanned bytes: {scanned_bytes})\")\n\n    def sample_impl(self) -&gt; pd.DataFrame:\n        \"\"\"Pull a sample of rows from the DataSource\n\n        Returns:\n            pd.DataFrame: A sample DataFrame for an Athena DataSource\n        \"\"\"\n\n        # Call the SQL function to pull a sample of the rows\n        return sample_rows.sample_rows(self)\n\n    def descriptive_stats(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n\n        Args:\n            recompute (bool): Recompute the descriptive stats (default: False)\n\n        Returns:\n            dict(dict): A dictionary of descriptive stats for each column in the form\n                 {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n                  'col2': ...}\n        \"\"\"\n\n        # First check if we have already computed the descriptive stats\n        stat_dict_json = self.sageworks_meta().get(\"sageworks_descriptive_stats\")\n        if stat_dict_json and not recompute:\n            return stat_dict_json\n\n        # Call the SQL function to compute descriptive stats\n        stat_dict = descriptive_stats.descriptive_stats(self)\n\n        # Push the descriptive stat data into our DataSource Metadata\n        self.upsert_sageworks_meta({\"sageworks_descriptive_stats\": stat_dict})\n\n        # Return the descriptive stats\n        return stat_dict\n\n    def outliers_impl(self, scale: float = 1.5, use_stddev=False, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Compute outliers for all the numeric columns in a DataSource\n\n        Args:\n            scale (float): The scale to use for the IQR (default: 1.5)\n            use_stddev (bool): Use Standard Deviation instead of IQR (default: False)\n            recompute (bool): Recompute the outliers (default: False)\n\n        Returns:\n            pd.DataFrame: A DataFrame of outliers from this DataSource\n\n        Notes:\n            Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n            The scale parameter can be adjusted to change the IQR multiplier\n        \"\"\"\n\n        # Compute outliers using the SQL Outliers class\n        sql_outliers = outliers.Outliers()\n        return sql_outliers.compute_outliers(self, scale=scale, use_stddev=use_stddev)\n\n    def smart_sample(self) -&gt; pd.DataFrame:\n        \"\"\"Get a smart sample dataframe for this DataSource\n\n        Note:\n            smart = sample data + outliers for the DataSource\"\"\"\n\n        # Outliers DataFrame\n        outlier_rows = self.outliers()\n\n        # Sample DataFrame\n        sample_rows = self.sample()\n        sample_rows[\"outlier_group\"] = \"sample\"\n\n        # Combine the sample rows with the outlier rows\n        all_rows = pd.concat([outlier_rows, sample_rows]).reset_index(drop=True)\n\n        # Drop duplicates\n        all_except_outlier_group = [col for col in all_rows.columns if col != \"outlier_group\"]\n        all_rows = all_rows.drop_duplicates(subset=all_except_outlier_group, ignore_index=True)\n        return all_rows\n\n    def correlations(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n        Args:\n            recompute (bool): Recompute the column stats (default: False)\n\n        Returns:\n            dict(dict): A dictionary of correlations for each column in this format\n                 {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n                  'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n        \"\"\"\n\n        # First check if we have already computed the correlations\n        correlations_dict = self.sageworks_meta().get(\"sageworks_correlations\")\n        if correlations_dict and not recompute:\n            return correlations_dict\n\n        # Call the SQL function to compute correlations\n        correlations_dict = correlations.correlations(self)\n\n        # Push the correlation data into our DataSource Metadata\n        self.upsert_sageworks_meta({\"sageworks_correlations\": correlations_dict})\n\n        # Return the correlation data\n        return correlations_dict\n\n    def column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Column Stats for all the columns in a DataSource\n\n        Args:\n            recompute (bool): Recompute the column stats (default: False)\n\n        Returns:\n            dict(dict): A dictionary of stats for each column this format\n            NB: String columns will NOT have num_zeros, descriptive_stats or correlation data\n                {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n                 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100,\n                          'descriptive_stats': {...}, 'correlations': {...}},\n                 ...}\n        \"\"\"\n\n        # First check if we have already computed the column stats\n        columns_stats_dict = self.sageworks_meta().get(\"sageworks_column_stats\")\n        if columns_stats_dict and not recompute:\n            return columns_stats_dict\n\n        # Call the SQL function to compute column stats\n        column_stats_dict = column_stats.column_stats(self, recompute=recompute)\n\n        # Push the column stats data into our DataSource Metadata\n        self.upsert_sageworks_meta({\"sageworks_column_stats\": column_stats_dict})\n\n        # Return the column stats data\n        return column_stats_dict\n\n    def value_counts(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n\n        Args:\n            recompute (bool): Recompute the value counts (default: False)\n\n        Returns:\n            dict(dict): A dictionary of value counts for each column in the form\n                 {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...},\n                  'col2': ...}\n        \"\"\"\n\n        # First check if we have already computed the value counts\n        value_counts_dict = self.sageworks_meta().get(\"sageworks_value_counts\")\n        if value_counts_dict and not recompute:\n            return value_counts_dict\n\n        # Call the SQL function to compute value_counts\n        value_count_dict = value_counts.value_counts(self)\n\n        # Push the value_count data into our DataSource Metadata\n        self.upsert_sageworks_meta({\"sageworks_value_counts\": value_count_dict})\n\n        # Return the value_count data\n        return value_count_dict\n\n    def details(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Additional Details about this AthenaSource Artifact\n\n        Args:\n            recompute (bool): Recompute the details (default: False)\n\n        Returns:\n            dict(dict): A dictionary of details about this AthenaSource\n        \"\"\"\n\n        # Check if we have cached version of the DataSource Details\n        storage_key = f\"data_source:{self.uuid}:details\"\n        cached_details = self.data_storage.get(storage_key)\n        if cached_details and not recompute:\n            return cached_details\n\n        self.log.info(f\"Recomputing DataSource Details ({self.uuid})...\")\n\n        # Get the details from the base class\n        details = super().details()\n\n        # Compute additional details\n        details[\"s3_storage_location\"] = self.s3_storage_location()\n        details[\"storage_type\"] = \"athena\"\n\n        # Compute our AWS URL\n        query = f\"select * from {self.get_database()}.{self.get_table_name()} limit 10\"\n        query_exec_id = wr.athena.start_query_execution(\n            sql=query, database=self.get_database(), boto3_session=self.boto_session\n        )\n        base_url = \"https://console.aws.amazon.com/athena/home\"\n        details[\"aws_url\"] = f\"{base_url}?region={self.aws_region}#query/history/{query_exec_id}\"\n\n        # Push the aws_url data into our DataSource Metadata\n        self.upsert_sageworks_meta({\"sageworks_details\": {\"aws_url\": details[\"aws_url\"]}})\n\n        # Convert any datetime fields to ISO-8601 strings\n        details = convert_all_to_iso8601(details)\n\n        # Add the column stats\n        details[\"column_stats\"] = self.column_stats()\n\n        # Cache the details\n        self.data_storage.set(storage_key, details)\n\n        # Return the details data\n        return details\n\n    def delete(self):\n        \"\"\"Delete the AWS Data Catalog Table and S3 Storage Objects\"\"\"\n\n        # Make sure the Feature Group exists\n        if not self.exists():\n            self.log.warning(f\"Trying to delete a AthenaSource that doesn't exist: {self.get_table_name()}\")\n\n        # Delete Data Catalog Table\n        self.log.info(f\"Deleting DataCatalog Table: {self.get_database()}.{self.get_table_name()}...\")\n        wr.catalog.delete_table_if_exists(self.get_database(), self.get_table_name(), boto3_session=self.boto_session)\n\n        # Delete S3 Storage Objects (if they exist)\n        try:\n            # Make sure we add the trailing slash\n            s3_path = self.s3_storage_location()\n            s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n\n            self.log.info(f\"Deleting S3 Storage Objects: {s3_path}...\")\n            wr.s3.delete_objects(s3_path, boto3_session=self.boto_session)\n        except TypeError:\n            self.log.warning(\"Malformed Artifact... good thing it's being deleted...\")\n\n        # Delete any data in the Cache\n        for key in self.data_storage.list_subkeys(f\"data_source:{self.uuid}:\"):\n            self.log.info(f\"Deleting Cache Key {key}...\")\n            self.data_storage.delete(key)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.__init__","title":"<code>__init__(data_uuid, database='sageworks', force_refresh=False)</code>","text":"<p>AthenaSource Initialization</p> <p>Parameters:</p> Name Type Description Default <code>data_uuid</code> <code>str</code> <p>Name of Athena Table</p> required <code>database</code> <code>str</code> <p>Athena Database Name (default: sageworks)</p> <code>'sageworks'</code> <code>force_refresh</code> <code>bool</code> <p>Force refresh of AWS Metadata (default: False)</p> <code>False</code> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def __init__(self, data_uuid, database=\"sageworks\", force_refresh: bool = False):\n    \"\"\"AthenaSource Initialization\n\n    Args:\n        data_uuid (str): Name of Athena Table\n        database (str): Athena Database Name (default: sageworks)\n        force_refresh (bool): Force refresh of AWS Metadata (default: False)\n    \"\"\"\n    # Ensure the data_uuid is a valid name/id\n    self.ensure_valid_name(data_uuid)\n\n    # Call superclass init\n    super().__init__(data_uuid, database)\n\n    # Flag for metadata cache refresh logic\n    self.metadata_refresh_needed = False\n\n    # Setup our AWS Metadata Broker\n    self.catalog_table_meta = self.meta_broker.data_source_details(\n        data_uuid, self.get_database(), refresh=force_refresh\n    )\n    if self.catalog_table_meta is None:\n        self.log.important(f\"Unable to find {self.get_database()}:{self.get_table_name()} in Glue Catalogs...\")\n\n    # Call superclass post init\n    super().__post_init__()\n\n    # All done\n    self.log.debug(f\"AthenaSource Initialized: {self.get_database()}.{self.get_table_name()}\")\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.arn","title":"<code>arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for this artifact</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n    # Grab our SageWorks Role Manager, get our AWS account id, and region for ARN creation\n    account_id = self.aws_account_clamp.account_id\n    region = self.aws_account_clamp.region\n    arn = f\"arn:aws:glue:{region}:{account_id}:table/{self.get_database()}/{self.get_table_name()}\"\n    return arn\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.athena_test_query","title":"<code>athena_test_query()</code>","text":"<p>Validate that Athena Queries are working</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def athena_test_query(self):\n    \"\"\"Validate that Athena Queries are working\"\"\"\n    query = f\"select count(*) as sageworks_count from {self.get_table_name()}\"\n    df = wr.athena.read_sql_query(\n        sql=query,\n        database=self.get_database(),\n        ctas_approach=False,\n        boto3_session=self.boto_session,\n    )\n    scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n    self.log.info(f\"Athena TEST Query successful (scanned bytes: {scanned_bytes})\")\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.aws_meta","title":"<code>aws_meta()</code>","text":"<p>Get the FULL AWS metadata for this artifact</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def aws_meta(self) -&gt; dict:\n    \"\"\"Get the FULL AWS metadata for this artifact\"\"\"\n    return self.catalog_table_meta\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.aws_url","title":"<code>aws_url()</code>","text":"<p>The AWS URL for looking at/querying this data source</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def aws_url(self):\n    \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n    sageworks_details = self.sageworks_meta().get(\"sageworks_details\", {})\n    return sageworks_details.get(\"aws_url\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.column_names","title":"<code>column_names()</code>","text":"<p>Return the column names for this Athena Table</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def column_names(self) -&gt; list[str]:\n    \"\"\"Return the column names for this Athena Table\"\"\"\n    return [item[\"Name\"] for item in self.catalog_table_meta[\"StorageDescriptor\"][\"Columns\"]]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.column_stats","title":"<code>column_stats(recompute=False)</code>","text":"<p>Compute Column Stats for all the columns in a DataSource</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the column stats (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of stats for each column this format</p> <code>NB</code> <code>dict[dict]</code> <p>String columns will NOT have num_zeros, descriptive_stats or correlation data {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},  'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100,           'descriptive_stats': {...}, 'correlations': {...}},  ...}</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Column Stats for all the columns in a DataSource\n\n    Args:\n        recompute (bool): Recompute the column stats (default: False)\n\n    Returns:\n        dict(dict): A dictionary of stats for each column this format\n        NB: String columns will NOT have num_zeros, descriptive_stats or correlation data\n            {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n             'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100,\n                      'descriptive_stats': {...}, 'correlations': {...}},\n             ...}\n    \"\"\"\n\n    # First check if we have already computed the column stats\n    columns_stats_dict = self.sageworks_meta().get(\"sageworks_column_stats\")\n    if columns_stats_dict and not recompute:\n        return columns_stats_dict\n\n    # Call the SQL function to compute column stats\n    column_stats_dict = column_stats.column_stats(self, recompute=recompute)\n\n    # Push the column stats data into our DataSource Metadata\n    self.upsert_sageworks_meta({\"sageworks_column_stats\": column_stats_dict})\n\n    # Return the column stats data\n    return column_stats_dict\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.column_types","title":"<code>column_types()</code>","text":"<p>Return the column types of the internal AthenaSource</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def column_types(self) -&gt; list[str]:\n    \"\"\"Return the column types of the internal AthenaSource\"\"\"\n    return [item[\"Type\"] for item in self.catalog_table_meta[\"StorageDescriptor\"][\"Columns\"]]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.correlations","title":"<code>correlations(recompute=False)</code>","text":"<p>Compute Correlations for all the numeric columns in a DataSource</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the column stats (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of correlations for each column in this format  {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},   'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def correlations(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n    Args:\n        recompute (bool): Recompute the column stats (default: False)\n\n    Returns:\n        dict(dict): A dictionary of correlations for each column in this format\n             {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n              'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n    \"\"\"\n\n    # First check if we have already computed the correlations\n    correlations_dict = self.sageworks_meta().get(\"sageworks_correlations\")\n    if correlations_dict and not recompute:\n        return correlations_dict\n\n    # Call the SQL function to compute correlations\n    correlations_dict = correlations.correlations(self)\n\n    # Push the correlation data into our DataSource Metadata\n    self.upsert_sageworks_meta({\"sageworks_correlations\": correlations_dict})\n\n    # Return the correlation data\n    return correlations_dict\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.created","title":"<code>created()</code>","text":"<p>Return the datetime when this artifact was created</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def created(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was created\"\"\"\n    return self.catalog_table_meta[\"CreateTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.delete","title":"<code>delete()</code>","text":"<p>Delete the AWS Data Catalog Table and S3 Storage Objects</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def delete(self):\n    \"\"\"Delete the AWS Data Catalog Table and S3 Storage Objects\"\"\"\n\n    # Make sure the Feature Group exists\n    if not self.exists():\n        self.log.warning(f\"Trying to delete a AthenaSource that doesn't exist: {self.get_table_name()}\")\n\n    # Delete Data Catalog Table\n    self.log.info(f\"Deleting DataCatalog Table: {self.get_database()}.{self.get_table_name()}...\")\n    wr.catalog.delete_table_if_exists(self.get_database(), self.get_table_name(), boto3_session=self.boto_session)\n\n    # Delete S3 Storage Objects (if they exist)\n    try:\n        # Make sure we add the trailing slash\n        s3_path = self.s3_storage_location()\n        s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n\n        self.log.info(f\"Deleting S3 Storage Objects: {s3_path}...\")\n        wr.s3.delete_objects(s3_path, boto3_session=self.boto_session)\n    except TypeError:\n        self.log.warning(\"Malformed Artifact... good thing it's being deleted...\")\n\n    # Delete any data in the Cache\n    for key in self.data_storage.list_subkeys(f\"data_source:{self.uuid}:\"):\n        self.log.info(f\"Deleting Cache Key {key}...\")\n        self.data_storage.delete(key)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.descriptive_stats","title":"<code>descriptive_stats(recompute=False)</code>","text":"<p>Compute Descriptive Stats for all the numeric columns in a DataSource</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the descriptive stats (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of descriptive stats for each column in the form  {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},   'col2': ...}</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def descriptive_stats(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n\n    Args:\n        recompute (bool): Recompute the descriptive stats (default: False)\n\n    Returns:\n        dict(dict): A dictionary of descriptive stats for each column in the form\n             {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n              'col2': ...}\n    \"\"\"\n\n    # First check if we have already computed the descriptive stats\n    stat_dict_json = self.sageworks_meta().get(\"sageworks_descriptive_stats\")\n    if stat_dict_json and not recompute:\n        return stat_dict_json\n\n    # Call the SQL function to compute descriptive stats\n    stat_dict = descriptive_stats.descriptive_stats(self)\n\n    # Push the descriptive stat data into our DataSource Metadata\n    self.upsert_sageworks_meta({\"sageworks_descriptive_stats\": stat_dict})\n\n    # Return the descriptive stats\n    return stat_dict\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.details","title":"<code>details(recompute=False)</code>","text":"<p>Additional Details about this AthenaSource Artifact</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the details (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about this AthenaSource</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def details(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Additional Details about this AthenaSource Artifact\n\n    Args:\n        recompute (bool): Recompute the details (default: False)\n\n    Returns:\n        dict(dict): A dictionary of details about this AthenaSource\n    \"\"\"\n\n    # Check if we have cached version of the DataSource Details\n    storage_key = f\"data_source:{self.uuid}:details\"\n    cached_details = self.data_storage.get(storage_key)\n    if cached_details and not recompute:\n        return cached_details\n\n    self.log.info(f\"Recomputing DataSource Details ({self.uuid})...\")\n\n    # Get the details from the base class\n    details = super().details()\n\n    # Compute additional details\n    details[\"s3_storage_location\"] = self.s3_storage_location()\n    details[\"storage_type\"] = \"athena\"\n\n    # Compute our AWS URL\n    query = f\"select * from {self.get_database()}.{self.get_table_name()} limit 10\"\n    query_exec_id = wr.athena.start_query_execution(\n        sql=query, database=self.get_database(), boto3_session=self.boto_session\n    )\n    base_url = \"https://console.aws.amazon.com/athena/home\"\n    details[\"aws_url\"] = f\"{base_url}?region={self.aws_region}#query/history/{query_exec_id}\"\n\n    # Push the aws_url data into our DataSource Metadata\n    self.upsert_sageworks_meta({\"sageworks_details\": {\"aws_url\": details[\"aws_url\"]}})\n\n    # Convert any datetime fields to ISO-8601 strings\n    details = convert_all_to_iso8601(details)\n\n    # Add the column stats\n    details[\"column_stats\"] = self.column_stats()\n\n    # Cache the details\n    self.data_storage.set(storage_key, details)\n\n    # Return the details data\n    return details\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.execute_statement","title":"<code>execute_statement(query)</code>","text":"<p>Execute a non-returning SQL statement in Athena.</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def execute_statement(self, query: str):\n    \"\"\"Execute a non-returning SQL statement in Athena.\"\"\"\n    try:\n        # Start the query execution\n        query_execution_id = wr.athena.start_query_execution(\n            sql=query,\n            database=self.get_database(),\n            boto3_session=self.boto_session,\n        )\n        self.log.debug(f\"QueryExecutionId: {query_execution_id}\")\n\n        # Wait for the query to complete\n        wr.athena.wait_query(query_execution_id=query_execution_id, boto3_session=self.boto_session)\n        self.log.debug(f\"Statement executed successfully: {query_execution_id}\")\n    except Exception as e:\n        self.log.error(f\"Failed to execute statement: {e}\")\n        raise\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.exists","title":"<code>exists()</code>","text":"<p>Validation Checks for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def exists(self) -&gt; bool:\n    \"\"\"Validation Checks for this Data Source\"\"\"\n\n    # We're we able to pull AWS Metadata for this table_name?\"\"\"\n    if self.catalog_table_meta is None:\n        self.log.debug(f\"AthenaSource {self.get_table_name()} not found in SageWorks Metadata...\")\n        return False\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.modified","title":"<code>modified()</code>","text":"<p>Return the datetime when this artifact was last modified</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def modified(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was last modified\"\"\"\n    return self.catalog_table_meta[\"UpdateTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.num_columns","title":"<code>num_columns()</code>","text":"<p>Return the number of columns for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def num_columns(self) -&gt; int:\n    \"\"\"Return the number of columns for this Data Source\"\"\"\n    return len(self.column_names())\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.num_rows","title":"<code>num_rows()</code>","text":"<p>Return the number of rows for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def num_rows(self) -&gt; int:\n    \"\"\"Return the number of rows for this Data Source\"\"\"\n    count_df = self.query(\n        f'select count(*) AS sageworks_count from \"{self.get_database()}\".\"{self.get_table_name()}\"'\n    )\n    return count_df[\"sageworks_count\"][0]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.outliers_impl","title":"<code>outliers_impl(scale=1.5, use_stddev=False, recompute=False)</code>","text":"<p>Compute outliers for all the numeric columns in a DataSource</p> <p>Parameters:</p> Name Type Description Default <code>scale</code> <code>float</code> <p>The scale to use for the IQR (default: 1.5)</p> <code>1.5</code> <code>use_stddev</code> <code>bool</code> <p>Use Standard Deviation instead of IQR (default: False)</p> <code>False</code> <code>recompute</code> <code>bool</code> <p>Recompute the outliers (default: False)</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A DataFrame of outliers from this DataSource</p> Notes <p>Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma) The scale parameter can be adjusted to change the IQR multiplier</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def outliers_impl(self, scale: float = 1.5, use_stddev=False, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Compute outliers for all the numeric columns in a DataSource\n\n    Args:\n        scale (float): The scale to use for the IQR (default: 1.5)\n        use_stddev (bool): Use Standard Deviation instead of IQR (default: False)\n        recompute (bool): Recompute the outliers (default: False)\n\n    Returns:\n        pd.DataFrame: A DataFrame of outliers from this DataSource\n\n    Notes:\n        Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n        The scale parameter can be adjusted to change the IQR multiplier\n    \"\"\"\n\n    # Compute outliers using the SQL Outliers class\n    sql_outliers = outliers.Outliers()\n    return sql_outliers.compute_outliers(self, scale=scale, use_stddev=use_stddev)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.query","title":"<code>query(query)</code>","text":"<p>Query the AthenaSource</p> <p>Parameters:</p> Name Type Description Default <code>query</code> <code>str</code> <p>The query to run against the AthenaSource</p> required <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The results of the query</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def query(self, query: str) -&gt; pd.DataFrame:\n    \"\"\"Query the AthenaSource\n\n    Args:\n        query (str): The query to run against the AthenaSource\n\n    Returns:\n        pd.DataFrame: The results of the query\n    \"\"\"\n    df = wr.athena.read_sql_query(\n        sql=query,\n        database=self.get_database(),\n        ctas_approach=False,\n        boto3_session=self.boto_session,\n    )\n    scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n    if scanned_bytes &gt; 0:\n        self.log.info(f\"Athena Query successful (scanned bytes: {scanned_bytes})\")\n    return df\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.refresh_meta","title":"<code>refresh_meta()</code>","text":"<p>Refresh our internal AWS Broker catalog metadata</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def refresh_meta(self):\n    \"\"\"Refresh our internal AWS Broker catalog metadata\"\"\"\n    _catalog_meta = self.aws_broker.get_metadata(ServiceCategory.DATA_CATALOG, force_refresh=True)\n    self.catalog_table_meta = _catalog_meta[self.get_database()].get(self.get_table_name())\n    self.metadata_refresh_needed = False\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.s3_storage_location","title":"<code>s3_storage_location()</code>","text":"<p>Get the S3 Storage Location for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def s3_storage_location(self) -&gt; str:\n    \"\"\"Get the S3 Storage Location for this Data Source\"\"\"\n    return self.catalog_table_meta[\"StorageDescriptor\"][\"Location\"]\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.sageworks_meta","title":"<code>sageworks_meta()</code>","text":"<p>Get the SageWorks specific metadata for this Artifact</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def sageworks_meta(self) -&gt; dict:\n    \"\"\"Get the SageWorks specific metadata for this Artifact\"\"\"\n\n    # Sanity Check if we have invalid AWS Metadata\n    self.log.info(f\"Retrieving SageWorks Metadata for Artifact: {self.uuid}...\")\n    if self.catalog_table_meta is None:\n        if not self.exists():\n            self.log.error(f\"DataSource {self.uuid} doesn't appear to exist...\")\n        else:\n            self.log.critical(f\"Unable to get AWS Metadata for {self.get_table_name()}\")\n            self.log.critical(\"Malformed Artifact! Delete this Artifact and recreate it!\")\n        return {}\n\n    # Check if we need to refresh our metadata\n    if self.metadata_refresh_needed:\n        self.refresh_meta()\n\n    # Get the SageWorks Metadata from the Catalog Table Metadata\n    return sageworks_meta_from_catalog_table_meta(self.catalog_table_meta)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.sample_impl","title":"<code>sample_impl()</code>","text":"<p>Pull a sample of rows from the DataSource</p> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: A sample DataFrame for an Athena DataSource</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def sample_impl(self) -&gt; pd.DataFrame:\n    \"\"\"Pull a sample of rows from the DataSource\n\n    Returns:\n        pd.DataFrame: A sample DataFrame for an Athena DataSource\n    \"\"\"\n\n    # Call the SQL function to pull a sample of the rows\n    return sample_rows.sample_rows(self)\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.size","title":"<code>size()</code>","text":"<p>Return the size of this data in MegaBytes</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def size(self) -&gt; float:\n    \"\"\"Return the size of this data in MegaBytes\"\"\"\n    size_in_bytes = sum(wr.s3.size_objects(self.s3_storage_location(), boto3_session=self.boto_session).values())\n    size_in_mb = size_in_bytes / 1_000_000\n    return size_in_mb\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.smart_sample","title":"<code>smart_sample()</code>","text":"<p>Get a smart sample dataframe for this DataSource</p> Note <p>smart = sample data + outliers for the DataSource</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def smart_sample(self) -&gt; pd.DataFrame:\n    \"\"\"Get a smart sample dataframe for this DataSource\n\n    Note:\n        smart = sample data + outliers for the DataSource\"\"\"\n\n    # Outliers DataFrame\n    outlier_rows = self.outliers()\n\n    # Sample DataFrame\n    sample_rows = self.sample()\n    sample_rows[\"outlier_group\"] = \"sample\"\n\n    # Combine the sample rows with the outlier rows\n    all_rows = pd.concat([outlier_rows, sample_rows]).reset_index(drop=True)\n\n    # Drop duplicates\n    all_except_outlier_group = [col for col in all_rows.columns if col != \"outlier_group\"]\n    all_rows = all_rows.drop_duplicates(subset=all_except_outlier_group, ignore_index=True)\n    return all_rows\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.upsert_sageworks_meta","title":"<code>upsert_sageworks_meta(new_meta)</code>","text":"<p>Add SageWorks specific metadata to this Artifact</p> <p>Parameters:</p> Name Type Description Default <code>new_meta</code> <code>dict</code> <p>Dictionary of new metadata to add</p> required Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def upsert_sageworks_meta(self, new_meta: dict):\n    \"\"\"Add SageWorks specific metadata to this Artifact\n\n    Args:\n        new_meta (dict): Dictionary of new metadata to add\n    \"\"\"\n\n    # Give a warning message for keys that don't start with sageworks_\n    for key in new_meta.keys():\n        if not key.startswith(\"sageworks_\"):\n            self.log.warning(\"Append 'sageworks_' to key names to avoid overwriting AWS meta data\")\n\n    # Now convert any non-string values to JSON strings\n    for key, value in new_meta.items():\n        if not isinstance(value, str):\n            new_meta[key] = json.dumps(value, cls=CustomEncoder)\n\n    # Store our updated metadata\n    try:\n        wr.catalog.upsert_table_parameters(\n            parameters=new_meta,\n            database=self.get_database(),\n            table=self.get_table_name(),\n            boto3_session=self.boto_session,\n        )\n        self.metadata_refresh_needed = True\n    except botocore.exceptions.ClientError as e:\n        error_code = e.response[\"Error\"][\"Code\"]\n        if error_code == \"InvalidInputException\":\n            self.log.error(f\"Unable to upsert metadata for {self.get_table_name()}\")\n            self.log.error(\"Probably because the metadata is too large\")\n            self.log.error(new_meta)\n        elif error_code == \"ConcurrentModificationException\":\n            self.log.warning(\"ConcurrentModificationException... trying again...\")\n            time.sleep(5)\n            wr.catalog.upsert_table_parameters(\n                parameters=new_meta,\n                database=self.get_database(),\n                table=self.get_table_name(),\n                boto3_session=self.boto_session,\n            )\n        else:\n            raise e\n</code></pre>"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.value_counts","title":"<code>value_counts(recompute=False)</code>","text":"<p>Compute 'value_counts' for all the string columns in a DataSource</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the value counts (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of value counts for each column in the form  {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...},   'col2': ...}</p> Source code in <code>src/sageworks/core/artifacts/athena_source.py</code> <pre><code>def value_counts(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n\n    Args:\n        recompute (bool): Recompute the value counts (default: False)\n\n    Returns:\n        dict(dict): A dictionary of value counts for each column in the form\n             {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...},\n              'col2': ...}\n    \"\"\"\n\n    # First check if we have already computed the value counts\n    value_counts_dict = self.sageworks_meta().get(\"sageworks_value_counts\")\n    if value_counts_dict and not recompute:\n        return value_counts_dict\n\n    # Call the SQL function to compute value_counts\n    value_count_dict = value_counts.value_counts(self)\n\n    # Push the value_count data into our DataSource Metadata\n    self.upsert_sageworks_meta({\"sageworks_value_counts\": value_count_dict})\n\n    # Return the value_count data\n    return value_count_dict\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/","title":"DataSource Abstract","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the DataSource API Class and voil\u00e0 it works the same.</p> <p>The DataSource Abstract class is a base/abstract class that defines API implemented by all the child classes (currently just AthenaSource but later RDSSource, FutureThing ).</p> <p>DataSourceAbstract: Abstract Base Class for all data sources (S3: CSV, JSONL, Parquet, RDS, etc)</p>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract","title":"<code>DataSourceAbstract</code>","text":"<p>               Bases: <code>Artifact</code></p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>class DataSourceAbstract(Artifact):\n    def __init__(self, data_uuid: str, database: str = \"sageworks\"):\n        \"\"\"DataSourceAbstract: Abstract Base Class for all data sources\n        Args:\n            data_uuid(str): The UUID for this Data Source\n            database(str): The database to use for this Data Source (default: sageworks)\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(data_uuid)\n\n        # Set up our instance attributes\n        self._database = database\n        self._table_name = data_uuid\n        self._display_columns = None\n\n    def __post_init__(self):\n        # Call superclass post_init\n        super().__post_init__()\n\n    def get_database(self) -&gt; str:\n        \"\"\"Get the database for this Data Source\"\"\"\n        return self._database\n\n    def get_table_name(self) -&gt; str:\n        \"\"\"Get the base table name for this Data Source\"\"\"\n        return self._table_name\n\n    @abstractmethod\n    def num_rows(self) -&gt; int:\n        \"\"\"Return the number of rows for this Data Source\"\"\"\n        pass\n\n    @abstractmethod\n    def num_columns(self) -&gt; int:\n        \"\"\"Return the number of columns for this Data Source\"\"\"\n        pass\n\n    @abstractmethod\n    def column_names(self) -&gt; list[str]:\n        \"\"\"Return the column names for this Data Source\"\"\"\n        pass\n\n    @abstractmethod\n    def column_types(self) -&gt; list[str]:\n        \"\"\"Return the column types for this Data Source\"\"\"\n        pass\n\n    def column_details(self, view: str = \"all\") -&gt; dict:\n        \"\"\"Return the column details for this Data Source\n        Args:\n            view (str): The view to get column details for (default: \"all\")\n        Returns:\n            dict: The column details for this Data Source\n        \"\"\"\n        names = self.column_names()\n        types = self.column_types()\n        if view == \"display\":\n            return {name: type_ for name, type_ in zip(names, types) if name in self.get_display_columns()}\n        elif view == \"computation\":\n            return {name: type_ for name, type_ in zip(names, types) if name in self.get_computation_columns()}\n        elif view == \"all\":\n            return {name: type_ for name, type_ in zip(names, types)}  # Return the full column details\n        else:\n            raise ValueError(f\"Unknown column details view: {view}\")\n\n    def get_display_columns(self) -&gt; list[str]:\n        \"\"\"Get the display columns for this Data Source\n        Returns:\n            list[str]: The display columns for this Data Source\n        \"\"\"\n        # Check if we have the display columns in our metadata\n        if self._display_columns is None:\n            self._display_columns = self.sageworks_meta().get(\"sageworks_display_columns\")\n\n        # If we still don't have display columns, try to set them\n        if self._display_columns is None:\n            # Exclude these automatically generated columns\n            exclude_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"id\"]\n\n            # We're going to remove any excluded columns from the display columns and limit to 30 total columns\n            self._display_columns = [col for col in self.column_names() if col not in exclude_columns][:30]\n\n            # Add the outlier_group column if it exists and isn't already in the display columns\n            if \"outlier_group\" in self.column_names():\n                self._display_columns = list(set(self._display_columns) + set([\"outlier_group\"]))\n\n            # Set the display columns in the metadata\n            self.set_display_columns(self._display_columns, onboard=False)\n\n        # Return the display columns\n        return self._display_columns\n\n    def set_display_columns(self, display_columns: list[str], onboard: bool = True):\n        \"\"\"Set the display columns for this Data Source\n\n        Args:\n            display_columns (list[str]): The display columns for this Data Source\n            onboard (bool): Onboard the Data Source after setting the display columns (default: True)\n        \"\"\"\n        self.log.important(f\"Setting Display Columns...{display_columns}\")\n        self._display_columns = display_columns\n        self.upsert_sageworks_meta({\"sageworks_display_columns\": self._display_columns})\n        if onboard:\n            self.onboard()\n\n    def num_display_columns(self) -&gt; int:\n        \"\"\"Return the number of display columns for this Data Source\"\"\"\n        return len(self._display_columns) if self._display_columns else 0\n\n    def get_computation_columns(self) -&gt; list[str]:\n        return self.get_display_columns()\n\n    def set_computation_columns(self, computation_columns: list[str]):\n        self.set_display_columns(computation_columns)\n\n    def num_computation_columns(self) -&gt; int:\n        return self.num_display_columns()\n\n    @abstractmethod\n    def query(self, query: str) -&gt; pd.DataFrame:\n        \"\"\"Query the DataSourceAbstract\n        Args:\n            query(str): The SQL query to execute\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def execute_statement(self, query: str):\n        \"\"\"Execute a non-returning SQL statement\n        Args:\n            query(str): The SQL query to execute\n        \"\"\"\n        pass\n\n    def sample(self, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Return a sample DataFrame from this DataSource\n        Args:\n            recompute (bool): Recompute the sample (default: False)\n        Returns:\n            pd.DataFrame: A sample DataFrame from this DataSource\n        \"\"\"\n\n        # Check if we have a cached sample of rows\n        storage_key = f\"data_source:{self.uuid}:sample\"\n        if not recompute and self.data_storage.get(storage_key):\n            return pd.read_json(StringIO(self.data_storage.get(storage_key)))\n\n        # No Cache, so we have to compute a sample of data\n        self.log.info(f\"Sampling {self.uuid}...\")\n        df = self.sample_impl()\n        self.data_storage.set(storage_key, df.to_json())\n        return df\n\n    @abstractmethod\n    def sample_impl(self) -&gt; pd.DataFrame:\n        \"\"\"Return a sample DataFrame from this DataSourceAbstract\n        Returns:\n            pd.DataFrame: A sample DataFrame from this DataSource\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def descriptive_stats(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n        Args:\n            recompute (bool): Recompute the descriptive stats (default: False)\n        Returns:\n            dict(dict): A dictionary of descriptive stats for each column in the form\n                 {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n                  'col2': ...}\n        \"\"\"\n        pass\n\n    def outliers(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Return a DataFrame of outliers from this DataSource\n        Args:\n            scale (float): The scale to use for the IQR (default: 1.5)\n            recompute (bool): Recompute the outliers (default: False)\n        Returns:\n            pd.DataFrame: A DataFrame of outliers from this DataSource\n        Notes:\n            Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n            The scale parameter can be adjusted to change the IQR multiplier\n        \"\"\"\n\n        # Check if we have cached outliers\n        storage_key = f\"data_source:{self.uuid}:outliers\"\n        if not recompute and self.data_storage.get(storage_key):\n            return pd.read_json(StringIO(self.data_storage.get(storage_key)))\n\n        # No Cache, so we have to compute the outliers\n        self.log.info(f\"Computing Outliers {self.uuid}...\")\n        df = self.outliers_impl()\n        self.data_storage.set(storage_key, df.to_json())\n        return df\n\n    @abstractmethod\n    def outliers_impl(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Return a DataFrame of outliers from this DataSource\n        Args:\n            scale (float): The scale to use for the IQR (default: 1.5)\n            recompute (bool): Recompute the outliers (default: False)\n        Returns:\n            pd.DataFrame: A DataFrame of outliers from this DataSource\n        Notes:\n            Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n            The scale parameter can be adjusted to change the IQR multiplier\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def smart_sample(self) -&gt; pd.DataFrame:\n        \"\"\"Get a SMART sample dataframe from this DataSource\n        Returns:\n            pd.DataFrame: A combined DataFrame of sample data + outliers\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def value_counts(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n        Args:\n            recompute (bool): Recompute the value counts (default: False)\n        Returns:\n            dict(dict): A dictionary of value counts for each column in the form\n                 {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...},\n                  'col2': ...}\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Column Stats for all the columns in a DataSource\n        Args:\n            recompute (bool): Recompute the column stats (default: False)\n        Returns:\n            dict(dict): A dictionary of stats for each column this format\n            NB: String columns will NOT have num_zeros and descriptive stats\n             {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n              'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n              ...}\n        \"\"\"\n        pass\n\n    def details(self) -&gt; dict:\n        \"\"\"Additional Details about this DataSourceAbstract Artifact\"\"\"\n        details = self.summary()\n        details[\"num_rows\"] = self.num_rows()\n        details[\"num_columns\"] = self.num_columns()\n        details[\"num_display_columns\"] = self.num_display_columns()\n        details[\"column_details\"] = self.column_details()\n        return details\n\n    def expected_meta(self) -&gt; list[str]:\n        \"\"\"DataSources have quite a bit of expected Metadata for EDA displays\"\"\"\n\n        # For DataSources, we expect to see the following metadata\n        expected_meta = [\n            \"sageworks_details\",\n            \"sageworks_descriptive_stats\",\n            \"sageworks_value_counts\",\n            \"sageworks_correlations\",\n            \"sageworks_column_stats\",\n        ]\n        return expected_meta\n\n    def ready(self) -&gt; bool:\n        \"\"\"Is the DataSource ready?\"\"\"\n\n        # Check if the Artifact is ready\n        if not super().ready():\n            return False\n\n        # Check if the samples and outliers have been computed\n        storage_key = f\"data_source:{self.uuid}:sample\"\n        if not self.data_storage.get(storage_key):\n            self.log.important(f\"DataSource {self.uuid} doesn't have sample() calling it...\")\n            self.sample()\n        storage_key = f\"data_source:{self.uuid}:outliers\"\n        if not self.data_storage.get(storage_key):\n            self.log.important(f\"DataSource {self.uuid} doesn't have outliers() calling it...\")\n            try:\n                self.outliers()\n            except KeyError:\n                self.log.error(\"DataSource outliers() failed...recomputing columns stats and trying again...\")\n                self.column_stats(recompute=True)\n                self.refresh_meta()\n                self.outliers()\n\n        # Okay so we have the samples and outliers, so we are ready\n        return True\n\n    def onboard(self) -&gt; bool:\n        \"\"\"This is a BLOCKING method that will onboard the data source (make it ready)\n\n        Returns:\n            bool: True if the DataSource was onboarded successfully\n        \"\"\"\n        self.log.important(f\"Onboarding {self.uuid}...\")\n        self.set_status(\"onboarding\")\n        self.remove_health_tag(\"needs_onboard\")\n        self.sample(recompute=True)\n        self.column_stats(recompute=True)\n        self.refresh_meta()  # Refresh the meta since outliers needs descriptive_stats and value_counts\n        self.outliers(recompute=True)\n\n        # Run a health check and refresh the meta\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.health_check()\n        self.refresh_meta()\n        self.details(recompute=True)\n        self.set_status(\"ready\")\n        return True\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.__init__","title":"<code>__init__(data_uuid, database='sageworks')</code>","text":"<p>DataSourceAbstract: Abstract Base Class for all data sources Args:     data_uuid(str): The UUID for this Data Source     database(str): The database to use for this Data Source (default: sageworks)</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def __init__(self, data_uuid: str, database: str = \"sageworks\"):\n    \"\"\"DataSourceAbstract: Abstract Base Class for all data sources\n    Args:\n        data_uuid(str): The UUID for this Data Source\n        database(str): The database to use for this Data Source (default: sageworks)\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(data_uuid)\n\n    # Set up our instance attributes\n    self._database = database\n    self._table_name = data_uuid\n    self._display_columns = None\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_details","title":"<code>column_details(view='all')</code>","text":"<p>Return the column details for this Data Source Args:     view (str): The view to get column details for (default: \"all\") Returns:     dict: The column details for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def column_details(self, view: str = \"all\") -&gt; dict:\n    \"\"\"Return the column details for this Data Source\n    Args:\n        view (str): The view to get column details for (default: \"all\")\n    Returns:\n        dict: The column details for this Data Source\n    \"\"\"\n    names = self.column_names()\n    types = self.column_types()\n    if view == \"display\":\n        return {name: type_ for name, type_ in zip(names, types) if name in self.get_display_columns()}\n    elif view == \"computation\":\n        return {name: type_ for name, type_ in zip(names, types) if name in self.get_computation_columns()}\n    elif view == \"all\":\n        return {name: type_ for name, type_ in zip(names, types)}  # Return the full column details\n    else:\n        raise ValueError(f\"Unknown column details view: {view}\")\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_names","title":"<code>column_names()</code>  <code>abstractmethod</code>","text":"<p>Return the column names for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef column_names(self) -&gt; list[str]:\n    \"\"\"Return the column names for this Data Source\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_stats","title":"<code>column_stats(recompute=False)</code>  <code>abstractmethod</code>","text":"<p>Compute Column Stats for all the columns in a DataSource Args:     recompute (bool): Recompute the column stats (default: False) Returns:     dict(dict): A dictionary of stats for each column this format     NB: String columns will NOT have num_zeros and descriptive stats      {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},       'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},       ...}</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Column Stats for all the columns in a DataSource\n    Args:\n        recompute (bool): Recompute the column stats (default: False)\n    Returns:\n        dict(dict): A dictionary of stats for each column this format\n        NB: String columns will NOT have num_zeros and descriptive stats\n         {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n          'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n          ...}\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_types","title":"<code>column_types()</code>  <code>abstractmethod</code>","text":"<p>Return the column types for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef column_types(self) -&gt; list[str]:\n    \"\"\"Return the column types for this Data Source\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.descriptive_stats","title":"<code>descriptive_stats(recompute=False)</code>  <code>abstractmethod</code>","text":"<p>Compute Descriptive Stats for all the numeric columns in a DataSource Args:     recompute (bool): Recompute the descriptive stats (default: False) Returns:     dict(dict): A dictionary of descriptive stats for each column in the form          {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},           'col2': ...}</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef descriptive_stats(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n    Args:\n        recompute (bool): Recompute the descriptive stats (default: False)\n    Returns:\n        dict(dict): A dictionary of descriptive stats for each column in the form\n             {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n              'col2': ...}\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.details","title":"<code>details()</code>","text":"<p>Additional Details about this DataSourceAbstract Artifact</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def details(self) -&gt; dict:\n    \"\"\"Additional Details about this DataSourceAbstract Artifact\"\"\"\n    details = self.summary()\n    details[\"num_rows\"] = self.num_rows()\n    details[\"num_columns\"] = self.num_columns()\n    details[\"num_display_columns\"] = self.num_display_columns()\n    details[\"column_details\"] = self.column_details()\n    return details\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.execute_statement","title":"<code>execute_statement(query)</code>  <code>abstractmethod</code>","text":"<p>Execute a non-returning SQL statement Args:     query(str): The SQL query to execute</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef execute_statement(self, query: str):\n    \"\"\"Execute a non-returning SQL statement\n    Args:\n        query(str): The SQL query to execute\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.expected_meta","title":"<code>expected_meta()</code>","text":"<p>DataSources have quite a bit of expected Metadata for EDA displays</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def expected_meta(self) -&gt; list[str]:\n    \"\"\"DataSources have quite a bit of expected Metadata for EDA displays\"\"\"\n\n    # For DataSources, we expect to see the following metadata\n    expected_meta = [\n        \"sageworks_details\",\n        \"sageworks_descriptive_stats\",\n        \"sageworks_value_counts\",\n        \"sageworks_correlations\",\n        \"sageworks_column_stats\",\n    ]\n    return expected_meta\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.get_database","title":"<code>get_database()</code>","text":"<p>Get the database for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def get_database(self) -&gt; str:\n    \"\"\"Get the database for this Data Source\"\"\"\n    return self._database\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.get_display_columns","title":"<code>get_display_columns()</code>","text":"<p>Get the display columns for this Data Source Returns:     list[str]: The display columns for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def get_display_columns(self) -&gt; list[str]:\n    \"\"\"Get the display columns for this Data Source\n    Returns:\n        list[str]: The display columns for this Data Source\n    \"\"\"\n    # Check if we have the display columns in our metadata\n    if self._display_columns is None:\n        self._display_columns = self.sageworks_meta().get(\"sageworks_display_columns\")\n\n    # If we still don't have display columns, try to set them\n    if self._display_columns is None:\n        # Exclude these automatically generated columns\n        exclude_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"id\"]\n\n        # We're going to remove any excluded columns from the display columns and limit to 30 total columns\n        self._display_columns = [col for col in self.column_names() if col not in exclude_columns][:30]\n\n        # Add the outlier_group column if it exists and isn't already in the display columns\n        if \"outlier_group\" in self.column_names():\n            self._display_columns = list(set(self._display_columns) + set([\"outlier_group\"]))\n\n        # Set the display columns in the metadata\n        self.set_display_columns(self._display_columns, onboard=False)\n\n    # Return the display columns\n    return self._display_columns\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.get_table_name","title":"<code>get_table_name()</code>","text":"<p>Get the base table name for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def get_table_name(self) -&gt; str:\n    \"\"\"Get the base table name for this Data Source\"\"\"\n    return self._table_name\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.num_columns","title":"<code>num_columns()</code>  <code>abstractmethod</code>","text":"<p>Return the number of columns for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef num_columns(self) -&gt; int:\n    \"\"\"Return the number of columns for this Data Source\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.num_display_columns","title":"<code>num_display_columns()</code>","text":"<p>Return the number of display columns for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def num_display_columns(self) -&gt; int:\n    \"\"\"Return the number of display columns for this Data Source\"\"\"\n    return len(self._display_columns) if self._display_columns else 0\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.num_rows","title":"<code>num_rows()</code>  <code>abstractmethod</code>","text":"<p>Return the number of rows for this Data Source</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef num_rows(self) -&gt; int:\n    \"\"\"Return the number of rows for this Data Source\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.onboard","title":"<code>onboard()</code>","text":"<p>This is a BLOCKING method that will onboard the data source (make it ready)</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the DataSource was onboarded successfully</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def onboard(self) -&gt; bool:\n    \"\"\"This is a BLOCKING method that will onboard the data source (make it ready)\n\n    Returns:\n        bool: True if the DataSource was onboarded successfully\n    \"\"\"\n    self.log.important(f\"Onboarding {self.uuid}...\")\n    self.set_status(\"onboarding\")\n    self.remove_health_tag(\"needs_onboard\")\n    self.sample(recompute=True)\n    self.column_stats(recompute=True)\n    self.refresh_meta()  # Refresh the meta since outliers needs descriptive_stats and value_counts\n    self.outliers(recompute=True)\n\n    # Run a health check and refresh the meta\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.health_check()\n    self.refresh_meta()\n    self.details(recompute=True)\n    self.set_status(\"ready\")\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.outliers","title":"<code>outliers(scale=1.5, recompute=False)</code>","text":"<p>Return a DataFrame of outliers from this DataSource Args:     scale (float): The scale to use for the IQR (default: 1.5)     recompute (bool): Recompute the outliers (default: False) Returns:     pd.DataFrame: A DataFrame of outliers from this DataSource Notes:     Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers     The scale parameter can be adjusted to change the IQR multiplier</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def outliers(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Return a DataFrame of outliers from this DataSource\n    Args:\n        scale (float): The scale to use for the IQR (default: 1.5)\n        recompute (bool): Recompute the outliers (default: False)\n    Returns:\n        pd.DataFrame: A DataFrame of outliers from this DataSource\n    Notes:\n        Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n        The scale parameter can be adjusted to change the IQR multiplier\n    \"\"\"\n\n    # Check if we have cached outliers\n    storage_key = f\"data_source:{self.uuid}:outliers\"\n    if not recompute and self.data_storage.get(storage_key):\n        return pd.read_json(StringIO(self.data_storage.get(storage_key)))\n\n    # No Cache, so we have to compute the outliers\n    self.log.info(f\"Computing Outliers {self.uuid}...\")\n    df = self.outliers_impl()\n    self.data_storage.set(storage_key, df.to_json())\n    return df\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.outliers_impl","title":"<code>outliers_impl(scale=1.5, recompute=False)</code>  <code>abstractmethod</code>","text":"<p>Return a DataFrame of outliers from this DataSource Args:     scale (float): The scale to use for the IQR (default: 1.5)     recompute (bool): Recompute the outliers (default: False) Returns:     pd.DataFrame: A DataFrame of outliers from this DataSource Notes:     Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers     The scale parameter can be adjusted to change the IQR multiplier</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef outliers_impl(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Return a DataFrame of outliers from this DataSource\n    Args:\n        scale (float): The scale to use for the IQR (default: 1.5)\n        recompute (bool): Recompute the outliers (default: False)\n    Returns:\n        pd.DataFrame: A DataFrame of outliers from this DataSource\n    Notes:\n        Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n        The scale parameter can be adjusted to change the IQR multiplier\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.query","title":"<code>query(query)</code>  <code>abstractmethod</code>","text":"<p>Query the DataSourceAbstract Args:     query(str): The SQL query to execute</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef query(self, query: str) -&gt; pd.DataFrame:\n    \"\"\"Query the DataSourceAbstract\n    Args:\n        query(str): The SQL query to execute\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.ready","title":"<code>ready()</code>","text":"<p>Is the DataSource ready?</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def ready(self) -&gt; bool:\n    \"\"\"Is the DataSource ready?\"\"\"\n\n    # Check if the Artifact is ready\n    if not super().ready():\n        return False\n\n    # Check if the samples and outliers have been computed\n    storage_key = f\"data_source:{self.uuid}:sample\"\n    if not self.data_storage.get(storage_key):\n        self.log.important(f\"DataSource {self.uuid} doesn't have sample() calling it...\")\n        self.sample()\n    storage_key = f\"data_source:{self.uuid}:outliers\"\n    if not self.data_storage.get(storage_key):\n        self.log.important(f\"DataSource {self.uuid} doesn't have outliers() calling it...\")\n        try:\n            self.outliers()\n        except KeyError:\n            self.log.error(\"DataSource outliers() failed...recomputing columns stats and trying again...\")\n            self.column_stats(recompute=True)\n            self.refresh_meta()\n            self.outliers()\n\n    # Okay so we have the samples and outliers, so we are ready\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.sample","title":"<code>sample(recompute=False)</code>","text":"<p>Return a sample DataFrame from this DataSource Args:     recompute (bool): Recompute the sample (default: False) Returns:     pd.DataFrame: A sample DataFrame from this DataSource</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def sample(self, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Return a sample DataFrame from this DataSource\n    Args:\n        recompute (bool): Recompute the sample (default: False)\n    Returns:\n        pd.DataFrame: A sample DataFrame from this DataSource\n    \"\"\"\n\n    # Check if we have a cached sample of rows\n    storage_key = f\"data_source:{self.uuid}:sample\"\n    if not recompute and self.data_storage.get(storage_key):\n        return pd.read_json(StringIO(self.data_storage.get(storage_key)))\n\n    # No Cache, so we have to compute a sample of data\n    self.log.info(f\"Sampling {self.uuid}...\")\n    df = self.sample_impl()\n    self.data_storage.set(storage_key, df.to_json())\n    return df\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.sample_impl","title":"<code>sample_impl()</code>  <code>abstractmethod</code>","text":"<p>Return a sample DataFrame from this DataSourceAbstract Returns:     pd.DataFrame: A sample DataFrame from this DataSource</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef sample_impl(self) -&gt; pd.DataFrame:\n    \"\"\"Return a sample DataFrame from this DataSourceAbstract\n    Returns:\n        pd.DataFrame: A sample DataFrame from this DataSource\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.set_display_columns","title":"<code>set_display_columns(display_columns, onboard=True)</code>","text":"<p>Set the display columns for this Data Source</p> <p>Parameters:</p> Name Type Description Default <code>display_columns</code> <code>list[str]</code> <p>The display columns for this Data Source</p> required <code>onboard</code> <code>bool</code> <p>Onboard the Data Source after setting the display columns (default: True)</p> <code>True</code> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>def set_display_columns(self, display_columns: list[str], onboard: bool = True):\n    \"\"\"Set the display columns for this Data Source\n\n    Args:\n        display_columns (list[str]): The display columns for this Data Source\n        onboard (bool): Onboard the Data Source after setting the display columns (default: True)\n    \"\"\"\n    self.log.important(f\"Setting Display Columns...{display_columns}\")\n    self._display_columns = display_columns\n    self.upsert_sageworks_meta({\"sageworks_display_columns\": self._display_columns})\n    if onboard:\n        self.onboard()\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.smart_sample","title":"<code>smart_sample()</code>  <code>abstractmethod</code>","text":"<p>Get a SMART sample dataframe from this DataSource Returns:     pd.DataFrame: A combined DataFrame of sample data + outliers</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef smart_sample(self) -&gt; pd.DataFrame:\n    \"\"\"Get a SMART sample dataframe from this DataSource\n    Returns:\n        pd.DataFrame: A combined DataFrame of sample data + outliers\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.value_counts","title":"<code>value_counts(recompute=False)</code>  <code>abstractmethod</code>","text":"<p>Compute 'value_counts' for all the string columns in a DataSource Args:     recompute (bool): Recompute the value counts (default: False) Returns:     dict(dict): A dictionary of value counts for each column in the form          {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...},           'col2': ...}</p> Source code in <code>src/sageworks/core/artifacts/data_source_abstract.py</code> <pre><code>@abstractmethod\ndef value_counts(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n    Args:\n        recompute (bool): Recompute the value counts (default: False)\n    Returns:\n        dict(dict): A dictionary of value counts for each column in the form\n             {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...},\n              'col2': ...}\n    \"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/","title":"EndpointCore","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the Endpoint API Class and voil\u00e0 it works the same.</p> <p>EndpointCore: SageWorks EndpointCore Class</p>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore","title":"<code>EndpointCore</code>","text":"<p>               Bases: <code>Artifact</code></p> <p>EndpointCore: SageWorks EndpointCore Class</p> Common Usage <pre><code>my_endpoint = EndpointCore(endpoint_uuid)\nprediction_df = my_endpoint.predict(test_df)\nmetrics = my_endpoint.regression_metrics(target_column, prediction_df)\nfor metric, value in metrics.items():\n    print(f\"{metric}: {value:0.3f}\")\n</code></pre> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>class EndpointCore(Artifact):\n    \"\"\"EndpointCore: SageWorks EndpointCore Class\n\n    Common Usage:\n        ```\n        my_endpoint = EndpointCore(endpoint_uuid)\n        prediction_df = my_endpoint.predict(test_df)\n        metrics = my_endpoint.regression_metrics(target_column, prediction_df)\n        for metric, value in metrics.items():\n            print(f\"{metric}: {value:0.3f}\")\n        ```\n    \"\"\"\n\n    def __init__(self, endpoint_uuid, force_refresh: bool = False, legacy: bool = False):\n        \"\"\"EndpointCore Initialization\n\n        Args:\n            endpoint_uuid (str): Name of Endpoint in SageWorks\n            force_refresh (bool, optional): Force a refresh of the AWS Broker. Defaults to False.\n            legacy (bool, optional): Force load of legacy models. Defaults to False.\n        \"\"\"\n\n        # Make sure the endpoint_uuid is a valid name\n        if not legacy:\n            self.ensure_valid_name(endpoint_uuid, delimiter=\"-\")\n\n        # Call SuperClass Initialization\n        super().__init__(endpoint_uuid)\n\n        # Grab an AWS Metadata Broker object and pull information for Endpoints\n        self.endpoint_name = endpoint_uuid\n        self.endpoint_meta = self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=force_refresh).get(\n            self.endpoint_name\n        )\n\n        # Sanity check that we found the endpoint\n        if self.endpoint_meta is None:\n            self.log.important(f\"Could not find endpoint {self.uuid} within current visibility scope\")\n            return\n\n        # Sanity check the Endpoint state\n        if self.endpoint_meta[\"EndpointStatus\"] == \"Failed\":\n            self.log.critical(f\"Endpoint {self.uuid} is in a failed state\")\n            reason = self.endpoint_meta[\"FailureReason\"]\n            self.log.critical(f\"Failure Reason: {reason}\")\n            self.log.critical(\"Please delete this endpoint and re-deploy...\")\n\n        # Set the Inference, Capture, and Monitoring S3 Paths\n        self.endpoint_inference_path = self.endpoints_s3_path + \"/inference/\" + self.uuid\n        self.endpoint_data_capture_path = self.endpoints_s3_path + \"/data_capture/\" + self.uuid\n        self.endpoint_monitoring_path = self.endpoints_s3_path + \"/monitoring/\" + self.uuid\n\n        # Set the Model Name\n        self.model_name = self.get_input()\n\n        # This is for endpoint error handling later\n        self.endpoint_return_columns = None\n        self.endpoint_retry = 0\n\n        # Call SuperClass Post Initialization\n        super().__post_init__()\n\n        # All done\n        self.log.info(f\"EndpointCore Initialized: {self.endpoint_name}\")\n\n    def refresh_meta(self):\n        \"\"\"Refresh the Artifact's metadata\"\"\"\n        self.endpoint_meta = self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=True).get(\n            self.endpoint_name\n        )\n\n    def exists(self) -&gt; bool:\n        \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n        if self.endpoint_meta is None:\n            self.log.debug(f\"Endpoint {self.endpoint_name} not found in AWS Metadata\")\n            return False\n        return True\n\n    def health_check(self) -&gt; list[str]:\n        \"\"\"Perform a health check on this model\n\n        Returns:\n            list[str]: List of health issues\n        \"\"\"\n        if not self.ready():\n            return [\"needs_onboard\"]\n\n        # Call the base class health check\n        health_issues = super().health_check()\n\n        # We're going to check for 5xx errors and no activity\n        endpoint_metrics = self.endpoint_metrics()\n\n        # Check if we have metrics\n        if endpoint_metrics is None:\n            health_issues.append(\"unknown_error\")\n            return health_issues\n\n        # Check for 5xx errors\n        num_errors = endpoint_metrics[\"Invocation5XXErrors\"].sum()\n        if num_errors &gt; 5:\n            health_issues.append(\"5xx_errors\")\n        elif num_errors &gt; 0:\n            health_issues.append(\"5xx_errors_min\")\n        else:\n            self.remove_health_tag(\"5xx_errors\")\n            self.remove_health_tag(\"5xx_errors_min\")\n\n        # Check for Endpoint activity\n        num_invocations = endpoint_metrics[\"Invocations\"].sum()\n        if num_invocations == 0:\n            health_issues.append(\"no_activity\")\n        else:\n            self.remove_health_tag(\"no_activity\")\n        return health_issues\n\n    def is_serverless(self):\n        \"\"\"Check if the current endpoint is serverless.\n\n        Returns:\n            bool: True if the endpoint is serverless, False otherwise.\n        \"\"\"\n        return \"Serverless\" in self.endpoint_meta[\"InstanceType\"]\n\n    def add_data_capture(self):\n        \"\"\"Add data capture to the endpoint\"\"\"\n        self.get_monitor().add_data_capture()\n\n    def get_monitor(self):\n        \"\"\"Get the MonitorCore class for this endpoint\"\"\"\n        from sageworks.core.artifacts.monitor_core import MonitorCore\n\n        return MonitorCore(self.endpoint_name)\n\n    def size(self) -&gt; float:\n        \"\"\"Return the size of this data in MegaBytes\"\"\"\n        return 0.0\n\n    def aws_meta(self) -&gt; dict:\n        \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n        return self.endpoint_meta\n\n    def arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n        return self.endpoint_meta[\"EndpointArn\"]\n\n    def aws_url(self):\n        \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n        return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n\n    def created(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was created\"\"\"\n        return self.endpoint_meta[\"CreationTime\"]\n\n    def modified(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was last modified\"\"\"\n        return self.endpoint_meta[\"LastModifiedTime\"]\n\n    def endpoint_metrics(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Return the metrics for this endpoint\n\n        Returns:\n            pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)\n        \"\"\"\n\n        # Do we have it cached?\n        metrics_key = f\"endpoint:{self.uuid}:endpoint_metrics\"\n        endpoint_metrics = self.temp_storage.get(metrics_key)\n        if endpoint_metrics is not None:\n            return endpoint_metrics\n\n        # We don't have it cached so let's get it from CloudWatch\n        if \"ProductionVariants\" not in self.endpoint_meta:\n            return None\n        self.log.important(\"Updating endpoint metrics...\")\n        variant = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n        endpoint_metrics = EndpointMetrics().get_metrics(self.uuid, variant=variant)\n        self.temp_storage.set(metrics_key, endpoint_metrics)\n        return endpoint_metrics\n\n    def details(self, recompute: bool = False) -&gt; dict:\n        \"\"\"Additional Details about this Endpoint\n        Args:\n            recompute (bool): Recompute the details (default: False)\n        Returns:\n            dict(dict): A dictionary of details about this Endpoint\n        \"\"\"\n        # Check if we have cached version of the FeatureSet Details\n        details_key = f\"endpoint:{self.uuid}:details\"\n\n        cached_details = self.data_storage.get(details_key)\n        if cached_details and not recompute:\n            # Update the endpoint metrics before returning cached details\n            endpoint_metrics = self.endpoint_metrics()\n            cached_details[\"endpoint_metrics\"] = endpoint_metrics\n            return cached_details\n\n        # Fill in all the details about this Endpoint\n        details = self.summary()\n\n        # Get details from our AWS Metadata\n        details[\"status\"] = self.endpoint_meta[\"EndpointStatus\"]\n        details[\"instance\"] = self.endpoint_meta[\"InstanceType\"]\n        try:\n            details[\"instance_count\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"CurrentInstanceCount\"] or \"-\"\n        except KeyError:\n            details[\"instance_count\"] = \"-\"\n        if \"ProductionVariants\" in self.endpoint_meta:\n            details[\"variant\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n        else:\n            details[\"variant\"] = \"-\"\n\n        # Add the underlying model details\n        details[\"model_name\"] = self.model_name\n        model_details = self.model_details()\n        details[\"model_type\"] = model_details.get(\"model_type\", \"unknown\")\n        details[\"model_metrics\"] = model_details.get(\"model_metrics\")\n        details[\"confusion_matrix\"] = model_details.get(\"confusion_matrix\")\n        details[\"predictions\"] = model_details.get(\"predictions\")\n        details[\"inference_meta\"] = model_details.get(\"inference_meta\")\n\n        # Add endpoint metrics from CloudWatch\n        details[\"endpoint_metrics\"] = self.endpoint_metrics()\n\n        # Cache the details\n        self.data_storage.set(details_key, details)\n\n        # Return the details\n        return details\n\n    def onboard(self, interactive: bool = False) -&gt; bool:\n        \"\"\"This is a BLOCKING method that will onboard the Endpoint (make it ready)\n        Args:\n            interactive (bool, optional): If True, will prompt the user for information. (default: False)\n        Returns:\n            bool: True if the Endpoint is successfully onboarded, False otherwise\n        \"\"\"\n\n        # Make sure our input is defined\n        if self.get_input() == \"unknown\":\n            if interactive:\n                input_model = input(\"Input Model?: \")\n            else:\n                self.log.error(\"Input Model is not defined!\")\n                return False\n        else:\n            input_model = self.get_input()\n\n        # Now that we have the details, let's onboard the Endpoint with args\n        return self.onboard_with_args(input_model)\n\n        # Run a health check and refresh the meta\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.health_check()\n        self.refresh_meta()\n        self.details(recompute=True)\n        self.set_status(\"ready\")\n        return True\n\n    def onboard_with_args(self, input_model: str) -&gt; bool:\n        \"\"\"Onboard the Endpoint with the given arguments\n\n        Args:\n            input_model (str): The input model for this endpoint\n        Returns:\n            bool: True if the Endpoint is successfully onboarded, False otherwise\n        \"\"\"\n        # Set the status to onboarding\n        self.set_status(\"onboarding\")\n\n        self.upsert_sageworks_meta({\"sageworks_input\": input_model})\n        self.model_name = input_model\n\n        # Remove the needs_onboard tag\n        self.remove_health_tag(\"needs_onboard\")\n        self.set_status(\"ready\")\n\n        # Run a health check and refresh the meta\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.health_check()\n        self.refresh_meta()\n        self.details(recompute=True)\n        return True\n\n    def model_details(self) -&gt; dict:\n        \"\"\"Return the details about the model used in this Endpoint\"\"\"\n        if self.model_name == \"unknown\":\n            return {}\n        else:\n            model = ModelCore(self.model_name)\n            if model.exists():\n                return model.details()\n            else:\n                return {}\n\n    def model_type(self) -&gt; str:\n        \"\"\"Return the type of model used in this Endpoint\"\"\"\n        return self.details().get(\"model_type\", \"unknown\")\n\n    def auto_inference(self, capture: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Run inference on the endpoint using FeatureSet data\n\n        Args:\n            capture (bool, optional): Capture the inference results and metrics (default=False)\n        \"\"\"\n\n        # This import needs to happen here (instead of top of file) to avoid circular imports\n        from sageworks.utils.endpoint_utils import fs_evaluation_data\n\n        eval_data_df = fs_evaluation_data(self)\n        capture_uuid = \"training_holdout\" if capture else None\n        return self.inference(eval_data_df, capture_uuid)\n\n    def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -&gt; pd.DataFrame:\n        \"\"\"Run inference and compute performance metrics with optional capture\n\n        Args:\n            eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n            capture_uuid (str, optional): UUID of the inference capture (default=None)\n            id_column (str, optional): Name of the ID column (default=None)\n\n        Returns:\n            pd.DataFrame: DataFrame with the inference results\n\n        Note:\n            If capture=True inference/performance metrics are written to S3 Endpoint Inference Folder\n        \"\"\"\n\n        # Run predictions on the evaluation data\n        prediction_df = self._predict(eval_df)\n\n        # Get the target column\n        target_column = ModelCore(self.model_name).target()\n\n        # Sanity Check that the target column is present\n        if target_column not in prediction_df.columns:\n            self.log.warning(f\"Target Column {target_column} not found in prediction_df!\")\n            self.log.warning(\"In order to compute metrics, the target column must be present!\")\n            return prediction_df\n\n        # Compute the standard performance metrics for this model\n        model_type = self.model_type()\n        if model_type in [ModelType.REGRESSOR.value, ModelType.QUANTILE_REGRESSOR.value]:\n            prediction_df = self.residuals(target_column, prediction_df)\n            metrics = self.regression_metrics(target_column, prediction_df)\n        elif model_type == ModelType.CLASSIFIER.value:\n            metrics = self.classification_metrics(target_column, prediction_df)\n        else:\n            # Unknown Model Type: Give log message and set metrics to empty dataframe\n            self.log.warning(f\"Unknown Model Type: {model_type}\")\n            metrics = pd.DataFrame()\n\n        # Print out the metrics\n        print(f\"Performance Metrics for {self.model_name} on {self.uuid}\")\n        print(metrics.head())\n\n        # Capture the inference results and metrics\n        if capture_uuid is not None:\n            description = capture_uuid.replace(\"_\", \" \").title()\n            self._capture_inference_results(capture_uuid, prediction_df, target_column, metrics, description, id_column)\n\n        # Return the prediction DataFrame\n        return prediction_df\n\n    def _predict(self, eval_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Internal: Run prediction on the given observations in the given DataFrame\n        Args:\n            eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n        Returns:\n            pd.DataFrame: Return the DataFrame with additional columns, prediction and any _proba columns\n        \"\"\"\n\n        # Make sure the eval_df has the features used to train the model\n        features = ModelCore(self.model_name).features()\n        if features and not set(features).issubset(eval_df.columns):\n            raise ValueError(f\"DataFrame does not contain required features: {features}\")\n\n        # Create our Endpoint Predictor Class\n        predictor = Predictor(\n            self.endpoint_name,\n            sagemaker_session=self.sm_session,\n            serializer=CSVSerializer(),\n            deserializer=CSVDeserializer(),\n        )\n\n        # Now split up the dataframe into 500 row chunks, send those chunks to our\n        # endpoint (with error handling) and stitch all the chunks back together\n        df_list = []\n        for index in range(0, len(eval_df), 500):\n            print(\"Processing...\")\n\n            # Compute partial DataFrames, add them to a list, and concatenate at the end\n            partial_df = self._endpoint_error_handling(predictor, eval_df[index : index + 500])\n            df_list.append(partial_df)\n\n        # Concatenate the dataframes\n        combined_df = pd.concat(df_list, ignore_index=True)\n\n        # Convert data to numeric\n        # Note: Since we're using CSV serializers numeric columns often get changed to generic 'object' types\n\n        # Hard Conversion\n        # Note: We explicitly catch exceptions for columns that cannot be converted to numeric\n        converted_df = combined_df.copy()\n        for column in combined_df.columns:\n            try:\n                converted_df[column] = pd.to_numeric(combined_df[column])\n            except ValueError:\n                # If a ValueError is raised, the column cannot be converted to numeric, so we keep it as is\n                pass\n\n        # Soft Conversion\n        # Convert columns to the best possible dtype that supports the pd.NA missing value.\n        converted_df = converted_df.convert_dtypes()\n\n        # Return the Dataframe\n        return converted_df\n\n    def _endpoint_error_handling(self, predictor, feature_df):\n        \"\"\"Internal: Method that handles Errors, Retries, and Binary Search for Error Row(s)\"\"\"\n\n        # Convert the DataFrame into a CSV buffer\n        csv_buffer = StringIO()\n        feature_df.to_csv(csv_buffer, index=False)\n\n        # Error Handling if the Endpoint gives back an error\n        try:\n            # Send the CSV Buffer to the predictor\n            results = predictor.predict(csv_buffer.getvalue())\n\n            # Construct a DataFrame from the results\n            results_df = pd.DataFrame.from_records(results[1:], columns=results[0])\n\n            # Capture the return columns\n            self.endpoint_return_columns = results_df.columns.tolist()\n\n            # Return the results dataframe\n            return results_df\n\n        except botocore.exceptions.ClientError as err:\n            if err.response[\"Error\"][\"Code\"] == \"ModelError\":  # Model Error\n                # Report the error and raise an exception\n                self.log.critical(f\"Endpoint prediction error: {err.response.get('Message')}\")\n                raise err\n\n            # Base case: DataFrame with 1 Row\n            if len(feature_df) == 1:\n                # If we don't have ANY known good results we're kinda screwed\n                if not self.endpoint_return_columns:\n                    raise err\n\n                # Construct an Error DataFrame (one row of NaNs in the return columns)\n                results_df = self._error_df(feature_df, self.endpoint_return_columns)\n                return results_df\n\n            # Recurse on binary splits of the dataframe\n            num_rows = len(feature_df)\n            split = int(num_rows / 2)\n            first_half = self._endpoint_error_handling(predictor, feature_df[0:split])\n            second_half = self._endpoint_error_handling(predictor, feature_df[split:num_rows])\n            return pd.concat([first_half, second_half], ignore_index=True)\n\n        # Catch the botocore.errorfactory.ModelNotReadyException\n        # Note: This is a SageMaker specific error that sometimes occurs\n        #       when the endpoint hasn't been used in a long time.\n        except botocore.errorfactory.ModelNotReadyException as err:\n            if self.endpoint_retry &gt;= 3:\n                raise err\n            self.endpoint_retry += 1\n            self.log.critical(f\"Endpoint model not ready: {err}\")\n            self.log.critical(\"Waiting and Retrying...\")\n            time.sleep(30)\n            return self._endpoint_error_handling(predictor, feature_df)\n\n    def _error_df(self, df, all_columns):\n        \"\"\"Internal: Method to construct an Error DataFrame (a Pandas DataFrame with one row of NaNs)\"\"\"\n        # Create a new dataframe with all NaNs\n        error_df = pd.DataFrame(dict(zip(all_columns, [[np.NaN]] * len(self.endpoint_return_columns))))\n        # Now set the original values for the incoming dataframe\n        for column in df.columns:\n            error_df[column] = df[column].values\n        return error_df\n\n    def _capture_inference_results(\n        self,\n        capture_uuid: str,\n        pred_results_df: pd.DataFrame,\n        target_column: str,\n        metrics: pd.DataFrame,\n        description: str,\n        id_column: str = None,\n    ):\n        \"\"\"Internal: Capture the inference results and metrics to S3\n\n        Args:\n            capture_uuid (str): UUID of the inference capture\n            pred_results_df (pd.DataFrame): DataFrame with the prediction results\n            target_column (str): Name of the target column\n            metrics (pd.DataFrame): DataFrame with the performance metrics\n            description (str): Description of the inference results\n            id_column (str, optional): Name of the ID column (default=None)\n        \"\"\"\n\n        # Compute a dataframe hash (just use the last 8)\n        data_hash = joblib.hash(pred_results_df)[:8]\n\n        # Metadata for the model inference\n        inference_meta = {\n            \"name\": capture_uuid,\n            \"data_hash\": data_hash,\n            \"num_rows\": len(pred_results_df),\n            \"description\": description,\n        }\n\n        # Create the S3 Path for the Inference Capture\n        inference_capture_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n        # Write the metadata dictionary, and metrics to our S3 Model Inference Folder\n        wr.s3.to_json(\n            pd.DataFrame([inference_meta]),\n            f\"{inference_capture_path}/inference_meta.json\",\n            index=False,\n        )\n        self.log.info(f\"Writing metrics to {inference_capture_path}/inference_metrics.csv\")\n        wr.s3.to_csv(metrics, f\"{inference_capture_path}/inference_metrics.csv\", index=False)\n\n        # Grab the target column, prediction column, any _proba columns, and the ID column (if present)\n        prediction_col = \"prediction\" if \"prediction\" in pred_results_df.columns else \"predictions\"\n        output_columns = [target_column, prediction_col]\n\n        # Add any _proba columns to the output columns\n        output_columns += [col for col in pred_results_df.columns if col.endswith(\"_proba\")]\n\n        # Add any quantile columns to the output columns\n        output_columns += [col for col in pred_results_df.columns if col.startswith(\"q_\") or col.startswith(\"qr_\")]\n\n        # Add the ID column\n        if id_column and id_column in pred_results_df.columns:\n            output_columns.append(id_column)\n\n        # Write the predictions to our S3 Model Inference Folder\n        self.log.info(f\"Writing predictions to {inference_capture_path}/inference_predictions.csv\")\n        subset_df = pred_results_df[output_columns]\n        wr.s3.to_csv(subset_df, f\"{inference_capture_path}/inference_predictions.csv\", index=False)\n\n        # CLASSIFIER: Write the confusion matrix to our S3 Model Inference Folder\n        model_type = self.model_type()\n        if model_type == ModelType.CLASSIFIER.value:\n            conf_mtx = self.confusion_matrix(target_column, pred_results_df)\n            self.log.info(f\"Writing confusion matrix to {inference_capture_path}/inference_cm.csv\")\n            # Note: Unlike other dataframes here, we want to write the index (labels) to the CSV\n            wr.s3.to_csv(conf_mtx, f\"{inference_capture_path}/inference_cm.csv\", index=True)\n\n        # Generate SHAP values for our Prediction Dataframe\n        generate_shap_values(self.endpoint_name, model_type, pred_results_df, inference_capture_path)\n\n        # Now recompute the details for our Model\n        self.log.important(f\"Recomputing Details for {self.model_name} to show latest Inference Results...\")\n        model = ModelCore(self.model_name)\n        model._load_inference_metrics(capture_uuid)\n        model.details(recompute=True)\n\n        # Recompute the details so that inference model metrics are updated\n        self.log.important(f\"Recomputing Details for {self.uuid} to show latest Inference Results...\")\n        self.details(recompute=True)\n\n    @staticmethod\n    def regression_metrics(target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Compute the performance metrics for this Endpoint\n        Args:\n            target_column (str): Name of the target column\n            prediction_df (pd.DataFrame): DataFrame with the prediction results\n        Returns:\n            pd.DataFrame: DataFrame with the performance metrics\n        \"\"\"\n\n        # Compute the metrics\n        y_true = prediction_df[target_column]\n        prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n        y_pred = prediction_df[prediction_col]\n\n        mae = mean_absolute_error(y_true, y_pred)\n        rmse = root_mean_squared_error(y_true, y_pred)\n        r2 = r2_score(y_true, y_pred)\n        # Mean Absolute Percentage Error\n        mape = np.mean(np.where(y_true != 0, np.abs((y_true - y_pred) / y_true), np.abs(y_true - y_pred))) * 100\n        # Median Absolute Error\n        medae = median_absolute_error(y_true, y_pred)\n\n        # Organize and return the metrics\n        metrics = {\n            \"MAE\": round(mae, 3),\n            \"RMSE\": round(rmse, 3),\n            \"R2\": round(r2, 3),\n            \"MAPE\": round(mape, 3),\n            \"MedAE\": round(medae, 3),\n            \"NumRows\": len(prediction_df),\n        }\n        return pd.DataFrame.from_records([metrics])\n\n    def residuals(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Add the residuals to the prediction DataFrame\n        Args:\n            target_column (str): Name of the target column\n            prediction_df (pd.DataFrame): DataFrame with the prediction results\n        Returns:\n            pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'\n        \"\"\"\n        # Sanity Check that this is a regression model\n        if self.model_type() not in [ModelType.REGRESSOR.value, ModelType.QUANTILE_REGRESSOR.value]:\n            self.log.warning(\"Residuals are only computed for regression models\")\n            return prediction_df\n\n        # Compute the residuals\n        y_true = prediction_df[target_column]\n        prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n        y_pred = prediction_df[prediction_col]\n\n        # Add the residuals and the absolute values to the DataFrame\n        prediction_df[\"residuals\"] = y_true - y_pred\n        prediction_df[\"residuals_abs\"] = np.abs(prediction_df[\"residuals\"])\n        return prediction_df\n\n    def classification_metrics(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Compute the performance metrics for this Endpoint\n        Args:\n            target_column (str): Name of the target column\n            prediction_df (pd.DataFrame): DataFrame with the prediction results\n        Returns:\n            pd.DataFrame: DataFrame with the performance metrics\n        \"\"\"\n\n        # Get a list of unique labels\n        labels = prediction_df[target_column].unique()\n\n        # Calculate scores\n        prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n        scores = precision_recall_fscore_support(\n            prediction_df[target_column], prediction_df[prediction_col], average=None, labels=labels\n        )\n\n        # Calculate ROC AUC\n        # ROC-AUC score measures the model's ability to distinguish between classes;\n        # - A value of 0.5 indicates no discrimination (equivalent to random guessing)\n        # - A score close to 1 indicates high discriminative power\n\n        # Sanity check for older versions that have a single column for probability\n        if \"pred_proba\" in prediction_df.columns:\n            self.log.error(\"Older version of prediction output detected, rerun inference...\")\n            roc_auc = [0.0] * len(labels)\n\n        # Convert probability columns to a 2D NumPy array\n        else:\n            proba_columns = [col for col in prediction_df.columns if col.endswith(\"_proba\")]\n            y_score = prediction_df[proba_columns].to_numpy()\n\n            # One-hot encode the true labels\n            lb = LabelBinarizer()\n            lb.fit(prediction_df[target_column])\n            y_true = lb.transform(prediction_df[target_column])\n\n            # Compute ROC AUC\n            roc_auc = roc_auc_score(y_true, y_score, multi_class=\"ovr\", average=None)\n\n        # Put the scores into a dataframe\n        score_df = pd.DataFrame(\n            {\n                target_column: labels,\n                \"precision\": scores[0],\n                \"recall\": scores[1],\n                \"fscore\": scores[2],\n                \"roc_auc\": roc_auc,\n                \"support\": scores[3],\n            }\n        )\n\n        # Sort the target labels\n        score_df = score_df.sort_values(by=[target_column], ascending=True)\n        return score_df\n\n    def confusion_matrix(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Compute the confusion matrix for this Endpoint\n        Args:\n            target_column (str): Name of the target column\n            prediction_df (pd.DataFrame): DataFrame with the prediction results\n        Returns:\n            pd.DataFrame: DataFrame with the confusion matrix\n        \"\"\"\n\n        y_true = prediction_df[target_column]\n        prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n        y_pred = prediction_df[prediction_col]\n\n        # Special case for low, medium, high classes\n        if (set(y_true) | set(y_pred)) == {\"low\", \"medium\", \"high\"}:\n            labels = [\"low\", \"medium\", \"high\"]\n        else:\n            labels = sorted(list(set(y_true) | set(y_pred)))\n\n        # Compute the confusion matrix\n        conf_mtx = confusion_matrix(y_true, y_pred, labels=labels)\n\n        # Create a DataFrame\n        conf_mtx_df = pd.DataFrame(conf_mtx, index=labels, columns=labels)\n        conf_mtx_df.index.name = \"labels\"\n        return conf_mtx_df\n\n    def endpoint_config_name(self) -&gt; str:\n        # Grab the Endpoint Config Name from the AWS\n        details = self.sm_client.describe_endpoint(EndpointName=self.endpoint_name)\n        return details[\"EndpointConfigName\"]\n\n    def set_input(self, input: str, force=False):\n        \"\"\"Override: Set the input data for this artifact\n\n        Args:\n            input (str): Name of input for this artifact\n            force (bool, optional): Force the input to be set. Defaults to False.\n        Note:\n            We're going to not allow this to be used for Models\n        \"\"\"\n        if not force:\n            self.log.warning(f\"Endpoint {self.uuid}: Does not allow manual override of the input!\")\n            return\n\n        # Okay we're going to allow this to be set\n        self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n        self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n        self.upsert_sageworks_meta({\"sageworks_input\": input})\n\n    def delete(self):\n        \"\"\"Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint\"\"\"\n        self.delete_endpoint_models()\n\n        # Grab the Endpoint Config Name from the AWS\n        endpoint_config_name = self.endpoint_config_name()\n        try:\n            self.log.info(f\"Deleting Endpoint Config {endpoint_config_name}...\")\n            self.sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n        except botocore.exceptions.ClientError:\n            self.log.info(f\"Endpoint Config {endpoint_config_name} doesn't exist...\")\n\n        # Check for any monitoring schedules\n        response = self.sm_client.list_monitoring_schedules(EndpointName=self.uuid)\n        monitoring_schedules = response[\"MonitoringScheduleSummaries\"]\n        for schedule in monitoring_schedules:\n            self.log.info(f\"Deleting Endpoint Monitoring Schedule {schedule['MonitoringScheduleName']}...\")\n            self.sm_client.delete_monitoring_schedule(MonitoringScheduleName=schedule[\"MonitoringScheduleName\"])\n\n        # Delete any inference, data_capture or monitoring artifacts\n        for s3_path in [self.endpoint_inference_path, self.endpoint_data_capture_path, self.endpoint_monitoring_path]:\n\n            # Make sure we add the trailing slash\n            s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n            objects = wr.s3.list_objects(s3_path, boto3_session=self.boto_session)\n            for obj in objects:\n                self.log.info(f\"Deleting S3 Object {obj}...\")\n            wr.s3.delete_objects(objects, boto3_session=self.boto_session)\n\n        # Now delete any data in the Cache\n        for key in self.data_storage.list_subkeys(f\"endpoint:{self.uuid}:\"):\n            self.log.info(f\"Deleting Cache Key: {key}\")\n            self.data_storage.delete(key)\n\n        # Okay now delete the Endpoint\n        try:\n            time.sleep(2)  # Let AWS catch up with any deletions performed above\n            self.log.info(f\"Deleting Endpoint {self.uuid}...\")\n            self.sm_client.delete_endpoint(EndpointName=self.uuid)\n        except botocore.exceptions.ClientError as e:\n            self.log.info(\"Endpoint ClientError...\")\n            raise e\n\n        # One more sleep to let AWS fully register the endpoint deletion\n        time.sleep(5)\n\n    def delete_endpoint_models(self):\n        \"\"\"Delete the underlying Model for an Endpoint\"\"\"\n\n        # Grab the Endpoint Config Name from the AWS\n        endpoint_config_name = self.endpoint_config_name()\n\n        # Retrieve the Model Names from the Endpoint Config\n        try:\n            endpoint_config = self.sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n        except botocore.exceptions.ClientError:\n            self.log.info(f\"Endpoint Config {self.uuid} doesn't exist...\")\n            return\n        model_names = [variant[\"ModelName\"] for variant in endpoint_config[\"ProductionVariants\"]]\n        for model_name in model_names:\n            self.log.info(f\"Deleting Model {model_name}...\")\n            try:\n                self.sm_client.delete_model(ModelName=model_name)\n            except botocore.exceptions.ClientError as error:\n                error_code = error.response[\"Error\"][\"Code\"]\n                error_message = error.response[\"Error\"][\"Message\"]\n                if error_code == \"ResourceInUse\":\n                    self.log.warning(f\"Model {model_name} is still in use...\")\n                else:\n                    self.log.warning(f\"Error: {error_code} - {error_message}\")\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.__init__","title":"<code>__init__(endpoint_uuid, force_refresh=False, legacy=False)</code>","text":"<p>EndpointCore Initialization</p> <p>Parameters:</p> Name Type Description Default <code>endpoint_uuid</code> <code>str</code> <p>Name of Endpoint in SageWorks</p> required <code>force_refresh</code> <code>bool</code> <p>Force a refresh of the AWS Broker. Defaults to False.</p> <code>False</code> <code>legacy</code> <code>bool</code> <p>Force load of legacy models. Defaults to False.</p> <code>False</code> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def __init__(self, endpoint_uuid, force_refresh: bool = False, legacy: bool = False):\n    \"\"\"EndpointCore Initialization\n\n    Args:\n        endpoint_uuid (str): Name of Endpoint in SageWorks\n        force_refresh (bool, optional): Force a refresh of the AWS Broker. Defaults to False.\n        legacy (bool, optional): Force load of legacy models. Defaults to False.\n    \"\"\"\n\n    # Make sure the endpoint_uuid is a valid name\n    if not legacy:\n        self.ensure_valid_name(endpoint_uuid, delimiter=\"-\")\n\n    # Call SuperClass Initialization\n    super().__init__(endpoint_uuid)\n\n    # Grab an AWS Metadata Broker object and pull information for Endpoints\n    self.endpoint_name = endpoint_uuid\n    self.endpoint_meta = self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=force_refresh).get(\n        self.endpoint_name\n    )\n\n    # Sanity check that we found the endpoint\n    if self.endpoint_meta is None:\n        self.log.important(f\"Could not find endpoint {self.uuid} within current visibility scope\")\n        return\n\n    # Sanity check the Endpoint state\n    if self.endpoint_meta[\"EndpointStatus\"] == \"Failed\":\n        self.log.critical(f\"Endpoint {self.uuid} is in a failed state\")\n        reason = self.endpoint_meta[\"FailureReason\"]\n        self.log.critical(f\"Failure Reason: {reason}\")\n        self.log.critical(\"Please delete this endpoint and re-deploy...\")\n\n    # Set the Inference, Capture, and Monitoring S3 Paths\n    self.endpoint_inference_path = self.endpoints_s3_path + \"/inference/\" + self.uuid\n    self.endpoint_data_capture_path = self.endpoints_s3_path + \"/data_capture/\" + self.uuid\n    self.endpoint_monitoring_path = self.endpoints_s3_path + \"/monitoring/\" + self.uuid\n\n    # Set the Model Name\n    self.model_name = self.get_input()\n\n    # This is for endpoint error handling later\n    self.endpoint_return_columns = None\n    self.endpoint_retry = 0\n\n    # Call SuperClass Post Initialization\n    super().__post_init__()\n\n    # All done\n    self.log.info(f\"EndpointCore Initialized: {self.endpoint_name}\")\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.add_data_capture","title":"<code>add_data_capture()</code>","text":"<p>Add data capture to the endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def add_data_capture(self):\n    \"\"\"Add data capture to the endpoint\"\"\"\n    self.get_monitor().add_data_capture()\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.arn","title":"<code>arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for this artifact</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n    return self.endpoint_meta[\"EndpointArn\"]\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.auto_inference","title":"<code>auto_inference(capture=False)</code>","text":"<p>Run inference on the endpoint using FeatureSet data</p> <p>Parameters:</p> Name Type Description Default <code>capture</code> <code>bool</code> <p>Capture the inference results and metrics (default=False)</p> <code>False</code> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def auto_inference(self, capture: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Run inference on the endpoint using FeatureSet data\n\n    Args:\n        capture (bool, optional): Capture the inference results and metrics (default=False)\n    \"\"\"\n\n    # This import needs to happen here (instead of top of file) to avoid circular imports\n    from sageworks.utils.endpoint_utils import fs_evaluation_data\n\n    eval_data_df = fs_evaluation_data(self)\n    capture_uuid = \"training_holdout\" if capture else None\n    return self.inference(eval_data_df, capture_uuid)\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.aws_meta","title":"<code>aws_meta()</code>","text":"<p>Get ALL the AWS metadata for this artifact</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def aws_meta(self) -&gt; dict:\n    \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n    return self.endpoint_meta\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.aws_url","title":"<code>aws_url()</code>","text":"<p>The AWS URL for looking at/querying this data source</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def aws_url(self):\n    \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n    return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.classification_metrics","title":"<code>classification_metrics(target_column, prediction_df)</code>","text":"<p>Compute the performance metrics for this Endpoint Args:     target_column (str): Name of the target column     prediction_df (pd.DataFrame): DataFrame with the prediction results Returns:     pd.DataFrame: DataFrame with the performance metrics</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def classification_metrics(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Compute the performance metrics for this Endpoint\n    Args:\n        target_column (str): Name of the target column\n        prediction_df (pd.DataFrame): DataFrame with the prediction results\n    Returns:\n        pd.DataFrame: DataFrame with the performance metrics\n    \"\"\"\n\n    # Get a list of unique labels\n    labels = prediction_df[target_column].unique()\n\n    # Calculate scores\n    prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n    scores = precision_recall_fscore_support(\n        prediction_df[target_column], prediction_df[prediction_col], average=None, labels=labels\n    )\n\n    # Calculate ROC AUC\n    # ROC-AUC score measures the model's ability to distinguish between classes;\n    # - A value of 0.5 indicates no discrimination (equivalent to random guessing)\n    # - A score close to 1 indicates high discriminative power\n\n    # Sanity check for older versions that have a single column for probability\n    if \"pred_proba\" in prediction_df.columns:\n        self.log.error(\"Older version of prediction output detected, rerun inference...\")\n        roc_auc = [0.0] * len(labels)\n\n    # Convert probability columns to a 2D NumPy array\n    else:\n        proba_columns = [col for col in prediction_df.columns if col.endswith(\"_proba\")]\n        y_score = prediction_df[proba_columns].to_numpy()\n\n        # One-hot encode the true labels\n        lb = LabelBinarizer()\n        lb.fit(prediction_df[target_column])\n        y_true = lb.transform(prediction_df[target_column])\n\n        # Compute ROC AUC\n        roc_auc = roc_auc_score(y_true, y_score, multi_class=\"ovr\", average=None)\n\n    # Put the scores into a dataframe\n    score_df = pd.DataFrame(\n        {\n            target_column: labels,\n            \"precision\": scores[0],\n            \"recall\": scores[1],\n            \"fscore\": scores[2],\n            \"roc_auc\": roc_auc,\n            \"support\": scores[3],\n        }\n    )\n\n    # Sort the target labels\n    score_df = score_df.sort_values(by=[target_column], ascending=True)\n    return score_df\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.confusion_matrix","title":"<code>confusion_matrix(target_column, prediction_df)</code>","text":"<p>Compute the confusion matrix for this Endpoint Args:     target_column (str): Name of the target column     prediction_df (pd.DataFrame): DataFrame with the prediction results Returns:     pd.DataFrame: DataFrame with the confusion matrix</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def confusion_matrix(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Compute the confusion matrix for this Endpoint\n    Args:\n        target_column (str): Name of the target column\n        prediction_df (pd.DataFrame): DataFrame with the prediction results\n    Returns:\n        pd.DataFrame: DataFrame with the confusion matrix\n    \"\"\"\n\n    y_true = prediction_df[target_column]\n    prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n    y_pred = prediction_df[prediction_col]\n\n    # Special case for low, medium, high classes\n    if (set(y_true) | set(y_pred)) == {\"low\", \"medium\", \"high\"}:\n        labels = [\"low\", \"medium\", \"high\"]\n    else:\n        labels = sorted(list(set(y_true) | set(y_pred)))\n\n    # Compute the confusion matrix\n    conf_mtx = confusion_matrix(y_true, y_pred, labels=labels)\n\n    # Create a DataFrame\n    conf_mtx_df = pd.DataFrame(conf_mtx, index=labels, columns=labels)\n    conf_mtx_df.index.name = \"labels\"\n    return conf_mtx_df\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.created","title":"<code>created()</code>","text":"<p>Return the datetime when this artifact was created</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def created(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was created\"\"\"\n    return self.endpoint_meta[\"CreationTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.delete","title":"<code>delete()</code>","text":"<p>Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def delete(self):\n    \"\"\"Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint\"\"\"\n    self.delete_endpoint_models()\n\n    # Grab the Endpoint Config Name from the AWS\n    endpoint_config_name = self.endpoint_config_name()\n    try:\n        self.log.info(f\"Deleting Endpoint Config {endpoint_config_name}...\")\n        self.sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n    except botocore.exceptions.ClientError:\n        self.log.info(f\"Endpoint Config {endpoint_config_name} doesn't exist...\")\n\n    # Check for any monitoring schedules\n    response = self.sm_client.list_monitoring_schedules(EndpointName=self.uuid)\n    monitoring_schedules = response[\"MonitoringScheduleSummaries\"]\n    for schedule in monitoring_schedules:\n        self.log.info(f\"Deleting Endpoint Monitoring Schedule {schedule['MonitoringScheduleName']}...\")\n        self.sm_client.delete_monitoring_schedule(MonitoringScheduleName=schedule[\"MonitoringScheduleName\"])\n\n    # Delete any inference, data_capture or monitoring artifacts\n    for s3_path in [self.endpoint_inference_path, self.endpoint_data_capture_path, self.endpoint_monitoring_path]:\n\n        # Make sure we add the trailing slash\n        s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n        objects = wr.s3.list_objects(s3_path, boto3_session=self.boto_session)\n        for obj in objects:\n            self.log.info(f\"Deleting S3 Object {obj}...\")\n        wr.s3.delete_objects(objects, boto3_session=self.boto_session)\n\n    # Now delete any data in the Cache\n    for key in self.data_storage.list_subkeys(f\"endpoint:{self.uuid}:\"):\n        self.log.info(f\"Deleting Cache Key: {key}\")\n        self.data_storage.delete(key)\n\n    # Okay now delete the Endpoint\n    try:\n        time.sleep(2)  # Let AWS catch up with any deletions performed above\n        self.log.info(f\"Deleting Endpoint {self.uuid}...\")\n        self.sm_client.delete_endpoint(EndpointName=self.uuid)\n    except botocore.exceptions.ClientError as e:\n        self.log.info(\"Endpoint ClientError...\")\n        raise e\n\n    # One more sleep to let AWS fully register the endpoint deletion\n    time.sleep(5)\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.delete_endpoint_models","title":"<code>delete_endpoint_models()</code>","text":"<p>Delete the underlying Model for an Endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def delete_endpoint_models(self):\n    \"\"\"Delete the underlying Model for an Endpoint\"\"\"\n\n    # Grab the Endpoint Config Name from the AWS\n    endpoint_config_name = self.endpoint_config_name()\n\n    # Retrieve the Model Names from the Endpoint Config\n    try:\n        endpoint_config = self.sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n    except botocore.exceptions.ClientError:\n        self.log.info(f\"Endpoint Config {self.uuid} doesn't exist...\")\n        return\n    model_names = [variant[\"ModelName\"] for variant in endpoint_config[\"ProductionVariants\"]]\n    for model_name in model_names:\n        self.log.info(f\"Deleting Model {model_name}...\")\n        try:\n            self.sm_client.delete_model(ModelName=model_name)\n        except botocore.exceptions.ClientError as error:\n            error_code = error.response[\"Error\"][\"Code\"]\n            error_message = error.response[\"Error\"][\"Message\"]\n            if error_code == \"ResourceInUse\":\n                self.log.warning(f\"Model {model_name} is still in use...\")\n            else:\n                self.log.warning(f\"Error: {error_code} - {error_message}\")\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.details","title":"<code>details(recompute=False)</code>","text":"<p>Additional Details about this Endpoint Args:     recompute (bool): Recompute the details (default: False) Returns:     dict(dict): A dictionary of details about this Endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def details(self, recompute: bool = False) -&gt; dict:\n    \"\"\"Additional Details about this Endpoint\n    Args:\n        recompute (bool): Recompute the details (default: False)\n    Returns:\n        dict(dict): A dictionary of details about this Endpoint\n    \"\"\"\n    # Check if we have cached version of the FeatureSet Details\n    details_key = f\"endpoint:{self.uuid}:details\"\n\n    cached_details = self.data_storage.get(details_key)\n    if cached_details and not recompute:\n        # Update the endpoint metrics before returning cached details\n        endpoint_metrics = self.endpoint_metrics()\n        cached_details[\"endpoint_metrics\"] = endpoint_metrics\n        return cached_details\n\n    # Fill in all the details about this Endpoint\n    details = self.summary()\n\n    # Get details from our AWS Metadata\n    details[\"status\"] = self.endpoint_meta[\"EndpointStatus\"]\n    details[\"instance\"] = self.endpoint_meta[\"InstanceType\"]\n    try:\n        details[\"instance_count\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"CurrentInstanceCount\"] or \"-\"\n    except KeyError:\n        details[\"instance_count\"] = \"-\"\n    if \"ProductionVariants\" in self.endpoint_meta:\n        details[\"variant\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n    else:\n        details[\"variant\"] = \"-\"\n\n    # Add the underlying model details\n    details[\"model_name\"] = self.model_name\n    model_details = self.model_details()\n    details[\"model_type\"] = model_details.get(\"model_type\", \"unknown\")\n    details[\"model_metrics\"] = model_details.get(\"model_metrics\")\n    details[\"confusion_matrix\"] = model_details.get(\"confusion_matrix\")\n    details[\"predictions\"] = model_details.get(\"predictions\")\n    details[\"inference_meta\"] = model_details.get(\"inference_meta\")\n\n    # Add endpoint metrics from CloudWatch\n    details[\"endpoint_metrics\"] = self.endpoint_metrics()\n\n    # Cache the details\n    self.data_storage.set(details_key, details)\n\n    # Return the details\n    return details\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.endpoint_metrics","title":"<code>endpoint_metrics()</code>","text":"<p>Return the metrics for this endpoint</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def endpoint_metrics(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Return the metrics for this endpoint\n\n    Returns:\n        pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)\n    \"\"\"\n\n    # Do we have it cached?\n    metrics_key = f\"endpoint:{self.uuid}:endpoint_metrics\"\n    endpoint_metrics = self.temp_storage.get(metrics_key)\n    if endpoint_metrics is not None:\n        return endpoint_metrics\n\n    # We don't have it cached so let's get it from CloudWatch\n    if \"ProductionVariants\" not in self.endpoint_meta:\n        return None\n    self.log.important(\"Updating endpoint metrics...\")\n    variant = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n    endpoint_metrics = EndpointMetrics().get_metrics(self.uuid, variant=variant)\n    self.temp_storage.set(metrics_key, endpoint_metrics)\n    return endpoint_metrics\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.exists","title":"<code>exists()</code>","text":"<p>Does the feature_set_name exist in the AWS Metadata?</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def exists(self) -&gt; bool:\n    \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n    if self.endpoint_meta is None:\n        self.log.debug(f\"Endpoint {self.endpoint_name} not found in AWS Metadata\")\n        return False\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.get_monitor","title":"<code>get_monitor()</code>","text":"<p>Get the MonitorCore class for this endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def get_monitor(self):\n    \"\"\"Get the MonitorCore class for this endpoint\"\"\"\n    from sageworks.core.artifacts.monitor_core import MonitorCore\n\n    return MonitorCore(self.endpoint_name)\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.health_check","title":"<code>health_check()</code>","text":"<p>Perform a health check on this model</p> <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: List of health issues</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def health_check(self) -&gt; list[str]:\n    \"\"\"Perform a health check on this model\n\n    Returns:\n        list[str]: List of health issues\n    \"\"\"\n    if not self.ready():\n        return [\"needs_onboard\"]\n\n    # Call the base class health check\n    health_issues = super().health_check()\n\n    # We're going to check for 5xx errors and no activity\n    endpoint_metrics = self.endpoint_metrics()\n\n    # Check if we have metrics\n    if endpoint_metrics is None:\n        health_issues.append(\"unknown_error\")\n        return health_issues\n\n    # Check for 5xx errors\n    num_errors = endpoint_metrics[\"Invocation5XXErrors\"].sum()\n    if num_errors &gt; 5:\n        health_issues.append(\"5xx_errors\")\n    elif num_errors &gt; 0:\n        health_issues.append(\"5xx_errors_min\")\n    else:\n        self.remove_health_tag(\"5xx_errors\")\n        self.remove_health_tag(\"5xx_errors_min\")\n\n    # Check for Endpoint activity\n    num_invocations = endpoint_metrics[\"Invocations\"].sum()\n    if num_invocations == 0:\n        health_issues.append(\"no_activity\")\n    else:\n        self.remove_health_tag(\"no_activity\")\n    return health_issues\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.inference","title":"<code>inference(eval_df, capture_uuid=None, id_column=None)</code>","text":"<p>Run inference and compute performance metrics with optional capture</p> <p>Parameters:</p> Name Type Description Default <code>eval_df</code> <code>DataFrame</code> <p>DataFrame to run predictions on (must have superset of features)</p> required <code>capture_uuid</code> <code>str</code> <p>UUID of the inference capture (default=None)</p> <code>None</code> <code>id_column</code> <code>str</code> <p>Name of the ID column (default=None)</p> <code>None</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: DataFrame with the inference results</p> Note <p>If capture=True inference/performance metrics are written to S3 Endpoint Inference Folder</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -&gt; pd.DataFrame:\n    \"\"\"Run inference and compute performance metrics with optional capture\n\n    Args:\n        eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n        capture_uuid (str, optional): UUID of the inference capture (default=None)\n        id_column (str, optional): Name of the ID column (default=None)\n\n    Returns:\n        pd.DataFrame: DataFrame with the inference results\n\n    Note:\n        If capture=True inference/performance metrics are written to S3 Endpoint Inference Folder\n    \"\"\"\n\n    # Run predictions on the evaluation data\n    prediction_df = self._predict(eval_df)\n\n    # Get the target column\n    target_column = ModelCore(self.model_name).target()\n\n    # Sanity Check that the target column is present\n    if target_column not in prediction_df.columns:\n        self.log.warning(f\"Target Column {target_column} not found in prediction_df!\")\n        self.log.warning(\"In order to compute metrics, the target column must be present!\")\n        return prediction_df\n\n    # Compute the standard performance metrics for this model\n    model_type = self.model_type()\n    if model_type in [ModelType.REGRESSOR.value, ModelType.QUANTILE_REGRESSOR.value]:\n        prediction_df = self.residuals(target_column, prediction_df)\n        metrics = self.regression_metrics(target_column, prediction_df)\n    elif model_type == ModelType.CLASSIFIER.value:\n        metrics = self.classification_metrics(target_column, prediction_df)\n    else:\n        # Unknown Model Type: Give log message and set metrics to empty dataframe\n        self.log.warning(f\"Unknown Model Type: {model_type}\")\n        metrics = pd.DataFrame()\n\n    # Print out the metrics\n    print(f\"Performance Metrics for {self.model_name} on {self.uuid}\")\n    print(metrics.head())\n\n    # Capture the inference results and metrics\n    if capture_uuid is not None:\n        description = capture_uuid.replace(\"_\", \" \").title()\n        self._capture_inference_results(capture_uuid, prediction_df, target_column, metrics, description, id_column)\n\n    # Return the prediction DataFrame\n    return prediction_df\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.is_serverless","title":"<code>is_serverless()</code>","text":"<p>Check if the current endpoint is serverless.</p> <p>Returns:</p> Name Type Description <code>bool</code> <p>True if the endpoint is serverless, False otherwise.</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def is_serverless(self):\n    \"\"\"Check if the current endpoint is serverless.\n\n    Returns:\n        bool: True if the endpoint is serverless, False otherwise.\n    \"\"\"\n    return \"Serverless\" in self.endpoint_meta[\"InstanceType\"]\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.model_details","title":"<code>model_details()</code>","text":"<p>Return the details about the model used in this Endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def model_details(self) -&gt; dict:\n    \"\"\"Return the details about the model used in this Endpoint\"\"\"\n    if self.model_name == \"unknown\":\n        return {}\n    else:\n        model = ModelCore(self.model_name)\n        if model.exists():\n            return model.details()\n        else:\n            return {}\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.model_type","title":"<code>model_type()</code>","text":"<p>Return the type of model used in this Endpoint</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def model_type(self) -&gt; str:\n    \"\"\"Return the type of model used in this Endpoint\"\"\"\n    return self.details().get(\"model_type\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.modified","title":"<code>modified()</code>","text":"<p>Return the datetime when this artifact was last modified</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def modified(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was last modified\"\"\"\n    return self.endpoint_meta[\"LastModifiedTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.onboard","title":"<code>onboard(interactive=False)</code>","text":"<p>This is a BLOCKING method that will onboard the Endpoint (make it ready) Args:     interactive (bool, optional): If True, will prompt the user for information. (default: False) Returns:     bool: True if the Endpoint is successfully onboarded, False otherwise</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def onboard(self, interactive: bool = False) -&gt; bool:\n    \"\"\"This is a BLOCKING method that will onboard the Endpoint (make it ready)\n    Args:\n        interactive (bool, optional): If True, will prompt the user for information. (default: False)\n    Returns:\n        bool: True if the Endpoint is successfully onboarded, False otherwise\n    \"\"\"\n\n    # Make sure our input is defined\n    if self.get_input() == \"unknown\":\n        if interactive:\n            input_model = input(\"Input Model?: \")\n        else:\n            self.log.error(\"Input Model is not defined!\")\n            return False\n    else:\n        input_model = self.get_input()\n\n    # Now that we have the details, let's onboard the Endpoint with args\n    return self.onboard_with_args(input_model)\n\n    # Run a health check and refresh the meta\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.health_check()\n    self.refresh_meta()\n    self.details(recompute=True)\n    self.set_status(\"ready\")\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.onboard_with_args","title":"<code>onboard_with_args(input_model)</code>","text":"<p>Onboard the Endpoint with the given arguments</p> <p>Parameters:</p> Name Type Description Default <code>input_model</code> <code>str</code> <p>The input model for this endpoint</p> required <p>Returns:     bool: True if the Endpoint is successfully onboarded, False otherwise</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def onboard_with_args(self, input_model: str) -&gt; bool:\n    \"\"\"Onboard the Endpoint with the given arguments\n\n    Args:\n        input_model (str): The input model for this endpoint\n    Returns:\n        bool: True if the Endpoint is successfully onboarded, False otherwise\n    \"\"\"\n    # Set the status to onboarding\n    self.set_status(\"onboarding\")\n\n    self.upsert_sageworks_meta({\"sageworks_input\": input_model})\n    self.model_name = input_model\n\n    # Remove the needs_onboard tag\n    self.remove_health_tag(\"needs_onboard\")\n    self.set_status(\"ready\")\n\n    # Run a health check and refresh the meta\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.health_check()\n    self.refresh_meta()\n    self.details(recompute=True)\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.refresh_meta","title":"<code>refresh_meta()</code>","text":"<p>Refresh the Artifact's metadata</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def refresh_meta(self):\n    \"\"\"Refresh the Artifact's metadata\"\"\"\n    self.endpoint_meta = self.aws_broker.get_metadata(ServiceCategory.ENDPOINTS, force_refresh=True).get(\n        self.endpoint_name\n    )\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.regression_metrics","title":"<code>regression_metrics(target_column, prediction_df)</code>  <code>staticmethod</code>","text":"<p>Compute the performance metrics for this Endpoint Args:     target_column (str): Name of the target column     prediction_df (pd.DataFrame): DataFrame with the prediction results Returns:     pd.DataFrame: DataFrame with the performance metrics</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>@staticmethod\ndef regression_metrics(target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Compute the performance metrics for this Endpoint\n    Args:\n        target_column (str): Name of the target column\n        prediction_df (pd.DataFrame): DataFrame with the prediction results\n    Returns:\n        pd.DataFrame: DataFrame with the performance metrics\n    \"\"\"\n\n    # Compute the metrics\n    y_true = prediction_df[target_column]\n    prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n    y_pred = prediction_df[prediction_col]\n\n    mae = mean_absolute_error(y_true, y_pred)\n    rmse = root_mean_squared_error(y_true, y_pred)\n    r2 = r2_score(y_true, y_pred)\n    # Mean Absolute Percentage Error\n    mape = np.mean(np.where(y_true != 0, np.abs((y_true - y_pred) / y_true), np.abs(y_true - y_pred))) * 100\n    # Median Absolute Error\n    medae = median_absolute_error(y_true, y_pred)\n\n    # Organize and return the metrics\n    metrics = {\n        \"MAE\": round(mae, 3),\n        \"RMSE\": round(rmse, 3),\n        \"R2\": round(r2, 3),\n        \"MAPE\": round(mape, 3),\n        \"MedAE\": round(medae, 3),\n        \"NumRows\": len(prediction_df),\n    }\n    return pd.DataFrame.from_records([metrics])\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.residuals","title":"<code>residuals(target_column, prediction_df)</code>","text":"<p>Add the residuals to the prediction DataFrame Args:     target_column (str): Name of the target column     prediction_df (pd.DataFrame): DataFrame with the prediction results Returns:     pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def residuals(self, target_column: str, prediction_df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Add the residuals to the prediction DataFrame\n    Args:\n        target_column (str): Name of the target column\n        prediction_df (pd.DataFrame): DataFrame with the prediction results\n    Returns:\n        pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'\n    \"\"\"\n    # Sanity Check that this is a regression model\n    if self.model_type() not in [ModelType.REGRESSOR.value, ModelType.QUANTILE_REGRESSOR.value]:\n        self.log.warning(\"Residuals are only computed for regression models\")\n        return prediction_df\n\n    # Compute the residuals\n    y_true = prediction_df[target_column]\n    prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n    y_pred = prediction_df[prediction_col]\n\n    # Add the residuals and the absolute values to the DataFrame\n    prediction_df[\"residuals\"] = y_true - y_pred\n    prediction_df[\"residuals_abs\"] = np.abs(prediction_df[\"residuals\"])\n    return prediction_df\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.set_input","title":"<code>set_input(input, force=False)</code>","text":"<p>Override: Set the input data for this artifact</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>str</code> <p>Name of input for this artifact</p> required <code>force</code> <code>bool</code> <p>Force the input to be set. Defaults to False.</p> <code>False</code> <p>Note:     We're going to not allow this to be used for Models</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def set_input(self, input: str, force=False):\n    \"\"\"Override: Set the input data for this artifact\n\n    Args:\n        input (str): Name of input for this artifact\n        force (bool, optional): Force the input to be set. Defaults to False.\n    Note:\n        We're going to not allow this to be used for Models\n    \"\"\"\n    if not force:\n        self.log.warning(f\"Endpoint {self.uuid}: Does not allow manual override of the input!\")\n        return\n\n    # Okay we're going to allow this to be set\n    self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n    self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n    self.upsert_sageworks_meta({\"sageworks_input\": input})\n</code></pre>"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.size","title":"<code>size()</code>","text":"<p>Return the size of this data in MegaBytes</p> Source code in <code>src/sageworks/core/artifacts/endpoint_core.py</code> <pre><code>def size(self) -&gt; float:\n    \"\"\"Return the size of this data in MegaBytes\"\"\"\n    return 0.0\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/","title":"FeatureSetCore","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the FeatureSet API Class and voil\u00e0 it works the same.</p> <p>FeatureSet: SageWorks Feature Set accessible through Athena</p>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore","title":"<code>FeatureSetCore</code>","text":"<p>               Bases: <code>Artifact</code></p> <p>FeatureSetCore: SageWorks FeatureSetCore Class</p> Common Usage <pre><code>my_features = FeatureSetCore(feature_uuid)\nmy_features.summary()\nmy_features.details()\n</code></pre> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>class FeatureSetCore(Artifact):\n    \"\"\"FeatureSetCore: SageWorks FeatureSetCore Class\n\n    Common Usage:\n        ```\n        my_features = FeatureSetCore(feature_uuid)\n        my_features.summary()\n        my_features.details()\n        ```\n    \"\"\"\n\n    def __init__(self, feature_set_uuid: str, force_refresh: bool = False):\n        \"\"\"FeatureSetCore Initialization\n\n        Args:\n            feature_set_uuid (str): Name of Feature Set\n            force_refresh (bool): Force a refresh of the Feature Set metadata (default: False)\n        \"\"\"\n\n        # Make sure the feature_set name is valid\n        self.ensure_valid_name(feature_set_uuid)\n\n        # Call superclass init\n        super().__init__(feature_set_uuid)\n\n        # Setup our AWS Broker catalog metadata\n        _catalog_meta = self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=force_refresh)\n        self.feature_meta = _catalog_meta.get(self.uuid)\n\n        # Sanity check and then set up our FeatureSet attributes\n        if self.feature_meta is None:\n            self.log.important(f\"Could not find feature set {self.uuid} within current visibility scope\")\n            self.data_source = None\n            return\n        else:\n            self.record_id = self.feature_meta[\"RecordIdentifierFeatureName\"]\n            self.event_time = self.feature_meta[\"EventTimeFeatureName\"]\n\n            # Pull Athena and S3 Storage information from metadata\n            self.athena_database = self.feature_meta[\"sageworks_meta\"].get(\"athena_database\")\n            self.athena_table = self.feature_meta[\"sageworks_meta\"].get(\"athena_table\")\n            self.s3_storage = self.feature_meta[\"sageworks_meta\"].get(\"s3_storage\")\n\n            # Create our internal DataSource (hardcoded to Athena for now)\n            self.data_source = AthenaSource(self.athena_table, self.athena_database)\n\n        # Spin up our Feature Store\n        self.feature_store = FeatureStore(self.sm_session)\n\n        # Call superclass post_init\n        super().__post_init__()\n\n        # All done\n        self.log.info(f\"FeatureSet Initialized: {self.uuid}\")\n\n    def refresh_meta(self):\n        \"\"\"Internal: Refresh our internal AWS Feature Store metadata\"\"\"\n        self.log.info(\"Calling refresh_meta() on the underlying DataSource\")\n        self.data_source.refresh_meta()\n\n    def exists(self) -&gt; bool:\n        \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n        if self.feature_meta is None:\n            self.log.debug(f\"FeatureSet {self.uuid} not found in AWS Metadata!\")\n            return False\n        return True\n\n    def health_check(self) -&gt; list[str]:\n        \"\"\"Perform a health check on this model\n\n        Returns:\n            list[str]: List of health issues\n        \"\"\"\n        # Call the base class health check\n        health_issues = super().health_check()\n\n        # If we have a 'needs_onboard' in the health check then just return\n        if \"needs_onboard\" in health_issues:\n            return health_issues\n\n        # Check our DataSource\n        if not self.data_source.exists():\n            self.log.critical(f\"Data Source check failed for {self.uuid}\")\n            self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n            health_issues.append(\"data_source_missing\")\n        return health_issues\n\n    def aws_meta(self) -&gt; dict:\n        \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n        return self.feature_meta\n\n    def arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n        return self.feature_meta[\"FeatureGroupArn\"]\n\n    def size(self) -&gt; float:\n        \"\"\"Return the size of the internal DataSource in MegaBytes\"\"\"\n        return self.data_source.size()\n\n    def column_names(self) -&gt; list[str]:\n        \"\"\"Return the column names of the Feature Set\"\"\"\n        return list(self.column_details().keys())\n\n    def column_types(self) -&gt; list[str]:\n        \"\"\"Return the column types of the Feature Set\"\"\"\n        return list(self.column_details().values())\n\n    def column_details(self, view: str = \"all\") -&gt; dict:\n        \"\"\"Return the column details of the Feature Set\n\n        Args:\n            view (str): The view to get column details for (default: \"all\")\n\n        Returns:\n            dict: The column details of the Feature Set\n\n        Notes:\n            We can't call just call self.data_source.column_details() because FeatureSets have different\n            types, so we need to overlay that type information on top of the DataSource type information\n        \"\"\"\n        fs_details = {item[\"FeatureName\"]: item[\"FeatureType\"] for item in self.feature_meta[\"FeatureDefinitions\"]}\n        ds_details = self.data_source.column_details(view)\n\n        # Overlay the FeatureSet type information on top of the DataSource type information\n        for col, dtype in ds_details.items():\n            ds_details[col] = fs_details.get(col, dtype)\n        return ds_details\n\n        # Not going to use these for now\n        \"\"\"\n        internal = {\n            \"write_time\": \"Timestamp\",\n            \"api_invocation_time\": \"Timestamp\",\n            \"is_deleted\": \"Boolean\",\n        }\n        details.update(internal)\n        return details\n        \"\"\"\n\n    def get_display_columns(self) -&gt; list[str]:\n        \"\"\"Get the display columns for this FeatureSet\n\n        Returns:\n            list[str]: The display columns for this FeatureSet\n\n        Notes:\n            This just pulls the display columns from the underlying DataSource\n        \"\"\"\n        return self.data_source.get_display_columns()\n\n    def set_display_columns(self, display_columns: list[str]):\n        \"\"\"Set the display columns for this FeatureSet\n\n        Args:\n            display_columns (list[str]): The display columns for this FeatureSet\n\n        Notes:\n            This just sets the display columns for the underlying DataSource\n        \"\"\"\n        self.data_source.set_display_columns(display_columns)\n        self.onboard()\n\n    def num_columns(self) -&gt; int:\n        \"\"\"Return the number of columns of the Feature Set\"\"\"\n        return len(self.column_names())\n\n    def num_rows(self) -&gt; int:\n        \"\"\"Return the number of rows of the internal DataSource\"\"\"\n        return self.data_source.num_rows()\n\n    def query(self, query: str, overwrite: bool = True) -&gt; pd.DataFrame:\n        \"\"\"Query the internal DataSource\n\n        Args:\n            query (str): The query to run against the DataSource\n            overwrite (bool): Overwrite the table name in the query (default: True)\n\n        Returns:\n            pd.DataFrame: The results of the query\n        \"\"\"\n        if overwrite:\n            query = query.replace(\" \" + self.uuid + \" \", \" \" + self.athena_table + \" \")\n        return self.data_source.query(query)\n\n    def aws_url(self):\n        \"\"\"The AWS URL for looking at/querying the underlying data source\"\"\"\n        return self.data_source.details().get(\"aws_url\", \"unknown\")\n\n    def created(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was created\"\"\"\n        return self.feature_meta[\"CreationTime\"]\n\n    def modified(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was last modified\"\"\"\n        # Note: We can't currently figure out how to this from AWS Metadata\n        return self.feature_meta[\"CreationTime\"]\n\n    def get_data_source(self) -&gt; DataSourceFactory:\n        \"\"\"Return the underlying DataSource object\"\"\"\n        return self.data_source\n\n    def get_feature_store(self) -&gt; FeatureStore:\n        \"\"\"Return the underlying AWS FeatureStore object. This can be useful for more advanced usage\n        with create_dataset() such as Joins and time ranges and a host of other options\n        See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n        \"\"\"\n        return self.feature_store\n\n    def create_s3_training_data(self) -&gt; str:\n        \"\"\"Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want\n        additional options/features use the get_feature_store() method and see AWS docs for all\n        the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n        Returns:\n            str: The full path/file for the CSV file created by Feature Store create_dataset()\n        \"\"\"\n\n        # Set up the S3 Query results path\n        date_time = datetime.now(timezone.utc).strftime(\"%Y-%m-%d_%H:%M:%S\")\n        s3_output_path = self.feature_sets_s3_path + f\"/{self.uuid}/datasets/all_{date_time}\"\n\n        # Get the training data query\n        query = self.get_training_data_query()\n\n        # Make the query\n        athena_query = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session).athena_query()\n        athena_query.run(query, output_location=s3_output_path)\n        athena_query.wait()\n        query_execution = athena_query.get_query_execution()\n\n        # Get the full path to the S3 files with the results\n        full_s3_path = s3_output_path + f\"/{query_execution['QueryExecution']['QueryExecutionId']}.csv\"\n        return full_s3_path\n\n    def get_training_data_query(self) -&gt; str:\n        \"\"\"Get the training data query for this FeatureSet\n\n        Returns:\n            str: The training data query for this FeatureSet\n        \"\"\"\n\n        # Do we have a training view?\n        training_view = self.get_training_view_table()\n        if training_view:\n            self.log.important(f\"Pulling Data from Training View {training_view}...\")\n            table_name = training_view\n        else:\n            self.log.warning(f\"No Training View found for {self.uuid}, using FeatureSet directly...\")\n            table_name = self.athena_table\n\n        # Make a query that gets all the data from the FeatureSet\n        return f\"SELECT * FROM {table_name}\"\n\n    def get_training_data(self, limit=50000) -&gt; pd.DataFrame:\n        \"\"\"Get the training data for this FeatureSet\n\n        Args:\n            limit (int): The number of rows to limit the query to (default: 1000)\n        Returns:\n            pd.DataFrame: The training data for this FeatureSet\n        \"\"\"\n\n        # Get the training data query (put a limit on it for now)\n        query = self.get_training_data_query() + f\" LIMIT {limit}\"\n\n        # Make the query\n        return self.query(query)\n\n    def snapshot_query(self, table_name: str = None) -&gt; str:\n        \"\"\"An Athena query to get the latest snapshot of features\n\n        Args:\n            table_name (str): The name of the table to query (default: None)\n\n        Returns:\n            str: The Athena query to get the latest snapshot of features\n        \"\"\"\n        # Remove FeatureGroup metadata columns that might have gotten added\n        columns = self.column_names()\n        filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n        columns = \", \".join(['\"' + x + '\"' for x in columns if x not in filter_columns])\n\n        query = (\n            f\"SELECT {columns} \"\n            f\"    FROM (SELECT *, row_number() OVER (PARTITION BY {self.record_id} \"\n            f\"        ORDER BY {self.event_time} desc, api_invocation_time DESC, write_time DESC) AS row_num \"\n            f'        FROM \"{table_name}\") '\n            \"    WHERE row_num = 1 and NOT is_deleted;\"\n        )\n        return query\n\n    def details(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Additional Details about this FeatureSet Artifact\n\n        Args:\n            recompute (bool): Recompute the details (default: False)\n\n        Returns:\n            dict(dict): A dictionary of details about this FeatureSet\n        \"\"\"\n\n        # Check if we have cached version of the FeatureSet Details\n        storage_key = f\"feature_set:{self.uuid}:details\"\n        cached_details = self.data_storage.get(storage_key)\n        if cached_details and not recompute:\n            return cached_details\n\n        self.log.info(f\"Recomputing FeatureSet Details ({self.uuid})...\")\n        details = self.summary()\n        details[\"aws_url\"] = self.aws_url()\n\n        # Store the AWS URL in the SageWorks Metadata\n        self.upsert_sageworks_meta({\"aws_url\": details[\"aws_url\"]})\n\n        # Now get a summary of the underlying DataSource\n        details[\"storage_summary\"] = self.data_source.summary()\n\n        # Number of Columns\n        details[\"num_columns\"] = self.num_columns()\n\n        # Number of Rows\n        details[\"num_rows\"] = self.num_rows()\n\n        # Additional Details\n        details[\"sageworks_status\"] = self.get_status()\n        details[\"sageworks_input\"] = self.get_input()\n        details[\"sageworks_tags\"] = self.tag_delimiter.join(self.get_tags())\n\n        # Underlying Storage Details\n        details[\"storage_type\"] = \"athena\"  # TODO: Add RDS support\n        details[\"storage_uuid\"] = self.data_source.uuid\n\n        # Add the column details and column stats\n        details[\"column_details\"] = self.column_details()\n        details[\"column_stats\"] = self.column_stats()\n\n        # Cache the details\n        self.data_storage.set(storage_key, details)\n\n        # Return the details data\n        return details\n\n    def delete(self):\n        \"\"\"Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\"\"\"\n\n        # Delete the Feature Group and ensure that it gets deleted\n        self.log.important(f\"Deleting FeatureSet {self.uuid}...\")\n        remove_fg = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session)\n        remove_fg.delete()\n        self.ensure_feature_group_deleted(remove_fg)\n\n        # Delete our underlying DataSource (Data Catalog Table and S3 Storage Objects)\n        self.data_source.delete()\n\n        # Delete the training view\n        self.delete_training_view()\n\n        # Feature Sets can often have a lot of cruft so delete the entire bucket/prefix\n        s3_delete_path = self.feature_sets_s3_path + f\"/{self.uuid}/\"\n        self.log.info(f\"Deleting All FeatureSet S3 Storage Objects {s3_delete_path}\")\n        wr.s3.delete_objects(s3_delete_path, boto3_session=self.boto_session)\n\n        # Now delete any data in the Cache\n        for key in self.data_storage.list_subkeys(f\"feature_set:{self.uuid}:\"):\n            self.log.info(f\"Deleting Cache Key: {key}\")\n            self.data_storage.delete(key)\n\n        # Force a refresh of the AWS Metadata (to make sure references to deleted artifacts are gone)\n        self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=True)\n\n    def ensure_feature_group_deleted(self, feature_group):\n        status = \"Deleting\"\n        while status == \"Deleting\":\n            self.log.debug(\"FeatureSet being Deleted...\")\n            try:\n                status = feature_group.describe().get(\"FeatureGroupStatus\")\n            except botocore.exceptions.ClientError as error:\n                # For ResourceNotFound/ValidationException, this is fine, otherwise raise all other exceptions\n                if error.response[\"Error\"][\"Code\"] in [\"ResourceNotFound\", \"ValidationException\"]:\n                    break\n                else:\n                    raise error\n            time.sleep(1)\n        self.log.info(f\"FeatureSet {feature_group.name} successfully deleted\")\n\n    def create_default_training_view(self):\n        \"\"\"Create a default view in Athena that assigns roughly 80% of the data to training\"\"\"\n\n        # Create the view name\n        view_name = f\"{self.athena_table}_training\"\n        self.log.important(f\"Creating default Training View {view_name}...\")\n\n        # Do we already have a training column?\n        if \"training\" in self.column_names():\n            create_view_query = f\"CREATE OR REPLACE VIEW {view_name} AS SELECT * FROM {self.athena_table}\"\n        else:\n            # No training column, so create one:\n            #    Construct the CREATE VIEW query with a simple modulo operation for the 80/20 split\n            #    using self.record_id as the stable identifier for row numbering\n            create_view_query = f\"\"\"\n            CREATE OR REPLACE VIEW {view_name} AS\n            SELECT *, CASE\n                WHEN MOD(ROW_NUMBER() OVER (ORDER BY {self.record_id}), 10) &lt; 8 THEN 1  -- Assign 80% to training\n                ELSE 0  -- Assign roughly 20% to validation\n            END AS training\n            FROM {self.athena_table}\n            \"\"\"\n\n        # Execute the CREATE VIEW query\n        self.data_source.execute_statement(create_view_query)\n\n    def create_training_view(self, id_column: str, holdout_ids: list[str]):\n        \"\"\"Create a view in Athena that marks hold out ids for this FeatureSet\n\n        Args:\n            id_column (str): The name of the id column in the output DataFrame.\n            holdout_ids (list[str]): The list of hold out ids.\n        \"\"\"\n\n        # Create the view name\n        view_name = f\"{self.athena_table}_training\"\n        self.log.important(f\"Creating Training View {view_name}...\")\n\n        # Format the list of hold out ids for SQL IN clause\n        if holdout_ids and all(isinstance(id, str) for id in holdout_ids):\n            formatted_holdout_ids = \", \".join(f\"'{id}'\" for id in holdout_ids)\n        else:\n            formatted_holdout_ids = \", \".join(map(str, holdout_ids))\n\n        # Construct the CREATE VIEW query\n        create_view_query = f\"\"\"\n        CREATE OR REPLACE VIEW {view_name} AS\n        SELECT *, CASE\n            WHEN {id_column} IN ({formatted_holdout_ids}) THEN 0\n            ELSE 1\n        END AS training\n        FROM {self.athena_table}\n        \"\"\"\n\n        # Execute the CREATE VIEW query\n        self.data_source.execute_statement(create_view_query)\n\n    def set_holdout_ids(self, id_column: str, holdout_ids: list[str]):\n        \"\"\"Set the hold out ids for this FeatureSet\n\n        Args:\n            id_column (str): The name of the id column in the output DataFrame.\n            holdout_ids (list[str]): The list of hold out ids.\n        \"\"\"\n        self.create_training_view(id_column, holdout_ids)\n\n    def get_holdout_ids(self, id_column: str) -&gt; list[str]:\n        \"\"\"Get the hold out ids for this FeatureSet\n\n        Args:\n            id_column (str): The name of the id column in the output DataFrame.\n\n        Returns:\n            list[str]: The list of hold out ids.\n        \"\"\"\n        training_view_table = self.get_training_view_table(create=False)\n        if training_view_table is not None:\n            query = f\"SELECT {id_column} FROM {training_view_table} WHERE training = 0\"\n            holdout_ids = self.query(query)[id_column].tolist()\n            return holdout_ids\n        else:\n            return []\n\n    def get_training_view_table(self, create: bool = True) -&gt; Union[str, None]:\n        \"\"\"Get the name of the training view for this FeatureSet\n        Args:\n            create (bool): Create the training view if it doesn't exist (default=True)\n        Returns:\n            str: The name of the training view for this FeatureSet\n        \"\"\"\n        training_view_name = f\"{self.athena_table}_training\"\n        glue_client = self.boto_session.client(\"glue\")\n        try:\n            glue_client.get_table(DatabaseName=self.athena_database, Name=training_view_name)\n            return training_view_name\n        except glue_client.exceptions.EntityNotFoundException:\n            if not create:\n                return None\n            self.log.warning(f\"Training View for {self.uuid} doesn't exist, creating one...\")\n            self.create_default_training_view()\n            time.sleep(1)  # Give AWS a second to catch up\n            return training_view_name\n\n    def delete_training_view(self):\n        \"\"\"Delete the training view for this FeatureSet\"\"\"\n        try:\n            training_view_table = self.get_training_view_table(create=False)\n            if training_view_table is not None:\n                self.log.info(f\"Deleting Training View {training_view_table} for {self.uuid}\")\n                glue_client = self.boto_session.client(\"glue\")\n                glue_client.delete_table(DatabaseName=self.athena_database, Name=training_view_table)\n        except botocore.exceptions.ClientError as error:\n            # For ResourceNotFound/ValidationException, this is fine, otherwise raise all other exceptions\n            if error.response[\"Error\"][\"Code\"] in [\"ResourceNotFound\", \"ValidationException\"]:\n                self.log.warning(f\"Training View for {self.uuid} doesn't exist, nothing to delete...\")\n                pass\n            else:\n                raise error\n\n    def descriptive_stats(self, recompute: bool = False) -&gt; dict:\n        \"\"\"Get the descriptive stats for the numeric columns of the underlying DataSource\n        Args:\n            recompute (bool): Recompute the descriptive stats (default=False)\n        Returns:\n            dict: A dictionary of descriptive stats for the numeric columns\n        \"\"\"\n        return self.data_source.descriptive_stats(recompute)\n\n    def sample(self, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Get a sample of the data from the underlying DataSource\n        Args:\n            recompute (bool): Recompute the sample (default=False)\n        Returns:\n            pd.DataFrame: A sample of the data from the underlying DataSource\n        \"\"\"\n        return self.data_source.sample(recompute)\n\n    def outliers(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n        \"\"\"Compute outliers for all the numeric columns in a DataSource\n        Args:\n            scale (float): The scale to use for the IQR (default: 1.5)\n            recompute (bool): Recompute the outliers (default: False)\n        Returns:\n            pd.DataFrame: A DataFrame of outliers from this DataSource\n        Notes:\n            Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n            The scale parameter can be adjusted to change the IQR multiplier\n        \"\"\"\n        return self.data_source.outliers(scale=scale, recompute=recompute)\n\n    def smart_sample(self) -&gt; pd.DataFrame:\n        \"\"\"Get a SMART sample dataframe from this FeatureSet\n        Returns:\n            pd.DataFrame: A combined DataFrame of sample data + outliers\n        \"\"\"\n        return self.data_source.smart_sample()\n\n    def anomalies(self) -&gt; pd.DataFrame:\n        \"\"\"Get a set of anomalous data from the underlying DataSource\n        Returns:\n            pd.DataFrame: A dataframe of anomalies from the underlying DataSource\n        \"\"\"\n\n        # FIXME: Mock this for now\n        anom_df = self.sample().copy()\n        anom_df[\"anomaly_score\"] = np.random.rand(anom_df.shape[0])\n        anom_df[\"cluster\"] = np.random.randint(0, 10, anom_df.shape[0])\n        anom_df[\"x\"] = np.random.rand(anom_df.shape[0])\n        anom_df[\"y\"] = np.random.rand(anom_df.shape[0])\n        return anom_df\n\n    def value_counts(self, recompute: bool = False) -&gt; dict:\n        \"\"\"Get the value counts for the string columns of the underlying DataSource\n        Args:\n            recompute (bool): Recompute the value counts (default=False)\n        Returns:\n            dict: A dictionary of value counts for the string columns\n        \"\"\"\n        return self.data_source.value_counts(recompute)\n\n    def correlations(self, recompute: bool = False) -&gt; dict:\n        \"\"\"Get the correlations for the numeric columns of the underlying DataSource\n        Args:\n            recompute (bool): Recompute the value counts (default=False)\n        Returns:\n            dict: A dictionary of correlations for the numeric columns\n        \"\"\"\n        return self.data_source.correlations(recompute)\n\n    def column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n        \"\"\"Compute Column Stats for all the columns in the FeatureSets underlying DataSource\n        Args:\n            recompute (bool): Recompute the column stats (default: False)\n        Returns:\n            dict(dict): A dictionary of stats for each column this format\n            NB: String columns will NOT have num_zeros and descriptive_stats\n             {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n              'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n              ...}\n        \"\"\"\n\n        # Grab the column stats from our DataSource\n        ds_column_stats = self.data_source.column_stats(recompute)\n\n        # Map the types from our DataSource to the FeatureSet types\n        fs_type_mapper = self.column_details()\n        for col, details in ds_column_stats.items():\n            details[\"fs_dtype\"] = fs_type_mapper.get(col, \"unknown\")\n\n        return ds_column_stats\n\n    def ready(self) -&gt; bool:\n        \"\"\"Is the FeatureSet ready? Is initial setup complete and expected metadata populated?\n        Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to\n           check both to see if the FeatureSet is ready.\"\"\"\n\n        # Check the expected metadata for the FeatureSet\n        expected_meta = self.expected_meta()\n        existing_meta = self.sageworks_meta()\n        feature_set_ready = set(existing_meta.keys()).issuperset(expected_meta)\n        if not feature_set_ready:\n            self.log.info(f\"FeatureSet {self.uuid} is not ready!\")\n            return False\n\n        # Okay now call/return the DataSource ready() method\n        return self.data_source.ready()\n\n    def onboard(self) -&gt; bool:\n        \"\"\"This is a BLOCKING method that will onboard the FeatureSet (make it ready)\"\"\"\n\n        # Set our status to onboarding\n        self.log.important(f\"Onboarding {self.uuid}...\")\n        self.set_status(\"onboarding\")\n        self.remove_health_tag(\"needs_onboard\")\n\n        # Call our underlying DataSource onboard method\n        self.data_source.refresh_meta()\n        if not self.data_source.exists():\n            self.log.critical(f\"Data Source check failed for {self.uuid}\")\n            self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n            return False\n        if not self.data_source.ready():\n            self.data_source.onboard()\n\n        # Run a health check and refresh the meta\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.health_check()\n        self.refresh_meta()\n        self.details(recompute=True)\n        self.set_status(\"ready\")\n        return True\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.__init__","title":"<code>__init__(feature_set_uuid, force_refresh=False)</code>","text":"<p>FeatureSetCore Initialization</p> <p>Parameters:</p> Name Type Description Default <code>feature_set_uuid</code> <code>str</code> <p>Name of Feature Set</p> required <code>force_refresh</code> <code>bool</code> <p>Force a refresh of the Feature Set metadata (default: False)</p> <code>False</code> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def __init__(self, feature_set_uuid: str, force_refresh: bool = False):\n    \"\"\"FeatureSetCore Initialization\n\n    Args:\n        feature_set_uuid (str): Name of Feature Set\n        force_refresh (bool): Force a refresh of the Feature Set metadata (default: False)\n    \"\"\"\n\n    # Make sure the feature_set name is valid\n    self.ensure_valid_name(feature_set_uuid)\n\n    # Call superclass init\n    super().__init__(feature_set_uuid)\n\n    # Setup our AWS Broker catalog metadata\n    _catalog_meta = self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=force_refresh)\n    self.feature_meta = _catalog_meta.get(self.uuid)\n\n    # Sanity check and then set up our FeatureSet attributes\n    if self.feature_meta is None:\n        self.log.important(f\"Could not find feature set {self.uuid} within current visibility scope\")\n        self.data_source = None\n        return\n    else:\n        self.record_id = self.feature_meta[\"RecordIdentifierFeatureName\"]\n        self.event_time = self.feature_meta[\"EventTimeFeatureName\"]\n\n        # Pull Athena and S3 Storage information from metadata\n        self.athena_database = self.feature_meta[\"sageworks_meta\"].get(\"athena_database\")\n        self.athena_table = self.feature_meta[\"sageworks_meta\"].get(\"athena_table\")\n        self.s3_storage = self.feature_meta[\"sageworks_meta\"].get(\"s3_storage\")\n\n        # Create our internal DataSource (hardcoded to Athena for now)\n        self.data_source = AthenaSource(self.athena_table, self.athena_database)\n\n    # Spin up our Feature Store\n    self.feature_store = FeatureStore(self.sm_session)\n\n    # Call superclass post_init\n    super().__post_init__()\n\n    # All done\n    self.log.info(f\"FeatureSet Initialized: {self.uuid}\")\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.anomalies","title":"<code>anomalies()</code>","text":"<p>Get a set of anomalous data from the underlying DataSource Returns:     pd.DataFrame: A dataframe of anomalies from the underlying DataSource</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def anomalies(self) -&gt; pd.DataFrame:\n    \"\"\"Get a set of anomalous data from the underlying DataSource\n    Returns:\n        pd.DataFrame: A dataframe of anomalies from the underlying DataSource\n    \"\"\"\n\n    # FIXME: Mock this for now\n    anom_df = self.sample().copy()\n    anom_df[\"anomaly_score\"] = np.random.rand(anom_df.shape[0])\n    anom_df[\"cluster\"] = np.random.randint(0, 10, anom_df.shape[0])\n    anom_df[\"x\"] = np.random.rand(anom_df.shape[0])\n    anom_df[\"y\"] = np.random.rand(anom_df.shape[0])\n    return anom_df\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.arn","title":"<code>arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for this artifact</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n    return self.feature_meta[\"FeatureGroupArn\"]\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.aws_meta","title":"<code>aws_meta()</code>","text":"<p>Get ALL the AWS metadata for this artifact</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def aws_meta(self) -&gt; dict:\n    \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n    return self.feature_meta\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.aws_url","title":"<code>aws_url()</code>","text":"<p>The AWS URL for looking at/querying the underlying data source</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def aws_url(self):\n    \"\"\"The AWS URL for looking at/querying the underlying data source\"\"\"\n    return self.data_source.details().get(\"aws_url\", \"unknown\")\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_details","title":"<code>column_details(view='all')</code>","text":"<p>Return the column details of the Feature Set</p> <p>Parameters:</p> Name Type Description Default <code>view</code> <code>str</code> <p>The view to get column details for (default: \"all\")</p> <code>'all'</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>The column details of the Feature Set</p> Notes <p>We can't call just call self.data_source.column_details() because FeatureSets have different types, so we need to overlay that type information on top of the DataSource type information</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def column_details(self, view: str = \"all\") -&gt; dict:\n    \"\"\"Return the column details of the Feature Set\n\n    Args:\n        view (str): The view to get column details for (default: \"all\")\n\n    Returns:\n        dict: The column details of the Feature Set\n\n    Notes:\n        We can't call just call self.data_source.column_details() because FeatureSets have different\n        types, so we need to overlay that type information on top of the DataSource type information\n    \"\"\"\n    fs_details = {item[\"FeatureName\"]: item[\"FeatureType\"] for item in self.feature_meta[\"FeatureDefinitions\"]}\n    ds_details = self.data_source.column_details(view)\n\n    # Overlay the FeatureSet type information on top of the DataSource type information\n    for col, dtype in ds_details.items():\n        ds_details[col] = fs_details.get(col, dtype)\n    return ds_details\n\n    # Not going to use these for now\n    \"\"\"\n    internal = {\n        \"write_time\": \"Timestamp\",\n        \"api_invocation_time\": \"Timestamp\",\n        \"is_deleted\": \"Boolean\",\n    }\n    details.update(internal)\n    return details\n    \"\"\"\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_names","title":"<code>column_names()</code>","text":"<p>Return the column names of the Feature Set</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def column_names(self) -&gt; list[str]:\n    \"\"\"Return the column names of the Feature Set\"\"\"\n    return list(self.column_details().keys())\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_stats","title":"<code>column_stats(recompute=False)</code>","text":"<p>Compute Column Stats for all the columns in the FeatureSets underlying DataSource Args:     recompute (bool): Recompute the column stats (default: False) Returns:     dict(dict): A dictionary of stats for each column this format     NB: String columns will NOT have num_zeros and descriptive_stats      {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},       'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},       ...}</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def column_stats(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Compute Column Stats for all the columns in the FeatureSets underlying DataSource\n    Args:\n        recompute (bool): Recompute the column stats (default: False)\n    Returns:\n        dict(dict): A dictionary of stats for each column this format\n        NB: String columns will NOT have num_zeros and descriptive_stats\n         {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n          'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n          ...}\n    \"\"\"\n\n    # Grab the column stats from our DataSource\n    ds_column_stats = self.data_source.column_stats(recompute)\n\n    # Map the types from our DataSource to the FeatureSet types\n    fs_type_mapper = self.column_details()\n    for col, details in ds_column_stats.items():\n        details[\"fs_dtype\"] = fs_type_mapper.get(col, \"unknown\")\n\n    return ds_column_stats\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_types","title":"<code>column_types()</code>","text":"<p>Return the column types of the Feature Set</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def column_types(self) -&gt; list[str]:\n    \"\"\"Return the column types of the Feature Set\"\"\"\n    return list(self.column_details().values())\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.correlations","title":"<code>correlations(recompute=False)</code>","text":"<p>Get the correlations for the numeric columns of the underlying DataSource Args:     recompute (bool): Recompute the value counts (default=False) Returns:     dict: A dictionary of correlations for the numeric columns</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def correlations(self, recompute: bool = False) -&gt; dict:\n    \"\"\"Get the correlations for the numeric columns of the underlying DataSource\n    Args:\n        recompute (bool): Recompute the value counts (default=False)\n    Returns:\n        dict: A dictionary of correlations for the numeric columns\n    \"\"\"\n    return self.data_source.correlations(recompute)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.create_default_training_view","title":"<code>create_default_training_view()</code>","text":"<p>Create a default view in Athena that assigns roughly 80% of the data to training</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def create_default_training_view(self):\n    \"\"\"Create a default view in Athena that assigns roughly 80% of the data to training\"\"\"\n\n    # Create the view name\n    view_name = f\"{self.athena_table}_training\"\n    self.log.important(f\"Creating default Training View {view_name}...\")\n\n    # Do we already have a training column?\n    if \"training\" in self.column_names():\n        create_view_query = f\"CREATE OR REPLACE VIEW {view_name} AS SELECT * FROM {self.athena_table}\"\n    else:\n        # No training column, so create one:\n        #    Construct the CREATE VIEW query with a simple modulo operation for the 80/20 split\n        #    using self.record_id as the stable identifier for row numbering\n        create_view_query = f\"\"\"\n        CREATE OR REPLACE VIEW {view_name} AS\n        SELECT *, CASE\n            WHEN MOD(ROW_NUMBER() OVER (ORDER BY {self.record_id}), 10) &lt; 8 THEN 1  -- Assign 80% to training\n            ELSE 0  -- Assign roughly 20% to validation\n        END AS training\n        FROM {self.athena_table}\n        \"\"\"\n\n    # Execute the CREATE VIEW query\n    self.data_source.execute_statement(create_view_query)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.create_s3_training_data","title":"<code>create_s3_training_data()</code>","text":"<p>Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want additional options/features use the get_feature_store() method and see AWS docs for all the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html Returns:     str: The full path/file for the CSV file created by Feature Store create_dataset()</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def create_s3_training_data(self) -&gt; str:\n    \"\"\"Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want\n    additional options/features use the get_feature_store() method and see AWS docs for all\n    the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n    Returns:\n        str: The full path/file for the CSV file created by Feature Store create_dataset()\n    \"\"\"\n\n    # Set up the S3 Query results path\n    date_time = datetime.now(timezone.utc).strftime(\"%Y-%m-%d_%H:%M:%S\")\n    s3_output_path = self.feature_sets_s3_path + f\"/{self.uuid}/datasets/all_{date_time}\"\n\n    # Get the training data query\n    query = self.get_training_data_query()\n\n    # Make the query\n    athena_query = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session).athena_query()\n    athena_query.run(query, output_location=s3_output_path)\n    athena_query.wait()\n    query_execution = athena_query.get_query_execution()\n\n    # Get the full path to the S3 files with the results\n    full_s3_path = s3_output_path + f\"/{query_execution['QueryExecution']['QueryExecutionId']}.csv\"\n    return full_s3_path\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.create_training_view","title":"<code>create_training_view(id_column, holdout_ids)</code>","text":"<p>Create a view in Athena that marks hold out ids for this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>id_column</code> <code>str</code> <p>The name of the id column in the output DataFrame.</p> required <code>holdout_ids</code> <code>list[str]</code> <p>The list of hold out ids.</p> required Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def create_training_view(self, id_column: str, holdout_ids: list[str]):\n    \"\"\"Create a view in Athena that marks hold out ids for this FeatureSet\n\n    Args:\n        id_column (str): The name of the id column in the output DataFrame.\n        holdout_ids (list[str]): The list of hold out ids.\n    \"\"\"\n\n    # Create the view name\n    view_name = f\"{self.athena_table}_training\"\n    self.log.important(f\"Creating Training View {view_name}...\")\n\n    # Format the list of hold out ids for SQL IN clause\n    if holdout_ids and all(isinstance(id, str) for id in holdout_ids):\n        formatted_holdout_ids = \", \".join(f\"'{id}'\" for id in holdout_ids)\n    else:\n        formatted_holdout_ids = \", \".join(map(str, holdout_ids))\n\n    # Construct the CREATE VIEW query\n    create_view_query = f\"\"\"\n    CREATE OR REPLACE VIEW {view_name} AS\n    SELECT *, CASE\n        WHEN {id_column} IN ({formatted_holdout_ids}) THEN 0\n        ELSE 1\n    END AS training\n    FROM {self.athena_table}\n    \"\"\"\n\n    # Execute the CREATE VIEW query\n    self.data_source.execute_statement(create_view_query)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.created","title":"<code>created()</code>","text":"<p>Return the datetime when this artifact was created</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def created(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was created\"\"\"\n    return self.feature_meta[\"CreationTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.delete","title":"<code>delete()</code>","text":"<p>Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def delete(self):\n    \"\"\"Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\"\"\"\n\n    # Delete the Feature Group and ensure that it gets deleted\n    self.log.important(f\"Deleting FeatureSet {self.uuid}...\")\n    remove_fg = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session)\n    remove_fg.delete()\n    self.ensure_feature_group_deleted(remove_fg)\n\n    # Delete our underlying DataSource (Data Catalog Table and S3 Storage Objects)\n    self.data_source.delete()\n\n    # Delete the training view\n    self.delete_training_view()\n\n    # Feature Sets can often have a lot of cruft so delete the entire bucket/prefix\n    s3_delete_path = self.feature_sets_s3_path + f\"/{self.uuid}/\"\n    self.log.info(f\"Deleting All FeatureSet S3 Storage Objects {s3_delete_path}\")\n    wr.s3.delete_objects(s3_delete_path, boto3_session=self.boto_session)\n\n    # Now delete any data in the Cache\n    for key in self.data_storage.list_subkeys(f\"feature_set:{self.uuid}:\"):\n        self.log.info(f\"Deleting Cache Key: {key}\")\n        self.data_storage.delete(key)\n\n    # Force a refresh of the AWS Metadata (to make sure references to deleted artifacts are gone)\n    self.aws_broker.get_metadata(ServiceCategory.FEATURE_STORE, force_refresh=True)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.delete_training_view","title":"<code>delete_training_view()</code>","text":"<p>Delete the training view for this FeatureSet</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def delete_training_view(self):\n    \"\"\"Delete the training view for this FeatureSet\"\"\"\n    try:\n        training_view_table = self.get_training_view_table(create=False)\n        if training_view_table is not None:\n            self.log.info(f\"Deleting Training View {training_view_table} for {self.uuid}\")\n            glue_client = self.boto_session.client(\"glue\")\n            glue_client.delete_table(DatabaseName=self.athena_database, Name=training_view_table)\n    except botocore.exceptions.ClientError as error:\n        # For ResourceNotFound/ValidationException, this is fine, otherwise raise all other exceptions\n        if error.response[\"Error\"][\"Code\"] in [\"ResourceNotFound\", \"ValidationException\"]:\n            self.log.warning(f\"Training View for {self.uuid} doesn't exist, nothing to delete...\")\n            pass\n        else:\n            raise error\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.descriptive_stats","title":"<code>descriptive_stats(recompute=False)</code>","text":"<p>Get the descriptive stats for the numeric columns of the underlying DataSource Args:     recompute (bool): Recompute the descriptive stats (default=False) Returns:     dict: A dictionary of descriptive stats for the numeric columns</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def descriptive_stats(self, recompute: bool = False) -&gt; dict:\n    \"\"\"Get the descriptive stats for the numeric columns of the underlying DataSource\n    Args:\n        recompute (bool): Recompute the descriptive stats (default=False)\n    Returns:\n        dict: A dictionary of descriptive stats for the numeric columns\n    \"\"\"\n    return self.data_source.descriptive_stats(recompute)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.details","title":"<code>details(recompute=False)</code>","text":"<p>Additional Details about this FeatureSet Artifact</p> <p>Parameters:</p> Name Type Description Default <code>recompute</code> <code>bool</code> <p>Recompute the details (default: False)</p> <code>False</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>A dictionary of details about this FeatureSet</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def details(self, recompute: bool = False) -&gt; dict[dict]:\n    \"\"\"Additional Details about this FeatureSet Artifact\n\n    Args:\n        recompute (bool): Recompute the details (default: False)\n\n    Returns:\n        dict(dict): A dictionary of details about this FeatureSet\n    \"\"\"\n\n    # Check if we have cached version of the FeatureSet Details\n    storage_key = f\"feature_set:{self.uuid}:details\"\n    cached_details = self.data_storage.get(storage_key)\n    if cached_details and not recompute:\n        return cached_details\n\n    self.log.info(f\"Recomputing FeatureSet Details ({self.uuid})...\")\n    details = self.summary()\n    details[\"aws_url\"] = self.aws_url()\n\n    # Store the AWS URL in the SageWorks Metadata\n    self.upsert_sageworks_meta({\"aws_url\": details[\"aws_url\"]})\n\n    # Now get a summary of the underlying DataSource\n    details[\"storage_summary\"] = self.data_source.summary()\n\n    # Number of Columns\n    details[\"num_columns\"] = self.num_columns()\n\n    # Number of Rows\n    details[\"num_rows\"] = self.num_rows()\n\n    # Additional Details\n    details[\"sageworks_status\"] = self.get_status()\n    details[\"sageworks_input\"] = self.get_input()\n    details[\"sageworks_tags\"] = self.tag_delimiter.join(self.get_tags())\n\n    # Underlying Storage Details\n    details[\"storage_type\"] = \"athena\"  # TODO: Add RDS support\n    details[\"storage_uuid\"] = self.data_source.uuid\n\n    # Add the column details and column stats\n    details[\"column_details\"] = self.column_details()\n    details[\"column_stats\"] = self.column_stats()\n\n    # Cache the details\n    self.data_storage.set(storage_key, details)\n\n    # Return the details data\n    return details\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.exists","title":"<code>exists()</code>","text":"<p>Does the feature_set_name exist in the AWS Metadata?</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def exists(self) -&gt; bool:\n    \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n    if self.feature_meta is None:\n        self.log.debug(f\"FeatureSet {self.uuid} not found in AWS Metadata!\")\n        return False\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_data_source","title":"<code>get_data_source()</code>","text":"<p>Return the underlying DataSource object</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_data_source(self) -&gt; DataSourceFactory:\n    \"\"\"Return the underlying DataSource object\"\"\"\n    return self.data_source\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_display_columns","title":"<code>get_display_columns()</code>","text":"<p>Get the display columns for this FeatureSet</p> <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: The display columns for this FeatureSet</p> Notes <p>This just pulls the display columns from the underlying DataSource</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_display_columns(self) -&gt; list[str]:\n    \"\"\"Get the display columns for this FeatureSet\n\n    Returns:\n        list[str]: The display columns for this FeatureSet\n\n    Notes:\n        This just pulls the display columns from the underlying DataSource\n    \"\"\"\n    return self.data_source.get_display_columns()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_feature_store","title":"<code>get_feature_store()</code>","text":"<p>Return the underlying AWS FeatureStore object. This can be useful for more advanced usage with create_dataset() such as Joins and time ranges and a host of other options See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_feature_store(self) -&gt; FeatureStore:\n    \"\"\"Return the underlying AWS FeatureStore object. This can be useful for more advanced usage\n    with create_dataset() such as Joins and time ranges and a host of other options\n    See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n    \"\"\"\n    return self.feature_store\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_holdout_ids","title":"<code>get_holdout_ids(id_column)</code>","text":"<p>Get the hold out ids for this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>id_column</code> <code>str</code> <p>The name of the id column in the output DataFrame.</p> required <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: The list of hold out ids.</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_holdout_ids(self, id_column: str) -&gt; list[str]:\n    \"\"\"Get the hold out ids for this FeatureSet\n\n    Args:\n        id_column (str): The name of the id column in the output DataFrame.\n\n    Returns:\n        list[str]: The list of hold out ids.\n    \"\"\"\n    training_view_table = self.get_training_view_table(create=False)\n    if training_view_table is not None:\n        query = f\"SELECT {id_column} FROM {training_view_table} WHERE training = 0\"\n        holdout_ids = self.query(query)[id_column].tolist()\n        return holdout_ids\n    else:\n        return []\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_training_data","title":"<code>get_training_data(limit=50000)</code>","text":"<p>Get the training data for this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>limit</code> <code>int</code> <p>The number of rows to limit the query to (default: 1000)</p> <code>50000</code> <p>Returns:     pd.DataFrame: The training data for this FeatureSet</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_training_data(self, limit=50000) -&gt; pd.DataFrame:\n    \"\"\"Get the training data for this FeatureSet\n\n    Args:\n        limit (int): The number of rows to limit the query to (default: 1000)\n    Returns:\n        pd.DataFrame: The training data for this FeatureSet\n    \"\"\"\n\n    # Get the training data query (put a limit on it for now)\n    query = self.get_training_data_query() + f\" LIMIT {limit}\"\n\n    # Make the query\n    return self.query(query)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_training_data_query","title":"<code>get_training_data_query()</code>","text":"<p>Get the training data query for this FeatureSet</p> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The training data query for this FeatureSet</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_training_data_query(self) -&gt; str:\n    \"\"\"Get the training data query for this FeatureSet\n\n    Returns:\n        str: The training data query for this FeatureSet\n    \"\"\"\n\n    # Do we have a training view?\n    training_view = self.get_training_view_table()\n    if training_view:\n        self.log.important(f\"Pulling Data from Training View {training_view}...\")\n        table_name = training_view\n    else:\n        self.log.warning(f\"No Training View found for {self.uuid}, using FeatureSet directly...\")\n        table_name = self.athena_table\n\n    # Make a query that gets all the data from the FeatureSet\n    return f\"SELECT * FROM {table_name}\"\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_training_view_table","title":"<code>get_training_view_table(create=True)</code>","text":"<p>Get the name of the training view for this FeatureSet Args:     create (bool): Create the training view if it doesn't exist (default=True) Returns:     str: The name of the training view for this FeatureSet</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def get_training_view_table(self, create: bool = True) -&gt; Union[str, None]:\n    \"\"\"Get the name of the training view for this FeatureSet\n    Args:\n        create (bool): Create the training view if it doesn't exist (default=True)\n    Returns:\n        str: The name of the training view for this FeatureSet\n    \"\"\"\n    training_view_name = f\"{self.athena_table}_training\"\n    glue_client = self.boto_session.client(\"glue\")\n    try:\n        glue_client.get_table(DatabaseName=self.athena_database, Name=training_view_name)\n        return training_view_name\n    except glue_client.exceptions.EntityNotFoundException:\n        if not create:\n            return None\n        self.log.warning(f\"Training View for {self.uuid} doesn't exist, creating one...\")\n        self.create_default_training_view()\n        time.sleep(1)  # Give AWS a second to catch up\n        return training_view_name\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.health_check","title":"<code>health_check()</code>","text":"<p>Perform a health check on this model</p> <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: List of health issues</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def health_check(self) -&gt; list[str]:\n    \"\"\"Perform a health check on this model\n\n    Returns:\n        list[str]: List of health issues\n    \"\"\"\n    # Call the base class health check\n    health_issues = super().health_check()\n\n    # If we have a 'needs_onboard' in the health check then just return\n    if \"needs_onboard\" in health_issues:\n        return health_issues\n\n    # Check our DataSource\n    if not self.data_source.exists():\n        self.log.critical(f\"Data Source check failed for {self.uuid}\")\n        self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n        health_issues.append(\"data_source_missing\")\n    return health_issues\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.modified","title":"<code>modified()</code>","text":"<p>Return the datetime when this artifact was last modified</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def modified(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was last modified\"\"\"\n    # Note: We can't currently figure out how to this from AWS Metadata\n    return self.feature_meta[\"CreationTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.num_columns","title":"<code>num_columns()</code>","text":"<p>Return the number of columns of the Feature Set</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def num_columns(self) -&gt; int:\n    \"\"\"Return the number of columns of the Feature Set\"\"\"\n    return len(self.column_names())\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.num_rows","title":"<code>num_rows()</code>","text":"<p>Return the number of rows of the internal DataSource</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def num_rows(self) -&gt; int:\n    \"\"\"Return the number of rows of the internal DataSource\"\"\"\n    return self.data_source.num_rows()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.onboard","title":"<code>onboard()</code>","text":"<p>This is a BLOCKING method that will onboard the FeatureSet (make it ready)</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def onboard(self) -&gt; bool:\n    \"\"\"This is a BLOCKING method that will onboard the FeatureSet (make it ready)\"\"\"\n\n    # Set our status to onboarding\n    self.log.important(f\"Onboarding {self.uuid}...\")\n    self.set_status(\"onboarding\")\n    self.remove_health_tag(\"needs_onboard\")\n\n    # Call our underlying DataSource onboard method\n    self.data_source.refresh_meta()\n    if not self.data_source.exists():\n        self.log.critical(f\"Data Source check failed for {self.uuid}\")\n        self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n        return False\n    if not self.data_source.ready():\n        self.data_source.onboard()\n\n    # Run a health check and refresh the meta\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.health_check()\n    self.refresh_meta()\n    self.details(recompute=True)\n    self.set_status(\"ready\")\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.outliers","title":"<code>outliers(scale=1.5, recompute=False)</code>","text":"<p>Compute outliers for all the numeric columns in a DataSource Args:     scale (float): The scale to use for the IQR (default: 1.5)     recompute (bool): Recompute the outliers (default: False) Returns:     pd.DataFrame: A DataFrame of outliers from this DataSource Notes:     Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers     The scale parameter can be adjusted to change the IQR multiplier</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def outliers(self, scale: float = 1.5, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Compute outliers for all the numeric columns in a DataSource\n    Args:\n        scale (float): The scale to use for the IQR (default: 1.5)\n        recompute (bool): Recompute the outliers (default: False)\n    Returns:\n        pd.DataFrame: A DataFrame of outliers from this DataSource\n    Notes:\n        Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n        The scale parameter can be adjusted to change the IQR multiplier\n    \"\"\"\n    return self.data_source.outliers(scale=scale, recompute=recompute)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.query","title":"<code>query(query, overwrite=True)</code>","text":"<p>Query the internal DataSource</p> <p>Parameters:</p> Name Type Description Default <code>query</code> <code>str</code> <p>The query to run against the DataSource</p> required <code>overwrite</code> <code>bool</code> <p>Overwrite the table name in the query (default: True)</p> <code>True</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>pd.DataFrame: The results of the query</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def query(self, query: str, overwrite: bool = True) -&gt; pd.DataFrame:\n    \"\"\"Query the internal DataSource\n\n    Args:\n        query (str): The query to run against the DataSource\n        overwrite (bool): Overwrite the table name in the query (default: True)\n\n    Returns:\n        pd.DataFrame: The results of the query\n    \"\"\"\n    if overwrite:\n        query = query.replace(\" \" + self.uuid + \" \", \" \" + self.athena_table + \" \")\n    return self.data_source.query(query)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.ready","title":"<code>ready()</code>","text":"<p>Is the FeatureSet ready? Is initial setup complete and expected metadata populated? Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to    check both to see if the FeatureSet is ready.</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def ready(self) -&gt; bool:\n    \"\"\"Is the FeatureSet ready? Is initial setup complete and expected metadata populated?\n    Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to\n       check both to see if the FeatureSet is ready.\"\"\"\n\n    # Check the expected metadata for the FeatureSet\n    expected_meta = self.expected_meta()\n    existing_meta = self.sageworks_meta()\n    feature_set_ready = set(existing_meta.keys()).issuperset(expected_meta)\n    if not feature_set_ready:\n        self.log.info(f\"FeatureSet {self.uuid} is not ready!\")\n        return False\n\n    # Okay now call/return the DataSource ready() method\n    return self.data_source.ready()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.refresh_meta","title":"<code>refresh_meta()</code>","text":"<p>Internal: Refresh our internal AWS Feature Store metadata</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def refresh_meta(self):\n    \"\"\"Internal: Refresh our internal AWS Feature Store metadata\"\"\"\n    self.log.info(\"Calling refresh_meta() on the underlying DataSource\")\n    self.data_source.refresh_meta()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.sample","title":"<code>sample(recompute=False)</code>","text":"<p>Get a sample of the data from the underlying DataSource Args:     recompute (bool): Recompute the sample (default=False) Returns:     pd.DataFrame: A sample of the data from the underlying DataSource</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def sample(self, recompute: bool = False) -&gt; pd.DataFrame:\n    \"\"\"Get a sample of the data from the underlying DataSource\n    Args:\n        recompute (bool): Recompute the sample (default=False)\n    Returns:\n        pd.DataFrame: A sample of the data from the underlying DataSource\n    \"\"\"\n    return self.data_source.sample(recompute)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.set_display_columns","title":"<code>set_display_columns(display_columns)</code>","text":"<p>Set the display columns for this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>display_columns</code> <code>list[str]</code> <p>The display columns for this FeatureSet</p> required Notes <p>This just sets the display columns for the underlying DataSource</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def set_display_columns(self, display_columns: list[str]):\n    \"\"\"Set the display columns for this FeatureSet\n\n    Args:\n        display_columns (list[str]): The display columns for this FeatureSet\n\n    Notes:\n        This just sets the display columns for the underlying DataSource\n    \"\"\"\n    self.data_source.set_display_columns(display_columns)\n    self.onboard()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.set_holdout_ids","title":"<code>set_holdout_ids(id_column, holdout_ids)</code>","text":"<p>Set the hold out ids for this FeatureSet</p> <p>Parameters:</p> Name Type Description Default <code>id_column</code> <code>str</code> <p>The name of the id column in the output DataFrame.</p> required <code>holdout_ids</code> <code>list[str]</code> <p>The list of hold out ids.</p> required Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def set_holdout_ids(self, id_column: str, holdout_ids: list[str]):\n    \"\"\"Set the hold out ids for this FeatureSet\n\n    Args:\n        id_column (str): The name of the id column in the output DataFrame.\n        holdout_ids (list[str]): The list of hold out ids.\n    \"\"\"\n    self.create_training_view(id_column, holdout_ids)\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.size","title":"<code>size()</code>","text":"<p>Return the size of the internal DataSource in MegaBytes</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def size(self) -&gt; float:\n    \"\"\"Return the size of the internal DataSource in MegaBytes\"\"\"\n    return self.data_source.size()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.smart_sample","title":"<code>smart_sample()</code>","text":"<p>Get a SMART sample dataframe from this FeatureSet Returns:     pd.DataFrame: A combined DataFrame of sample data + outliers</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def smart_sample(self) -&gt; pd.DataFrame:\n    \"\"\"Get a SMART sample dataframe from this FeatureSet\n    Returns:\n        pd.DataFrame: A combined DataFrame of sample data + outliers\n    \"\"\"\n    return self.data_source.smart_sample()\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.snapshot_query","title":"<code>snapshot_query(table_name=None)</code>","text":"<p>An Athena query to get the latest snapshot of features</p> <p>Parameters:</p> Name Type Description Default <code>table_name</code> <code>str</code> <p>The name of the table to query (default: None)</p> <code>None</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The Athena query to get the latest snapshot of features</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def snapshot_query(self, table_name: str = None) -&gt; str:\n    \"\"\"An Athena query to get the latest snapshot of features\n\n    Args:\n        table_name (str): The name of the table to query (default: None)\n\n    Returns:\n        str: The Athena query to get the latest snapshot of features\n    \"\"\"\n    # Remove FeatureGroup metadata columns that might have gotten added\n    columns = self.column_names()\n    filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n    columns = \", \".join(['\"' + x + '\"' for x in columns if x not in filter_columns])\n\n    query = (\n        f\"SELECT {columns} \"\n        f\"    FROM (SELECT *, row_number() OVER (PARTITION BY {self.record_id} \"\n        f\"        ORDER BY {self.event_time} desc, api_invocation_time DESC, write_time DESC) AS row_num \"\n        f'        FROM \"{table_name}\") '\n        \"    WHERE row_num = 1 and NOT is_deleted;\"\n    )\n    return query\n</code></pre>"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.value_counts","title":"<code>value_counts(recompute=False)</code>","text":"<p>Get the value counts for the string columns of the underlying DataSource Args:     recompute (bool): Recompute the value counts (default=False) Returns:     dict: A dictionary of value counts for the string columns</p> Source code in <code>src/sageworks/core/artifacts/feature_set_core.py</code> <pre><code>def value_counts(self, recompute: bool = False) -&gt; dict:\n    \"\"\"Get the value counts for the string columns of the underlying DataSource\n    Args:\n        recompute (bool): Recompute the value counts (default=False)\n    Returns:\n        dict: A dictionary of value counts for the string columns\n    \"\"\"\n    return self.data_source.value_counts(recompute)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/","title":"ModelCore","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the Model API Class and voil\u00e0 it works the same.</p> <p>ModelCore: SageWorks ModelCore Class</p>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore","title":"<code>ModelCore</code>","text":"<p>               Bases: <code>Artifact</code></p> <p>ModelCore: SageWorks ModelCore Class</p> Common Usage <pre><code>my_model = ModelCore(model_uuid)\nmy_model.summary()\nmy_model.details()\n</code></pre> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>class ModelCore(Artifact):\n    \"\"\"ModelCore: SageWorks ModelCore Class\n\n    Common Usage:\n        ```\n        my_model = ModelCore(model_uuid)\n        my_model.summary()\n        my_model.details()\n        ```\n    \"\"\"\n\n    def __init__(\n        self, model_uuid: str, force_refresh: bool = False, model_type: ModelType = None, legacy: bool = False\n    ):\n        \"\"\"ModelCore Initialization\n        Args:\n            model_uuid (str): Name of Model in SageWorks.\n            force_refresh (bool, optional): Force a refresh of the AWS Broker. Defaults to False.\n            model_type (ModelType, optional): Set this for newly created Models. Defaults to None.\n            legacy (bool, optional): Force load of legacy models. Defaults to False.\n        \"\"\"\n\n        # Make sure the model name is valid\n        if not legacy:\n            self.ensure_valid_name(model_uuid, delimiter=\"-\")\n\n        # Call SuperClass Initialization\n        super().__init__(model_uuid)\n\n        # Grab an AWS Metadata Broker object and pull information for Models\n        self.model_name = model_uuid\n        aws_meta = self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=force_refresh)\n        self.model_meta = aws_meta.get(self.model_name)\n        if self.model_meta is None:\n            self.log.important(f\"Could not find model {self.model_name} within current visibility scope\")\n            self.latest_model = None\n            self.model_type = ModelType.UNKNOWN\n            return\n        else:\n            try:\n                self.latest_model = self.model_meta[0]\n                self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n                self.training_job_name = self._extract_training_job_name()\n                if model_type:\n                    self._set_model_type(model_type)\n                else:\n                    self.model_type = self._get_model_type()\n            except (IndexError, KeyError):\n                self.log.critical(f\"Model {self.model_name} appears to be malformed. Delete and recreate it!\")\n                self.latest_model = None\n                self.model_type = ModelType.UNKNOWN\n                return\n\n        # Set the Model Training S3 Path\n        self.model_training_path = self.models_s3_path + \"/training/\" + self.model_name\n\n        # Get our Endpoint Inference Path (might be None)\n        self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n        # Call SuperClass Post Initialization\n        super().__post_init__()\n\n        # All done\n        self.log.info(f\"Model Initialized: {self.model_name}\")\n\n    def refresh_meta(self):\n        \"\"\"Refresh the Artifact's metadata\"\"\"\n        self.model_meta = self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=True).get(self.model_name)\n        self.latest_model = self.model_meta[0]\n        self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n        self.training_job_name = self._extract_training_job_name()\n\n    def exists(self) -&gt; bool:\n        \"\"\"Does the model metadata exist in the AWS Metadata?\"\"\"\n        if self.model_meta is None:\n            self.log.debug(f\"Model {self.model_name} not found in AWS Metadata!\")\n            return False\n        return True\n\n    def health_check(self) -&gt; list[str]:\n        \"\"\"Perform a health check on this model\n        Returns:\n            list[str]: List of health issues\n        \"\"\"\n        # Call the base class health check\n        health_issues = super().health_check()\n\n        # Model Type\n        if self._get_model_type() == ModelType.UNKNOWN:\n            health_issues.append(\"model_type_unknown\")\n        else:\n            self.remove_health_tag(\"model_type_unknown\")\n\n        # Model Performance Metrics\n        if self.get_inference_metrics() is None:\n            health_issues.append(\"metrics_needed\")\n        else:\n            self.remove_health_tag(\"metrics_needed\")\n        return health_issues\n\n    def latest_model_object(self) -&gt; SagemakerModel:\n        \"\"\"Return the latest AWS Sagemaker Model object for this SageWorks Model\n\n        Returns:\n           sagemaker.model.Model: AWS Sagemaker Model object\n        \"\"\"\n        return SagemakerModel(\n            model_data=self.model_package_arn(), sagemaker_session=self.sm_session, image_uri=self.model_image()\n        )\n\n    def list_inference_runs(self) -&gt; list[str]:\n        \"\"\"List the inference runs for this model\n\n        Returns:\n            list[str]: List of inference run UUIDs\n        \"\"\"\n        if self.endpoint_inference_path is None:\n            return [\"model_training\"]  # Just the training run\n        directories = wr.s3.list_directories(path=self.endpoint_inference_path + \"/\")\n        inference_runs = [urlparse(directory).path.split(\"/\")[-2] for directory in directories]\n\n        # We're going to add the training to the front of the list\n        inference_runs.insert(0, \"model_training\")\n        return inference_runs\n\n    def get_inference_metrics(self, capture_uuid: str = \"latest\") -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Retrieve the inference performance metrics for this model\n\n        Args:\n            capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n        Returns:\n            pd.DataFrame: DataFrame of the Model Metrics\n\n        Note:\n            If a capture_uuid isn't specified this will try to return something reasonable\n        \"\"\"\n        # Try to get the auto_capture 'training_holdout' or the training\n        if capture_uuid == \"latest\":\n            metrics_df = self.get_inference_metrics(\"training_holdout\")\n            return metrics_df if metrics_df is not None else self.get_inference_metrics(\"model_training\")\n\n        # Grab the metrics captured during model training (could return None)\n        if capture_uuid == \"model_training\":\n            metrics = self.sageworks_meta().get(\"sageworks_training_metrics\")\n            return pd.DataFrame.from_dict(metrics) if metrics else None\n\n        else:  # Specific capture_uuid (could return None)\n            s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n            metrics = pull_s3_data(s3_path, embedded_index=True)\n            if metrics is not None:\n                return metrics\n            else:\n                self.log.warning(f\"Performance metrics {capture_uuid} not found for {self.model_name}!\")\n                return None\n\n    def confusion_matrix(self, capture_uuid: str = \"latest\") -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Retrieve the confusion_matrix for this model\n\n        Args:\n            capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n        Returns:\n            pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n        \"\"\"\n        # Grab the metrics from the SageWorks Metadata (try inference first, then training)\n        if capture_uuid == \"latest\":\n            cm = self.sageworks_meta().get(\"sageworks_inference_cm\")\n            return cm if cm is not None else self.confusion_matrix(\"model_training\")\n\n        # Grab the confusion matrix captured during model training (could return None)\n        if capture_uuid == \"model_training\":\n            cm = self.sageworks_meta().get(\"sageworks_training_cm\")\n            return pd.DataFrame.from_dict(cm) if cm else None\n\n        else:  # Specific capture_uuid\n            s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_cm.csv\"\n            cm = pull_s3_data(s3_path, embedded_index=True)\n            if cm is not None:\n                return cm\n            else:\n                self.log.warning(f\"Confusion Matrix {capture_uuid} not found for {self.model_name}!\")\n                return None\n\n    def set_input(self, input: str, force: bool = False):\n        \"\"\"Override: Set the input data for this artifact\n\n        Args:\n            input (str): Name of input for this artifact\n            force (bool, optional): Force the input to be set (default: False)\n        Note:\n            We're going to not allow this to be used for Models\n        \"\"\"\n        if not force:\n            self.log.warning(f\"Model {self.uuid}: Does not allow manual override of the input!\")\n            return\n\n        # Okay we're going to allow this to be set\n        self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n        self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n        self.upsert_sageworks_meta({\"sageworks_input\": input})\n\n    def size(self) -&gt; float:\n        \"\"\"Return the size of this data in MegaBytes\"\"\"\n        return 0.0\n\n    def aws_meta(self) -&gt; dict:\n        \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n        return self.latest_model\n\n    def arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n        return self.group_arn()\n\n    def group_arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n        return self.latest_model[\"ModelPackageGroupArn\"]\n\n    def model_package_arn(self) -&gt; str:\n        \"\"\"AWS ARN (Amazon Resource Name) for the Model Package (within the Group)\"\"\"\n        return self.latest_model[\"ModelPackageArn\"]\n\n    def model_container_info(self) -&gt; dict:\n        \"\"\"Container Info for the Latest Model Package\"\"\"\n        return self.latest_model[\"ModelPackageDetails\"][\"InferenceSpecification\"][\"Containers\"][0]\n\n    def model_image(self) -&gt; str:\n        \"\"\"Container Image for the Latest Model Package\"\"\"\n        return self.model_container_info()[\"Image\"]\n\n    def aws_url(self):\n        \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n        return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n\n    def created(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was created\"\"\"\n        return self.latest_model[\"CreationTime\"]\n\n    def modified(self) -&gt; datetime:\n        \"\"\"Return the datetime when this artifact was last modified\"\"\"\n        return self.latest_model[\"CreationTime\"]\n\n    def register_endpoint(self, endpoint_name: str):\n        \"\"\"Add this endpoint to the set of registered endpoints for the model\n\n        Args:\n            endpoint_name (str): Name of the endpoint\n        \"\"\"\n        self.log.important(f\"Registering Endpoint {endpoint_name} with Model {self.uuid}...\")\n        registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n        registered_endpoints.add(endpoint_name)\n        self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n        # A new endpoint means we need to refresh our inference path\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n    def endpoints(self) -&gt; list[str]:\n        \"\"\"Get the list of registered endpoints for this Model\n\n        Returns:\n            list[str]: List of registered endpoints\n        \"\"\"\n        return self.sageworks_meta().get(\"sageworks_registered_endpoints\", [])\n\n    def get_endpoint_inference_path(self) -&gt; str:\n        \"\"\"Get the S3 Path for the Inference Data\"\"\"\n\n        # Look for any Registered Endpoints\n        registered_endpoints = self.sageworks_meta().get(\"sageworks_registered_endpoints\")\n\n        # Note: We may have 0 to N endpoints, so we find the one with the most recent artifacts\n        if registered_endpoints:\n            endpoint_inference_base = self.endpoints_s3_path + \"/inference/\"\n            endpoint_inference_paths = [endpoint_inference_base + e for e in registered_endpoints]\n            return newest_files(endpoint_inference_paths, self.sm_session)\n        else:\n            self.log.warning(f\"No registered endpoints found for {self.model_name}!\")\n            return None\n\n    def set_target(self, target_column: str):\n        \"\"\"Set the target for this Model\n\n        Args:\n            target_column (str): Target column for this Model\n        \"\"\"\n        self.upsert_sageworks_meta({\"sageworks_model_target\": target_column})\n\n    def set_features(self, feature_columns: list[str]):\n        \"\"\"Set the features for this Model\n\n        Args:\n            feature_columns (list[str]): List of feature columns\n        \"\"\"\n        self.upsert_sageworks_meta({\"sageworks_model_features\": feature_columns})\n\n    def target(self) -&gt; Union[str, None]:\n        \"\"\"Return the target for this Model (if supervised, else None)\n\n        Returns:\n            str: Target column for this Model (if supervised, else None)\n        \"\"\"\n        return self.sageworks_meta().get(\"sageworks_model_target\")  # Returns None if not found\n\n    def features(self) -&gt; Union[list[str], None]:\n        \"\"\"Return a list of features used for this Model\n\n        Returns:\n            list[str]: List of features used for this Model\n        \"\"\"\n        return self.sageworks_meta().get(\"sageworks_model_features\")  # Returns None if not found\n\n    def class_labels(self) -&gt; Union[list[str], None]:\n        \"\"\"Return the class labels for this Model (if it's a classifier)\n\n        Returns:\n            list[str]: List of class labels\n        \"\"\"\n        if self.model_type == ModelType.CLASSIFIER:\n            return self.sageworks_meta().get(\"class_labels\")  # Returns None if not found\n        else:\n            return None\n\n    def set_class_labels(self, labels: list[str]):\n        \"\"\"Return the class labels for this Model (if it's a classifier)\n\n        Args:\n            labels (list[str]): List of class labels\n        \"\"\"\n        if self.model_type == ModelType.CLASSIFIER:\n            self.upsert_sageworks_meta({\"class_labels\": labels})\n        else:\n            self.log.error(f\"Model {self.model_name} is not a classifier!\")\n\n    def details(self, recompute=False) -&gt; dict:\n        \"\"\"Additional Details about this Model\n        Args:\n            recompute (bool, optional): Recompute the details (default: False)\n        Returns:\n            dict: Dictionary of details about this Model\n        \"\"\"\n\n        # Check if we have cached version of the Model Details\n        storage_key = f\"model:{self.uuid}:details\"\n        cached_details = self.data_storage.get(storage_key)\n        if cached_details and not recompute:\n            return cached_details\n\n        self.log.info(\"Recomputing Model Details...\")\n        details = self.summary()\n        details[\"pipeline\"] = self.get_pipeline()\n        details[\"model_type\"] = self.model_type.value\n        details[\"model_package_group_arn\"] = self.group_arn()\n        details[\"model_package_arn\"] = self.model_package_arn()\n        aws_meta = self.aws_meta()\n        details[\"description\"] = aws_meta.get(\"ModelPackageDescription\", \"-\")\n        details[\"version\"] = aws_meta[\"ModelPackageVersion\"]\n        details[\"status\"] = aws_meta[\"ModelPackageStatus\"]\n        details[\"approval_status\"] = aws_meta[\"ModelApprovalStatus\"]\n        details[\"image\"] = self.model_image().split(\"/\")[-1]  # Shorten the image uri\n\n        # Grab the inference and container info\n        package_details = aws_meta[\"ModelPackageDetails\"]\n        inference_spec = package_details[\"InferenceSpecification\"]\n        container_info = self.model_container_info()\n        details[\"framework\"] = container_info.get(\"Framework\", \"unknown\")\n        details[\"framework_version\"] = container_info.get(\"FrameworkVersion\", \"unknown\")\n        details[\"inference_types\"] = inference_spec[\"SupportedRealtimeInferenceInstanceTypes\"]\n        details[\"transform_types\"] = inference_spec[\"SupportedTransformInstanceTypes\"]\n        details[\"content_types\"] = inference_spec[\"SupportedContentTypes\"]\n        details[\"response_types\"] = inference_spec[\"SupportedResponseMIMETypes\"]\n        details[\"model_metrics\"] = self.get_inference_metrics()\n        if self.model_type == ModelType.CLASSIFIER:\n            details[\"confusion_matrix\"] = self.confusion_matrix()\n            details[\"predictions\"] = None\n        else:\n            details[\"confusion_matrix\"] = None\n            details[\"predictions\"] = self.get_inference_predictions()\n\n        # Grab the inference metadata\n        details[\"inference_meta\"] = self.get_inference_metadata()\n\n        # Cache the details\n        self.data_storage.set(storage_key, details)\n\n        # Return the details\n        return details\n\n    # Pipeline for this model\n    def get_pipeline(self) -&gt; str:\n        \"\"\"Get the pipeline for this model\"\"\"\n        return self.sageworks_meta().get(\"sageworks_pipeline\")\n\n    def set_pipeline(self, pipeline: str):\n        \"\"\"Set the pipeline for this model\n\n        Args:\n            pipeline (str): Pipeline that was used to create this model\n        \"\"\"\n        self.upsert_sageworks_meta({\"sageworks_pipeline\": pipeline})\n\n    def expected_meta(self) -&gt; list[str]:\n        \"\"\"Metadata we expect to see for this Model when it's ready\n        Returns:\n            list[str]: List of expected metadata keys\n        \"\"\"\n        # Our current list of expected metadata, we can add to this as needed\n        return [\"sageworks_status\", \"sageworks_training_metrics\", \"sageworks_training_cm\"]\n\n    def is_model_unknown(self) -&gt; bool:\n        \"\"\"Is the Model Type unknown?\"\"\"\n        return self.model_type == ModelType.UNKNOWN\n\n    def _determine_model_type(self):\n        \"\"\"Internal: Determine the Model Type\"\"\"\n        model_type = input(\"Model Type? (classifier, regressor, quantile_regressor, unsupervised, transformer): \")\n        if model_type == \"classifier\":\n            self._set_model_type(ModelType.CLASSIFIER)\n        elif model_type == \"regressor\":\n            self._set_model_type(ModelType.REGRESSOR)\n        elif model_type == \"quantile_regressor\":\n            self._set_model_type(ModelType.QUANTILE_REGRESSOR)\n        elif model_type == \"unsupervised\":\n            self._set_model_type(ModelType.UNSUPERVISED)\n        elif model_type == \"transformer\":\n            self._set_model_type(ModelType.TRANSFORMER)\n        else:\n            self.log.warning(f\"Unknown Model Type {model_type}!\")\n            self._set_model_type(ModelType.UNKNOWN)\n\n    def onboard(self, ask_everything=False) -&gt; bool:\n        \"\"\"This is an interactive method that will onboard the Model (make it ready)\n\n        Args:\n            ask_everything (bool, optional): Ask for all the details. Defaults to False.\n\n        Returns:\n            bool: True if the Model is successfully onboarded, False otherwise\n        \"\"\"\n        # Set the status to onboarding\n        self.set_status(\"onboarding\")\n\n        # Determine the Model Type\n        while self.is_model_unknown():\n            self._determine_model_type()\n\n        # Is our input data set?\n        if self.get_input() in [\"\", \"unknown\"] or ask_everything:\n            input_data = input(\"Input Data?: \")\n            if input_data not in [\"None\", \"none\", \"\", \"unknown\"]:\n                self.set_input(input_data)\n\n        # Determine the Target Column (can be None)\n        target_column = self.target()\n        if target_column is None or ask_everything:\n            target_column = input(\"Target Column? (for unsupervised/transformer just type None): \")\n            if target_column in [\"None\", \"none\", \"\"]:\n                target_column = None\n\n        # Determine the Feature Columns\n        feature_columns = self.features()\n        if feature_columns is None or ask_everything:\n            feature_columns = input(\"Feature Columns? (use commas): \")\n            feature_columns = [e.strip() for e in feature_columns.split(\",\")]\n            if feature_columns in [[\"None\"], [\"none\"], [\"\"]]:\n                feature_columns = None\n\n        # Registered Endpoints?\n        endpoints = self.endpoints()\n        if not endpoints or ask_everything:\n            endpoints = input(\"Register Endpoints? (use commas for multiple): \")\n            endpoints = [e.strip() for e in endpoints.split(\",\")]\n            if endpoints in [[\"None\"], [\"none\"], [\"\"]]:\n                endpoints = None\n\n        # Model Owner?\n        owner = self.get_owner()\n        if owner in [None, \"unknown\"] or ask_everything:\n            owner = input(\"Model Owner: \")\n            if owner in [\"None\", \"none\", \"\"]:\n                owner = \"unknown\"\n\n        # Now that we have all the details, let's onboard the Model with all the args\n        return self.onboard_with_args(self.model_type, target_column, feature_columns, endpoints, owner)\n\n    def onboard_with_args(\n        self,\n        model_type: ModelType,\n        target_column: str = None,\n        feature_list: list = None,\n        endpoints: list = None,\n        owner: str = None,\n    ) -&gt; bool:\n        \"\"\"Onboard the Model with the given arguments\n\n        Args:\n            model_type (ModelType): Model Type\n            target_column (str): Target Column\n            feature_list (list): List of Feature Columns\n            endpoints (list, optional): List of Endpoints. Defaults to None.\n            owner (str, optional): Model Owner. Defaults to None.\n        Returns:\n            bool: True if the Model is successfully onboarded, False otherwise\n        \"\"\"\n        # Set the status to onboarding\n        self.set_status(\"onboarding\")\n\n        # Set All the Details\n        self._set_model_type(model_type)\n        if target_column:\n            self.set_target(target_column)\n        if feature_list:\n            self.set_features(feature_list)\n        if endpoints:\n            for endpoint in endpoints:\n                self.register_endpoint(endpoint)\n        if owner:\n            self.set_owner(owner)\n\n        # Load the training metrics and inference metrics\n        self._load_training_metrics()\n        self._load_inference_metrics()\n        self._load_inference_cm()\n\n        # Remove the needs_onboard tag\n        self.remove_health_tag(\"needs_onboard\")\n        self.set_status(\"ready\")\n\n        # Run a health check and refresh the meta\n        time.sleep(2)  # Give the AWS Metadata a chance to update\n        self.health_check()\n        self.refresh_meta()\n        self.details(recompute=True)\n        return True\n\n    def delete(self):\n        \"\"\"Delete the Model Packages and the Model Group\"\"\"\n\n        # If we don't have meta then the model probably doesn't exist\n        if self.model_meta is None:\n            self.log.info(f\"Model {self.model_name} doesn't appear to exist...\")\n            return\n\n        # First delete the Model Packages within the Model Group\n        for model in self.model_meta:\n            self.log.info(f\"Deleting Model Package {model['ModelPackageArn']}...\")\n            self.sm_client.delete_model_package(ModelPackageName=model[\"ModelPackageArn\"])\n\n        # Delete the Model Package Group\n        self.log.info(f\"Deleting Model Group {self.model_name}...\")\n        self.sm_client.delete_model_package_group(ModelPackageGroupName=self.model_name)\n\n        # Delete any training artifacts\n        s3_delete_path = f\"{self.model_training_path}/\"\n        self.log.info(f\"Deleting Training S3 Objects {s3_delete_path}\")\n        wr.s3.delete_objects(s3_delete_path, boto3_session=self.boto_session)\n\n        # Delete any data in the Cache\n        for key in self.data_storage.list_subkeys(f\"model:{self.uuid}:\"):\n            self.log.info(f\"Deleting Cache Key {key}...\")\n            self.data_storage.delete(key)\n\n    def _set_model_type(self, model_type: ModelType):\n        \"\"\"Internal: Set the Model Type for this Model\"\"\"\n        self.model_type = model_type\n        self.upsert_sageworks_meta({\"sageworks_model_type\": self.model_type.value})\n        self.remove_health_tag(\"model_type_unknown\")\n\n    def _get_model_type(self) -&gt; ModelType:\n        \"\"\"Internal: Query the SageWorks Metadata to get the model type\n        Returns:\n            ModelType: The ModelType of this Model\n        Notes:\n            This is an internal method that should not be called directly\n            Use the model_type attribute instead\n        \"\"\"\n        model_type = self.sageworks_meta().get(\"sageworks_model_type\")\n        try:\n            return ModelType(model_type)\n        except ValueError:\n            self.log.warning(f\"Could not determine model type for {self.model_name}!\")\n            return ModelType.UNKNOWN\n\n    def _load_training_metrics(self):\n        \"\"\"Internal: Retrieve the training metrics and Confusion Matrix for this model\n                     and load the data into the SageWorks Metadata\n\n        Notes:\n            This may or may not exist based on whether we have access to TrainingJobAnalytics\n        \"\"\"\n        try:\n            df = TrainingJobAnalytics(training_job_name=self.training_job_name).dataframe()\n            if df.empty:\n                self.log.warning(f\"No training job metrics found for {self.training_job_name}\")\n                self.upsert_sageworks_meta({\"sageworks_training_metrics\": None, \"sageworks_training_cm\": None})\n                return\n            if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n                if \"timestamp\" in df.columns:\n                    df = df.drop(columns=[\"timestamp\"])\n\n                # We're going to pivot the DataFrame to get the desired structure\n                reg_metrics_df = df.set_index(\"metric_name\").T\n\n                # Store and return the metrics in the SageWorks Metadata\n                self.upsert_sageworks_meta(\n                    {\"sageworks_training_metrics\": reg_metrics_df.to_dict(), \"sageworks_training_cm\": None}\n                )\n                return\n\n        except (KeyError, botocore.exceptions.ClientError):\n            self.log.warning(f\"No training job metrics found for {self.training_job_name}\")\n            # Store and return the metrics in the SageWorks Metadata\n            self.upsert_sageworks_meta({\"sageworks_training_metrics\": None, \"sageworks_training_cm\": None})\n            return\n\n        # We need additional processing for classification metrics\n        if self.model_type == ModelType.CLASSIFIER:\n            metrics_df, cm_df = self._process_classification_metrics(df)\n\n            # Store and return the metrics in the SageWorks Metadata\n            self.upsert_sageworks_meta(\n                {\"sageworks_training_metrics\": metrics_df.to_dict(), \"sageworks_training_cm\": cm_df.to_dict()}\n            )\n\n    def _load_inference_metrics(self, capture_uuid: str = \"training_holdout\"):\n        \"\"\"Internal: Retrieve the inference model metrics for this model\n                     and load the data into the SageWorks Metadata\n\n        Args:\n            capture_uuid (str, optional): A specific capture_uuid (default: \"training_holdout\")\n        Notes:\n            This may or may not exist based on whether an Endpoint ran Inference\n        \"\"\"\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n        inference_metrics = pull_s3_data(s3_path)\n\n        # Store data into the SageWorks Metadata\n        metrics_storage = None if inference_metrics is None else inference_metrics.to_dict(\"records\")\n        self.upsert_sageworks_meta({\"sageworks_inference_metrics\": metrics_storage})\n\n    def _load_inference_cm(self, capture_uuid: str = \"training_holdout\"):\n        \"\"\"Internal: Pull the inference Confusion Matrix for this model\n                     and load the data into the SageWorks Metadata\n\n        Args:\n            capture_uuid (str, optional): A specific capture_uuid (default: \"training_holdout\")\n\n        Returns:\n            pd.DataFrame: DataFrame of the inference Confusion Matrix (might be None)\n\n        Notes:\n            This may or may not exist based on whether an Endpoint ran Inference\n        \"\"\"\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_cm.csv\"\n        inference_cm = pull_s3_data(s3_path, embedded_index=True)\n\n        # Store data into the SageWorks Metadata\n        cm_storage = None if inference_cm is None else inference_cm.to_dict(\"records\")\n        self.upsert_sageworks_meta({\"sageworks_inference_cm\": cm_storage})\n\n    def get_inference_metadata(self, capture_uuid: str = \"training_holdout\") -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Retrieve the inference metadata for this model\n\n        Args:\n            capture_uuid (str, optional): A specific capture_uuid (default: \"training_holdout\")\n\n        Returns:\n            dict: Dictionary of the inference metadata (might be None)\n        Notes:\n            Basically when Endpoint inference was run, name of the dataset, the MD5, etc\n        \"\"\"\n        # Sanity check the inference path (which may or may not exist)\n        if self.endpoint_inference_path is None:\n            return None\n\n        # Check for model_training capture_uuid\n        if capture_uuid == \"model_training\":\n            # Create a DataFrame with the training metadata\n            meta_df = pd.DataFrame(\n                [\n                    {\n                        \"name\": \"AWS Training Capture\",\n                        \"data_hash\": \"N/A\",\n                        \"num_rows\": \"-\",\n                        \"description\": \"-\",\n                    }\n                ]\n            )\n            return meta_df\n\n        # Pull the inference metadata\n        try:\n            s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_meta.json\"\n            return wr.s3.read_json(s3_path)\n        except NoFilesFound:\n            self.log.info(f\"Could not find model inference meta at {s3_path}...\")\n            return None\n\n    def get_inference_predictions(self, capture_uuid: str = \"training_holdout\") -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Retrieve the captured prediction results for this model\n\n        Args:\n            capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n        Returns:\n            pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n        \"\"\"\n        self.log.important(f\"Grabbing {capture_uuid} predictions for {self.model_name}...\")\n\n        # Special case for model_training\n        if capture_uuid == \"model_training\":\n            return self._get_validation_predictions()\n\n        # Construct the S3 path for the Inference Predictions\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_predictions.csv\"\n        return pull_s3_data(s3_path)\n\n    def _get_validation_predictions(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Internal: Retrieve the captured prediction results for this model\n\n        Returns:\n            pd.DataFrame: DataFrame of the Captured Validation Predictions (might be None)\n        \"\"\"\n        self.log.important(f\"Grabbing Validation Predictions for {self.model_name}...\")\n        s3_path = f\"{self.model_training_path}/validation_predictions.csv\"\n        df = pull_s3_data(s3_path)\n        return df\n\n    def _extract_training_job_name(self) -&gt; Union[str, None]:\n        \"\"\"Internal: Extract the training job name from the ModelDataUrl\"\"\"\n        try:\n            model_data_url = self.model_container_info()[\"ModelDataUrl\"]\n            parsed_url = urllib.parse.urlparse(model_data_url)\n            training_job_name = parsed_url.path.lstrip(\"/\").split(\"/\")[0]\n            return training_job_name\n        except KeyError:\n            self.log.warning(f\"Could not extract training job name from {model_data_url}\")\n            return None\n\n    @staticmethod\n    def _process_classification_metrics(df: pd.DataFrame) -&gt; (pd.DataFrame, pd.DataFrame):\n        \"\"\"Internal: Process classification metrics into a more reasonable format\n        Args:\n            df (pd.DataFrame): DataFrame of training metrics\n        Returns:\n            (pd.DataFrame, pd.DataFrame): Tuple of DataFrames. Metrics and confusion matrix\n        \"\"\"\n        # Split into two DataFrames based on 'metric_name'\n        metrics_df = df[df[\"metric_name\"].str.startswith(\"Metrics:\")].copy()\n        cm_df = df[df[\"metric_name\"].str.startswith(\"ConfusionMatrix:\")].copy()\n\n        # Split the 'metric_name' into different parts\n        metrics_df[\"class\"] = metrics_df[\"metric_name\"].str.split(\":\").str[1]\n        metrics_df[\"metric_type\"] = metrics_df[\"metric_name\"].str.split(\":\").str[2]\n\n        # Pivot the DataFrame to get the desired structure\n        metrics_df = metrics_df.pivot(index=\"class\", columns=\"metric_type\", values=\"value\").reset_index()\n        metrics_df = metrics_df.rename_axis(None, axis=1)\n\n        # Now process the confusion matrix\n        cm_df[\"row_class\"] = cm_df[\"metric_name\"].str.split(\":\").str[1]\n        cm_df[\"col_class\"] = cm_df[\"metric_name\"].str.split(\":\").str[2]\n\n        # Pivot the DataFrame to create a form suitable for the heatmap\n        cm_df = cm_df.pivot(index=\"row_class\", columns=\"col_class\", values=\"value\")\n\n        # Convert the values in cm_df to integers\n        cm_df = cm_df.astype(int)\n\n        return metrics_df, cm_df\n\n    def shapley_values(self, capture_uuid: str = \"training_holdout\") -&gt; Union[list[pd.DataFrame], pd.DataFrame, None]:\n        \"\"\"Retrieve the Shapely values for this model\n\n        Args:\n            capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n        Returns:\n            pd.DataFrame: Dataframe of the shapley values for the prediction dataframe\n\n        Notes:\n            This may or may not exist based on whether an Endpoint ran Shapley\n        \"\"\"\n\n        # Sanity check the inference path (which may or may not exist)\n        if self.endpoint_inference_path is None:\n            return None\n\n        # Construct the S3 path for the Shapley values\n        shapley_s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n        # Multiple CSV if classifier\n        if self.model_type == ModelType.CLASSIFIER:\n            # CSVs for shap values are indexed by prediction class\n            # Because we don't know how many classes there are, we need to search through\n            # a list of S3 objects in the parent folder\n            s3_paths = wr.s3.list_objects(shapley_s3_path)\n            return [pull_s3_data(f) for f in s3_paths if \"inference_shap_values\" in f]\n\n        # One CSV if regressor\n        if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n            s3_path = f\"{shapley_s3_path}/inference_shap_values.csv\"\n            return pull_s3_data(s3_path)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.__init__","title":"<code>__init__(model_uuid, force_refresh=False, model_type=None, legacy=False)</code>","text":"<p>ModelCore Initialization Args:     model_uuid (str): Name of Model in SageWorks.     force_refresh (bool, optional): Force a refresh of the AWS Broker. Defaults to False.     model_type (ModelType, optional): Set this for newly created Models. Defaults to None.     legacy (bool, optional): Force load of legacy models. Defaults to False.</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def __init__(\n    self, model_uuid: str, force_refresh: bool = False, model_type: ModelType = None, legacy: bool = False\n):\n    \"\"\"ModelCore Initialization\n    Args:\n        model_uuid (str): Name of Model in SageWorks.\n        force_refresh (bool, optional): Force a refresh of the AWS Broker. Defaults to False.\n        model_type (ModelType, optional): Set this for newly created Models. Defaults to None.\n        legacy (bool, optional): Force load of legacy models. Defaults to False.\n    \"\"\"\n\n    # Make sure the model name is valid\n    if not legacy:\n        self.ensure_valid_name(model_uuid, delimiter=\"-\")\n\n    # Call SuperClass Initialization\n    super().__init__(model_uuid)\n\n    # Grab an AWS Metadata Broker object and pull information for Models\n    self.model_name = model_uuid\n    aws_meta = self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=force_refresh)\n    self.model_meta = aws_meta.get(self.model_name)\n    if self.model_meta is None:\n        self.log.important(f\"Could not find model {self.model_name} within current visibility scope\")\n        self.latest_model = None\n        self.model_type = ModelType.UNKNOWN\n        return\n    else:\n        try:\n            self.latest_model = self.model_meta[0]\n            self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n            self.training_job_name = self._extract_training_job_name()\n            if model_type:\n                self._set_model_type(model_type)\n            else:\n                self.model_type = self._get_model_type()\n        except (IndexError, KeyError):\n            self.log.critical(f\"Model {self.model_name} appears to be malformed. Delete and recreate it!\")\n            self.latest_model = None\n            self.model_type = ModelType.UNKNOWN\n            return\n\n    # Set the Model Training S3 Path\n    self.model_training_path = self.models_s3_path + \"/training/\" + self.model_name\n\n    # Get our Endpoint Inference Path (might be None)\n    self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n    # Call SuperClass Post Initialization\n    super().__post_init__()\n\n    # All done\n    self.log.info(f\"Model Initialized: {self.model_name}\")\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.arn","title":"<code>arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for the Model Package Group</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n    return self.group_arn()\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.aws_meta","title":"<code>aws_meta()</code>","text":"<p>Get ALL the AWS metadata for this artifact</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def aws_meta(self) -&gt; dict:\n    \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n    return self.latest_model\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.aws_url","title":"<code>aws_url()</code>","text":"<p>The AWS URL for looking at/querying this data source</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def aws_url(self):\n    \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n    return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.class_labels","title":"<code>class_labels()</code>","text":"<p>Return the class labels for this Model (if it's a classifier)</p> <p>Returns:</p> Type Description <code>Union[list[str], None]</code> <p>list[str]: List of class labels</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def class_labels(self) -&gt; Union[list[str], None]:\n    \"\"\"Return the class labels for this Model (if it's a classifier)\n\n    Returns:\n        list[str]: List of class labels\n    \"\"\"\n    if self.model_type == ModelType.CLASSIFIER:\n        return self.sageworks_meta().get(\"class_labels\")  # Returns None if not found\n    else:\n        return None\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.confusion_matrix","title":"<code>confusion_matrix(capture_uuid='latest')</code>","text":"<p>Retrieve the confusion_matrix for this model</p> <p>Parameters:</p> Name Type Description Default <code>capture_uuid</code> <code>str</code> <p>Specific capture_uuid or \"training\" (default: \"latest\")</p> <code>'latest'</code> <p>Returns:     pd.DataFrame: DataFrame of the Confusion Matrix (might be None)</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def confusion_matrix(self, capture_uuid: str = \"latest\") -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Retrieve the confusion_matrix for this model\n\n    Args:\n        capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n    Returns:\n        pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n    \"\"\"\n    # Grab the metrics from the SageWorks Metadata (try inference first, then training)\n    if capture_uuid == \"latest\":\n        cm = self.sageworks_meta().get(\"sageworks_inference_cm\")\n        return cm if cm is not None else self.confusion_matrix(\"model_training\")\n\n    # Grab the confusion matrix captured during model training (could return None)\n    if capture_uuid == \"model_training\":\n        cm = self.sageworks_meta().get(\"sageworks_training_cm\")\n        return pd.DataFrame.from_dict(cm) if cm else None\n\n    else:  # Specific capture_uuid\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_cm.csv\"\n        cm = pull_s3_data(s3_path, embedded_index=True)\n        if cm is not None:\n            return cm\n        else:\n            self.log.warning(f\"Confusion Matrix {capture_uuid} not found for {self.model_name}!\")\n            return None\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.created","title":"<code>created()</code>","text":"<p>Return the datetime when this artifact was created</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def created(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was created\"\"\"\n    return self.latest_model[\"CreationTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.delete","title":"<code>delete()</code>","text":"<p>Delete the Model Packages and the Model Group</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def delete(self):\n    \"\"\"Delete the Model Packages and the Model Group\"\"\"\n\n    # If we don't have meta then the model probably doesn't exist\n    if self.model_meta is None:\n        self.log.info(f\"Model {self.model_name} doesn't appear to exist...\")\n        return\n\n    # First delete the Model Packages within the Model Group\n    for model in self.model_meta:\n        self.log.info(f\"Deleting Model Package {model['ModelPackageArn']}...\")\n        self.sm_client.delete_model_package(ModelPackageName=model[\"ModelPackageArn\"])\n\n    # Delete the Model Package Group\n    self.log.info(f\"Deleting Model Group {self.model_name}...\")\n    self.sm_client.delete_model_package_group(ModelPackageGroupName=self.model_name)\n\n    # Delete any training artifacts\n    s3_delete_path = f\"{self.model_training_path}/\"\n    self.log.info(f\"Deleting Training S3 Objects {s3_delete_path}\")\n    wr.s3.delete_objects(s3_delete_path, boto3_session=self.boto_session)\n\n    # Delete any data in the Cache\n    for key in self.data_storage.list_subkeys(f\"model:{self.uuid}:\"):\n        self.log.info(f\"Deleting Cache Key {key}...\")\n        self.data_storage.delete(key)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.details","title":"<code>details(recompute=False)</code>","text":"<p>Additional Details about this Model Args:     recompute (bool, optional): Recompute the details (default: False) Returns:     dict: Dictionary of details about this Model</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def details(self, recompute=False) -&gt; dict:\n    \"\"\"Additional Details about this Model\n    Args:\n        recompute (bool, optional): Recompute the details (default: False)\n    Returns:\n        dict: Dictionary of details about this Model\n    \"\"\"\n\n    # Check if we have cached version of the Model Details\n    storage_key = f\"model:{self.uuid}:details\"\n    cached_details = self.data_storage.get(storage_key)\n    if cached_details and not recompute:\n        return cached_details\n\n    self.log.info(\"Recomputing Model Details...\")\n    details = self.summary()\n    details[\"pipeline\"] = self.get_pipeline()\n    details[\"model_type\"] = self.model_type.value\n    details[\"model_package_group_arn\"] = self.group_arn()\n    details[\"model_package_arn\"] = self.model_package_arn()\n    aws_meta = self.aws_meta()\n    details[\"description\"] = aws_meta.get(\"ModelPackageDescription\", \"-\")\n    details[\"version\"] = aws_meta[\"ModelPackageVersion\"]\n    details[\"status\"] = aws_meta[\"ModelPackageStatus\"]\n    details[\"approval_status\"] = aws_meta[\"ModelApprovalStatus\"]\n    details[\"image\"] = self.model_image().split(\"/\")[-1]  # Shorten the image uri\n\n    # Grab the inference and container info\n    package_details = aws_meta[\"ModelPackageDetails\"]\n    inference_spec = package_details[\"InferenceSpecification\"]\n    container_info = self.model_container_info()\n    details[\"framework\"] = container_info.get(\"Framework\", \"unknown\")\n    details[\"framework_version\"] = container_info.get(\"FrameworkVersion\", \"unknown\")\n    details[\"inference_types\"] = inference_spec[\"SupportedRealtimeInferenceInstanceTypes\"]\n    details[\"transform_types\"] = inference_spec[\"SupportedTransformInstanceTypes\"]\n    details[\"content_types\"] = inference_spec[\"SupportedContentTypes\"]\n    details[\"response_types\"] = inference_spec[\"SupportedResponseMIMETypes\"]\n    details[\"model_metrics\"] = self.get_inference_metrics()\n    if self.model_type == ModelType.CLASSIFIER:\n        details[\"confusion_matrix\"] = self.confusion_matrix()\n        details[\"predictions\"] = None\n    else:\n        details[\"confusion_matrix\"] = None\n        details[\"predictions\"] = self.get_inference_predictions()\n\n    # Grab the inference metadata\n    details[\"inference_meta\"] = self.get_inference_metadata()\n\n    # Cache the details\n    self.data_storage.set(storage_key, details)\n\n    # Return the details\n    return details\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.endpoints","title":"<code>endpoints()</code>","text":"<p>Get the list of registered endpoints for this Model</p> <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: List of registered endpoints</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def endpoints(self) -&gt; list[str]:\n    \"\"\"Get the list of registered endpoints for this Model\n\n    Returns:\n        list[str]: List of registered endpoints\n    \"\"\"\n    return self.sageworks_meta().get(\"sageworks_registered_endpoints\", [])\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.exists","title":"<code>exists()</code>","text":"<p>Does the model metadata exist in the AWS Metadata?</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def exists(self) -&gt; bool:\n    \"\"\"Does the model metadata exist in the AWS Metadata?\"\"\"\n    if self.model_meta is None:\n        self.log.debug(f\"Model {self.model_name} not found in AWS Metadata!\")\n        return False\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.expected_meta","title":"<code>expected_meta()</code>","text":"<p>Metadata we expect to see for this Model when it's ready Returns:     list[str]: List of expected metadata keys</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def expected_meta(self) -&gt; list[str]:\n    \"\"\"Metadata we expect to see for this Model when it's ready\n    Returns:\n        list[str]: List of expected metadata keys\n    \"\"\"\n    # Our current list of expected metadata, we can add to this as needed\n    return [\"sageworks_status\", \"sageworks_training_metrics\", \"sageworks_training_cm\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.features","title":"<code>features()</code>","text":"<p>Return a list of features used for this Model</p> <p>Returns:</p> Type Description <code>Union[list[str], None]</code> <p>list[str]: List of features used for this Model</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def features(self) -&gt; Union[list[str], None]:\n    \"\"\"Return a list of features used for this Model\n\n    Returns:\n        list[str]: List of features used for this Model\n    \"\"\"\n    return self.sageworks_meta().get(\"sageworks_model_features\")  # Returns None if not found\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_endpoint_inference_path","title":"<code>get_endpoint_inference_path()</code>","text":"<p>Get the S3 Path for the Inference Data</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def get_endpoint_inference_path(self) -&gt; str:\n    \"\"\"Get the S3 Path for the Inference Data\"\"\"\n\n    # Look for any Registered Endpoints\n    registered_endpoints = self.sageworks_meta().get(\"sageworks_registered_endpoints\")\n\n    # Note: We may have 0 to N endpoints, so we find the one with the most recent artifacts\n    if registered_endpoints:\n        endpoint_inference_base = self.endpoints_s3_path + \"/inference/\"\n        endpoint_inference_paths = [endpoint_inference_base + e for e in registered_endpoints]\n        return newest_files(endpoint_inference_paths, self.sm_session)\n    else:\n        self.log.warning(f\"No registered endpoints found for {self.model_name}!\")\n        return None\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_metadata","title":"<code>get_inference_metadata(capture_uuid='training_holdout')</code>","text":"<p>Retrieve the inference metadata for this model</p> <p>Parameters:</p> Name Type Description Default <code>capture_uuid</code> <code>str</code> <p>A specific capture_uuid (default: \"training_holdout\")</p> <code>'training_holdout'</code> <p>Returns:</p> Name Type Description <code>dict</code> <code>Union[DataFrame, None]</code> <p>Dictionary of the inference metadata (might be None)</p> <p>Notes:     Basically when Endpoint inference was run, name of the dataset, the MD5, etc</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def get_inference_metadata(self, capture_uuid: str = \"training_holdout\") -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Retrieve the inference metadata for this model\n\n    Args:\n        capture_uuid (str, optional): A specific capture_uuid (default: \"training_holdout\")\n\n    Returns:\n        dict: Dictionary of the inference metadata (might be None)\n    Notes:\n        Basically when Endpoint inference was run, name of the dataset, the MD5, etc\n    \"\"\"\n    # Sanity check the inference path (which may or may not exist)\n    if self.endpoint_inference_path is None:\n        return None\n\n    # Check for model_training capture_uuid\n    if capture_uuid == \"model_training\":\n        # Create a DataFrame with the training metadata\n        meta_df = pd.DataFrame(\n            [\n                {\n                    \"name\": \"AWS Training Capture\",\n                    \"data_hash\": \"N/A\",\n                    \"num_rows\": \"-\",\n                    \"description\": \"-\",\n                }\n            ]\n        )\n        return meta_df\n\n    # Pull the inference metadata\n    try:\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_meta.json\"\n        return wr.s3.read_json(s3_path)\n    except NoFilesFound:\n        self.log.info(f\"Could not find model inference meta at {s3_path}...\")\n        return None\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_metrics","title":"<code>get_inference_metrics(capture_uuid='latest')</code>","text":"<p>Retrieve the inference performance metrics for this model</p> <p>Parameters:</p> Name Type Description Default <code>capture_uuid</code> <code>str</code> <p>Specific capture_uuid or \"training\" (default: \"latest\")</p> <code>'latest'</code> <p>Returns:     pd.DataFrame: DataFrame of the Model Metrics</p> Note <p>If a capture_uuid isn't specified this will try to return something reasonable</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def get_inference_metrics(self, capture_uuid: str = \"latest\") -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Retrieve the inference performance metrics for this model\n\n    Args:\n        capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n    Returns:\n        pd.DataFrame: DataFrame of the Model Metrics\n\n    Note:\n        If a capture_uuid isn't specified this will try to return something reasonable\n    \"\"\"\n    # Try to get the auto_capture 'training_holdout' or the training\n    if capture_uuid == \"latest\":\n        metrics_df = self.get_inference_metrics(\"training_holdout\")\n        return metrics_df if metrics_df is not None else self.get_inference_metrics(\"model_training\")\n\n    # Grab the metrics captured during model training (could return None)\n    if capture_uuid == \"model_training\":\n        metrics = self.sageworks_meta().get(\"sageworks_training_metrics\")\n        return pd.DataFrame.from_dict(metrics) if metrics else None\n\n    else:  # Specific capture_uuid (could return None)\n        s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n        metrics = pull_s3_data(s3_path, embedded_index=True)\n        if metrics is not None:\n            return metrics\n        else:\n            self.log.warning(f\"Performance metrics {capture_uuid} not found for {self.model_name}!\")\n            return None\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_predictions","title":"<code>get_inference_predictions(capture_uuid='training_holdout')</code>","text":"<p>Retrieve the captured prediction results for this model</p> <p>Parameters:</p> Name Type Description Default <code>capture_uuid</code> <code>str</code> <p>Specific capture_uuid (default: training_holdout)</p> <code>'training_holdout'</code> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: DataFrame of the Captured Predictions (might be None)</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def get_inference_predictions(self, capture_uuid: str = \"training_holdout\") -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Retrieve the captured prediction results for this model\n\n    Args:\n        capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n    Returns:\n        pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n    \"\"\"\n    self.log.important(f\"Grabbing {capture_uuid} predictions for {self.model_name}...\")\n\n    # Special case for model_training\n    if capture_uuid == \"model_training\":\n        return self._get_validation_predictions()\n\n    # Construct the S3 path for the Inference Predictions\n    s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_predictions.csv\"\n    return pull_s3_data(s3_path)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_pipeline","title":"<code>get_pipeline()</code>","text":"<p>Get the pipeline for this model</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def get_pipeline(self) -&gt; str:\n    \"\"\"Get the pipeline for this model\"\"\"\n    return self.sageworks_meta().get(\"sageworks_pipeline\")\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.group_arn","title":"<code>group_arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for the Model Package Group</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def group_arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n    return self.latest_model[\"ModelPackageGroupArn\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.health_check","title":"<code>health_check()</code>","text":"<p>Perform a health check on this model Returns:     list[str]: List of health issues</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def health_check(self) -&gt; list[str]:\n    \"\"\"Perform a health check on this model\n    Returns:\n        list[str]: List of health issues\n    \"\"\"\n    # Call the base class health check\n    health_issues = super().health_check()\n\n    # Model Type\n    if self._get_model_type() == ModelType.UNKNOWN:\n        health_issues.append(\"model_type_unknown\")\n    else:\n        self.remove_health_tag(\"model_type_unknown\")\n\n    # Model Performance Metrics\n    if self.get_inference_metrics() is None:\n        health_issues.append(\"metrics_needed\")\n    else:\n        self.remove_health_tag(\"metrics_needed\")\n    return health_issues\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.is_model_unknown","title":"<code>is_model_unknown()</code>","text":"<p>Is the Model Type unknown?</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def is_model_unknown(self) -&gt; bool:\n    \"\"\"Is the Model Type unknown?\"\"\"\n    return self.model_type == ModelType.UNKNOWN\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.latest_model_object","title":"<code>latest_model_object()</code>","text":"<p>Return the latest AWS Sagemaker Model object for this SageWorks Model</p> <p>Returns:</p> Type Description <code>Model</code> <p>sagemaker.model.Model: AWS Sagemaker Model object</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def latest_model_object(self) -&gt; SagemakerModel:\n    \"\"\"Return the latest AWS Sagemaker Model object for this SageWorks Model\n\n    Returns:\n       sagemaker.model.Model: AWS Sagemaker Model object\n    \"\"\"\n    return SagemakerModel(\n        model_data=self.model_package_arn(), sagemaker_session=self.sm_session, image_uri=self.model_image()\n    )\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.list_inference_runs","title":"<code>list_inference_runs()</code>","text":"<p>List the inference runs for this model</p> <p>Returns:</p> Type Description <code>list[str]</code> <p>list[str]: List of inference run UUIDs</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def list_inference_runs(self) -&gt; list[str]:\n    \"\"\"List the inference runs for this model\n\n    Returns:\n        list[str]: List of inference run UUIDs\n    \"\"\"\n    if self.endpoint_inference_path is None:\n        return [\"model_training\"]  # Just the training run\n    directories = wr.s3.list_directories(path=self.endpoint_inference_path + \"/\")\n    inference_runs = [urlparse(directory).path.split(\"/\")[-2] for directory in directories]\n\n    # We're going to add the training to the front of the list\n    inference_runs.insert(0, \"model_training\")\n    return inference_runs\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.model_container_info","title":"<code>model_container_info()</code>","text":"<p>Container Info for the Latest Model Package</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def model_container_info(self) -&gt; dict:\n    \"\"\"Container Info for the Latest Model Package\"\"\"\n    return self.latest_model[\"ModelPackageDetails\"][\"InferenceSpecification\"][\"Containers\"][0]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.model_image","title":"<code>model_image()</code>","text":"<p>Container Image for the Latest Model Package</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def model_image(self) -&gt; str:\n    \"\"\"Container Image for the Latest Model Package\"\"\"\n    return self.model_container_info()[\"Image\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.model_package_arn","title":"<code>model_package_arn()</code>","text":"<p>AWS ARN (Amazon Resource Name) for the Model Package (within the Group)</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def model_package_arn(self) -&gt; str:\n    \"\"\"AWS ARN (Amazon Resource Name) for the Model Package (within the Group)\"\"\"\n    return self.latest_model[\"ModelPackageArn\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.modified","title":"<code>modified()</code>","text":"<p>Return the datetime when this artifact was last modified</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def modified(self) -&gt; datetime:\n    \"\"\"Return the datetime when this artifact was last modified\"\"\"\n    return self.latest_model[\"CreationTime\"]\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.onboard","title":"<code>onboard(ask_everything=False)</code>","text":"<p>This is an interactive method that will onboard the Model (make it ready)</p> <p>Parameters:</p> Name Type Description Default <code>ask_everything</code> <code>bool</code> <p>Ask for all the details. Defaults to False.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the Model is successfully onboarded, False otherwise</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def onboard(self, ask_everything=False) -&gt; bool:\n    \"\"\"This is an interactive method that will onboard the Model (make it ready)\n\n    Args:\n        ask_everything (bool, optional): Ask for all the details. Defaults to False.\n\n    Returns:\n        bool: True if the Model is successfully onboarded, False otherwise\n    \"\"\"\n    # Set the status to onboarding\n    self.set_status(\"onboarding\")\n\n    # Determine the Model Type\n    while self.is_model_unknown():\n        self._determine_model_type()\n\n    # Is our input data set?\n    if self.get_input() in [\"\", \"unknown\"] or ask_everything:\n        input_data = input(\"Input Data?: \")\n        if input_data not in [\"None\", \"none\", \"\", \"unknown\"]:\n            self.set_input(input_data)\n\n    # Determine the Target Column (can be None)\n    target_column = self.target()\n    if target_column is None or ask_everything:\n        target_column = input(\"Target Column? (for unsupervised/transformer just type None): \")\n        if target_column in [\"None\", \"none\", \"\"]:\n            target_column = None\n\n    # Determine the Feature Columns\n    feature_columns = self.features()\n    if feature_columns is None or ask_everything:\n        feature_columns = input(\"Feature Columns? (use commas): \")\n        feature_columns = [e.strip() for e in feature_columns.split(\",\")]\n        if feature_columns in [[\"None\"], [\"none\"], [\"\"]]:\n            feature_columns = None\n\n    # Registered Endpoints?\n    endpoints = self.endpoints()\n    if not endpoints or ask_everything:\n        endpoints = input(\"Register Endpoints? (use commas for multiple): \")\n        endpoints = [e.strip() for e in endpoints.split(\",\")]\n        if endpoints in [[\"None\"], [\"none\"], [\"\"]]:\n            endpoints = None\n\n    # Model Owner?\n    owner = self.get_owner()\n    if owner in [None, \"unknown\"] or ask_everything:\n        owner = input(\"Model Owner: \")\n        if owner in [\"None\", \"none\", \"\"]:\n            owner = \"unknown\"\n\n    # Now that we have all the details, let's onboard the Model with all the args\n    return self.onboard_with_args(self.model_type, target_column, feature_columns, endpoints, owner)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.onboard_with_args","title":"<code>onboard_with_args(model_type, target_column=None, feature_list=None, endpoints=None, owner=None)</code>","text":"<p>Onboard the Model with the given arguments</p> <p>Parameters:</p> Name Type Description Default <code>model_type</code> <code>ModelType</code> <p>Model Type</p> required <code>target_column</code> <code>str</code> <p>Target Column</p> <code>None</code> <code>feature_list</code> <code>list</code> <p>List of Feature Columns</p> <code>None</code> <code>endpoints</code> <code>list</code> <p>List of Endpoints. Defaults to None.</p> <code>None</code> <code>owner</code> <code>str</code> <p>Model Owner. Defaults to None.</p> <code>None</code> <p>Returns:     bool: True if the Model is successfully onboarded, False otherwise</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def onboard_with_args(\n    self,\n    model_type: ModelType,\n    target_column: str = None,\n    feature_list: list = None,\n    endpoints: list = None,\n    owner: str = None,\n) -&gt; bool:\n    \"\"\"Onboard the Model with the given arguments\n\n    Args:\n        model_type (ModelType): Model Type\n        target_column (str): Target Column\n        feature_list (list): List of Feature Columns\n        endpoints (list, optional): List of Endpoints. Defaults to None.\n        owner (str, optional): Model Owner. Defaults to None.\n    Returns:\n        bool: True if the Model is successfully onboarded, False otherwise\n    \"\"\"\n    # Set the status to onboarding\n    self.set_status(\"onboarding\")\n\n    # Set All the Details\n    self._set_model_type(model_type)\n    if target_column:\n        self.set_target(target_column)\n    if feature_list:\n        self.set_features(feature_list)\n    if endpoints:\n        for endpoint in endpoints:\n            self.register_endpoint(endpoint)\n    if owner:\n        self.set_owner(owner)\n\n    # Load the training metrics and inference metrics\n    self._load_training_metrics()\n    self._load_inference_metrics()\n    self._load_inference_cm()\n\n    # Remove the needs_onboard tag\n    self.remove_health_tag(\"needs_onboard\")\n    self.set_status(\"ready\")\n\n    # Run a health check and refresh the meta\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.health_check()\n    self.refresh_meta()\n    self.details(recompute=True)\n    return True\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.refresh_meta","title":"<code>refresh_meta()</code>","text":"<p>Refresh the Artifact's metadata</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def refresh_meta(self):\n    \"\"\"Refresh the Artifact's metadata\"\"\"\n    self.model_meta = self.aws_broker.get_metadata(ServiceCategory.MODELS, force_refresh=True).get(self.model_name)\n    self.latest_model = self.model_meta[0]\n    self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n    self.training_job_name = self._extract_training_job_name()\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.register_endpoint","title":"<code>register_endpoint(endpoint_name)</code>","text":"<p>Add this endpoint to the set of registered endpoints for the model</p> <p>Parameters:</p> Name Type Description Default <code>endpoint_name</code> <code>str</code> <p>Name of the endpoint</p> required Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def register_endpoint(self, endpoint_name: str):\n    \"\"\"Add this endpoint to the set of registered endpoints for the model\n\n    Args:\n        endpoint_name (str): Name of the endpoint\n    \"\"\"\n    self.log.important(f\"Registering Endpoint {endpoint_name} with Model {self.uuid}...\")\n    registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n    registered_endpoints.add(endpoint_name)\n    self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n    # A new endpoint means we need to refresh our inference path\n    time.sleep(2)  # Give the AWS Metadata a chance to update\n    self.endpoint_inference_path = self.get_endpoint_inference_path()\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_class_labels","title":"<code>set_class_labels(labels)</code>","text":"<p>Return the class labels for this Model (if it's a classifier)</p> <p>Parameters:</p> Name Type Description Default <code>labels</code> <code>list[str]</code> <p>List of class labels</p> required Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def set_class_labels(self, labels: list[str]):\n    \"\"\"Return the class labels for this Model (if it's a classifier)\n\n    Args:\n        labels (list[str]): List of class labels\n    \"\"\"\n    if self.model_type == ModelType.CLASSIFIER:\n        self.upsert_sageworks_meta({\"class_labels\": labels})\n    else:\n        self.log.error(f\"Model {self.model_name} is not a classifier!\")\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_features","title":"<code>set_features(feature_columns)</code>","text":"<p>Set the features for this Model</p> <p>Parameters:</p> Name Type Description Default <code>feature_columns</code> <code>list[str]</code> <p>List of feature columns</p> required Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def set_features(self, feature_columns: list[str]):\n    \"\"\"Set the features for this Model\n\n    Args:\n        feature_columns (list[str]): List of feature columns\n    \"\"\"\n    self.upsert_sageworks_meta({\"sageworks_model_features\": feature_columns})\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_input","title":"<code>set_input(input, force=False)</code>","text":"<p>Override: Set the input data for this artifact</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>str</code> <p>Name of input for this artifact</p> required <code>force</code> <code>bool</code> <p>Force the input to be set (default: False)</p> <code>False</code> <p>Note:     We're going to not allow this to be used for Models</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def set_input(self, input: str, force: bool = False):\n    \"\"\"Override: Set the input data for this artifact\n\n    Args:\n        input (str): Name of input for this artifact\n        force (bool, optional): Force the input to be set (default: False)\n    Note:\n        We're going to not allow this to be used for Models\n    \"\"\"\n    if not force:\n        self.log.warning(f\"Model {self.uuid}: Does not allow manual override of the input!\")\n        return\n\n    # Okay we're going to allow this to be set\n    self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n    self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n    self.upsert_sageworks_meta({\"sageworks_input\": input})\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_pipeline","title":"<code>set_pipeline(pipeline)</code>","text":"<p>Set the pipeline for this model</p> <p>Parameters:</p> Name Type Description Default <code>pipeline</code> <code>str</code> <p>Pipeline that was used to create this model</p> required Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def set_pipeline(self, pipeline: str):\n    \"\"\"Set the pipeline for this model\n\n    Args:\n        pipeline (str): Pipeline that was used to create this model\n    \"\"\"\n    self.upsert_sageworks_meta({\"sageworks_pipeline\": pipeline})\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_target","title":"<code>set_target(target_column)</code>","text":"<p>Set the target for this Model</p> <p>Parameters:</p> Name Type Description Default <code>target_column</code> <code>str</code> <p>Target column for this Model</p> required Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def set_target(self, target_column: str):\n    \"\"\"Set the target for this Model\n\n    Args:\n        target_column (str): Target column for this Model\n    \"\"\"\n    self.upsert_sageworks_meta({\"sageworks_model_target\": target_column})\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.shapley_values","title":"<code>shapley_values(capture_uuid='training_holdout')</code>","text":"<p>Retrieve the Shapely values for this model</p> <p>Parameters:</p> Name Type Description Default <code>capture_uuid</code> <code>str</code> <p>Specific capture_uuid (default: training_holdout)</p> <code>'training_holdout'</code> <p>Returns:</p> Type Description <code>Union[list[DataFrame], DataFrame, None]</code> <p>pd.DataFrame: Dataframe of the shapley values for the prediction dataframe</p> Notes <p>This may or may not exist based on whether an Endpoint ran Shapley</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def shapley_values(self, capture_uuid: str = \"training_holdout\") -&gt; Union[list[pd.DataFrame], pd.DataFrame, None]:\n    \"\"\"Retrieve the Shapely values for this model\n\n    Args:\n        capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n    Returns:\n        pd.DataFrame: Dataframe of the shapley values for the prediction dataframe\n\n    Notes:\n        This may or may not exist based on whether an Endpoint ran Shapley\n    \"\"\"\n\n    # Sanity check the inference path (which may or may not exist)\n    if self.endpoint_inference_path is None:\n        return None\n\n    # Construct the S3 path for the Shapley values\n    shapley_s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n    # Multiple CSV if classifier\n    if self.model_type == ModelType.CLASSIFIER:\n        # CSVs for shap values are indexed by prediction class\n        # Because we don't know how many classes there are, we need to search through\n        # a list of S3 objects in the parent folder\n        s3_paths = wr.s3.list_objects(shapley_s3_path)\n        return [pull_s3_data(f) for f in s3_paths if \"inference_shap_values\" in f]\n\n    # One CSV if regressor\n    if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n        s3_path = f\"{shapley_s3_path}/inference_shap_values.csv\"\n        return pull_s3_data(s3_path)\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.size","title":"<code>size()</code>","text":"<p>Return the size of this data in MegaBytes</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def size(self) -&gt; float:\n    \"\"\"Return the size of this data in MegaBytes\"\"\"\n    return 0.0\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.target","title":"<code>target()</code>","text":"<p>Return the target for this Model (if supervised, else None)</p> <p>Returns:</p> Name Type Description <code>str</code> <code>Union[str, None]</code> <p>Target column for this Model (if supervised, else None)</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>def target(self) -&gt; Union[str, None]:\n    \"\"\"Return the target for this Model (if supervised, else None)\n\n    Returns:\n        str: Target column for this Model (if supervised, else None)\n    \"\"\"\n    return self.sageworks_meta().get(\"sageworks_model_target\")  # Returns None if not found\n</code></pre>"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelType","title":"<code>ModelType</code>","text":"<p>               Bases: <code>Enum</code></p> <p>Enumerated Types for SageWorks Model Types</p> Source code in <code>src/sageworks/core/artifacts/model_core.py</code> <pre><code>class ModelType(Enum):\n    \"\"\"Enumerated Types for SageWorks Model Types\"\"\"\n\n    CLASSIFIER = \"classifier\"\n    REGRESSOR = \"regressor\"\n    CLUSTERER = \"clusterer\"\n    TRANSFORMER = \"transformer\"\n    QUANTILE_REGRESSOR = \"quantile_regressor\"\n    UNKNOWN = \"unknown\"\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/","title":"MonitorCore","text":"<p>API Classes</p> <p>Found a method here you want to use? The API Classes have method pass-through so just call the method on the Monitor API Class and voil\u00e0 it works the same.</p> <p>MonitorCore class for monitoring SageMaker endpoints</p>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore","title":"<code>MonitorCore</code>","text":"Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>class MonitorCore:\n    def __init__(self, endpoint_name, instance_type=\"ml.t3.large\"):\n        \"\"\"ExtractModelArtifact Class\n        Args:\n            endpoint_name (str): Name of the endpoint to set up monitoring for\n            instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\".\n                                 Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...\n        \"\"\"\n        self.log = logging.getLogger(\"sageworks\")\n        self.endpoint_name = endpoint_name\n        self.endpoint = EndpointCore(self.endpoint_name)\n\n        # Initialize Class Attributes\n        self.sagemaker_session = self.endpoint.sm_session\n        self.sagemaker_client = self.endpoint.sm_client\n        self.data_capture_path = self.endpoint.endpoint_data_capture_path\n        self.monitoring_path = self.endpoint.endpoint_monitoring_path\n        self.instance_type = instance_type\n        self.monitoring_schedule_name = f\"{self.endpoint_name}-monitoring-schedule\"\n        self.monitoring_output_path = f\"{self.monitoring_path}/monitoring_reports\"\n        self.baseline_dir = f\"{self.monitoring_path}/baseline\"\n        self.baseline_csv_file = f\"{self.baseline_dir}/baseline.csv\"\n        self.constraints_json_file = f\"{self.baseline_dir}/constraints.json\"\n        self.statistics_json_file = f\"{self.baseline_dir}/statistics.json\"\n\n        # Initialize the DefaultModelMonitor\n        self.sageworks_role = AWSAccountClamp().sageworks_execution_role_arn()\n        self.model_monitor = DefaultModelMonitor(role=self.sageworks_role, instance_type=self.instance_type)\n\n    def summary(self) -&gt; dict:\n        \"\"\"Return the summary of information about the endpoint monitor\n\n        Returns:\n            dict: Summary of information about the endpoint monitor\n        \"\"\"\n        if self.endpoint.is_serverless():\n            return {\n                \"endpoint_type\": \"serverless\",\n                \"data_capture\": \"not supported\",\n                \"baseline\": \"not supported\",\n                \"monitoring_schedule\": \"not supported\",\n            }\n        else:\n            summary = {\n                \"endpoint_type\": \"realtime\",\n                \"data_capture\": self.is_data_capture_configured(capture_percentage=100),\n                \"baseline\": self.baseline_exists(),\n                \"monitoring_schedule\": self.monitoring_schedule_exists(),\n            }\n            summary.update(self.last_run_details() or {})\n            return summary\n\n    def __repr__(self) -&gt; str:\n        \"\"\"String representation of this MonitorCore object\n\n        Returns:\n            str: String representation of this MonitorCore object\n        \"\"\"\n        summary_dict = self.summary()\n        summary_items = [f\"  {repr(key)}: {repr(value)}\" for key, value in summary_dict.items()]\n        summary_str = f\"{self.__class__.__name__}: {self.endpoint_name}\\n\" + \",\\n\".join(summary_items)\n        return summary_str\n\n    def last_run_details(self) -&gt; Union[dict, None]:\n        \"\"\"Return the details of the last monitoring run for the endpoint\n\n        Returns:\n            dict: The details of the last monitoring run for the endpoint (None if no monitoring schedule)\n        \"\"\"\n        # Check if we have a monitoring schedule\n        if not self.monitoring_schedule_exists():\n            return None\n\n        # Get the details of the last monitoring run\n        schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n            MonitoringScheduleName=self.monitoring_schedule_name\n        )\n        last_run_status = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"MonitoringExecutionStatus\")\n        last_run_time = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"ScheduledTime\")\n        failure_reason = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"FailureReason\")\n        return {\n            \"last_run_status\": last_run_status,\n            \"last_run_time\": str(last_run_time),\n            \"failure_reason\": failure_reason,\n        }\n\n    def details(self) -&gt; dict:\n        \"\"\"Return the details of the monitoring for the endpoint\n\n        Returns:\n            dict: The details of the monitoring for the endpoint\n        \"\"\"\n        # Check if we have data capture\n        if self.is_data_capture_configured(capture_percentage=100):\n            data_capture_path = self.data_capture_path\n        else:\n            data_capture_path = None\n\n        # Check if we have a baseline\n        if self.baseline_exists():\n            baseline_csv_file = self.baseline_csv_file\n            constraints_json_file = self.constraints_json_file\n            statistics_json_file = self.statistics_json_file\n        else:\n            baseline_csv_file = None\n            constraints_json_file = None\n            statistics_json_file = None\n\n        # Check if we have a monitoring schedule\n        if self.monitoring_schedule_exists():\n            schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n                MonitoringScheduleName=self.monitoring_schedule_name\n            )\n\n            # General monitoring details\n            schedule_name = schedule_details.get(\"MonitoringScheduleName\")\n            schedule_status = schedule_details.get(\"MonitoringScheduleStatus\")\n            output_path = self.monitoring_output_path\n            last_run_details = self.last_run_details()\n        else:\n            schedule_name = None\n            schedule_status = \"Not Scheduled\"\n            schedule_details = None\n            output_path = None\n            last_run_details = None\n\n        # General monitoring details\n        general = {\n            \"data_capture_path\": data_capture_path,\n            \"baseline_csv_file\": baseline_csv_file,\n            \"baseline_constraints_json_file\": constraints_json_file,\n            \"baseline_statistics_json_file\": statistics_json_file,\n            \"monitoring_schedule_name\": schedule_name,\n            \"monitoring_output_path\": output_path,\n            \"monitoring_schedule_status\": schedule_status,\n            \"monitoring_schedule_details\": schedule_details,\n        }\n        if last_run_details:\n            general.update(last_run_details)\n        return general\n\n    def add_data_capture(self, capture_percentage=100):\n        \"\"\"\n        Add data capture configuration for the SageMaker endpoint.\n\n        Args:\n            capture_percentage (int): Percentage of data to capture. Defaults to 100.\n        \"\"\"\n\n        # Check if this endpoint is a serverless endpoint\n        if self.endpoint.is_serverless():\n            self.log.warning(\"Data capture is not currently supported for serverless endpoints.\")\n            return\n\n        # Check if the endpoint already has data capture configured\n        if self.is_data_capture_configured(capture_percentage):\n            self.log.important(f\"Data capture {capture_percentage} already configured for {self.endpoint_name}.\")\n            return\n\n        # Get the current endpoint configuration name\n        current_endpoint_config_name = self.endpoint.endpoint_config_name()\n\n        # Log the data capture path\n        self.log.important(f\"Adding Data Capture to {self.endpoint_name} --&gt; {self.data_capture_path}\")\n        self.log.important(\"This normally redeploys the endpoint...\")\n\n        # Setup data capture config\n        data_capture_config = DataCaptureConfig(\n            enable_capture=True,\n            sampling_percentage=capture_percentage,\n            destination_s3_uri=self.data_capture_path,\n            capture_options=[\"Input\", \"Output\"],\n            csv_content_types=[\"text/csv\"],\n        )\n\n        # Create a Predictor instance and update data capture configuration\n        predictor = Predictor(self.endpoint_name, sagemaker_session=self.sagemaker_session)\n        predictor.update_data_capture_config(data_capture_config=data_capture_config)\n\n        # Delete the old endpoint configuration\n        self.log.important(f\"Deleting old endpoint configuration: {current_endpoint_config_name}\")\n        self.sagemaker_client.delete_endpoint_config(EndpointConfigName=current_endpoint_config_name)\n\n    def is_data_capture_configured(self, capture_percentage):\n        \"\"\"\n        Check if data capture is already configured on the endpoint.\n        Args:\n            capture_percentage (int): Expected data capture percentage.\n        Returns:\n            bool: True if data capture is already configured, False otherwise.\n        \"\"\"\n        try:\n            endpoint_config_name = self.endpoint.endpoint_config_name()\n            endpoint_config = self.sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n            data_capture_config = endpoint_config.get(\"DataCaptureConfig\", {})\n\n            # Check if data capture is enabled and the percentage matches\n            is_enabled = data_capture_config.get(\"EnableCapture\", False)\n            current_percentage = data_capture_config.get(\"InitialSamplingPercentage\", 0)\n            return is_enabled and current_percentage == capture_percentage\n        except Exception as e:\n            self.log.error(f\"Error checking data capture configuration: {e}\")\n            return False\n\n    def get_latest_data_capture(self) -&gt; (pd.DataFrame, pd.DataFrame):\n        \"\"\"\n        Get the latest data capture from S3.\n\n        Returns:\n            DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n        \"\"\"\n        # List files in the specified S3 path\n        files = wr.s3.list_objects(self.data_capture_path)\n\n        if files:\n            print(f\"Found {len(files)} files in {self.data_capture_path}. Reading the most recent file.\")\n\n            # Read the most recent file into a DataFrame\n            df = wr.s3.read_json(path=files[-1], lines=True)  # Reads the last file assuming it's the most recent one\n\n            # Process the captured data and return the input and output DataFrames\n            return self.process_captured_data(df)\n        else:\n            print(f\"No data capture files found in {self.data_capture_path}.\")\n            return None, None\n\n    @staticmethod\n    def process_captured_data(df: pd.DataFrame) -&gt; (pd.DataFrame, pd.DataFrame):\n        \"\"\"\n        Process the captured data DataFrame to extract and flatten the nested data.\n\n        Args:\n            df (DataFrame): DataFrame with captured data.\n\n        Returns:\n            DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n        \"\"\"\n        processed_records = []\n\n        # Phase1: Process the AWS Data Capture format into a flatter DataFrame\n        for _, row in df.iterrows():\n            # Extract data from captureData dictionary\n            capture_data = row[\"captureData\"]\n            input_data = capture_data[\"endpointInput\"]\n            output_data = capture_data[\"endpointOutput\"]\n\n            # Process input and output, both meta and actual data\n            record = {\n                \"input_content_type\": input_data.get(\"observedContentType\"),\n                \"input_encoding\": input_data.get(\"encoding\"),\n                \"input\": input_data.get(\"data\"),\n                \"output_content_type\": output_data.get(\"observedContentType\"),\n                \"output_encoding\": output_data.get(\"encoding\"),\n                \"output\": output_data.get(\"data\"),\n            }\n            processed_records.append(record)\n        processed_df = pd.DataFrame(processed_records)\n\n        # Phase2: Process the input and output 'data' columns into separate DataFrames\n        input_df_list = []\n        output_df_list = []\n        for _, row in processed_df.iterrows():\n            input_df = pd.read_csv(StringIO(row[\"input\"]))\n            input_df_list.append(input_df)\n            output_df = pd.read_csv(StringIO(row[\"output\"]))\n            output_df_list.append(output_df)\n\n        # Return the input and output DataFrames\n        return pd.concat(input_df_list), pd.concat(output_df_list)\n\n    def baseline_exists(self) -&gt; bool:\n        \"\"\"\n        Check if baseline files exist in S3.\n\n        Returns:\n            bool: True if all files exist, False otherwise.\n        \"\"\"\n\n        files = [self.baseline_csv_file, self.constraints_json_file, self.statistics_json_file]\n        return all(wr.s3.does_object_exist(file) for file in files)\n\n    def create_baseline(self, recreate: bool = False):\n        \"\"\"Code to create a baseline for monitoring\n        Args:\n            recreate (bool): If True, recreate the baseline even if it already exists\n        Notes:\n            This will create/write three files to the baseline_dir:\n            - baseline.csv\n            - constraints.json\n            - statistics.json\n        \"\"\"\n        # Check if this endpoint is a serverless endpoint\n        if self.endpoint.is_serverless():\n            self.log.warning(\n                \"You can create a baseline but it can't be used/monitored for serverless endpoints, skipping...\"\n            )\n            return\n\n        if not self.baseline_exists() or recreate:\n            # Create a baseline for monitoring (training data from the FeatureSet)\n            baseline_df = endpoint_utils.fs_training_data(self.endpoint)\n            wr.s3.to_csv(baseline_df, self.baseline_csv_file, index=False)\n\n            self.log.important(f\"Creating baseline files for {self.endpoint_name} --&gt; {self.baseline_dir}\")\n            self.model_monitor.suggest_baseline(\n                baseline_dataset=self.baseline_csv_file,\n                dataset_format=DatasetFormat.csv(header=True),\n                output_s3_uri=self.baseline_dir,\n            )\n        else:\n            self.log.important(f\"Baseline already exists for {self.endpoint_name}\")\n\n    def get_baseline(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n        Returns:\n            pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n        \"\"\"\n        # Read the monitoring data from S3\n        if not wr.s3.does_object_exist(path=self.baseline_csv_file):\n            self.log.warning(\"baseline.csv data does not exist in S3.\")\n            return None\n        else:\n            return wr.s3.read_csv(self.baseline_csv_file)\n\n    def get_constraints(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the constraints from the baseline\n\n        Returns:\n           pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n        \"\"\"\n        return self._get_monitor_json_data(self.constraints_json_file)\n\n    def get_statistics(self) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Code to get the statistics from the baseline\n\n        Returns:\n            pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n        \"\"\"\n        return self._get_monitor_json_data(self.statistics_json_file)\n\n    def _get_monitor_json_data(self, s3_path: str) -&gt; Union[pd.DataFrame, None]:\n        \"\"\"Internal: Convert the JSON monitoring data into a DataFrame\n        Args:\n            s3_path(str): The S3 path to the monitoring data\n        Returns:\n            pd.DataFrame: Monitoring data in DataFrame form (None if it doesn't exist)\n        \"\"\"\n        # Read the monitoring data from S3\n        if not wr.s3.does_object_exist(path=s3_path):\n            self.log.warning(\"Monitoring data does not exist in S3.\")\n            return None\n        else:\n            raw_json = read_s3_file(s3_path=s3_path)\n            monitoring_data = json.loads(raw_json)\n            monitoring_df = pd.json_normalize(monitoring_data[\"features\"])\n            return monitoring_df\n\n    def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n        \"\"\"\n        Sets up the monitoring schedule for the model endpoint.\n        Args:\n            schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n            recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n        \"\"\"\n        # Check if this endpoint is a serverless endpoint\n        if self.endpoint.is_serverless():\n            self.log.warning(\"Monitoring Schedule is not currently supported for serverless endpoints.\")\n            return\n\n        # Set up the monitoring schedule, name, and output path\n        if schedule == \"daily\":\n            schedule = CronExpressionGenerator.daily()\n        else:\n            schedule = CronExpressionGenerator.hourly()\n\n        # Check if the baseline exists\n        if not self.baseline_exists():\n            self.log.warning(f\"Baseline does not exist for {self.endpoint_name}. Call create_baseline() first...\")\n            return\n\n        # Check if monitoring schedule already exists\n        schedule_exists = self.monitoring_schedule_exists()\n\n        # If the schedule exists, and we don't want to recreate it, return\n        if schedule_exists and not recreate:\n            return\n\n        # If the schedule exists, delete it\n        if schedule_exists:\n            self.log.important(f\"Deleting existing monitoring schedule for {self.endpoint_name}...\")\n            self.sagemaker_client.delete_monitoring_schedule(MonitoringScheduleName=self.monitoring_schedule_name)\n\n        # Set up a NEW monitoring schedule\n        self.model_monitor.create_monitoring_schedule(\n            monitor_schedule_name=self.monitoring_schedule_name,\n            endpoint_input=self.endpoint_name,\n            output_s3_uri=self.monitoring_output_path,\n            statistics=self.statistics_json_file,\n            constraints=self.constraints_json_file,\n            schedule_cron_expression=schedule,\n        )\n        self.log.important(f\"New Monitoring schedule created for {self.endpoint_name}.\")\n\n    def setup_alerts(self):\n        \"\"\"Code to set up alerts based on monitoring results\"\"\"\n        pass\n\n    def monitoring_schedule_exists(self):\n        \"\"\"Code to figure out if a monitoring schedule already exists for this endpoint\"\"\"\n        existing_schedules = self.sagemaker_client.list_monitoring_schedules(MaxResults=100).get(\n            \"MonitoringScheduleSummaries\", []\n        )\n        if any(schedule[\"MonitoringScheduleName\"] == self.monitoring_schedule_name for schedule in existing_schedules):\n            self.log.info(f\"Monitoring schedule already exists for {self.endpoint_name}.\")\n            return True\n        else:\n            self.log.info(f\"Could not find a Monitoring schedule for {self.endpoint_name}.\")\n            return False\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.__init__","title":"<code>__init__(endpoint_name, instance_type='ml.t3.large')</code>","text":"<p>ExtractModelArtifact Class Args:     endpoint_name (str): Name of the endpoint to set up monitoring for     instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\".                          Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def __init__(self, endpoint_name, instance_type=\"ml.t3.large\"):\n    \"\"\"ExtractModelArtifact Class\n    Args:\n        endpoint_name (str): Name of the endpoint to set up monitoring for\n        instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\".\n                             Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...\n    \"\"\"\n    self.log = logging.getLogger(\"sageworks\")\n    self.endpoint_name = endpoint_name\n    self.endpoint = EndpointCore(self.endpoint_name)\n\n    # Initialize Class Attributes\n    self.sagemaker_session = self.endpoint.sm_session\n    self.sagemaker_client = self.endpoint.sm_client\n    self.data_capture_path = self.endpoint.endpoint_data_capture_path\n    self.monitoring_path = self.endpoint.endpoint_monitoring_path\n    self.instance_type = instance_type\n    self.monitoring_schedule_name = f\"{self.endpoint_name}-monitoring-schedule\"\n    self.monitoring_output_path = f\"{self.monitoring_path}/monitoring_reports\"\n    self.baseline_dir = f\"{self.monitoring_path}/baseline\"\n    self.baseline_csv_file = f\"{self.baseline_dir}/baseline.csv\"\n    self.constraints_json_file = f\"{self.baseline_dir}/constraints.json\"\n    self.statistics_json_file = f\"{self.baseline_dir}/statistics.json\"\n\n    # Initialize the DefaultModelMonitor\n    self.sageworks_role = AWSAccountClamp().sageworks_execution_role_arn()\n    self.model_monitor = DefaultModelMonitor(role=self.sageworks_role, instance_type=self.instance_type)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.__repr__","title":"<code>__repr__()</code>","text":"<p>String representation of this MonitorCore object</p> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>String representation of this MonitorCore object</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def __repr__(self) -&gt; str:\n    \"\"\"String representation of this MonitorCore object\n\n    Returns:\n        str: String representation of this MonitorCore object\n    \"\"\"\n    summary_dict = self.summary()\n    summary_items = [f\"  {repr(key)}: {repr(value)}\" for key, value in summary_dict.items()]\n    summary_str = f\"{self.__class__.__name__}: {self.endpoint_name}\\n\" + \",\\n\".join(summary_items)\n    return summary_str\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.add_data_capture","title":"<code>add_data_capture(capture_percentage=100)</code>","text":"<p>Add data capture configuration for the SageMaker endpoint.</p> <p>Parameters:</p> Name Type Description Default <code>capture_percentage</code> <code>int</code> <p>Percentage of data to capture. Defaults to 100.</p> <code>100</code> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def add_data_capture(self, capture_percentage=100):\n    \"\"\"\n    Add data capture configuration for the SageMaker endpoint.\n\n    Args:\n        capture_percentage (int): Percentage of data to capture. Defaults to 100.\n    \"\"\"\n\n    # Check if this endpoint is a serverless endpoint\n    if self.endpoint.is_serverless():\n        self.log.warning(\"Data capture is not currently supported for serverless endpoints.\")\n        return\n\n    # Check if the endpoint already has data capture configured\n    if self.is_data_capture_configured(capture_percentage):\n        self.log.important(f\"Data capture {capture_percentage} already configured for {self.endpoint_name}.\")\n        return\n\n    # Get the current endpoint configuration name\n    current_endpoint_config_name = self.endpoint.endpoint_config_name()\n\n    # Log the data capture path\n    self.log.important(f\"Adding Data Capture to {self.endpoint_name} --&gt; {self.data_capture_path}\")\n    self.log.important(\"This normally redeploys the endpoint...\")\n\n    # Setup data capture config\n    data_capture_config = DataCaptureConfig(\n        enable_capture=True,\n        sampling_percentage=capture_percentage,\n        destination_s3_uri=self.data_capture_path,\n        capture_options=[\"Input\", \"Output\"],\n        csv_content_types=[\"text/csv\"],\n    )\n\n    # Create a Predictor instance and update data capture configuration\n    predictor = Predictor(self.endpoint_name, sagemaker_session=self.sagemaker_session)\n    predictor.update_data_capture_config(data_capture_config=data_capture_config)\n\n    # Delete the old endpoint configuration\n    self.log.important(f\"Deleting old endpoint configuration: {current_endpoint_config_name}\")\n    self.sagemaker_client.delete_endpoint_config(EndpointConfigName=current_endpoint_config_name)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.baseline_exists","title":"<code>baseline_exists()</code>","text":"<p>Check if baseline files exist in S3.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if all files exist, False otherwise.</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def baseline_exists(self) -&gt; bool:\n    \"\"\"\n    Check if baseline files exist in S3.\n\n    Returns:\n        bool: True if all files exist, False otherwise.\n    \"\"\"\n\n    files = [self.baseline_csv_file, self.constraints_json_file, self.statistics_json_file]\n    return all(wr.s3.does_object_exist(file) for file in files)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.create_baseline","title":"<code>create_baseline(recreate=False)</code>","text":"<p>Code to create a baseline for monitoring Args:     recreate (bool): If True, recreate the baseline even if it already exists Notes:     This will create/write three files to the baseline_dir:     - baseline.csv     - constraints.json     - statistics.json</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def create_baseline(self, recreate: bool = False):\n    \"\"\"Code to create a baseline for monitoring\n    Args:\n        recreate (bool): If True, recreate the baseline even if it already exists\n    Notes:\n        This will create/write three files to the baseline_dir:\n        - baseline.csv\n        - constraints.json\n        - statistics.json\n    \"\"\"\n    # Check if this endpoint is a serverless endpoint\n    if self.endpoint.is_serverless():\n        self.log.warning(\n            \"You can create a baseline but it can't be used/monitored for serverless endpoints, skipping...\"\n        )\n        return\n\n    if not self.baseline_exists() or recreate:\n        # Create a baseline for monitoring (training data from the FeatureSet)\n        baseline_df = endpoint_utils.fs_training_data(self.endpoint)\n        wr.s3.to_csv(baseline_df, self.baseline_csv_file, index=False)\n\n        self.log.important(f\"Creating baseline files for {self.endpoint_name} --&gt; {self.baseline_dir}\")\n        self.model_monitor.suggest_baseline(\n            baseline_dataset=self.baseline_csv_file,\n            dataset_format=DatasetFormat.csv(header=True),\n            output_s3_uri=self.baseline_dir,\n        )\n    else:\n        self.log.important(f\"Baseline already exists for {self.endpoint_name}\")\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.create_monitoring_schedule","title":"<code>create_monitoring_schedule(schedule='hourly', recreate=False)</code>","text":"<p>Sets up the monitoring schedule for the model endpoint. Args:     schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).     recreate (bool): If True, recreate the monitoring schedule even if it already exists.</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n    \"\"\"\n    Sets up the monitoring schedule for the model endpoint.\n    Args:\n        schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n        recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n    \"\"\"\n    # Check if this endpoint is a serverless endpoint\n    if self.endpoint.is_serverless():\n        self.log.warning(\"Monitoring Schedule is not currently supported for serverless endpoints.\")\n        return\n\n    # Set up the monitoring schedule, name, and output path\n    if schedule == \"daily\":\n        schedule = CronExpressionGenerator.daily()\n    else:\n        schedule = CronExpressionGenerator.hourly()\n\n    # Check if the baseline exists\n    if not self.baseline_exists():\n        self.log.warning(f\"Baseline does not exist for {self.endpoint_name}. Call create_baseline() first...\")\n        return\n\n    # Check if monitoring schedule already exists\n    schedule_exists = self.monitoring_schedule_exists()\n\n    # If the schedule exists, and we don't want to recreate it, return\n    if schedule_exists and not recreate:\n        return\n\n    # If the schedule exists, delete it\n    if schedule_exists:\n        self.log.important(f\"Deleting existing monitoring schedule for {self.endpoint_name}...\")\n        self.sagemaker_client.delete_monitoring_schedule(MonitoringScheduleName=self.monitoring_schedule_name)\n\n    # Set up a NEW monitoring schedule\n    self.model_monitor.create_monitoring_schedule(\n        monitor_schedule_name=self.monitoring_schedule_name,\n        endpoint_input=self.endpoint_name,\n        output_s3_uri=self.monitoring_output_path,\n        statistics=self.statistics_json_file,\n        constraints=self.constraints_json_file,\n        schedule_cron_expression=schedule,\n    )\n    self.log.important(f\"New Monitoring schedule created for {self.endpoint_name}.\")\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.details","title":"<code>details()</code>","text":"<p>Return the details of the monitoring for the endpoint</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>The details of the monitoring for the endpoint</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def details(self) -&gt; dict:\n    \"\"\"Return the details of the monitoring for the endpoint\n\n    Returns:\n        dict: The details of the monitoring for the endpoint\n    \"\"\"\n    # Check if we have data capture\n    if self.is_data_capture_configured(capture_percentage=100):\n        data_capture_path = self.data_capture_path\n    else:\n        data_capture_path = None\n\n    # Check if we have a baseline\n    if self.baseline_exists():\n        baseline_csv_file = self.baseline_csv_file\n        constraints_json_file = self.constraints_json_file\n        statistics_json_file = self.statistics_json_file\n    else:\n        baseline_csv_file = None\n        constraints_json_file = None\n        statistics_json_file = None\n\n    # Check if we have a monitoring schedule\n    if self.monitoring_schedule_exists():\n        schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n            MonitoringScheduleName=self.monitoring_schedule_name\n        )\n\n        # General monitoring details\n        schedule_name = schedule_details.get(\"MonitoringScheduleName\")\n        schedule_status = schedule_details.get(\"MonitoringScheduleStatus\")\n        output_path = self.monitoring_output_path\n        last_run_details = self.last_run_details()\n    else:\n        schedule_name = None\n        schedule_status = \"Not Scheduled\"\n        schedule_details = None\n        output_path = None\n        last_run_details = None\n\n    # General monitoring details\n    general = {\n        \"data_capture_path\": data_capture_path,\n        \"baseline_csv_file\": baseline_csv_file,\n        \"baseline_constraints_json_file\": constraints_json_file,\n        \"baseline_statistics_json_file\": statistics_json_file,\n        \"monitoring_schedule_name\": schedule_name,\n        \"monitoring_output_path\": output_path,\n        \"monitoring_schedule_status\": schedule_status,\n        \"monitoring_schedule_details\": schedule_details,\n    }\n    if last_run_details:\n        general.update(last_run_details)\n    return general\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_baseline","title":"<code>get_baseline()</code>","text":"<p>Code to get the baseline CSV from the S3 baseline directory</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def get_baseline(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n    Returns:\n        pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n    \"\"\"\n    # Read the monitoring data from S3\n    if not wr.s3.does_object_exist(path=self.baseline_csv_file):\n        self.log.warning(\"baseline.csv data does not exist in S3.\")\n        return None\n    else:\n        return wr.s3.read_csv(self.baseline_csv_file)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_constraints","title":"<code>get_constraints()</code>","text":"<p>Code to get the constraints from the baseline</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def get_constraints(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the constraints from the baseline\n\n    Returns:\n       pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n    \"\"\"\n    return self._get_monitor_json_data(self.constraints_json_file)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_latest_data_capture","title":"<code>get_latest_data_capture()</code>","text":"<p>Get the latest data capture from S3.</p> <p>Returns:</p> Name Type Description <code>DataFrame</code> <code>input), DataFrame(output</code> <p>Flattened and processed DataFrames for input and output data.</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def get_latest_data_capture(self) -&gt; (pd.DataFrame, pd.DataFrame):\n    \"\"\"\n    Get the latest data capture from S3.\n\n    Returns:\n        DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n    \"\"\"\n    # List files in the specified S3 path\n    files = wr.s3.list_objects(self.data_capture_path)\n\n    if files:\n        print(f\"Found {len(files)} files in {self.data_capture_path}. Reading the most recent file.\")\n\n        # Read the most recent file into a DataFrame\n        df = wr.s3.read_json(path=files[-1], lines=True)  # Reads the last file assuming it's the most recent one\n\n        # Process the captured data and return the input and output DataFrames\n        return self.process_captured_data(df)\n    else:\n        print(f\"No data capture files found in {self.data_capture_path}.\")\n        return None, None\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_statistics","title":"<code>get_statistics()</code>","text":"<p>Code to get the statistics from the baseline</p> <p>Returns:</p> Type Description <code>Union[DataFrame, None]</code> <p>pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def get_statistics(self) -&gt; Union[pd.DataFrame, None]:\n    \"\"\"Code to get the statistics from the baseline\n\n    Returns:\n        pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n    \"\"\"\n    return self._get_monitor_json_data(self.statistics_json_file)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.is_data_capture_configured","title":"<code>is_data_capture_configured(capture_percentage)</code>","text":"<p>Check if data capture is already configured on the endpoint. Args:     capture_percentage (int): Expected data capture percentage. Returns:     bool: True if data capture is already configured, False otherwise.</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def is_data_capture_configured(self, capture_percentage):\n    \"\"\"\n    Check if data capture is already configured on the endpoint.\n    Args:\n        capture_percentage (int): Expected data capture percentage.\n    Returns:\n        bool: True if data capture is already configured, False otherwise.\n    \"\"\"\n    try:\n        endpoint_config_name = self.endpoint.endpoint_config_name()\n        endpoint_config = self.sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n        data_capture_config = endpoint_config.get(\"DataCaptureConfig\", {})\n\n        # Check if data capture is enabled and the percentage matches\n        is_enabled = data_capture_config.get(\"EnableCapture\", False)\n        current_percentage = data_capture_config.get(\"InitialSamplingPercentage\", 0)\n        return is_enabled and current_percentage == capture_percentage\n    except Exception as e:\n        self.log.error(f\"Error checking data capture configuration: {e}\")\n        return False\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.last_run_details","title":"<code>last_run_details()</code>","text":"<p>Return the details of the last monitoring run for the endpoint</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>Union[dict, None]</code> <p>The details of the last monitoring run for the endpoint (None if no monitoring schedule)</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def last_run_details(self) -&gt; Union[dict, None]:\n    \"\"\"Return the details of the last monitoring run for the endpoint\n\n    Returns:\n        dict: The details of the last monitoring run for the endpoint (None if no monitoring schedule)\n    \"\"\"\n    # Check if we have a monitoring schedule\n    if not self.monitoring_schedule_exists():\n        return None\n\n    # Get the details of the last monitoring run\n    schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n        MonitoringScheduleName=self.monitoring_schedule_name\n    )\n    last_run_status = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"MonitoringExecutionStatus\")\n    last_run_time = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"ScheduledTime\")\n    failure_reason = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"FailureReason\")\n    return {\n        \"last_run_status\": last_run_status,\n        \"last_run_time\": str(last_run_time),\n        \"failure_reason\": failure_reason,\n    }\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.monitoring_schedule_exists","title":"<code>monitoring_schedule_exists()</code>","text":"<p>Code to figure out if a monitoring schedule already exists for this endpoint</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def monitoring_schedule_exists(self):\n    \"\"\"Code to figure out if a monitoring schedule already exists for this endpoint\"\"\"\n    existing_schedules = self.sagemaker_client.list_monitoring_schedules(MaxResults=100).get(\n        \"MonitoringScheduleSummaries\", []\n    )\n    if any(schedule[\"MonitoringScheduleName\"] == self.monitoring_schedule_name for schedule in existing_schedules):\n        self.log.info(f\"Monitoring schedule already exists for {self.endpoint_name}.\")\n        return True\n    else:\n        self.log.info(f\"Could not find a Monitoring schedule for {self.endpoint_name}.\")\n        return False\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.process_captured_data","title":"<code>process_captured_data(df)</code>  <code>staticmethod</code>","text":"<p>Process the captured data DataFrame to extract and flatten the nested data.</p> <p>Parameters:</p> Name Type Description Default <code>df</code> <code>DataFrame</code> <p>DataFrame with captured data.</p> required <p>Returns:</p> Name Type Description <code>DataFrame</code> <code>input), DataFrame(output</code> <p>Flattened and processed DataFrames for input and output data.</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>@staticmethod\ndef process_captured_data(df: pd.DataFrame) -&gt; (pd.DataFrame, pd.DataFrame):\n    \"\"\"\n    Process the captured data DataFrame to extract and flatten the nested data.\n\n    Args:\n        df (DataFrame): DataFrame with captured data.\n\n    Returns:\n        DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n    \"\"\"\n    processed_records = []\n\n    # Phase1: Process the AWS Data Capture format into a flatter DataFrame\n    for _, row in df.iterrows():\n        # Extract data from captureData dictionary\n        capture_data = row[\"captureData\"]\n        input_data = capture_data[\"endpointInput\"]\n        output_data = capture_data[\"endpointOutput\"]\n\n        # Process input and output, both meta and actual data\n        record = {\n            \"input_content_type\": input_data.get(\"observedContentType\"),\n            \"input_encoding\": input_data.get(\"encoding\"),\n            \"input\": input_data.get(\"data\"),\n            \"output_content_type\": output_data.get(\"observedContentType\"),\n            \"output_encoding\": output_data.get(\"encoding\"),\n            \"output\": output_data.get(\"data\"),\n        }\n        processed_records.append(record)\n    processed_df = pd.DataFrame(processed_records)\n\n    # Phase2: Process the input and output 'data' columns into separate DataFrames\n    input_df_list = []\n    output_df_list = []\n    for _, row in processed_df.iterrows():\n        input_df = pd.read_csv(StringIO(row[\"input\"]))\n        input_df_list.append(input_df)\n        output_df = pd.read_csv(StringIO(row[\"output\"]))\n        output_df_list.append(output_df)\n\n    # Return the input and output DataFrames\n    return pd.concat(input_df_list), pd.concat(output_df_list)\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.setup_alerts","title":"<code>setup_alerts()</code>","text":"<p>Code to set up alerts based on monitoring results</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def setup_alerts(self):\n    \"\"\"Code to set up alerts based on monitoring results\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.summary","title":"<code>summary()</code>","text":"<p>Return the summary of information about the endpoint monitor</p> <p>Returns:</p> Name Type Description <code>dict</code> <code>dict</code> <p>Summary of information about the endpoint monitor</p> Source code in <code>src/sageworks/core/artifacts/monitor_core.py</code> <pre><code>def summary(self) -&gt; dict:\n    \"\"\"Return the summary of information about the endpoint monitor\n\n    Returns:\n        dict: Summary of information about the endpoint monitor\n    \"\"\"\n    if self.endpoint.is_serverless():\n        return {\n            \"endpoint_type\": \"serverless\",\n            \"data_capture\": \"not supported\",\n            \"baseline\": \"not supported\",\n            \"monitoring_schedule\": \"not supported\",\n        }\n    else:\n        summary = {\n            \"endpoint_type\": \"realtime\",\n            \"data_capture\": self.is_data_capture_configured(capture_percentage=100),\n            \"baseline\": self.baseline_exists(),\n            \"monitoring_schedule\": self.monitoring_schedule_exists(),\n        }\n        summary.update(self.last_run_details() or {})\n        return summary\n</code></pre>"},{"location":"core_classes/artifacts/overview/","title":"SageWorks Artifacts","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p>"},{"location":"core_classes/artifacts/overview/#welcome-to-the-sageworks-core-artifact-classes","title":"Welcome to the SageWorks Core Artifact Classes","text":"<p>These classes provide low-level APIs for the SageWorks package, they interact more directly with AWS Services and are therefore more complex with a fairly large number of methods. </p> <ul> <li>AthenaSource: Manages AWS Data Catalog and Athena</li> <li>FeatureSetCore: Manages AWS Feature Store and Feature Groups</li> <li>ModelCore: Manages the training and deployment of AWS Model Groups and Packages</li> <li>EndpointCore: Manages the deployment and invocations/inference on AWS Endpoints</li> </ul> <p></p>"},{"location":"core_classes/transforms/data_loaders_heavy/","title":"DataLoaders Heavy","text":"<p>These DataLoader Classes are intended to load larger dataset into AWS. For large data we need to use AWS Glue Jobs/Batch Jobs and in general the process is a bit more complicated and has less features.</p> <p>If you have smaller data please see DataLoaders Light</p> <p>Welcome to the SageWorks DataLoaders Heavy Classes</p> <p>These classes provide low-level APIs for loading larger data into AWS services</p> <ul> <li>S3HeavyToDataSource: Loads large data from S3 into a DataSource</li> </ul>"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource","title":"<code>S3HeavyToDataSource</code>","text":"Source code in <code>src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py</code> <pre><code>class S3HeavyToDataSource:\n    def __init__(self, glue_context: GlueContext, input_uuid: str, output_uuid: str):\n        \"\"\"S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource\n\n        Args:\n            glue_context: GlueContext, AWS Glue Specific wrapper around SparkContext\n            input_uuid (str): The S3 Path to the files to be loaded\n            output_uuid (str): The UUID of the SageWorks DataSource to be created\n        \"\"\"\n        self.log = glue_context.get_logger()\n\n        # FIXME: Pull these from Parameter Store or Config\n        self.input_uuid = input_uuid\n        self.output_uuid = output_uuid\n        self.output_meta = {\"sageworks_input\": self.input_uuid}\n        sageworks_bucket = \"s3://sandbox-sageworks-artifacts\"\n        self.data_sources_s3_path = sageworks_bucket + \"/data-sources\"\n\n        # Our Spark Context\n        self.glue_context = glue_context\n\n    @staticmethod\n    def resolve_choice_fields(dyf):\n        # Get schema fields\n        schema_fields = dyf.schema().fields\n\n        # Collect choice fields\n        choice_fields = [(field.name, \"cast:long\") for field in schema_fields if field.dataType.typeName() == \"choice\"]\n        print(f\"Choice Fields: {choice_fields}\")\n\n        # If there are choice fields, resolve them\n        if choice_fields:\n            dyf = dyf.resolveChoice(specs=choice_fields)\n\n        return dyf\n\n    def timestamp_conversions(self, dyf: DynamicFrame, time_columns: list = []) -&gt; DynamicFrame:\n        \"\"\"Convert columns in the DynamicFrame to the correct data types\n        Args:\n            dyf (DynamicFrame): The DynamicFrame to convert\n            time_columns (list): A list of column names to convert to timestamp\n        Returns:\n            DynamicFrame: The converted DynamicFrame\n        \"\"\"\n\n        # Convert the timestamp columns to timestamp types\n        spark_df = dyf.toDF()\n        for column in time_columns:\n            spark_df = spark_df.withColumn(column, to_timestamp(col(column)))\n\n        # Convert the Spark DataFrame back to a Glue DynamicFrame and return\n        return DynamicFrame.fromDF(spark_df, self.glue_context, \"output_dyf\")\n\n    @staticmethod\n    def remove_periods_from_column_names(dyf: DynamicFrame) -&gt; DynamicFrame:\n        \"\"\"Remove periods from column names in the DynamicFrame\n        Args:\n            dyf (DynamicFrame): The DynamicFrame to convert\n        Returns:\n            DynamicFrame: The converted DynamicFrame\n        \"\"\"\n        # Extract the column names from the schema\n        old_column_names = [field.name for field in dyf.schema().fields]\n\n        # Create a new list of renamed column names\n        new_column_names = [name.replace(\".\", \"_\") for name in old_column_names]\n        print(old_column_names)\n        print(new_column_names)\n\n        # Create a new DynamicFrame with renamed columns\n        for c_old, c_new in zip(old_column_names, new_column_names):\n            dyf = dyf.rename_field(f\"`{c_old}`\", c_new)\n        return dyf\n\n    def transform(\n        self,\n        input_type: str = \"json\",\n        timestamp_columns: list = None,\n        output_format: str = \"parquet\",\n    ):\n        \"\"\"Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and\n        store the information about the data to the AWS Data Catalog sageworks database\n        Args:\n            input_type (str): The type of input files, either 'csv' or 'json'\n            timestamp_columns (list): A list of column names to convert to timestamp\n            output_format (str): The format of the output files, either 'parquet' or 'orc'\n        \"\"\"\n\n        # Add some tags here\n        tags = [\"heavy\"]\n\n        # Create the Output Parquet file S3 Storage Path\n        s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n        # Read JSONL files from S3 and infer schema dynamically\n        self.log.info(f\"Reading JSONL files from {self.input_uuid}...\")\n        input_dyf = self.glue_context.create_dynamic_frame.from_options(\n            connection_type=\"s3\",\n            connection_options={\n                \"paths\": [self.input_uuid],\n                \"recurse\": True,\n                \"gzip\": True,\n            },\n            format=input_type,\n            # format_options={'jsonPath': 'auto'}, Look into this later\n        )\n        self.log.info(\"Incoming DataFrame...\")\n        input_dyf.show(5)\n        input_dyf.printSchema()\n\n        # Resolve Choice fields\n        resolved_dyf = self.resolve_choice_fields(input_dyf)\n\n        # The next couple of lines of code is for un-nesting any nested JSON\n        # Create a Dynamic Frame Collection (dfc)\n        dfc = Relationalize.apply(resolved_dyf, name=\"root\")\n\n        # Aggregate the collection into a single dynamic frame\n        output_dyf = dfc.select(\"root\")\n\n        print(\"Before TimeStamp Conversions\")\n        output_dyf.printSchema()\n\n        # Convert any timestamp columns\n        output_dyf = self.timestamp_conversions(output_dyf, timestamp_columns)\n\n        # Relationalize will put periods in the column names. This will cause\n        # problems later when we try to create a FeatureSet from this DataSource\n        output_dyf = self.remove_periods_from_column_names(output_dyf)\n\n        print(\"After TimeStamp Conversions and Removing Periods from column names\")\n        output_dyf.printSchema()\n\n        # Write Parquet files to S3\n        self.log.info(f\"Writing Parquet files to {s3_storage_path}...\")\n        self.glue_context.purge_s3_path(s3_storage_path, {\"retentionPeriod\": 0})\n        self.glue_context.write_dynamic_frame.from_options(\n            frame=output_dyf,\n            connection_type=\"s3\",\n            connection_options={\n                \"path\": s3_storage_path\n                # \"partitionKeys\": [\"year\", \"month\", \"day\"],\n            },\n            format=output_format,\n        )\n\n        # Set up our SageWorks metadata (description, tags, etc)\n        description = f\"SageWorks data source: {self.output_uuid}\"\n        sageworks_meta = {\"sageworks_tags\": self.tag_delimiter.join(tags)}\n        for key, value in self.output_meta.items():\n            sageworks_meta[key] = value\n\n        # Create a new table in the AWS Data Catalog\n        self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n\n        # Converting the Spark Types to Athena Types\n        def to_athena_type(col):\n            athena_type_map = {\"long\": \"bigint\"}\n            spark_type = col.dataType.typeName()\n            return athena_type_map.get(spark_type, spark_type)\n\n        column_name_types = [{\"Name\": col.name, \"Type\": to_athena_type(col)} for col in output_dyf.schema().fields]\n\n        # Our parameters for the Glue Data Catalog are different for Parquet and ORC\n        if output_format == \"parquet\":\n            glue_input_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat\"\n            glue_output_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat\"\n            serialization_library = \"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe\"\n        else:\n            glue_input_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n            glue_output_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n            serialization_library = \"org.apache.hadoop.hive.ql.io.orc.OrcSerde\"\n\n        table_input = {\n            \"Name\": self.output_uuid,\n            \"Description\": description,\n            \"Parameters\": sageworks_meta,\n            \"TableType\": \"EXTERNAL_TABLE\",\n            \"StorageDescriptor\": {\n                \"Columns\": column_name_types,\n                \"Location\": s3_storage_path,\n                \"InputFormat\": glue_input_format,\n                \"OutputFormat\": glue_output_format,\n                \"Compressed\": True,\n                \"SerdeInfo\": {\n                    \"SerializationLibrary\": serialization_library,\n                },\n            },\n        }\n\n        # Delete the Data Catalog Table if it already exists\n        glue_client = boto3.client(\"glue\")\n        try:\n            glue_client.delete_table(DatabaseName=\"sageworks\", Name=self.output_uuid)\n            self.log.info(f\"Deleting Data Catalog Table: {self.output_uuid}...\")\n        except ClientError as e:\n            if e.response[\"Error\"][\"Code\"] != \"EntityNotFoundException\":\n                raise e\n\n        self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n        glue_client.create_table(DatabaseName=\"sageworks\", TableInput=table_input)\n\n        # All done!\n        self.log.info(f\"{self.input_uuid} --&gt; {self.output_uuid} complete!\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.__init__","title":"<code>__init__(glue_context, input_uuid, output_uuid)</code>","text":"<p>S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource</p> <p>Parameters:</p> Name Type Description Default <code>glue_context</code> <code>GlueContext</code> <p>GlueContext, AWS Glue Specific wrapper around SparkContext</p> required <code>input_uuid</code> <code>str</code> <p>The S3 Path to the files to be loaded</p> required <code>output_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be created</p> required Source code in <code>src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py</code> <pre><code>def __init__(self, glue_context: GlueContext, input_uuid: str, output_uuid: str):\n    \"\"\"S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource\n\n    Args:\n        glue_context: GlueContext, AWS Glue Specific wrapper around SparkContext\n        input_uuid (str): The S3 Path to the files to be loaded\n        output_uuid (str): The UUID of the SageWorks DataSource to be created\n    \"\"\"\n    self.log = glue_context.get_logger()\n\n    # FIXME: Pull these from Parameter Store or Config\n    self.input_uuid = input_uuid\n    self.output_uuid = output_uuid\n    self.output_meta = {\"sageworks_input\": self.input_uuid}\n    sageworks_bucket = \"s3://sandbox-sageworks-artifacts\"\n    self.data_sources_s3_path = sageworks_bucket + \"/data-sources\"\n\n    # Our Spark Context\n    self.glue_context = glue_context\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.remove_periods_from_column_names","title":"<code>remove_periods_from_column_names(dyf)</code>  <code>staticmethod</code>","text":"<p>Remove periods from column names in the DynamicFrame Args:     dyf (DynamicFrame): The DynamicFrame to convert Returns:     DynamicFrame: The converted DynamicFrame</p> Source code in <code>src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py</code> <pre><code>@staticmethod\ndef remove_periods_from_column_names(dyf: DynamicFrame) -&gt; DynamicFrame:\n    \"\"\"Remove periods from column names in the DynamicFrame\n    Args:\n        dyf (DynamicFrame): The DynamicFrame to convert\n    Returns:\n        DynamicFrame: The converted DynamicFrame\n    \"\"\"\n    # Extract the column names from the schema\n    old_column_names = [field.name for field in dyf.schema().fields]\n\n    # Create a new list of renamed column names\n    new_column_names = [name.replace(\".\", \"_\") for name in old_column_names]\n    print(old_column_names)\n    print(new_column_names)\n\n    # Create a new DynamicFrame with renamed columns\n    for c_old, c_new in zip(old_column_names, new_column_names):\n        dyf = dyf.rename_field(f\"`{c_old}`\", c_new)\n    return dyf\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.timestamp_conversions","title":"<code>timestamp_conversions(dyf, time_columns=[])</code>","text":"<p>Convert columns in the DynamicFrame to the correct data types Args:     dyf (DynamicFrame): The DynamicFrame to convert     time_columns (list): A list of column names to convert to timestamp Returns:     DynamicFrame: The converted DynamicFrame</p> Source code in <code>src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py</code> <pre><code>def timestamp_conversions(self, dyf: DynamicFrame, time_columns: list = []) -&gt; DynamicFrame:\n    \"\"\"Convert columns in the DynamicFrame to the correct data types\n    Args:\n        dyf (DynamicFrame): The DynamicFrame to convert\n        time_columns (list): A list of column names to convert to timestamp\n    Returns:\n        DynamicFrame: The converted DynamicFrame\n    \"\"\"\n\n    # Convert the timestamp columns to timestamp types\n    spark_df = dyf.toDF()\n    for column in time_columns:\n        spark_df = spark_df.withColumn(column, to_timestamp(col(column)))\n\n    # Convert the Spark DataFrame back to a Glue DynamicFrame and return\n    return DynamicFrame.fromDF(spark_df, self.glue_context, \"output_dyf\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.transform","title":"<code>transform(input_type='json', timestamp_columns=None, output_format='parquet')</code>","text":"<p>Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and store the information about the data to the AWS Data Catalog sageworks database Args:     input_type (str): The type of input files, either 'csv' or 'json'     timestamp_columns (list): A list of column names to convert to timestamp     output_format (str): The format of the output files, either 'parquet' or 'orc'</p> Source code in <code>src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py</code> <pre><code>def transform(\n    self,\n    input_type: str = \"json\",\n    timestamp_columns: list = None,\n    output_format: str = \"parquet\",\n):\n    \"\"\"Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and\n    store the information about the data to the AWS Data Catalog sageworks database\n    Args:\n        input_type (str): The type of input files, either 'csv' or 'json'\n        timestamp_columns (list): A list of column names to convert to timestamp\n        output_format (str): The format of the output files, either 'parquet' or 'orc'\n    \"\"\"\n\n    # Add some tags here\n    tags = [\"heavy\"]\n\n    # Create the Output Parquet file S3 Storage Path\n    s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n    # Read JSONL files from S3 and infer schema dynamically\n    self.log.info(f\"Reading JSONL files from {self.input_uuid}...\")\n    input_dyf = self.glue_context.create_dynamic_frame.from_options(\n        connection_type=\"s3\",\n        connection_options={\n            \"paths\": [self.input_uuid],\n            \"recurse\": True,\n            \"gzip\": True,\n        },\n        format=input_type,\n        # format_options={'jsonPath': 'auto'}, Look into this later\n    )\n    self.log.info(\"Incoming DataFrame...\")\n    input_dyf.show(5)\n    input_dyf.printSchema()\n\n    # Resolve Choice fields\n    resolved_dyf = self.resolve_choice_fields(input_dyf)\n\n    # The next couple of lines of code is for un-nesting any nested JSON\n    # Create a Dynamic Frame Collection (dfc)\n    dfc = Relationalize.apply(resolved_dyf, name=\"root\")\n\n    # Aggregate the collection into a single dynamic frame\n    output_dyf = dfc.select(\"root\")\n\n    print(\"Before TimeStamp Conversions\")\n    output_dyf.printSchema()\n\n    # Convert any timestamp columns\n    output_dyf = self.timestamp_conversions(output_dyf, timestamp_columns)\n\n    # Relationalize will put periods in the column names. This will cause\n    # problems later when we try to create a FeatureSet from this DataSource\n    output_dyf = self.remove_periods_from_column_names(output_dyf)\n\n    print(\"After TimeStamp Conversions and Removing Periods from column names\")\n    output_dyf.printSchema()\n\n    # Write Parquet files to S3\n    self.log.info(f\"Writing Parquet files to {s3_storage_path}...\")\n    self.glue_context.purge_s3_path(s3_storage_path, {\"retentionPeriod\": 0})\n    self.glue_context.write_dynamic_frame.from_options(\n        frame=output_dyf,\n        connection_type=\"s3\",\n        connection_options={\n            \"path\": s3_storage_path\n            # \"partitionKeys\": [\"year\", \"month\", \"day\"],\n        },\n        format=output_format,\n    )\n\n    # Set up our SageWorks metadata (description, tags, etc)\n    description = f\"SageWorks data source: {self.output_uuid}\"\n    sageworks_meta = {\"sageworks_tags\": self.tag_delimiter.join(tags)}\n    for key, value in self.output_meta.items():\n        sageworks_meta[key] = value\n\n    # Create a new table in the AWS Data Catalog\n    self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n\n    # Converting the Spark Types to Athena Types\n    def to_athena_type(col):\n        athena_type_map = {\"long\": \"bigint\"}\n        spark_type = col.dataType.typeName()\n        return athena_type_map.get(spark_type, spark_type)\n\n    column_name_types = [{\"Name\": col.name, \"Type\": to_athena_type(col)} for col in output_dyf.schema().fields]\n\n    # Our parameters for the Glue Data Catalog are different for Parquet and ORC\n    if output_format == \"parquet\":\n        glue_input_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat\"\n        glue_output_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat\"\n        serialization_library = \"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe\"\n    else:\n        glue_input_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n        glue_output_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n        serialization_library = \"org.apache.hadoop.hive.ql.io.orc.OrcSerde\"\n\n    table_input = {\n        \"Name\": self.output_uuid,\n        \"Description\": description,\n        \"Parameters\": sageworks_meta,\n        \"TableType\": \"EXTERNAL_TABLE\",\n        \"StorageDescriptor\": {\n            \"Columns\": column_name_types,\n            \"Location\": s3_storage_path,\n            \"InputFormat\": glue_input_format,\n            \"OutputFormat\": glue_output_format,\n            \"Compressed\": True,\n            \"SerdeInfo\": {\n                \"SerializationLibrary\": serialization_library,\n            },\n        },\n    }\n\n    # Delete the Data Catalog Table if it already exists\n    glue_client = boto3.client(\"glue\")\n    try:\n        glue_client.delete_table(DatabaseName=\"sageworks\", Name=self.output_uuid)\n        self.log.info(f\"Deleting Data Catalog Table: {self.output_uuid}...\")\n    except ClientError as e:\n        if e.response[\"Error\"][\"Code\"] != \"EntityNotFoundException\":\n            raise e\n\n    self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n    glue_client.create_table(DatabaseName=\"sageworks\", TableInput=table_input)\n\n    # All done!\n    self.log.info(f\"{self.input_uuid} --&gt; {self.output_uuid} complete!\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/","title":"DataLoaders Light","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p> <p>These DataLoader Classes are intended to load smaller dataset into AWS. If you have large data please see DataLoaders Heavy</p> <p>Welcome to the SageWorks DataLoaders Light Classes</p> <p>These classes provide low-level APIs for loading smaller data into AWS services</p> <ul> <li>CSVToDataSource: Loads local CSV data into a DataSource</li> <li>JSONToDataSource: Loads local JSON data into a DataSource</li> <li>S3ToDataSourceLight: Loads S3 data into a DataSource</li> </ul>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource","title":"<code>CSVToDataSource</code>","text":"<p>               Bases: <code>Transform</code></p> <p>CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource</p> Common Usage <pre><code>csv_to_data = CSVToDataSource(csv_file_path, data_uuid)\ncsv_to_data.set_output_tags([\"abalone\", \"csv\", \"whatever\"])\ncsv_to_data.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py</code> <pre><code>class CSVToDataSource(Transform):\n    \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n    Common Usage:\n        ```\n        csv_to_data = CSVToDataSource(csv_file_path, data_uuid)\n        csv_to_data.set_output_tags([\"abalone\", \"csv\", \"whatever\"])\n        csv_to_data.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, csv_file_path: str, data_uuid: str):\n        \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n        Args:\n            csv_file_path (str): The path to the CSV file to be transformed\n            data_uuid (str): The UUID of the SageWorks DataSource to be created\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(csv_file_path, data_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.LOCAL_FILE\n        self.output_type = TransformOutput.DATA_SOURCE\n\n    def transform_impl(self, overwrite: bool = True):\n        \"\"\"Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and\n        store the information about the data to the AWS Data Catalog sageworks database\n        \"\"\"\n\n        # Report the transformation initiation\n        csv_file = os.path.basename(self.input_uuid)\n        self.log.info(f\"Starting {csv_file} --&gt;  DataSource: {self.output_uuid}...\")\n\n        # Read in the Local CSV as a Pandas DataFrame\n        df = pd.read_csv(self.input_uuid, low_memory=False)\n        df = convert_object_columns(df)\n\n        # Use the SageWorks Pandas to Data Source class\n        pandas_to_data = PandasToData(self.output_uuid)\n        pandas_to_data.set_input(df)\n        pandas_to_data.set_output_tags(self.output_tags)\n        pandas_to_data.add_output_meta(self.output_meta)\n        pandas_to_data.transform()\n\n        # Report the transformation results\n        self.log.info(f\"{csv_file} --&gt;  DataSource: {self.output_uuid} Complete!\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform\"\"\"\n        self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.__init__","title":"<code>__init__(csv_file_path, data_uuid)</code>","text":"<p>CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource</p> <p>Parameters:</p> Name Type Description Default <code>csv_file_path</code> <code>str</code> <p>The path to the CSV file to be transformed</p> required <code>data_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be created</p> required Source code in <code>src/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py</code> <pre><code>def __init__(self, csv_file_path: str, data_uuid: str):\n    \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n    Args:\n        csv_file_path (str): The path to the CSV file to be transformed\n        data_uuid (str): The UUID of the SageWorks DataSource to be created\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(csv_file_path, data_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.LOCAL_FILE\n    self.output_type = TransformOutput.DATA_SOURCE\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform\"\"\"\n    self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.transform_impl","title":"<code>transform_impl(overwrite=True)</code>","text":"<p>Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py</code> <pre><code>def transform_impl(self, overwrite: bool = True):\n    \"\"\"Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and\n    store the information about the data to the AWS Data Catalog sageworks database\n    \"\"\"\n\n    # Report the transformation initiation\n    csv_file = os.path.basename(self.input_uuid)\n    self.log.info(f\"Starting {csv_file} --&gt;  DataSource: {self.output_uuid}...\")\n\n    # Read in the Local CSV as a Pandas DataFrame\n    df = pd.read_csv(self.input_uuid, low_memory=False)\n    df = convert_object_columns(df)\n\n    # Use the SageWorks Pandas to Data Source class\n    pandas_to_data = PandasToData(self.output_uuid)\n    pandas_to_data.set_input(df)\n    pandas_to_data.set_output_tags(self.output_tags)\n    pandas_to_data.add_output_meta(self.output_meta)\n    pandas_to_data.transform()\n\n    # Report the transformation results\n    self.log.info(f\"{csv_file} --&gt;  DataSource: {self.output_uuid} Complete!\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource","title":"<code>JSONToDataSource</code>","text":"<p>               Bases: <code>Transform</code></p> <p>JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource</p> Common Usage <pre><code>json_to_data = JSONToDataSource(json_file_path, data_uuid)\njson_to_data.set_output_tags([\"abalone\", \"json\", \"whatever\"])\njson_to_data.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/data_loaders/light/json_to_data_source.py</code> <pre><code>class JSONToDataSource(Transform):\n    \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n    Common Usage:\n        ```\n        json_to_data = JSONToDataSource(json_file_path, data_uuid)\n        json_to_data.set_output_tags([\"abalone\", \"json\", \"whatever\"])\n        json_to_data.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, json_file_path: str, data_uuid: str):\n        \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n        Args:\n            json_file_path (str): The path to the JSON file to be transformed\n            data_uuid (str): The UUID of the SageWorks DataSource to be created\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(json_file_path, data_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.LOCAL_FILE\n        self.output_type = TransformOutput.DATA_SOURCE\n\n    def transform_impl(self, overwrite: bool = True):\n        \"\"\"Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and\n        store the information about the data to the AWS Data Catalog sageworks database\n        \"\"\"\n\n        # Report the transformation initiation\n        json_file = os.path.basename(self.input_uuid)\n        self.log.info(f\"Starting {json_file} --&gt;  DataSource: {self.output_uuid}...\")\n\n        # Read in the Local JSON as a Pandas DataFrame\n        df = pd.read_json(self.input_uuid, lines=True)\n\n        # Use the SageWorks Pandas to Data Source class\n        pandas_to_data = PandasToData(self.output_uuid)\n        pandas_to_data.set_input(df)\n        pandas_to_data.set_output_tags(self.output_tags)\n        pandas_to_data.add_output_meta(self.output_meta)\n        pandas_to_data.transform()\n\n        # Report the transformation results\n        self.log.info(f\"{json_file} --&gt;  DataSource: {self.output_uuid} Complete!\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform\"\"\"\n        self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.__init__","title":"<code>__init__(json_file_path, data_uuid)</code>","text":"<p>JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource</p> <p>Parameters:</p> Name Type Description Default <code>json_file_path</code> <code>str</code> <p>The path to the JSON file to be transformed</p> required <code>data_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be created</p> required Source code in <code>src/sageworks/core/transforms/data_loaders/light/json_to_data_source.py</code> <pre><code>def __init__(self, json_file_path: str, data_uuid: str):\n    \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n    Args:\n        json_file_path (str): The path to the JSON file to be transformed\n        data_uuid (str): The UUID of the SageWorks DataSource to be created\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(json_file_path, data_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.LOCAL_FILE\n    self.output_type = TransformOutput.DATA_SOURCE\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/json_to_data_source.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform\"\"\"\n    self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.transform_impl","title":"<code>transform_impl(overwrite=True)</code>","text":"<p>Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/json_to_data_source.py</code> <pre><code>def transform_impl(self, overwrite: bool = True):\n    \"\"\"Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and\n    store the information about the data to the AWS Data Catalog sageworks database\n    \"\"\"\n\n    # Report the transformation initiation\n    json_file = os.path.basename(self.input_uuid)\n    self.log.info(f\"Starting {json_file} --&gt;  DataSource: {self.output_uuid}...\")\n\n    # Read in the Local JSON as a Pandas DataFrame\n    df = pd.read_json(self.input_uuid, lines=True)\n\n    # Use the SageWorks Pandas to Data Source class\n    pandas_to_data = PandasToData(self.output_uuid)\n    pandas_to_data.set_input(df)\n    pandas_to_data.set_output_tags(self.output_tags)\n    pandas_to_data.add_output_meta(self.output_meta)\n    pandas_to_data.transform()\n\n    # Report the transformation results\n    self.log.info(f\"{json_file} --&gt;  DataSource: {self.output_uuid} Complete!\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight","title":"<code>S3ToDataSourceLight</code>","text":"<p>               Bases: <code>Transform</code></p> <p>S3ToDataSourceLight: Class to move LIGHT S3 Files into a SageWorks DataSource</p> Common Usage <pre><code>s3_to_data = S3ToDataSourceLight(s3_path, data_uuid, datatype=\"csv/json\")\ns3_to_data.set_output_tags([\"abalone\", \"whatever\"])\ns3_to_data.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py</code> <pre><code>class S3ToDataSourceLight(Transform):\n    \"\"\"S3ToDataSourceLight: Class to move LIGHT S3 Files into a SageWorks DataSource\n\n    Common Usage:\n        ```\n        s3_to_data = S3ToDataSourceLight(s3_path, data_uuid, datatype=\"csv/json\")\n        s3_to_data.set_output_tags([\"abalone\", \"whatever\"])\n        s3_to_data.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, s3_path: str, data_uuid: str, datatype: str = \"csv\"):\n        \"\"\"S3ToDataSourceLight Initialization\n\n        Args:\n            s3_path (str): The S3 Path to the file to be transformed\n            data_uuid (str): The UUID of the SageWorks DataSource to be created\n            datatype (str): The datatype of the file to be transformed (defaults to \"csv\")\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(s3_path, data_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.S3_OBJECT\n        self.output_type = TransformOutput.DATA_SOURCE\n        self.datatype = datatype\n\n    def input_size_mb(self) -&gt; int:\n        \"\"\"Get the size of the input S3 object in MBytes\"\"\"\n        size_in_bytes = wr.s3.size_objects(self.input_uuid, boto3_session=self.boto_session)[self.input_uuid]\n        size_in_mb = round(size_in_bytes / 1_000_000)\n        return size_in_mb\n\n    def transform_impl(self, overwrite: bool = True):\n        \"\"\"Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and\n        store the information about the data to the AWS Data Catalog sageworks database\n        \"\"\"\n\n        # Sanity Check for S3 Object size\n        object_megabytes = self.input_size_mb()\n        if object_megabytes &gt; 100:\n            self.log.error(f\"S3 Object too big ({object_megabytes} MBytes): Use the S3ToDataSourceHeavy class!\")\n            return\n\n        # Read in the S3 CSV as a Pandas DataFrame\n        if self.datatype == \"csv\":\n            df = wr.s3.read_csv(self.input_uuid, low_memory=False, boto3_session=self.boto_session)\n        else:\n            df = wr.s3.read_json(self.input_uuid, lines=True, boto3_session=self.boto_session)\n\n        # Temporary hack to limit the number of columns in the dataframe\n        if len(df.columns) &gt; 40:\n            self.log.warning(f\"{self.input_uuid} Too Many Columns! Talk to SageWorks Support...\")\n\n        # Convert object columns before sending to SageWorks Data Source\n        df = convert_object_columns(df)\n\n        # Use the SageWorks Pandas to Data Source class\n        pandas_to_data = PandasToData(self.output_uuid)\n        pandas_to_data.set_input(df)\n        pandas_to_data.set_output_tags(self.output_tags)\n        pandas_to_data.add_output_meta(self.output_meta)\n        pandas_to_data.transform()\n\n        # Report the transformation results\n        self.log.info(f\"{self.input_uuid} --&gt;  DataSource: {self.output_uuid} Complete!\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform\"\"\"\n        self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.__init__","title":"<code>__init__(s3_path, data_uuid, datatype='csv')</code>","text":"<p>S3ToDataSourceLight Initialization</p> <p>Parameters:</p> Name Type Description Default <code>s3_path</code> <code>str</code> <p>The S3 Path to the file to be transformed</p> required <code>data_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be created</p> required <code>datatype</code> <code>str</code> <p>The datatype of the file to be transformed (defaults to \"csv\")</p> <code>'csv'</code> Source code in <code>src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py</code> <pre><code>def __init__(self, s3_path: str, data_uuid: str, datatype: str = \"csv\"):\n    \"\"\"S3ToDataSourceLight Initialization\n\n    Args:\n        s3_path (str): The S3 Path to the file to be transformed\n        data_uuid (str): The UUID of the SageWorks DataSource to be created\n        datatype (str): The datatype of the file to be transformed (defaults to \"csv\")\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(s3_path, data_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.S3_OBJECT\n    self.output_type = TransformOutput.DATA_SOURCE\n    self.datatype = datatype\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.input_size_mb","title":"<code>input_size_mb()</code>","text":"<p>Get the size of the input S3 object in MBytes</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py</code> <pre><code>def input_size_mb(self) -&gt; int:\n    \"\"\"Get the size of the input S3 object in MBytes\"\"\"\n    size_in_bytes = wr.s3.size_objects(self.input_uuid, boto3_session=self.boto_session)[self.input_uuid]\n    size_in_mb = round(size_in_bytes / 1_000_000)\n    return size_in_mb\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform\"\"\"\n    self.log.info(\"Post-Transform: S3 to DataSource...\")\n</code></pre>"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.transform_impl","title":"<code>transform_impl(overwrite=True)</code>","text":"<p>Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database</p> Source code in <code>src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py</code> <pre><code>def transform_impl(self, overwrite: bool = True):\n    \"\"\"Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and\n    store the information about the data to the AWS Data Catalog sageworks database\n    \"\"\"\n\n    # Sanity Check for S3 Object size\n    object_megabytes = self.input_size_mb()\n    if object_megabytes &gt; 100:\n        self.log.error(f\"S3 Object too big ({object_megabytes} MBytes): Use the S3ToDataSourceHeavy class!\")\n        return\n\n    # Read in the S3 CSV as a Pandas DataFrame\n    if self.datatype == \"csv\":\n        df = wr.s3.read_csv(self.input_uuid, low_memory=False, boto3_session=self.boto_session)\n    else:\n        df = wr.s3.read_json(self.input_uuid, lines=True, boto3_session=self.boto_session)\n\n    # Temporary hack to limit the number of columns in the dataframe\n    if len(df.columns) &gt; 40:\n        self.log.warning(f\"{self.input_uuid} Too Many Columns! Talk to SageWorks Support...\")\n\n    # Convert object columns before sending to SageWorks Data Source\n    df = convert_object_columns(df)\n\n    # Use the SageWorks Pandas to Data Source class\n    pandas_to_data = PandasToData(self.output_uuid)\n    pandas_to_data.set_input(df)\n    pandas_to_data.set_output_tags(self.output_tags)\n    pandas_to_data.add_output_meta(self.output_meta)\n    pandas_to_data.transform()\n\n    # Report the transformation results\n    self.log.info(f\"{self.input_uuid} --&gt;  DataSource: {self.output_uuid} Complete!\")\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/","title":"Data To Features","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p> <p>DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas</p> <p>MolecularDescriptors: Compute a Feature Set based on RDKit Descriptors</p>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight","title":"<code>DataToFeaturesLight</code>","text":"<p>               Bases: <code>Transform</code></p> <p>DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas</p> Common Usage <pre><code>to_features = DataToFeaturesLight(data_uuid, feature_uuid)\nto_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\nto_features.transform(target_column=\"target\"/None, id_column=\"id\"/None,\n                      event_time_column=\"date\"/None, query=str/None)\n</code></pre> Source code in <code>src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py</code> <pre><code>class DataToFeaturesLight(Transform):\n    \"\"\"DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas\n\n    Common Usage:\n        ```\n        to_features = DataToFeaturesLight(data_uuid, feature_uuid)\n        to_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n        to_features.transform(target_column=\"target\"/None, id_column=\"id\"/None,\n                              event_time_column=\"date\"/None, query=str/None)\n        ```\n    \"\"\"\n\n    def __init__(self, data_uuid: str, feature_uuid: str):\n        \"\"\"DataToFeaturesLight Initialization\n\n        Args:\n            data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n            feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(data_uuid, feature_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.DATA_SOURCE\n        self.output_type = TransformOutput.FEATURE_SET\n        self.input_df = None\n        self.output_df = None\n\n    def pre_transform(self, query: str = None, **kwargs):\n        \"\"\"Pull the input DataSource into our Input Pandas DataFrame\n        Args:\n            query(str): Optional query to filter the input DataFrame\n        \"\"\"\n\n        # Grab the Input (Data Source)\n        data_to_pandas = DataToPandas(self.input_uuid)\n        data_to_pandas.transform(query=query)\n        self.input_df = data_to_pandas.get_output()\n\n    def transform_impl(self, **kwargs):\n        \"\"\"Transform the input DataFrame into a Feature Set\"\"\"\n\n        # This is a reference implementation that should be overridden by the subclass\n        self.output_df = self.input_df\n\n    def post_transform(self, target_column=None, id_column=None, event_time_column=None, auto_one_hot=False, **kwargs):\n        \"\"\"At this point the output DataFrame should be populated, so publish it as a Feature Set\n        Args:\n            target_column(str): The name of the target column in the output DataFrame (default: None)\n            id_column(str): The name of the id column in the output DataFrame (default: None)\n            event_time_column(str): The name of the event time column in the output DataFrame (default: None)\n            auto_one_hot(bool): Automatically one-hot encode categorical columns (default: False)\n        \"\"\"\n        # Now publish to the output location\n        output_features = PandasToFeatures(self.output_uuid, auto_one_hot=auto_one_hot)\n        output_features.set_input(\n            self.output_df, target_column=target_column, id_column=id_column, event_time_column=event_time_column\n        )\n        output_features.set_output_tags(self.output_tags)\n        output_features.add_output_meta(self.output_meta)\n        output_features.transform()\n\n        # Create a default training_view for this FeatureSet\n        fs = FeatureSetCore(self.output_uuid, force_refresh=True)\n        fs.create_default_training_view()\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.__init__","title":"<code>__init__(data_uuid, feature_uuid)</code>","text":"<p>DataToFeaturesLight Initialization</p> <p>Parameters:</p> Name Type Description Default <code>data_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be transformed</p> required <code>feature_uuid</code> <code>str</code> <p>The UUID of the SageWorks FeatureSet to be created</p> required Source code in <code>src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py</code> <pre><code>def __init__(self, data_uuid: str, feature_uuid: str):\n    \"\"\"DataToFeaturesLight Initialization\n\n    Args:\n        data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n        feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(data_uuid, feature_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.DATA_SOURCE\n    self.output_type = TransformOutput.FEATURE_SET\n    self.input_df = None\n    self.output_df = None\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.post_transform","title":"<code>post_transform(target_column=None, id_column=None, event_time_column=None, auto_one_hot=False, **kwargs)</code>","text":"<p>At this point the output DataFrame should be populated, so publish it as a Feature Set Args:     target_column(str): The name of the target column in the output DataFrame (default: None)     id_column(str): The name of the id column in the output DataFrame (default: None)     event_time_column(str): The name of the event time column in the output DataFrame (default: None)     auto_one_hot(bool): Automatically one-hot encode categorical columns (default: False)</p> Source code in <code>src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py</code> <pre><code>def post_transform(self, target_column=None, id_column=None, event_time_column=None, auto_one_hot=False, **kwargs):\n    \"\"\"At this point the output DataFrame should be populated, so publish it as a Feature Set\n    Args:\n        target_column(str): The name of the target column in the output DataFrame (default: None)\n        id_column(str): The name of the id column in the output DataFrame (default: None)\n        event_time_column(str): The name of the event time column in the output DataFrame (default: None)\n        auto_one_hot(bool): Automatically one-hot encode categorical columns (default: False)\n    \"\"\"\n    # Now publish to the output location\n    output_features = PandasToFeatures(self.output_uuid, auto_one_hot=auto_one_hot)\n    output_features.set_input(\n        self.output_df, target_column=target_column, id_column=id_column, event_time_column=event_time_column\n    )\n    output_features.set_output_tags(self.output_tags)\n    output_features.add_output_meta(self.output_meta)\n    output_features.transform()\n\n    # Create a default training_view for this FeatureSet\n    fs = FeatureSetCore(self.output_uuid, force_refresh=True)\n    fs.create_default_training_view()\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.pre_transform","title":"<code>pre_transform(query=None, **kwargs)</code>","text":"<p>Pull the input DataSource into our Input Pandas DataFrame Args:     query(str): Optional query to filter the input DataFrame</p> Source code in <code>src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py</code> <pre><code>def pre_transform(self, query: str = None, **kwargs):\n    \"\"\"Pull the input DataSource into our Input Pandas DataFrame\n    Args:\n        query(str): Optional query to filter the input DataFrame\n    \"\"\"\n\n    # Grab the Input (Data Source)\n    data_to_pandas = DataToPandas(self.input_uuid)\n    data_to_pandas.transform(query=query)\n    self.input_df = data_to_pandas.get_output()\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.transform_impl","title":"<code>transform_impl(**kwargs)</code>","text":"<p>Transform the input DataFrame into a Feature Set</p> Source code in <code>src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py</code> <pre><code>def transform_impl(self, **kwargs):\n    \"\"\"Transform the input DataFrame into a Feature Set\"\"\"\n\n    # This is a reference implementation that should be overridden by the subclass\n    self.output_df = self.input_df\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors","title":"<code>MolecularDescriptors</code>","text":"<p>               Bases: <code>DataToFeaturesLight</code></p> <p>MolecularDescriptors: Create a FeatureSet (RDKit Descriptors) from a DataSource</p> Common Usage <pre><code>to_features = MolecularDescriptors(data_uuid, feature_uuid)\nto_features.set_output_tags([\"aqsol\", \"whatever\"])\nto_features.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py</code> <pre><code>class MolecularDescriptors(DataToFeaturesLight):\n    \"\"\"MolecularDescriptors: Create a FeatureSet (RDKit Descriptors) from a DataSource\n\n    Common Usage:\n        ```\n        to_features = MolecularDescriptors(data_uuid, feature_uuid)\n        to_features.set_output_tags([\"aqsol\", \"whatever\"])\n        to_features.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, data_uuid: str, feature_uuid: str):\n        \"\"\"MolecularDescriptors Initialization\n\n        Args:\n            data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n            feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n        \"\"\"\n\n        # Call superclass init\n        super().__init__(data_uuid, feature_uuid)\n\n        # Turn off warnings for RDKIT (revisit this)\n        RDLogger.DisableLog(\"rdApp.*\")\n\n    def transform_impl(self, **kwargs):\n        \"\"\"Compute a Feature Set based on RDKit Descriptors\"\"\"\n\n        # Check the input DataFrame has the required columns\n        if \"smiles\" not in self.input_df.columns:\n            raise ValueError(\"Input DataFrame must have a 'smiles' column\")\n\n        # There are certain smiles that cause Mordred to crash\n        # We'll replace them with 'equivalent' smiles (these need to be verified)\n        self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n            \"[O-]C([O-])=O.[NH4+]CCO.[NH4+]CCO\", \"[O]C([O])=O.[N]CCO.[N]CCO\"\n        )\n        self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n            \"[NH4+]CCO.[NH4+]CCO.[O-]C([O-])=O\", \"[N]CCO.[N]CCO.[O]C([O])=O\"\n        )\n        self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n            \"O=S(=O)(Nn1c-nnc1)C1=CC=CC=C1\", \"O=S(=O)(NN(C=N1)C=N1)C(C=CC1)=CC=1\"\n        )\n\n        # Compute/add all the Molecular Descriptors\n        self.output_df = self.compute_molecular_descriptors(self.input_df)\n\n        # Get the columns that are descriptors\n        desc_columns = set(self.output_df.columns) - set(self.input_df.columns)\n\n        # Drop any NaNs (and INFs)\n        current_rows = self.output_df.shape[0]\n        self.output_df = pandas_utils.drop_nans(self.output_df, how=\"any\", subset=desc_columns)\n        self.log.warning(f\"Dropped {current_rows - self.output_df.shape[0]} NaN rows\")\n\n    def compute_molecular_descriptors(self, process_df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Compute and add all the Molecular Descriptors\n        Args:\n            process_df(pd.DataFrame): The DataFrame to process and generate RDKit Descriptors\n        Returns:\n            pd.DataFrame: The input DataFrame with all the RDKit Descriptors added\n        \"\"\"\n        self.log.important(\"Computing Molecular Descriptors...\")\n\n        # Conversion to Molecules\n        molecules = [Chem.MolFromSmiles(smile) for smile in process_df[\"smiles\"]]\n\n        # Now get all the RDKIT Descriptors\n        all_descriptors = [x[0] for x in Descriptors._descList]\n\n        # There's an overflow issue that happens with the IPC descriptor, so we'll remove it\n        # See: https://github.com/rdkit/rdkit/issues/1527\n        if \"Ipc\" in all_descriptors:\n            all_descriptors.remove(\"Ipc\")\n\n        # Make sure we don't have duplicates\n        all_descriptors = list(set(all_descriptors))\n\n        # Super useful Molecular Descriptor Calculator Class\n        calc = MoleculeDescriptors.MolecularDescriptorCalculator(all_descriptors)\n        column_names = calc.GetDescriptorNames()\n        descriptor_values = [calc.CalcDescriptors(m) for m in molecules]\n        rdkit_features_df = pd.DataFrame(descriptor_values, columns=column_names)\n\n        # Now compute Mordred Features\n        descriptor_choice = [AcidBase, Aromatic, Polarizability, RotatableBond]\n        calc = Calculator()\n        for des in descriptor_choice:\n            calc.register(des)\n        mordred_df = calc.pandas(molecules, nproc=1)\n\n        # Return the DataFrame with the RDKit and Mordred Descriptors added\n        return pd.concat([process_df, rdkit_features_df, mordred_df], axis=1)\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors.__init__","title":"<code>__init__(data_uuid, feature_uuid)</code>","text":"<p>MolecularDescriptors Initialization</p> <p>Parameters:</p> Name Type Description Default <code>data_uuid</code> <code>str</code> <p>The UUID of the SageWorks DataSource to be transformed</p> required <code>feature_uuid</code> <code>str</code> <p>The UUID of the SageWorks FeatureSet to be created</p> required Source code in <code>src/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py</code> <pre><code>def __init__(self, data_uuid: str, feature_uuid: str):\n    \"\"\"MolecularDescriptors Initialization\n\n    Args:\n        data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n        feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n    \"\"\"\n\n    # Call superclass init\n    super().__init__(data_uuid, feature_uuid)\n\n    # Turn off warnings for RDKIT (revisit this)\n    RDLogger.DisableLog(\"rdApp.*\")\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors.compute_molecular_descriptors","title":"<code>compute_molecular_descriptors(process_df)</code>","text":"<p>Compute and add all the Molecular Descriptors Args:     process_df(pd.DataFrame): The DataFrame to process and generate RDKit Descriptors Returns:     pd.DataFrame: The input DataFrame with all the RDKit Descriptors added</p> Source code in <code>src/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py</code> <pre><code>def compute_molecular_descriptors(self, process_df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Compute and add all the Molecular Descriptors\n    Args:\n        process_df(pd.DataFrame): The DataFrame to process and generate RDKit Descriptors\n    Returns:\n        pd.DataFrame: The input DataFrame with all the RDKit Descriptors added\n    \"\"\"\n    self.log.important(\"Computing Molecular Descriptors...\")\n\n    # Conversion to Molecules\n    molecules = [Chem.MolFromSmiles(smile) for smile in process_df[\"smiles\"]]\n\n    # Now get all the RDKIT Descriptors\n    all_descriptors = [x[0] for x in Descriptors._descList]\n\n    # There's an overflow issue that happens with the IPC descriptor, so we'll remove it\n    # See: https://github.com/rdkit/rdkit/issues/1527\n    if \"Ipc\" in all_descriptors:\n        all_descriptors.remove(\"Ipc\")\n\n    # Make sure we don't have duplicates\n    all_descriptors = list(set(all_descriptors))\n\n    # Super useful Molecular Descriptor Calculator Class\n    calc = MoleculeDescriptors.MolecularDescriptorCalculator(all_descriptors)\n    column_names = calc.GetDescriptorNames()\n    descriptor_values = [calc.CalcDescriptors(m) for m in molecules]\n    rdkit_features_df = pd.DataFrame(descriptor_values, columns=column_names)\n\n    # Now compute Mordred Features\n    descriptor_choice = [AcidBase, Aromatic, Polarizability, RotatableBond]\n    calc = Calculator()\n    for des in descriptor_choice:\n        calc.register(des)\n    mordred_df = calc.pandas(molecules, nproc=1)\n\n    # Return the DataFrame with the RDKit and Mordred Descriptors added\n    return pd.concat([process_df, rdkit_features_df, mordred_df], axis=1)\n</code></pre>"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors.transform_impl","title":"<code>transform_impl(**kwargs)</code>","text":"<p>Compute a Feature Set based on RDKit Descriptors</p> Source code in <code>src/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py</code> <pre><code>def transform_impl(self, **kwargs):\n    \"\"\"Compute a Feature Set based on RDKit Descriptors\"\"\"\n\n    # Check the input DataFrame has the required columns\n    if \"smiles\" not in self.input_df.columns:\n        raise ValueError(\"Input DataFrame must have a 'smiles' column\")\n\n    # There are certain smiles that cause Mordred to crash\n    # We'll replace them with 'equivalent' smiles (these need to be verified)\n    self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n        \"[O-]C([O-])=O.[NH4+]CCO.[NH4+]CCO\", \"[O]C([O])=O.[N]CCO.[N]CCO\"\n    )\n    self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n        \"[NH4+]CCO.[NH4+]CCO.[O-]C([O-])=O\", \"[N]CCO.[N]CCO.[O]C([O])=O\"\n    )\n    self.input_df[\"smiles\"] = self.input_df[\"smiles\"].replace(\n        \"O=S(=O)(Nn1c-nnc1)C1=CC=CC=C1\", \"O=S(=O)(NN(C=N1)C=N1)C(C=CC1)=CC=1\"\n    )\n\n    # Compute/add all the Molecular Descriptors\n    self.output_df = self.compute_molecular_descriptors(self.input_df)\n\n    # Get the columns that are descriptors\n    desc_columns = set(self.output_df.columns) - set(self.input_df.columns)\n\n    # Drop any NaNs (and INFs)\n    current_rows = self.output_df.shape[0]\n    self.output_df = pandas_utils.drop_nans(self.output_df, how=\"any\", subset=desc_columns)\n    self.log.warning(f\"Dropped {current_rows - self.output_df.shape[0]} NaN rows\")\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/","title":"Features To Model","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p> <p>FeaturesToModel: Train/Create a Model from a Feature Set</p>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel","title":"<code>FeaturesToModel</code>","text":"<p>               Bases: <code>Transform</code></p> <p>FeaturesToModel: Train/Create a Model from a FeatureSet</p> Common Usage <pre><code>to_model = FeaturesToModel(feature_uuid, model_uuid, model_type=ModelType)\nto_model.set_output_tags([\"abalone\", \"public\", \"whatever\"])\nto_model.transform(target_column=\"class_number_of_rings\",\n                   input_feature_list=[feature_list])\n</code></pre> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>class FeaturesToModel(Transform):\n    \"\"\"FeaturesToModel: Train/Create a Model from a FeatureSet\n\n    Common Usage:\n        ```\n        to_model = FeaturesToModel(feature_uuid, model_uuid, model_type=ModelType)\n        to_model.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n        to_model.transform(target_column=\"class_number_of_rings\",\n                           input_feature_list=[feature_list])\n        ```\n    \"\"\"\n\n    def __init__(self, feature_uuid: str, model_uuid: str, model_type: ModelType = ModelType.UNKNOWN, model_class=None):\n        \"\"\"FeaturesToModel Initialization\n        Args:\n            feature_uuid (str): UUID of the FeatureSet to use as input\n            model_uuid (str): UUID of the Model to create as output\n            model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.\n            model_class (str): The class of the model (optional)\n        \"\"\"\n\n        # Make sure the model_uuid is a valid name\n        Artifact.ensure_valid_name(model_uuid, delimiter=\"-\")\n\n        # Call superclass init\n        super().__init__(feature_uuid, model_uuid)\n\n        # If the model_type is UNKNOWN the model_class must be specified\n        if model_type == ModelType.UNKNOWN:\n            if model_class is None:\n                msg = \"ModelType is UNKNOWN, must specify a model_class!\"\n                self.log.critical(msg)\n                raise ValueError(msg)\n            else:\n                self.log.info(\"ModelType is UNKNOWN, using model_class to determine the type...\")\n                model_type = self._determine_model_type(model_class)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.FEATURE_SET\n        self.output_type = TransformOutput.MODEL\n        self.model_type = model_type\n        self.model_class = model_class\n        self.estimator = None\n        self.model_script_dir = None\n        self.model_description = None\n        self.model_training_root = self.models_s3_path + \"/training\"\n        self.model_feature_list = None\n        self.target_column = None\n        self.class_labels = None\n\n    def _determine_model_type(self, model_class: str) -&gt; ModelType:\n        \"\"\"Determine the ModelType from the model_class\n        Args:\n            model_class (str): The class of the model\n        Returns:\n            ModelType: The determined ModelType\n        \"\"\"\n        model_class_lower = model_class.lower()\n\n        # Direct mapping for specific models\n        specific_model_mapping = {\n            \"logisticregression\": ModelType.CLASSIFIER,\n            \"linearregression\": ModelType.REGRESSOR,\n            \"ridge\": ModelType.REGRESSOR,\n            \"lasso\": ModelType.REGRESSOR,\n            \"elasticnet\": ModelType.REGRESSOR,\n            \"bayesianridge\": ModelType.REGRESSOR,\n            \"svc\": ModelType.CLASSIFIER,\n            \"svr\": ModelType.REGRESSOR,\n            \"gaussiannb\": ModelType.CLASSIFIER,\n            \"kmeans\": ModelType.CLUSTERER,\n            \"dbscan\": ModelType.CLUSTERER,\n            \"meanshift\": ModelType.CLUSTERER,\n        }\n\n        if model_class_lower in specific_model_mapping:\n            return specific_model_mapping[model_class_lower]\n\n        # General pattern matching\n        if \"regressor\" in model_class_lower:\n            return ModelType.REGRESSOR\n        elif \"classifier\" in model_class_lower:\n            return ModelType.CLASSIFIER\n        elif \"quantile\" in model_class_lower:\n            return ModelType.QUANTILE_REGRESSOR\n        elif \"cluster\" in model_class_lower:\n            return ModelType.CLUSTERER\n        elif \"transform\" in model_class_lower:\n            return ModelType.TRANSFORMER\n        else:\n            self.log.critical(f\"Unknown ModelType for model_class: {model_class}\")\n            return ModelType.UNKNOWN\n\n    def generate_model_script(self, target_column: str, feature_list: list[str], train_all_data: bool) -&gt; str:\n        \"\"\"Fill in the model template with specific target and feature_list\n        Args:\n            target_column (str): Column name of the target variable\n            feature_list (list[str]): A list of columns for the features\n            train_all_data (bool): Train on ALL (100%) of the data\n        Returns:\n           str: The name of the generated model script\n        \"\"\"\n\n        # FIXME: Revisit all of this since it's a bit wonky\n        # Did they specify a Scikit-Learn model class?\n        if self.model_class:\n            self.log.info(f\"Using Scikit-Learn model class: {self.model_class}\")\n            script_name = \"generated_scikit_model.py\"\n            dir_path = Path(__file__).parent.absolute()\n            self.model_script_dir = os.path.join(dir_path, \"light_scikit_learn\")\n            template_path = os.path.join(self.model_script_dir, \"scikit_learn.template\")\n            output_path = os.path.join(self.model_script_dir, script_name)\n            with open(template_path, \"r\") as fp:\n                scikit_template = fp.read()\n\n            # Template replacements\n            aws_script = scikit_template.replace(\"{{model_class}}\", self.model_class)\n            aws_script = aws_script.replace(\"{{target_column}}\", target_column)\n            feature_list_str = json.dumps(feature_list)\n            aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n            aws_script = aws_script.replace(\"{{model_type}}\", self.model_type.value)\n            metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n            aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n            aws_script = aws_script.replace(\"{{train_all_data}}\", str(train_all_data))\n\n        elif self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.CLASSIFIER:\n            script_name = \"generated_xgb_model.py\"\n            dir_path = Path(__file__).parent.absolute()\n            self.model_script_dir = os.path.join(dir_path, \"light_xgb_model\")\n            template_path = os.path.join(self.model_script_dir, \"xgb_model.template\")\n            output_path = os.path.join(self.model_script_dir, script_name)\n            with open(template_path, \"r\") as fp:\n                xgb_template = fp.read()\n\n            # Template replacements\n            aws_script = xgb_template.replace(\"{{target_column}}\", target_column)\n            feature_list_str = json.dumps(feature_list)\n            aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n            aws_script = aws_script.replace(\"{{model_type}}\", self.model_type.value)\n            metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n            aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n            aws_script = aws_script.replace(\"{{train_all_data}}\", str(train_all_data))\n\n        elif self.model_type == ModelType.QUANTILE_REGRESSOR:\n            script_name = \"generated_quantile_model.py\"\n            dir_path = Path(__file__).parent.absolute()\n            self.model_script_dir = os.path.join(dir_path, \"light_quant_regression\")\n            template_path = os.path.join(self.model_script_dir, \"quant_regression.template\")\n            output_path = os.path.join(self.model_script_dir, script_name)\n            with open(template_path, \"r\") as fp:\n                quant_template = fp.read()\n\n            # Template replacements\n            aws_script = quant_template.replace(\"{{target_column}}\", target_column)\n            feature_list_str = json.dumps(feature_list)\n            aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n            metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n            aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n\n        # Now write out the generated model script and return the name\n        with open(output_path, \"w\") as fp:\n            fp.write(aws_script)\n        return script_name\n\n    def transform_impl(\n        self, target_column: str, description: str = None, feature_list: list = None, train_all_data=False\n    ):\n        \"\"\"Generic Features to Model: Note you should create a new class and inherit from\n        this one to include specific logic for your Feature Set/Model\n        Args:\n            target_column (str): Column name of the target variable\n            description (str): Description of the model (optional)\n            feature_list (list[str]): A list of columns for the features (default None, will try to guess)\n            train_all_data (bool): Train on ALL (100%) of the data (default False)\n        \"\"\"\n        # Delete the existing model (if it exists)\n        self.log.important(\"Trying to delete existing model...\")\n        delete_model = ModelCore(self.output_uuid, force_refresh=True)\n        delete_model.delete()\n\n        # Set our model description\n        self.model_description = description if description is not None else f\"Model created from {self.input_uuid}\"\n\n        # Get our Feature Set and create an S3 CSV Training dataset\n        feature_set = FeatureSetCore(self.input_uuid)\n        s3_training_path = feature_set.create_s3_training_data()\n        self.log.info(f\"Created new training data {s3_training_path}...\")\n\n        # Report the target column\n        self.target_column = target_column\n        self.log.info(f\"Target column: {self.target_column}\")\n\n        # Did they specify a feature list?\n        if feature_list:\n            # AWS Feature Groups will also add these implicit columns, so remove them\n            aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"training\"]\n            feature_list = [c for c in feature_list if c not in aws_cols]\n\n        # If they didn't specify a feature list, try to guess it\n        else:\n            # Try to figure out features with this logic\n            # - Don't include id, event_time, __index_level_0__, or training columns\n            # - Don't include AWS generated columns (e.g. write_time, api_invocation_time, is_deleted)\n            # - Don't include the target columns\n            # - Don't include any columns that are of type string or timestamp\n            # - The rest of the columns are assumed to be features\n            self.log.warning(\"Guessing at the feature list, HIGHLY SUGGESTED to specify an explicit feature list!\")\n            all_columns = feature_set.column_names()\n            filter_list = [\n                \"id\",\n                \"__index_level_0__\",\n                \"write_time\",\n                \"api_invocation_time\",\n                \"is_deleted\",\n                \"event_time\",\n                \"training\",\n            ] + [self.target_column]\n            feature_list = [c for c in all_columns if c not in filter_list]\n\n        # AWS Feature Store has 3 user column types (String, Integral, Fractional)\n        # and two internal types (Timestamp and Boolean). A Feature List for\n        # modeling can only contain Integral and Fractional types.\n        remove_columns = []\n        column_details = feature_set.column_details()\n        for column_name in feature_list:\n            if column_details[column_name] not in [\"Integral\", \"Fractional\"]:\n                self.log.warning(\n                    f\"Removing {column_name} from feature list, improper type {column_details[column_name]}\"\n                )\n                remove_columns.append(column_name)\n\n        # Remove the columns that are not Integral or Fractional\n        self.model_feature_list = [c for c in feature_list if c not in remove_columns]\n        self.log.important(f\"Feature List for Modeling: {self.model_feature_list}\")\n\n        # Generate our model script\n        script_path = self.generate_model_script(self.target_column, self.model_feature_list, train_all_data)\n\n        # Metric Definitions for Regression\n        if self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.QUANTILE_REGRESSOR:\n            metric_definitions = [\n                {\"Name\": \"RMSE\", \"Regex\": \"RMSE: ([0-9.]+)\"},\n                {\"Name\": \"MAE\", \"Regex\": \"MAE: ([0-9.]+)\"},\n                {\"Name\": \"R2\", \"Regex\": \"R2: ([0-9.]+)\"},\n                {\"Name\": \"NumRows\", \"Regex\": \"NumRows: ([0-9]+)\"},\n            ]\n\n        # Metric Definitions for Classification\n        elif self.model_type == ModelType.CLASSIFIER:\n            # We need to get creative with the Classification Metrics\n\n            # Grab all the target column class values (class labels)\n            table = feature_set.data_source.get_table_name()\n            self.class_labels = feature_set.query(f\"select DISTINCT {self.target_column} FROM {table}\")[\n                self.target_column\n            ].to_list()\n\n            # Sanity check on the targets\n            if len(self.class_labels) &gt; 10:\n                msg = f\"Too many target classes ({len(self.class_labels)}) for classification, aborting!\"\n                self.log.critical(msg)\n                raise ValueError(msg)\n\n            # Dynamically create the metric definitions\n            metrics = [\"precision\", \"recall\", \"fscore\"]\n            metric_definitions = []\n            for t in self.class_labels:\n                for m in metrics:\n                    metric_definitions.append({\"Name\": f\"Metrics:{t}:{m}\", \"Regex\": f\"Metrics:{t}:{m} ([0-9.]+)\"})\n\n            # Add the confusion matrix metrics\n            for row in self.class_labels:\n                for col in self.class_labels:\n                    metric_definitions.append(\n                        {\"Name\": f\"ConfusionMatrix:{row}:{col}\", \"Regex\": f\"ConfusionMatrix:{row}:{col} ([0-9.]+)\"}\n                    )\n\n        # If the model type is UNKNOWN, our metric_definitions will be empty\n        else:\n            self.log.warning(f\"ModelType is {self.model_type}, skipping metric_definitions...\")\n            metric_definitions = []\n\n        # Create a Sagemaker Model with our script\n        self.estimator = SKLearn(\n            entry_point=script_path,\n            source_dir=self.model_script_dir,\n            role=self.sageworks_role_arn,\n            instance_type=\"ml.m5.large\",\n            sagemaker_session=self.sm_session,\n            framework_version=\"1.2-1\",\n            metric_definitions=metric_definitions,\n        )\n\n        # Training Job Name based on the Model UUID and today's date\n        training_date_time_utc = datetime.utcnow().strftime(\"%Y-%m-%d-%H-%M\")\n        training_job_name = f\"{self.output_uuid}-{training_date_time_utc}\"\n\n        # Train the estimator\n        self.estimator.fit({\"train\": s3_training_path}, job_name=training_job_name)\n\n        # Now delete the training data\n        self.log.info(f\"Deleting training data {s3_training_path}...\")\n        wr.s3.delete_objects(\n            [s3_training_path, s3_training_path.replace(\".csv\", \".csv.metadata\")],\n            boto3_session=self.boto_session,\n        )\n\n        # Create Model and officially Register\n        self.log.important(f\"Creating new model {self.output_uuid}...\")\n        self.create_and_register_model()\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Calling onboard() on the Model\"\"\"\n        self.log.info(\"Post-Transform: Calling onboard() on the Model...\")\n\n        # Store the model feature_list and target_column in the sageworks_meta\n        output_model = ModelCore(self.output_uuid, model_type=self.model_type, force_refresh=True)\n        output_model.upsert_sageworks_meta({\"sageworks_model_features\": self.model_feature_list})\n        output_model.upsert_sageworks_meta({\"sageworks_model_target\": self.target_column})\n\n        # Store the class labels (if they exist)\n        if self.class_labels:\n            output_model.set_class_labels(self.class_labels)\n\n        # Call the Model onboard method\n        output_model.onboard_with_args(self.model_type, self.target_column, self.model_feature_list)\n\n    def create_and_register_model(self):\n        \"\"\"Create and Register the Model\"\"\"\n\n        # Get the metadata/tags to push into AWS\n        aws_tags = self.get_aws_tags()\n\n        # Create model group (if it doesn't already exist)\n        self.sm_client.create_model_package_group(\n            ModelPackageGroupName=self.output_uuid,\n            ModelPackageGroupDescription=self.model_description,\n            Tags=aws_tags,\n        )\n\n        # Register our model\n        model = self.estimator.create_model(role=self.sageworks_role_arn)\n        model.register(\n            model_package_group_name=self.output_uuid,\n            framework_version=\"1.2.1\",\n            content_types=[\"text/csv\"],\n            response_types=[\"text/csv\"],\n            inference_instances=[\"ml.t2.medium\"],\n            transform_instances=[\"ml.m5.large\"],\n            approval_status=\"Approved\",\n            description=self.model_description,\n        )\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.__init__","title":"<code>__init__(feature_uuid, model_uuid, model_type=ModelType.UNKNOWN, model_class=None)</code>","text":"<p>FeaturesToModel Initialization Args:     feature_uuid (str): UUID of the FeatureSet to use as input     model_uuid (str): UUID of the Model to create as output     model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.     model_class (str): The class of the model (optional)</p> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>def __init__(self, feature_uuid: str, model_uuid: str, model_type: ModelType = ModelType.UNKNOWN, model_class=None):\n    \"\"\"FeaturesToModel Initialization\n    Args:\n        feature_uuid (str): UUID of the FeatureSet to use as input\n        model_uuid (str): UUID of the Model to create as output\n        model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.\n        model_class (str): The class of the model (optional)\n    \"\"\"\n\n    # Make sure the model_uuid is a valid name\n    Artifact.ensure_valid_name(model_uuid, delimiter=\"-\")\n\n    # Call superclass init\n    super().__init__(feature_uuid, model_uuid)\n\n    # If the model_type is UNKNOWN the model_class must be specified\n    if model_type == ModelType.UNKNOWN:\n        if model_class is None:\n            msg = \"ModelType is UNKNOWN, must specify a model_class!\"\n            self.log.critical(msg)\n            raise ValueError(msg)\n        else:\n            self.log.info(\"ModelType is UNKNOWN, using model_class to determine the type...\")\n            model_type = self._determine_model_type(model_class)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.FEATURE_SET\n    self.output_type = TransformOutput.MODEL\n    self.model_type = model_type\n    self.model_class = model_class\n    self.estimator = None\n    self.model_script_dir = None\n    self.model_description = None\n    self.model_training_root = self.models_s3_path + \"/training\"\n    self.model_feature_list = None\n    self.target_column = None\n    self.class_labels = None\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.create_and_register_model","title":"<code>create_and_register_model()</code>","text":"<p>Create and Register the Model</p> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>def create_and_register_model(self):\n    \"\"\"Create and Register the Model\"\"\"\n\n    # Get the metadata/tags to push into AWS\n    aws_tags = self.get_aws_tags()\n\n    # Create model group (if it doesn't already exist)\n    self.sm_client.create_model_package_group(\n        ModelPackageGroupName=self.output_uuid,\n        ModelPackageGroupDescription=self.model_description,\n        Tags=aws_tags,\n    )\n\n    # Register our model\n    model = self.estimator.create_model(role=self.sageworks_role_arn)\n    model.register(\n        model_package_group_name=self.output_uuid,\n        framework_version=\"1.2.1\",\n        content_types=[\"text/csv\"],\n        response_types=[\"text/csv\"],\n        inference_instances=[\"ml.t2.medium\"],\n        transform_instances=[\"ml.m5.large\"],\n        approval_status=\"Approved\",\n        description=self.model_description,\n    )\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.generate_model_script","title":"<code>generate_model_script(target_column, feature_list, train_all_data)</code>","text":"<p>Fill in the model template with specific target and feature_list Args:     target_column (str): Column name of the target variable     feature_list (list[str]): A list of columns for the features     train_all_data (bool): Train on ALL (100%) of the data Returns:    str: The name of the generated model script</p> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>def generate_model_script(self, target_column: str, feature_list: list[str], train_all_data: bool) -&gt; str:\n    \"\"\"Fill in the model template with specific target and feature_list\n    Args:\n        target_column (str): Column name of the target variable\n        feature_list (list[str]): A list of columns for the features\n        train_all_data (bool): Train on ALL (100%) of the data\n    Returns:\n       str: The name of the generated model script\n    \"\"\"\n\n    # FIXME: Revisit all of this since it's a bit wonky\n    # Did they specify a Scikit-Learn model class?\n    if self.model_class:\n        self.log.info(f\"Using Scikit-Learn model class: {self.model_class}\")\n        script_name = \"generated_scikit_model.py\"\n        dir_path = Path(__file__).parent.absolute()\n        self.model_script_dir = os.path.join(dir_path, \"light_scikit_learn\")\n        template_path = os.path.join(self.model_script_dir, \"scikit_learn.template\")\n        output_path = os.path.join(self.model_script_dir, script_name)\n        with open(template_path, \"r\") as fp:\n            scikit_template = fp.read()\n\n        # Template replacements\n        aws_script = scikit_template.replace(\"{{model_class}}\", self.model_class)\n        aws_script = aws_script.replace(\"{{target_column}}\", target_column)\n        feature_list_str = json.dumps(feature_list)\n        aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n        aws_script = aws_script.replace(\"{{model_type}}\", self.model_type.value)\n        metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n        aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n        aws_script = aws_script.replace(\"{{train_all_data}}\", str(train_all_data))\n\n    elif self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.CLASSIFIER:\n        script_name = \"generated_xgb_model.py\"\n        dir_path = Path(__file__).parent.absolute()\n        self.model_script_dir = os.path.join(dir_path, \"light_xgb_model\")\n        template_path = os.path.join(self.model_script_dir, \"xgb_model.template\")\n        output_path = os.path.join(self.model_script_dir, script_name)\n        with open(template_path, \"r\") as fp:\n            xgb_template = fp.read()\n\n        # Template replacements\n        aws_script = xgb_template.replace(\"{{target_column}}\", target_column)\n        feature_list_str = json.dumps(feature_list)\n        aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n        aws_script = aws_script.replace(\"{{model_type}}\", self.model_type.value)\n        metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n        aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n        aws_script = aws_script.replace(\"{{train_all_data}}\", str(train_all_data))\n\n    elif self.model_type == ModelType.QUANTILE_REGRESSOR:\n        script_name = \"generated_quantile_model.py\"\n        dir_path = Path(__file__).parent.absolute()\n        self.model_script_dir = os.path.join(dir_path, \"light_quant_regression\")\n        template_path = os.path.join(self.model_script_dir, \"quant_regression.template\")\n        output_path = os.path.join(self.model_script_dir, script_name)\n        with open(template_path, \"r\") as fp:\n            quant_template = fp.read()\n\n        # Template replacements\n        aws_script = quant_template.replace(\"{{target_column}}\", target_column)\n        feature_list_str = json.dumps(feature_list)\n        aws_script = aws_script.replace(\"{{feature_list}}\", feature_list_str)\n        metrics_s3_path = f\"{self.model_training_root}/{self.output_uuid}\"\n        aws_script = aws_script.replace(\"{{model_metrics_s3_path}}\", metrics_s3_path)\n\n    # Now write out the generated model script and return the name\n    with open(output_path, \"w\") as fp:\n        fp.write(aws_script)\n    return script_name\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Calling onboard() on the Model</p> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Calling onboard() on the Model\"\"\"\n    self.log.info(\"Post-Transform: Calling onboard() on the Model...\")\n\n    # Store the model feature_list and target_column in the sageworks_meta\n    output_model = ModelCore(self.output_uuid, model_type=self.model_type, force_refresh=True)\n    output_model.upsert_sageworks_meta({\"sageworks_model_features\": self.model_feature_list})\n    output_model.upsert_sageworks_meta({\"sageworks_model_target\": self.target_column})\n\n    # Store the class labels (if they exist)\n    if self.class_labels:\n        output_model.set_class_labels(self.class_labels)\n\n    # Call the Model onboard method\n    output_model.onboard_with_args(self.model_type, self.target_column, self.model_feature_list)\n</code></pre>"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.transform_impl","title":"<code>transform_impl(target_column, description=None, feature_list=None, train_all_data=False)</code>","text":"<p>Generic Features to Model: Note you should create a new class and inherit from this one to include specific logic for your Feature Set/Model Args:     target_column (str): Column name of the target variable     description (str): Description of the model (optional)     feature_list (list[str]): A list of columns for the features (default None, will try to guess)     train_all_data (bool): Train on ALL (100%) of the data (default False)</p> Source code in <code>src/sageworks/core/transforms/features_to_model/features_to_model.py</code> <pre><code>def transform_impl(\n    self, target_column: str, description: str = None, feature_list: list = None, train_all_data=False\n):\n    \"\"\"Generic Features to Model: Note you should create a new class and inherit from\n    this one to include specific logic for your Feature Set/Model\n    Args:\n        target_column (str): Column name of the target variable\n        description (str): Description of the model (optional)\n        feature_list (list[str]): A list of columns for the features (default None, will try to guess)\n        train_all_data (bool): Train on ALL (100%) of the data (default False)\n    \"\"\"\n    # Delete the existing model (if it exists)\n    self.log.important(\"Trying to delete existing model...\")\n    delete_model = ModelCore(self.output_uuid, force_refresh=True)\n    delete_model.delete()\n\n    # Set our model description\n    self.model_description = description if description is not None else f\"Model created from {self.input_uuid}\"\n\n    # Get our Feature Set and create an S3 CSV Training dataset\n    feature_set = FeatureSetCore(self.input_uuid)\n    s3_training_path = feature_set.create_s3_training_data()\n    self.log.info(f\"Created new training data {s3_training_path}...\")\n\n    # Report the target column\n    self.target_column = target_column\n    self.log.info(f\"Target column: {self.target_column}\")\n\n    # Did they specify a feature list?\n    if feature_list:\n        # AWS Feature Groups will also add these implicit columns, so remove them\n        aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"training\"]\n        feature_list = [c for c in feature_list if c not in aws_cols]\n\n    # If they didn't specify a feature list, try to guess it\n    else:\n        # Try to figure out features with this logic\n        # - Don't include id, event_time, __index_level_0__, or training columns\n        # - Don't include AWS generated columns (e.g. write_time, api_invocation_time, is_deleted)\n        # - Don't include the target columns\n        # - Don't include any columns that are of type string or timestamp\n        # - The rest of the columns are assumed to be features\n        self.log.warning(\"Guessing at the feature list, HIGHLY SUGGESTED to specify an explicit feature list!\")\n        all_columns = feature_set.column_names()\n        filter_list = [\n            \"id\",\n            \"__index_level_0__\",\n            \"write_time\",\n            \"api_invocation_time\",\n            \"is_deleted\",\n            \"event_time\",\n            \"training\",\n        ] + [self.target_column]\n        feature_list = [c for c in all_columns if c not in filter_list]\n\n    # AWS Feature Store has 3 user column types (String, Integral, Fractional)\n    # and two internal types (Timestamp and Boolean). A Feature List for\n    # modeling can only contain Integral and Fractional types.\n    remove_columns = []\n    column_details = feature_set.column_details()\n    for column_name in feature_list:\n        if column_details[column_name] not in [\"Integral\", \"Fractional\"]:\n            self.log.warning(\n                f\"Removing {column_name} from feature list, improper type {column_details[column_name]}\"\n            )\n            remove_columns.append(column_name)\n\n    # Remove the columns that are not Integral or Fractional\n    self.model_feature_list = [c for c in feature_list if c not in remove_columns]\n    self.log.important(f\"Feature List for Modeling: {self.model_feature_list}\")\n\n    # Generate our model script\n    script_path = self.generate_model_script(self.target_column, self.model_feature_list, train_all_data)\n\n    # Metric Definitions for Regression\n    if self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.QUANTILE_REGRESSOR:\n        metric_definitions = [\n            {\"Name\": \"RMSE\", \"Regex\": \"RMSE: ([0-9.]+)\"},\n            {\"Name\": \"MAE\", \"Regex\": \"MAE: ([0-9.]+)\"},\n            {\"Name\": \"R2\", \"Regex\": \"R2: ([0-9.]+)\"},\n            {\"Name\": \"NumRows\", \"Regex\": \"NumRows: ([0-9]+)\"},\n        ]\n\n    # Metric Definitions for Classification\n    elif self.model_type == ModelType.CLASSIFIER:\n        # We need to get creative with the Classification Metrics\n\n        # Grab all the target column class values (class labels)\n        table = feature_set.data_source.get_table_name()\n        self.class_labels = feature_set.query(f\"select DISTINCT {self.target_column} FROM {table}\")[\n            self.target_column\n        ].to_list()\n\n        # Sanity check on the targets\n        if len(self.class_labels) &gt; 10:\n            msg = f\"Too many target classes ({len(self.class_labels)}) for classification, aborting!\"\n            self.log.critical(msg)\n            raise ValueError(msg)\n\n        # Dynamically create the metric definitions\n        metrics = [\"precision\", \"recall\", \"fscore\"]\n        metric_definitions = []\n        for t in self.class_labels:\n            for m in metrics:\n                metric_definitions.append({\"Name\": f\"Metrics:{t}:{m}\", \"Regex\": f\"Metrics:{t}:{m} ([0-9.]+)\"})\n\n        # Add the confusion matrix metrics\n        for row in self.class_labels:\n            for col in self.class_labels:\n                metric_definitions.append(\n                    {\"Name\": f\"ConfusionMatrix:{row}:{col}\", \"Regex\": f\"ConfusionMatrix:{row}:{col} ([0-9.]+)\"}\n                )\n\n    # If the model type is UNKNOWN, our metric_definitions will be empty\n    else:\n        self.log.warning(f\"ModelType is {self.model_type}, skipping metric_definitions...\")\n        metric_definitions = []\n\n    # Create a Sagemaker Model with our script\n    self.estimator = SKLearn(\n        entry_point=script_path,\n        source_dir=self.model_script_dir,\n        role=self.sageworks_role_arn,\n        instance_type=\"ml.m5.large\",\n        sagemaker_session=self.sm_session,\n        framework_version=\"1.2-1\",\n        metric_definitions=metric_definitions,\n    )\n\n    # Training Job Name based on the Model UUID and today's date\n    training_date_time_utc = datetime.utcnow().strftime(\"%Y-%m-%d-%H-%M\")\n    training_job_name = f\"{self.output_uuid}-{training_date_time_utc}\"\n\n    # Train the estimator\n    self.estimator.fit({\"train\": s3_training_path}, job_name=training_job_name)\n\n    # Now delete the training data\n    self.log.info(f\"Deleting training data {s3_training_path}...\")\n    wr.s3.delete_objects(\n        [s3_training_path, s3_training_path.replace(\".csv\", \".csv.metadata\")],\n        boto3_session=self.boto_session,\n    )\n\n    # Create Model and officially Register\n    self.log.important(f\"Creating new model {self.output_uuid}...\")\n    self.create_and_register_model()\n</code></pre>"},{"location":"core_classes/transforms/model_to_endpoint/","title":"Model to Endpoint","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p> <p>ModelToEndpoint: Deploy an Endpoint for a Model</p>"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint","title":"<code>ModelToEndpoint</code>","text":"<p>               Bases: <code>Transform</code></p> <p>ModelToEndpoint: Deploy an Endpoint for a Model</p> Common Usage <pre><code>to_endpoint = ModelToEndpoint(model_uuid, endpoint_uuid)\nto_endpoint.set_output_tags([\"aqsol\", \"public\", \"whatever\"])\nto_endpoint.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py</code> <pre><code>class ModelToEndpoint(Transform):\n    \"\"\"ModelToEndpoint: Deploy an Endpoint for a Model\n\n    Common Usage:\n        ```\n        to_endpoint = ModelToEndpoint(model_uuid, endpoint_uuid)\n        to_endpoint.set_output_tags([\"aqsol\", \"public\", \"whatever\"])\n        to_endpoint.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, model_uuid: str, endpoint_uuid: str, serverless: bool = True):\n        \"\"\"ModelToEndpoint Initialization\n        Args:\n            model_uuid(str): The UUID of the input Model\n            endpoint_uuid(str): The UUID of the output Endpoint\n            serverless(bool): Deploy the Endpoint in serverless mode (default: True)\n        \"\"\"\n\n        # Make sure the endpoint_uuid is a valid name\n        Artifact.ensure_valid_name(endpoint_uuid, delimiter=\"-\")\n\n        # Call superclass init\n        super().__init__(model_uuid, endpoint_uuid)\n\n        # Set up all my instance attributes\n        self.instance_type = \"serverless\" if serverless else \"ml.t2.medium\"\n        self.input_type = TransformInput.MODEL\n        self.output_type = TransformOutput.ENDPOINT\n\n    def transform_impl(self):\n        \"\"\"Deploy an Endpoint for a Model\"\"\"\n\n        # Delete endpoint (if it already exists)\n        existing_endpoint = EndpointCore(self.output_uuid, force_refresh=True)\n        if existing_endpoint.exists():\n            existing_endpoint.delete()\n\n        # Get the Model Package ARN for our input model\n        input_model = ModelCore(self.input_uuid)\n        model_package_arn = input_model.model_package_arn()\n\n        # Will this be a Serverless Endpoint?\n        if self.instance_type == \"serverless\":\n            self._serverless_deploy(model_package_arn)\n        else:\n            self._realtime_deploy(model_package_arn)\n\n        # Add this endpoint to the set of registered endpoints for the model\n        input_model.register_endpoint(self.output_uuid)\n\n        # This ensures that the endpoint is ready for use\n        time.sleep(5)  # We wait for AWS Lag\n        end = EndpointCore(self.output_uuid, force_refresh=True)\n        self.log.important(f\"Endpoint {end.uuid} is ready for use\")\n\n    def _realtime_deploy(self, model_package_arn: str):\n        \"\"\"Internal Method: Deploy the Realtime Endpoint\n\n        Args:\n            model_package_arn(str): The Model Package ARN used to deploy the Endpoint\n        \"\"\"\n        # Create a Model Package\n        model_package = ModelPackage(role=self.sageworks_role_arn, model_package_arn=model_package_arn)\n\n        # Get the metadata/tags to push into AWS\n        aws_tags = self.get_aws_tags()\n\n        # Deploy a Realtime Endpoint\n        model_package.deploy(\n            initial_instance_count=1,\n            instance_type=self.instance_type,\n            endpoint_name=self.output_uuid,\n            serializer=CSVSerializer(),\n            deserializer=CSVDeserializer(),\n            tags=aws_tags,\n        )\n\n    def _serverless_deploy(self, model_package_arn, mem_size=2048, max_concurrency=5, wait=True):\n        \"\"\"Internal Method: Deploy a Serverless Endpoint\n\n        Args:\n            mem_size(int): Memory size in MB (default: 2048)\n            max_concurrency(int): Max concurrency (default: 5)\n            wait(bool): Wait for the Endpoint to be ready (default: True)\n        \"\"\"\n        model_name = self.input_uuid\n        endpoint_name = self.output_uuid\n        aws_tags = self.get_aws_tags()\n\n        # Create Low Level Model Resource (Endpoint Config below references this Model Resource)\n        # Note: Since model is internal to the endpoint we'll add a timestamp (just like SageMaker does)\n        datetime_str = datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S-%f\")[:-3]\n        model_name = f\"{model_name}-{datetime_str}\"\n        self.log.info(f\"Creating Low Level Model: {model_name}...\")\n        self.sm_client.create_model(\n            ModelName=model_name,\n            PrimaryContainer={\n                \"ModelPackageName\": model_package_arn,\n            },\n            ExecutionRoleArn=self.sageworks_role_arn,\n            Tags=aws_tags,\n        )\n\n        # Create Endpoint Config\n        self.log.info(f\"Creating Endpoint Config {endpoint_name}...\")\n        try:\n            self.sm_client.create_endpoint_config(\n                EndpointConfigName=endpoint_name,\n                ProductionVariants=[\n                    {\n                        \"ServerlessConfig\": {\"MemorySizeInMB\": mem_size, \"MaxConcurrency\": max_concurrency},\n                        \"ModelName\": model_name,\n                        \"VariantName\": \"AllTraffic\",\n                    }\n                ],\n            )\n        except ClientError as e:\n            # Already Exists: Check if ValidationException and existing endpoint configuration\n            if (\n                e.response[\"Error\"][\"Code\"] == \"ValidationException\"\n                and \"already existing endpoint configuration\" in e.response[\"Error\"][\"Message\"]\n            ):\n                self.log.warning(\"Endpoint configuration already exists: Deleting and retrying...\")\n                self.sm_client.delete_endpoint_config(EndpointConfigName=endpoint_name)\n                self.sm_client.create_endpoint_config(\n                    EndpointConfigName=endpoint_name,\n                    ProductionVariants=[\n                        {\n                            \"ServerlessConfig\": {\"MemorySizeInMB\": mem_size, \"MaxConcurrency\": max_concurrency},\n                            \"ModelName\": model_name,\n                            \"VariantName\": \"AllTraffic\",\n                        }\n                    ],\n                )\n\n        # Create Endpoint\n        self.log.info(f\"Creating Serverless Endpoint {endpoint_name}...\")\n        self.sm_client.create_endpoint(\n            EndpointName=endpoint_name, EndpointConfigName=endpoint_name, Tags=self.get_aws_tags()\n        )\n\n        # Wait for Endpoint to be ready\n        if not wait:\n            self.log.important(f\"Endpoint {endpoint_name} is being created...\")\n        else:\n            self.log.important(f\"Waiting for Endpoint {endpoint_name} to be ready...\")\n            describe_endpoint_response = self.sm_client.describe_endpoint(EndpointName=endpoint_name)\n            while describe_endpoint_response[\"EndpointStatus\"] == \"Creating\":\n                time.sleep(30)\n                describe_endpoint_response = self.sm_client.describe_endpoint(EndpointName=endpoint_name)\n                self.log.info(f\"Endpoint Status: {describe_endpoint_response['EndpointStatus']}\")\n            status = describe_endpoint_response[\"EndpointStatus\"]\n            if status != \"InService\":\n                msg = f\"Endpoint {endpoint_name} failed to be created. Status: {status}\"\n                details = describe_endpoint_response[\"FailureReason\"]\n                self.log.critical(msg)\n                self.log.critical(details)\n                raise Exception(msg)\n            self.log.important(f\"Endpoint {endpoint_name} is now {status}...\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Calling onboard() for the Endpoint\"\"\"\n        self.log.info(\"Post-Transform: Calling onboard() for the Endpoint...\")\n\n        # Onboard the Endpoint\n        output_endpoint = EndpointCore(self.output_uuid, force_refresh=True)\n        output_endpoint.onboard()\n</code></pre>"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.__init__","title":"<code>__init__(model_uuid, endpoint_uuid, serverless=True)</code>","text":"<p>ModelToEndpoint Initialization Args:     model_uuid(str): The UUID of the input Model     endpoint_uuid(str): The UUID of the output Endpoint     serverless(bool): Deploy the Endpoint in serverless mode (default: True)</p> Source code in <code>src/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py</code> <pre><code>def __init__(self, model_uuid: str, endpoint_uuid: str, serverless: bool = True):\n    \"\"\"ModelToEndpoint Initialization\n    Args:\n        model_uuid(str): The UUID of the input Model\n        endpoint_uuid(str): The UUID of the output Endpoint\n        serverless(bool): Deploy the Endpoint in serverless mode (default: True)\n    \"\"\"\n\n    # Make sure the endpoint_uuid is a valid name\n    Artifact.ensure_valid_name(endpoint_uuid, delimiter=\"-\")\n\n    # Call superclass init\n    super().__init__(model_uuid, endpoint_uuid)\n\n    # Set up all my instance attributes\n    self.instance_type = \"serverless\" if serverless else \"ml.t2.medium\"\n    self.input_type = TransformInput.MODEL\n    self.output_type = TransformOutput.ENDPOINT\n</code></pre>"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Calling onboard() for the Endpoint</p> Source code in <code>src/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Calling onboard() for the Endpoint\"\"\"\n    self.log.info(\"Post-Transform: Calling onboard() for the Endpoint...\")\n\n    # Onboard the Endpoint\n    output_endpoint = EndpointCore(self.output_uuid, force_refresh=True)\n    output_endpoint.onboard()\n</code></pre>"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.transform_impl","title":"<code>transform_impl()</code>","text":"<p>Deploy an Endpoint for a Model</p> Source code in <code>src/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py</code> <pre><code>def transform_impl(self):\n    \"\"\"Deploy an Endpoint for a Model\"\"\"\n\n    # Delete endpoint (if it already exists)\n    existing_endpoint = EndpointCore(self.output_uuid, force_refresh=True)\n    if existing_endpoint.exists():\n        existing_endpoint.delete()\n\n    # Get the Model Package ARN for our input model\n    input_model = ModelCore(self.input_uuid)\n    model_package_arn = input_model.model_package_arn()\n\n    # Will this be a Serverless Endpoint?\n    if self.instance_type == \"serverless\":\n        self._serverless_deploy(model_package_arn)\n    else:\n        self._realtime_deploy(model_package_arn)\n\n    # Add this endpoint to the set of registered endpoints for the model\n    input_model.register_endpoint(self.output_uuid)\n\n    # This ensures that the endpoint is ready for use\n    time.sleep(5)  # We wait for AWS Lag\n    end = EndpointCore(self.output_uuid, force_refresh=True)\n    self.log.important(f\"Endpoint {end.uuid} is ready for use\")\n</code></pre>"},{"location":"core_classes/transforms/overview/","title":"Transforms","text":"<p>API Classes</p> <p>For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline</p> <p>SageWorks currently has a large set of Transforms that go from one Artifact type to another (e.g. DataSource to FeatureSet). The Transforms will often have light and heavy versions depending on the scale of data that needs to be transformed.</p>"},{"location":"core_classes/transforms/overview/#transform-details","title":"Transform Details","text":"<ul> <li>DataLoaders Light: Loads various light/smaller data into AWS Data Catalog and Athena</li> <li>DataLoaders Heavy: Loads heavy/larger data (via Glue) into AWS Data Catalog and Athena</li> <li>DataToFeatures: Transforms a DataSource into a FeatureSet (AWS Feature Store/Group)</li> <li>FeaturesToModel: Trains and deploys an AWS Model Package/Group from a FeatureSet</li> <li>ModelToEndpoint: Manages the provisioning and deployment of a Model Endpoint</li> <li>PandasTransforms: Pandas DataFrame transforms and helper methods.</li> </ul>"},{"location":"core_classes/transforms/pandas_transforms/","title":"Pandas Transforms","text":"<p>API Classes</p> <p>The API Classes will often provide helpful methods that give you a DataFrame (data_source.query() for instance), so always check out the API Classes first.</p> <p>These Transforms will give you the ultimate in customization and flexibility when creating AWS Machine Learning Pipelines. Grab a Pandas DataFrame from a DataSource or FeatureSet process in whatever way for your use case and simply create another Sageworks DataSource or FeatureSet from the resulting DataFrame.</p> <p>Lots of Options:</p> <p>Not for Large Data</p> <p>Pandas Transforms can't handle large datasets (&gt; 4 GigaBytes). For doing transforma on large data see our Heavy Transforms</p> <ul> <li>S3 --&gt; DF --&gt; DataSource</li> <li>DataSource --&gt; DF --&gt; DataSource</li> <li>DataSoruce --&gt; DF --&gt; FeatureSet</li> <li>Get Creative!</li> </ul> <p>Welcome to the SageWorks Pandas Transform Classes</p> <p>These classes provide low-level APIs for using Pandas DataFrames</p> <ul> <li>DataToPandas: Pull a dataframe from a SageWorks DataSource</li> <li>FeaturesToPandas: Pull a dataframe from a SageWorks FeatureSet</li> <li>PandasToData: Create a SageWorks DataSource using a Pandas DataFrame as the source</li> <li>PandasToFeatures: Create a SageWorks FeatureSet using a Pandas DataFrame as the source</li> <li>PandasToFeaturesChunked: Create a SageWorks FeatureSet using a Chunked/Streaming Pandas DataFrame as the source</li> </ul>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas","title":"<code>DataToPandas</code>","text":"<p>               Bases: <code>Transform</code></p> <p>DataToPandas: Class to transform a Data Source into a Pandas DataFrame</p> Common Usage <pre><code>data_to_df = DataToPandas(data_source_uuid)\ndata_to_df.transform(query=&lt;optional SQL query to filter/process data&gt;)\ndata_to_df.transform(max_rows=&lt;optional max rows to sample&gt;)\nmy_df = data_to_df.get_output()\n\nNote: query is the best way to use this class, so use it :)\n</code></pre> Source code in <code>src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py</code> <pre><code>class DataToPandas(Transform):\n    \"\"\"DataToPandas: Class to transform a Data Source into a Pandas DataFrame\n\n    Common Usage:\n        ```\n        data_to_df = DataToPandas(data_source_uuid)\n        data_to_df.transform(query=&lt;optional SQL query to filter/process data&gt;)\n        data_to_df.transform(max_rows=&lt;optional max rows to sample&gt;)\n        my_df = data_to_df.get_output()\n\n        Note: query is the best way to use this class, so use it :)\n        ```\n    \"\"\"\n\n    def __init__(self, input_uuid: str):\n        \"\"\"DataToPandas Initialization\"\"\"\n\n        # Call superclass init\n        super().__init__(input_uuid, \"DataFrame\")\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.DATA_SOURCE\n        self.output_type = TransformOutput.PANDAS_DF\n        self.output_df = None\n\n    def transform_impl(self, query: str = None, max_rows=100000):\n        \"\"\"Convert the DataSource into a Pandas DataFrame\n        Args:\n            query(str): The query to run against the DataSource (default: None)\n            max_rows(int): The maximum number of rows to return (default: 100000)\n        \"\"\"\n\n        # Grab the Input (Data Source)\n        input_data = DataSourceFactory(self.input_uuid)\n        if not input_data.exists():\n            self.log.critical(f\"Data Check on {self.input_uuid} failed!\")\n            return\n\n        # If a query is provided, that overrides the queries below\n        if query:\n            self.log.info(f\"Querying {self.input_uuid} with {query}...\")\n            self.output_df = input_data.query(query)\n            return\n\n        # If the data source has more rows than max_rows, do a sample query\n        num_rows = input_data.num_rows()\n        if num_rows &gt; max_rows:\n            percentage = round(max_rows * 100.0 / num_rows)\n            self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n            query = f\"SELECT * FROM {self.input_uuid} TABLESAMPLE BERNOULLI({percentage})\"\n        else:\n            query = f\"SELECT * FROM {self.input_uuid}\"\n\n        # Mark the transform as complete and set the output DataFrame\n        self.output_df = input_data.query(query)\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n        self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n        self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n\n    def get_output(self) -&gt; pd.DataFrame:\n        \"\"\"Get the DataFrame Output from this Transform\"\"\"\n        return self.output_df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.__init__","title":"<code>__init__(input_uuid)</code>","text":"<p>DataToPandas Initialization</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py</code> <pre><code>def __init__(self, input_uuid: str):\n    \"\"\"DataToPandas Initialization\"\"\"\n\n    # Call superclass init\n    super().__init__(input_uuid, \"DataFrame\")\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.DATA_SOURCE\n    self.output_type = TransformOutput.PANDAS_DF\n    self.output_df = None\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.get_output","title":"<code>get_output()</code>","text":"<p>Get the DataFrame Output from this Transform</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py</code> <pre><code>def get_output(self) -&gt; pd.DataFrame:\n    \"\"\"Get the DataFrame Output from this Transform\"\"\"\n    return self.output_df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Any checks on the Pandas DataFrame that need to be done</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n    self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n    self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.transform_impl","title":"<code>transform_impl(query=None, max_rows=100000)</code>","text":"<p>Convert the DataSource into a Pandas DataFrame Args:     query(str): The query to run against the DataSource (default: None)     max_rows(int): The maximum number of rows to return (default: 100000)</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py</code> <pre><code>def transform_impl(self, query: str = None, max_rows=100000):\n    \"\"\"Convert the DataSource into a Pandas DataFrame\n    Args:\n        query(str): The query to run against the DataSource (default: None)\n        max_rows(int): The maximum number of rows to return (default: 100000)\n    \"\"\"\n\n    # Grab the Input (Data Source)\n    input_data = DataSourceFactory(self.input_uuid)\n    if not input_data.exists():\n        self.log.critical(f\"Data Check on {self.input_uuid} failed!\")\n        return\n\n    # If a query is provided, that overrides the queries below\n    if query:\n        self.log.info(f\"Querying {self.input_uuid} with {query}...\")\n        self.output_df = input_data.query(query)\n        return\n\n    # If the data source has more rows than max_rows, do a sample query\n    num_rows = input_data.num_rows()\n    if num_rows &gt; max_rows:\n        percentage = round(max_rows * 100.0 / num_rows)\n        self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n        query = f\"SELECT * FROM {self.input_uuid} TABLESAMPLE BERNOULLI({percentage})\"\n    else:\n        query = f\"SELECT * FROM {self.input_uuid}\"\n\n    # Mark the transform as complete and set the output DataFrame\n    self.output_df = input_data.query(query)\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas","title":"<code>FeaturesToPandas</code>","text":"<p>               Bases: <code>Transform</code></p> <p>FeaturesToPandas: Class to transform a FeatureSet into a Pandas DataFrame</p> Common Usage <pre><code>feature_to_df = FeaturesToPandas(feature_set_uuid)\nfeature_to_df.transform(max_rows=&lt;optional max rows to sample&gt;)\nmy_df = feature_to_df.get_output()\n</code></pre> Source code in <code>src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py</code> <pre><code>class FeaturesToPandas(Transform):\n    \"\"\"FeaturesToPandas: Class to transform a FeatureSet into a Pandas DataFrame\n\n    Common Usage:\n        ```\n        feature_to_df = FeaturesToPandas(feature_set_uuid)\n        feature_to_df.transform(max_rows=&lt;optional max rows to sample&gt;)\n        my_df = feature_to_df.get_output()\n        ```\n    \"\"\"\n\n    def __init__(self, feature_set_name: str):\n        \"\"\"FeaturesToPandas Initialization\"\"\"\n\n        # Call superclass init\n        super().__init__(input_uuid=feature_set_name, output_uuid=\"DataFrame\")\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.FEATURE_SET\n        self.output_type = TransformOutput.PANDAS_DF\n        self.output_df = None\n        self.transform_run = False\n\n    def transform_impl(self, max_rows=100000):\n        \"\"\"Convert the FeatureSet into a Pandas DataFrame\"\"\"\n\n        # Grab the Input (Feature Set)\n        input_data = FeatureSetCore(self.input_uuid)\n        if not input_data.exists():\n            self.log.critical(f\"Feature Set Check on {self.input_uuid} failed!\")\n            return\n\n        # Grab the table for this Feature Set\n        table = input_data.athena_table\n\n        # Get the list of columns (and subtract metadata columns that might get added)\n        columns = input_data.column_names()\n        filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n        columns = \", \".join([x for x in columns if x not in filter_columns])\n\n        # Get the number of rows in the Feature Set\n        num_rows = input_data.num_rows()\n\n        # If the data source has more rows than max_rows, do a sample query\n        if num_rows &gt; max_rows:\n            percentage = round(max_rows * 100.0 / num_rows)\n            self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n            query = f'SELECT {columns} FROM \"{table}\" TABLESAMPLE BERNOULLI({percentage})'\n        else:\n            query = f'SELECT {columns} FROM \"{table}\"'\n\n        # Mark the transform as complete and set the output DataFrame\n        self.transform_run = True\n        self.output_df = input_data.query(query)\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n        self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n        self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n\n    def get_output(self) -&gt; pd.DataFrame:\n        \"\"\"Get the DataFrame Output from this Transform\"\"\"\n        if not self.transform_run:\n            self.transform()\n        return self.output_df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.__init__","title":"<code>__init__(feature_set_name)</code>","text":"<p>FeaturesToPandas Initialization</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py</code> <pre><code>def __init__(self, feature_set_name: str):\n    \"\"\"FeaturesToPandas Initialization\"\"\"\n\n    # Call superclass init\n    super().__init__(input_uuid=feature_set_name, output_uuid=\"DataFrame\")\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.FEATURE_SET\n    self.output_type = TransformOutput.PANDAS_DF\n    self.output_df = None\n    self.transform_run = False\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.get_output","title":"<code>get_output()</code>","text":"<p>Get the DataFrame Output from this Transform</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py</code> <pre><code>def get_output(self) -&gt; pd.DataFrame:\n    \"\"\"Get the DataFrame Output from this Transform\"\"\"\n    if not self.transform_run:\n        self.transform()\n    return self.output_df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Any checks on the Pandas DataFrame that need to be done</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n    self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n    self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.transform_impl","title":"<code>transform_impl(max_rows=100000)</code>","text":"<p>Convert the FeatureSet into a Pandas DataFrame</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py</code> <pre><code>def transform_impl(self, max_rows=100000):\n    \"\"\"Convert the FeatureSet into a Pandas DataFrame\"\"\"\n\n    # Grab the Input (Feature Set)\n    input_data = FeatureSetCore(self.input_uuid)\n    if not input_data.exists():\n        self.log.critical(f\"Feature Set Check on {self.input_uuid} failed!\")\n        return\n\n    # Grab the table for this Feature Set\n    table = input_data.athena_table\n\n    # Get the list of columns (and subtract metadata columns that might get added)\n    columns = input_data.column_names()\n    filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n    columns = \", \".join([x for x in columns if x not in filter_columns])\n\n    # Get the number of rows in the Feature Set\n    num_rows = input_data.num_rows()\n\n    # If the data source has more rows than max_rows, do a sample query\n    if num_rows &gt; max_rows:\n        percentage = round(max_rows * 100.0 / num_rows)\n        self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n        query = f'SELECT {columns} FROM \"{table}\" TABLESAMPLE BERNOULLI({percentage})'\n    else:\n        query = f'SELECT {columns} FROM \"{table}\"'\n\n    # Mark the transform as complete and set the output DataFrame\n    self.transform_run = True\n    self.output_df = input_data.query(query)\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData","title":"<code>PandasToData</code>","text":"<p>               Bases: <code>Transform</code></p> <p>PandasToData: Class to publish a Pandas DataFrame as a DataSource</p> Common Usage <pre><code>df_to_data = PandasToData(output_uuid)\ndf_to_data.set_output_tags([\"test\", \"small\"])\ndf_to_data.set_input(test_df)\ndf_to_data.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>class PandasToData(Transform):\n    \"\"\"PandasToData: Class to publish a Pandas DataFrame as a DataSource\n\n    Common Usage:\n        ```\n        df_to_data = PandasToData(output_uuid)\n        df_to_data.set_output_tags([\"test\", \"small\"])\n        df_to_data.set_input(test_df)\n        df_to_data.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, output_uuid: str, output_format: str = \"parquet\"):\n        \"\"\"PandasToData Initialization\n        Args:\n            output_uuid (str): The UUID of the DataSource to create\n            output_format (str): The file format to store the S3 object data in (default: \"parquet\")\n        \"\"\"\n\n        # Make sure the output_uuid is a valid name/id\n        Artifact.ensure_valid_name(output_uuid)\n\n        # Call superclass init\n        super().__init__(\"DataFrame\", output_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.PANDAS_DF\n        self.output_type = TransformOutput.DATA_SOURCE\n        self.output_df = None\n\n        # Give a message that Parquet is best in most cases\n        if output_format != \"parquet\":\n            self.log.warning(\"Parquet format works the best in most cases please consider using it\")\n        self.output_format = output_format\n\n    def set_input(self, input_df: pd.DataFrame):\n        \"\"\"Set the DataFrame Input for this Transform\"\"\"\n        self.output_df = input_df.copy()\n\n    def convert_object_to_string(self, df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Try to automatically convert object columns to string columns\"\"\"\n        for c in df.columns[df.dtypes == \"object\"]:  # Look at the object columns\n            try:\n                df[c] = df[c].astype(\"string\")\n                df[c] = df[c].str.replace(\"'\", '\"')  # This is for nested JSON\n            except (ParserError, ValueError, TypeError):\n                self.log.info(f\"Column {c} could not be converted to string...\")\n        return df\n\n    def convert_object_to_datetime(self, df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Try to automatically convert object columns to datetime or string columns\"\"\"\n        for c in df.columns[df.dtypes == \"object\"]:  # Look at the object columns\n            try:\n                df[c] = pd.to_datetime(df[c])\n            except (ParserError, ValueError, TypeError):\n                self.log.debug(f\"Column {c} could not be converted to datetime...\")\n        return df\n\n    @staticmethod\n    def convert_datetime_columns(df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Convert datetime columns to ISO-8601 string\"\"\"\n        datetime_type = [\"datetime\", \"datetime64\", \"datetime64[ns]\", \"datetimetz\"]\n        for c in df.select_dtypes(include=datetime_type).columns:\n            df[c] = df[c].map(datetime_to_iso8601)\n            df[c] = df[c].astype(pd.StringDtype())\n        return df\n\n    def transform_impl(self, overwrite: bool = True, **kwargs):\n        \"\"\"Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and\n        store the information about the data to the AWS Data Catalog sageworks database\n\n        Args:\n            overwrite (bool): Overwrite the existing data in the SageWorks S3 Bucket\n        \"\"\"\n        self.log.info(f\"DataFrame to SageWorks DataSource: {self.output_uuid}...\")\n\n        # Set up our metadata storage\n        sageworks_meta = {\"sageworks_tags\": self.output_tags}\n        sageworks_meta.update(self.output_meta)\n\n        # Create the Output Parquet file S3 Storage Path\n        s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n        # Convert columns names to lowercase, Athena will not work with uppercase column names\n        if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n            for c in self.output_df.columns:\n                if c != c.lower():\n                    self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n            self.output_df.columns = self.output_df.columns.str.lower()\n\n        # Convert Object Columns to String\n        self.output_df = self.convert_object_to_string(self.output_df)\n\n        # Note: Both of these conversions may not be necessary, so we're leaving them commented out\n        \"\"\"\n        # Convert Object Columns to Datetime\n        self.output_df = self.convert_object_to_datetime(self.output_df)\n\n        # Now convert datetime columns to ISO-8601 string\n        # self.output_df = self.convert_datetime_columns(self.output_df)\n        \"\"\"\n\n        # Write out the DataFrame to AWS Data Catalog in either Parquet or JSONL format\n        description = f\"SageWorks data source: {self.output_uuid}\"\n        glue_table_settings = {\"description\": description, \"parameters\": sageworks_meta}\n        if self.output_format == \"parquet\":\n            wr.s3.to_parquet(\n                self.output_df,\n                path=s3_storage_path,\n                dataset=True,\n                mode=\"overwrite\",\n                database=self.data_catalog_db,\n                table=self.output_uuid,\n                filename_prefix=f\"{self.output_uuid}_\",\n                boto3_session=self.boto_session,\n                partition_cols=None,\n                glue_table_settings=glue_table_settings,\n                sanitize_columns=False,\n            )  # FIXME: Have some logic around partition columns\n\n        # Note: In general Parquet works will for most uses cases. We recommend using Parquet\n        #       You can use JSON_EXTRACT on Parquet string field, and it works great.\n        elif self.output_format == \"jsonl\":\n            self.log.warning(\"We recommend using Parquet format for most use cases\")\n            self.log.warning(\"If you have a use case that requires JSONL please contact SageWorks support\")\n            self.log.warning(\"We'd like to understand what functionality JSONL is providing that isn't already\")\n            self.log.warning(\"provided with Parquet and JSON_EXTRACT() for your Athena Queries\")\n            wr.s3.to_json(\n                self.output_df,\n                path=s3_storage_path,\n                orient=\"records\",\n                lines=True,\n                date_format=\"iso\",\n                dataset=True,\n                mode=\"overwrite\",\n                database=self.data_catalog_db,\n                table=self.output_uuid,\n                filename_prefix=f\"{self.output_uuid}_\",\n                boto3_session=self.boto_session,\n                partition_cols=None,\n                glue_table_settings=glue_table_settings,\n            )\n        else:\n            raise ValueError(f\"Unsupported file format: {self.output_format}\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Calling onboard() fnr the DataSource\"\"\"\n        self.log.info(\"Post-Transform: Calling onboard() for the DataSource...\")\n\n        # Onboard the DataSource\n        output_data_source = DataSourceFactory(self.output_uuid, force_refresh=True)\n        output_data_source.onboard()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.__init__","title":"<code>__init__(output_uuid, output_format='parquet')</code>","text":"<p>PandasToData Initialization Args:     output_uuid (str): The UUID of the DataSource to create     output_format (str): The file format to store the S3 object data in (default: \"parquet\")</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def __init__(self, output_uuid: str, output_format: str = \"parquet\"):\n    \"\"\"PandasToData Initialization\n    Args:\n        output_uuid (str): The UUID of the DataSource to create\n        output_format (str): The file format to store the S3 object data in (default: \"parquet\")\n    \"\"\"\n\n    # Make sure the output_uuid is a valid name/id\n    Artifact.ensure_valid_name(output_uuid)\n\n    # Call superclass init\n    super().__init__(\"DataFrame\", output_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.PANDAS_DF\n    self.output_type = TransformOutput.DATA_SOURCE\n    self.output_df = None\n\n    # Give a message that Parquet is best in most cases\n    if output_format != \"parquet\":\n        self.log.warning(\"Parquet format works the best in most cases please consider using it\")\n    self.output_format = output_format\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_datetime_columns","title":"<code>convert_datetime_columns(df)</code>  <code>staticmethod</code>","text":"<p>Convert datetime columns to ISO-8601 string</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>@staticmethod\ndef convert_datetime_columns(df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Convert datetime columns to ISO-8601 string\"\"\"\n    datetime_type = [\"datetime\", \"datetime64\", \"datetime64[ns]\", \"datetimetz\"]\n    for c in df.select_dtypes(include=datetime_type).columns:\n        df[c] = df[c].map(datetime_to_iso8601)\n        df[c] = df[c].astype(pd.StringDtype())\n    return df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_object_to_datetime","title":"<code>convert_object_to_datetime(df)</code>","text":"<p>Try to automatically convert object columns to datetime or string columns</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def convert_object_to_datetime(self, df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Try to automatically convert object columns to datetime or string columns\"\"\"\n    for c in df.columns[df.dtypes == \"object\"]:  # Look at the object columns\n        try:\n            df[c] = pd.to_datetime(df[c])\n        except (ParserError, ValueError, TypeError):\n            self.log.debug(f\"Column {c} could not be converted to datetime...\")\n    return df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_object_to_string","title":"<code>convert_object_to_string(df)</code>","text":"<p>Try to automatically convert object columns to string columns</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def convert_object_to_string(self, df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Try to automatically convert object columns to string columns\"\"\"\n    for c in df.columns[df.dtypes == \"object\"]:  # Look at the object columns\n        try:\n            df[c] = df[c].astype(\"string\")\n            df[c] = df[c].str.replace(\"'\", '\"')  # This is for nested JSON\n        except (ParserError, ValueError, TypeError):\n            self.log.info(f\"Column {c} could not be converted to string...\")\n    return df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Calling onboard() fnr the DataSource</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Calling onboard() fnr the DataSource\"\"\"\n    self.log.info(\"Post-Transform: Calling onboard() for the DataSource...\")\n\n    # Onboard the DataSource\n    output_data_source = DataSourceFactory(self.output_uuid, force_refresh=True)\n    output_data_source.onboard()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.set_input","title":"<code>set_input(input_df)</code>","text":"<p>Set the DataFrame Input for this Transform</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def set_input(self, input_df: pd.DataFrame):\n    \"\"\"Set the DataFrame Input for this Transform\"\"\"\n    self.output_df = input_df.copy()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.transform_impl","title":"<code>transform_impl(overwrite=True, **kwargs)</code>","text":"<p>Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and store the information about the data to the AWS Data Catalog sageworks database</p> <p>Parameters:</p> Name Type Description Default <code>overwrite</code> <code>bool</code> <p>Overwrite the existing data in the SageWorks S3 Bucket</p> <code>True</code> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py</code> <pre><code>def transform_impl(self, overwrite: bool = True, **kwargs):\n    \"\"\"Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and\n    store the information about the data to the AWS Data Catalog sageworks database\n\n    Args:\n        overwrite (bool): Overwrite the existing data in the SageWorks S3 Bucket\n    \"\"\"\n    self.log.info(f\"DataFrame to SageWorks DataSource: {self.output_uuid}...\")\n\n    # Set up our metadata storage\n    sageworks_meta = {\"sageworks_tags\": self.output_tags}\n    sageworks_meta.update(self.output_meta)\n\n    # Create the Output Parquet file S3 Storage Path\n    s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n    # Convert columns names to lowercase, Athena will not work with uppercase column names\n    if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n        for c in self.output_df.columns:\n            if c != c.lower():\n                self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n        self.output_df.columns = self.output_df.columns.str.lower()\n\n    # Convert Object Columns to String\n    self.output_df = self.convert_object_to_string(self.output_df)\n\n    # Note: Both of these conversions may not be necessary, so we're leaving them commented out\n    \"\"\"\n    # Convert Object Columns to Datetime\n    self.output_df = self.convert_object_to_datetime(self.output_df)\n\n    # Now convert datetime columns to ISO-8601 string\n    # self.output_df = self.convert_datetime_columns(self.output_df)\n    \"\"\"\n\n    # Write out the DataFrame to AWS Data Catalog in either Parquet or JSONL format\n    description = f\"SageWorks data source: {self.output_uuid}\"\n    glue_table_settings = {\"description\": description, \"parameters\": sageworks_meta}\n    if self.output_format == \"parquet\":\n        wr.s3.to_parquet(\n            self.output_df,\n            path=s3_storage_path,\n            dataset=True,\n            mode=\"overwrite\",\n            database=self.data_catalog_db,\n            table=self.output_uuid,\n            filename_prefix=f\"{self.output_uuid}_\",\n            boto3_session=self.boto_session,\n            partition_cols=None,\n            glue_table_settings=glue_table_settings,\n            sanitize_columns=False,\n        )  # FIXME: Have some logic around partition columns\n\n    # Note: In general Parquet works will for most uses cases. We recommend using Parquet\n    #       You can use JSON_EXTRACT on Parquet string field, and it works great.\n    elif self.output_format == \"jsonl\":\n        self.log.warning(\"We recommend using Parquet format for most use cases\")\n        self.log.warning(\"If you have a use case that requires JSONL please contact SageWorks support\")\n        self.log.warning(\"We'd like to understand what functionality JSONL is providing that isn't already\")\n        self.log.warning(\"provided with Parquet and JSON_EXTRACT() for your Athena Queries\")\n        wr.s3.to_json(\n            self.output_df,\n            path=s3_storage_path,\n            orient=\"records\",\n            lines=True,\n            date_format=\"iso\",\n            dataset=True,\n            mode=\"overwrite\",\n            database=self.data_catalog_db,\n            table=self.output_uuid,\n            filename_prefix=f\"{self.output_uuid}_\",\n            boto3_session=self.boto_session,\n            partition_cols=None,\n            glue_table_settings=glue_table_settings,\n        )\n    else:\n        raise ValueError(f\"Unsupported file format: {self.output_format}\")\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures","title":"<code>PandasToFeatures</code>","text":"<p>               Bases: <code>Transform</code></p> <p>PandasToFeatures: Class to publish a Pandas DataFrame into a FeatureSet</p> Common Usage <pre><code>to_features = PandasToFeatures(output_uuid)\nto_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\nto_features.set_input(df, id_column=\"id\"/None, event_time_column=\"date\"/None)\nto_features.transform()\n</code></pre> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>class PandasToFeatures(Transform):\n    \"\"\"PandasToFeatures: Class to publish a Pandas DataFrame into a FeatureSet\n\n    Common Usage:\n        ```\n        to_features = PandasToFeatures(output_uuid)\n        to_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n        to_features.set_input(df, id_column=\"id\"/None, event_time_column=\"date\"/None)\n        to_features.transform()\n        ```\n    \"\"\"\n\n    def __init__(self, output_uuid: str, auto_one_hot=False):\n        \"\"\"PandasToFeatures Initialization\n        Args:\n            output_uuid (str): The UUID of the FeatureSet to create\n            auto_one_hot (bool): Should we automatically one-hot encode categorical columns?\n        \"\"\"\n\n        # Make sure the output_uuid is a valid name\n        Artifact.ensure_valid_name(output_uuid)\n\n        # Call superclass init\n        super().__init__(\"DataFrame\", output_uuid)\n\n        # Set up all my instance attributes\n        self.input_type = TransformInput.PANDAS_DF\n        self.output_type = TransformOutput.FEATURE_SET\n        self.target_column = None\n        self.id_column = None\n        self.event_time_column = None\n        self.auto_one_hot = auto_one_hot\n        self.categorical_dtypes = {}\n        self.output_df = None\n        self.table_format = TableFormatEnum.ICEBERG\n\n        # Delete the existing FeatureSet if it exists\n        self.delete_existing()\n\n        # These will be set in the transform method\n        self.output_feature_group = None\n        self.output_feature_set = None\n        self.expected_rows = 0\n\n    def set_input(self, input_df: pd.DataFrame, target_column=None, id_column=None, event_time_column=None):\n        \"\"\"Set the Input DataFrame for this Transform\n        Args:\n            input_df (pd.DataFrame): The input DataFrame\n            target_column (str): The name of the target column (default: None)\n            id_column (str): The name of the id column (default: None)\n            event_time_column (str): The name of the event_time column (default: None)\n        \"\"\"\n        self.target_column = target_column\n        self.id_column = id_column\n        self.event_time_column = event_time_column\n        self.output_df = input_df.copy()\n\n        # Now Prepare the DataFrame for its journey into an AWS FeatureGroup\n        self.prep_dataframe()\n\n    def delete_existing(self):\n        # Delete the existing FeatureSet if it exists\n        try:\n            delete_fs = FeatureSetCore(self.output_uuid)\n            if delete_fs.exists():\n                self.log.info(f\"Deleting the {self.output_uuid} FeatureSet...\")\n                delete_fs.delete()\n                time.sleep(1)\n        except ClientError as exc:\n            self.log.info(f\"FeatureSet {self.output_uuid} doesn't exist...\")\n            self.log.info(exc)\n\n    def _ensure_id_column(self):\n        \"\"\"Internal: AWS Feature Store requires an Id field for all data store\"\"\"\n        if self.id_column is None or self.id_column not in self.output_df.columns:\n            if \"id\" not in self.output_df.columns:\n                self.log.info(\"Generating an id column before FeatureSet Creation...\")\n                self.output_df[\"id\"] = self.output_df.index\n            self.id_column = \"id\"\n\n    def _ensure_event_time(self):\n        \"\"\"Internal: AWS Feature Store requires an event_time field for all data stored\"\"\"\n        if self.event_time_column is None or self.event_time_column not in self.output_df.columns:\n            self.log.info(\"Generating an event_time column before FeatureSet Creation...\")\n            self.event_time_column = \"event_time\"\n            self.output_df[self.event_time_column] = pd.Timestamp(\"now\", tz=\"UTC\")\n\n        # The event_time_column is defined, so we need to make sure it's in ISO-8601 string format\n        # Note: AWS Feature Store only a particular ISO-8601 format not ALL ISO-8601 formats\n        time_column = self.output_df[self.event_time_column]\n\n        # Check if the event_time_column is of type object or string convert it to DateTime\n        if time_column.dtypes == \"object\" or time_column.dtypes.name == \"string\":\n            self.log.info(f\"Converting {self.event_time_column} to DateTime...\")\n            time_column = pd.to_datetime(time_column)\n\n        # Let's make sure it the right type for Feature Store\n        if pd.api.types.is_datetime64_any_dtype(time_column):\n            self.log.info(f\"Converting {self.event_time_column} to ISOFormat Date String before FeatureSet Creation...\")\n\n            # Convert the datetime DType to ISO-8601 string\n            # TableFormat=ICEBERG does not support alternate formats for event_time field, it only supports String type.\n            time_column = time_column.map(datetime_to_iso8601)\n            self.output_df[self.event_time_column] = time_column.astype(\"string\")\n\n    def _convert_objs_to_string(self):\n        \"\"\"Internal: AWS Feature Store doesn't know how to store object dtypes, so convert to String\"\"\"\n        for col in self.output_df:\n            if pd.api.types.is_object_dtype(self.output_df[col].dtype):\n                self.output_df[col] = self.output_df[col].astype(pd.StringDtype())\n\n    def process_column_name(self, column: str, shorten: bool = False) -&gt; str:\n        \"\"\"Call various methods to make sure the column is ready for Feature Store\n        Args:\n            column (str): The column name to process\n            shorten (bool): Should we shorten the column name? (default: False)\n        \"\"\"\n        self.log.debug(f\"Processing column {column}...\")\n\n        # Make sure the column name is valid\n        column = self.sanitize_column_name(column)\n\n        # Make sure the column name isn't too long\n        if shorten:\n            column = self.shorten_column_name(column)\n\n        return column\n\n    def shorten_column_name(self, name, max_length=20):\n        if len(name) &lt;= max_length:\n            return name\n\n        # Start building the new name from the end\n        parts = name.split(\"_\")[::-1]\n        new_name = \"\"\n        for part in parts:\n            if len(new_name) + len(part) + 1 &lt;= max_length:  # +1 for the underscore\n                new_name = f\"{part}_{new_name}\" if new_name else part\n            else:\n                break\n\n        # If new_name is empty, just use the last part of the original name\n        if not new_name:\n            new_name = parts[0]\n\n        self.log.info(f\"Shortening {name} to {new_name}\")\n        return new_name\n\n    def sanitize_column_name(self, name):\n        # Remove all invalid characters\n        sanitized = re.sub(\"[^a-zA-Z0-9-_]\", \"_\", name)\n        sanitized = re.sub(\"_+\", \"_\", sanitized)\n        sanitized = sanitized.strip(\"_\")\n\n        # Log the change if the name was altered\n        if sanitized != name:\n            self.log.info(f\"Sanitizing {name} to {sanitized}\")\n\n        return sanitized\n\n    def one_hot_encoding(self, df, categorical_columns: list) -&gt; pd.DataFrame:\n        \"\"\"One Hot Encoding for Categorical Columns with additional column name management\"\"\"\n\n        # Now convert Categorical Types to One Hot Encoding\n        current_columns = list(df.columns)\n        df = pd.get_dummies(df, columns=categorical_columns)\n\n        # Compute the new columns generated by get_dummies\n        new_columns = list(set(df.columns) - set(current_columns))\n\n        # Convert new columns to int32\n        df[new_columns] = df[new_columns].astype(\"int32\")\n\n        # For the new columns we're going to shorten the names\n        renamed_columns = {col: self.process_column_name(col) for col in new_columns}\n\n        # Rename the columns in the DataFrame\n        df.rename(columns=renamed_columns, inplace=True)\n\n        return df\n\n    # Helper Methods\n    def auto_convert_columns_to_categorical(self):\n        \"\"\"Convert object and string types to Categorical\"\"\"\n        categorical_columns = []\n        for feature, dtype in self.output_df.dtypes.items():\n            if dtype in [\"object\", \"string\", \"category\"] and feature not in [\n                self.event_time_column,\n                self.id_column,\n                self.target_column,\n            ]:\n                unique_values = self.output_df[feature].nunique()\n                if 1 &lt; unique_values &lt; 6:\n                    self.log.important(f\"Converting column {feature} to categorical (unique {unique_values})\")\n                    self.output_df[feature] = self.output_df[feature].astype(\"category\")\n                    categorical_columns.append(feature)\n\n        # Now run one hot encoding on categorical columns\n        self.output_df = self.one_hot_encoding(self.output_df, categorical_columns)\n\n    def manual_categorical_converter(self):\n        \"\"\"Convert object and string types to Categorical\n\n        Note:\n            This method is used for streaming/chunking. You can set the\n            categorical_dtypes attribute to a dictionary of column names and\n            their respective categorical types.\n        \"\"\"\n        for column, cat_d_type in self.categorical_dtypes.items():\n            self.output_df[column] = self.output_df[column].astype(cat_d_type)\n\n        # Now convert Categorical Types to One Hot Encoding\n        categorical_columns = list(self.categorical_dtypes.keys())\n        self.output_df = self.one_hot_encoding(self.output_df, categorical_columns)\n\n    @staticmethod\n    def convert_column_types(df: pd.DataFrame) -&gt; pd.DataFrame:\n        \"\"\"Convert the types of the DataFrame to the correct types for the Feature Store\"\"\"\n        for column in list(df.select_dtypes(include=\"bool\").columns):\n            df[column] = df[column].astype(\"int32\")\n        for column in list(df.select_dtypes(include=\"category\").columns):\n            df[column] = df[column].astype(\"str\")\n\n        # Special case for datetime types\n        for column in df.select_dtypes(include=[\"datetime\"]).columns:\n            df[column] = df[column].map(datetime_to_iso8601).astype(\"string\")\n\n        \"\"\"FIXME Not sure we need these conversions\n        for column in list(df.select_dtypes(include=\"object\").columns):\n            df[column] = df[column].astype(\"string\")\n        for column in list(df.select_dtypes(include=[pd.Int64Dtype]).columns):\n            df[column] = df[column].astype(\"int64\")\n        for column in list(df.select_dtypes(include=[pd.Float64Dtype]).columns):\n            df[column] = df[column].astype(\"float64\")\n        \"\"\"\n        return df\n\n    def prep_dataframe(self):\n        \"\"\"Prep the DataFrame for Feature Store Creation\"\"\"\n        self.log.info(\"Prep the output_df (cat_convert, convert types, and lowercase columns)...\")\n\n        # Make sure we have the required id and event_time columns\n        self._ensure_id_column()\n        self._ensure_event_time()\n\n        # Convert object and string types to Categorical\n        if self.auto_one_hot:\n            self.auto_convert_columns_to_categorical()\n        else:\n            self.manual_categorical_converter()\n\n        # We need to convert some of our column types to the correct types\n        # Feature Store only supports these data types:\n        # - Integral\n        # - Fractional\n        # - String (timestamp/datetime types need to be converted to string)\n        self.output_df = self.convert_column_types(self.output_df)\n\n        # Convert columns names to lowercase, Athena will not work with uppercase column names\n        if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n            for c in self.output_df.columns:\n                if c != c.lower():\n                    self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n            self.output_df.columns = self.output_df.columns.str.lower()\n\n    def create_feature_group(self):\n        \"\"\"Create a Feature Group, load our Feature Definitions, and wait for it to be ready\"\"\"\n\n        # Create a Feature Group and load our Feature Definitions\n        my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)\n        my_feature_group.load_feature_definitions(data_frame=self.output_df)\n\n        # Create the Output S3 Storage Path for this Feature Set\n        s3_storage_path = f\"{self.feature_sets_s3_path}/{self.output_uuid}\"\n\n        # Get the metadata/tags to push into AWS\n        aws_tags = self.get_aws_tags()\n\n        # Create the Feature Group\n        my_feature_group.create(\n            s3_uri=s3_storage_path,\n            record_identifier_name=self.id_column,\n            event_time_feature_name=self.event_time_column,\n            role_arn=self.sageworks_role_arn,\n            enable_online_store=True,\n            table_format=self.table_format,\n            tags=aws_tags,\n        )\n\n        # Ensure/wait for the feature group to be created\n        self.ensure_feature_group_created(my_feature_group)\n        return my_feature_group\n\n    def pre_transform(self, **kwargs):\n        \"\"\"Pre-Transform: Create the Feature Group\"\"\"\n        self.output_feature_group = self.create_feature_group()\n\n    def transform_impl(self):\n        \"\"\"Transform Implementation: Ingest the data into the Feature Group\"\"\"\n\n        # Now we actually push the data into the Feature Group (called ingestion)\n        self.log.important(\"Ingesting rows into Feature Group...\")\n        ingest_manager = self.output_feature_group.ingest(self.output_df, max_processes=8, wait=False)\n        try:\n            ingest_manager.wait()\n        except IngestionError as exc:\n            self.log.warning(f\"Some rows had an ingesting error: {exc}\")\n\n        # Report on any rows that failed to ingest\n        if ingest_manager.failed_rows:\n            self.log.warning(f\"Number of Failed Rows: {len(ingest_manager.failed_rows)}\")\n\n            # FIXME: This may or may not give us the correct rows\n            # If any index is greater then the number of rows, then the index needs\n            # to be converted to a relative index in our current output_df\n            df_rows = len(self.output_df)\n            relative_indexes = [idx - df_rows if idx &gt;= df_rows else idx for idx in ingest_manager.failed_rows]\n            failed_data = self.output_df.iloc[relative_indexes]\n            for idx, row in failed_data.iterrows():\n                self.log.warning(f\"Failed Row {idx}: {row.to_dict()}\")\n\n        # Keep track of the number of rows we expect to be ingested\n        self.expected_rows += len(self.output_df) - len(ingest_manager.failed_rows)\n        self.log.info(f\"Added rows: {len(self.output_df)}\")\n        self.log.info(f\"Failed rows: {len(ingest_manager.failed_rows)}\")\n        self.log.info(f\"Total rows to be ingested: {self.expected_rows}\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Populating Offline Storage and onboard()\"\"\"\n        self.log.info(\"Post-Transform: Populating Offline Storage and onboard()...\")\n\n        # Feature Group Ingestion takes a while, so we need to wait for it to finish\n        self.output_feature_set = FeatureSetCore(self.output_uuid, force_refresh=True)\n        self.log.important(\"Waiting for AWS Feature Group Offline storage to be ready...\")\n        self.log.important(\"This will often take 10-20 minutes...go have coffee or lunch :)\")\n        self.output_feature_set.set_status(\"initializing\")\n        self.wait_for_rows(self.expected_rows)\n\n        # Call the FeatureSet onboard method to compute a bunch of EDA stuff\n        self.output_feature_set.onboard()\n\n    def ensure_feature_group_created(self, feature_group):\n        status = feature_group.describe().get(\"FeatureGroupStatus\")\n        while status == \"Creating\":\n            self.log.debug(\"FeatureSet being Created...\")\n            time.sleep(5)\n            status = feature_group.describe().get(\"FeatureGroupStatus\")\n        self.log.info(f\"FeatureSet {feature_group.name} successfully created\")\n\n    def wait_for_rows(self, expected_rows: int):\n        \"\"\"Wait for AWS Feature Group to fully populate the Offline Storage\"\"\"\n        rows = self.output_feature_set.num_rows()\n\n        # Wait for the rows to be populated\n        self.log.info(f\"Waiting for AWS Feature Group {self.output_uuid} Offline Storage...\")\n        not_all_rows_retry = 5\n        while rows &lt; expected_rows and not_all_rows_retry &gt; 0:\n            sleep_time = 5 if rows else 60\n            not_all_rows_retry -= 1 if rows else 0\n            time.sleep(sleep_time)\n            rows = self.output_feature_set.num_rows()\n            self.log.info(f\"Offline Storage {self.output_uuid}: {rows} rows out of {expected_rows}\")\n        if rows == expected_rows:\n            self.log.important(f\"Success: Reached Expected Rows ({rows} rows)...\")\n        else:\n            self.log.warning(\n                f\"Did not reach expected rows ({rows}/{expected_rows}) but we're not sweating the small stuff...\"\n            )\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.__init__","title":"<code>__init__(output_uuid, auto_one_hot=False)</code>","text":"<p>PandasToFeatures Initialization Args:     output_uuid (str): The UUID of the FeatureSet to create     auto_one_hot (bool): Should we automatically one-hot encode categorical columns?</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def __init__(self, output_uuid: str, auto_one_hot=False):\n    \"\"\"PandasToFeatures Initialization\n    Args:\n        output_uuid (str): The UUID of the FeatureSet to create\n        auto_one_hot (bool): Should we automatically one-hot encode categorical columns?\n    \"\"\"\n\n    # Make sure the output_uuid is a valid name\n    Artifact.ensure_valid_name(output_uuid)\n\n    # Call superclass init\n    super().__init__(\"DataFrame\", output_uuid)\n\n    # Set up all my instance attributes\n    self.input_type = TransformInput.PANDAS_DF\n    self.output_type = TransformOutput.FEATURE_SET\n    self.target_column = None\n    self.id_column = None\n    self.event_time_column = None\n    self.auto_one_hot = auto_one_hot\n    self.categorical_dtypes = {}\n    self.output_df = None\n    self.table_format = TableFormatEnum.ICEBERG\n\n    # Delete the existing FeatureSet if it exists\n    self.delete_existing()\n\n    # These will be set in the transform method\n    self.output_feature_group = None\n    self.output_feature_set = None\n    self.expected_rows = 0\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.auto_convert_columns_to_categorical","title":"<code>auto_convert_columns_to_categorical()</code>","text":"<p>Convert object and string types to Categorical</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def auto_convert_columns_to_categorical(self):\n    \"\"\"Convert object and string types to Categorical\"\"\"\n    categorical_columns = []\n    for feature, dtype in self.output_df.dtypes.items():\n        if dtype in [\"object\", \"string\", \"category\"] and feature not in [\n            self.event_time_column,\n            self.id_column,\n            self.target_column,\n        ]:\n            unique_values = self.output_df[feature].nunique()\n            if 1 &lt; unique_values &lt; 6:\n                self.log.important(f\"Converting column {feature} to categorical (unique {unique_values})\")\n                self.output_df[feature] = self.output_df[feature].astype(\"category\")\n                categorical_columns.append(feature)\n\n    # Now run one hot encoding on categorical columns\n    self.output_df = self.one_hot_encoding(self.output_df, categorical_columns)\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.convert_column_types","title":"<code>convert_column_types(df)</code>  <code>staticmethod</code>","text":"<p>Convert the types of the DataFrame to the correct types for the Feature Store</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>@staticmethod\ndef convert_column_types(df: pd.DataFrame) -&gt; pd.DataFrame:\n    \"\"\"Convert the types of the DataFrame to the correct types for the Feature Store\"\"\"\n    for column in list(df.select_dtypes(include=\"bool\").columns):\n        df[column] = df[column].astype(\"int32\")\n    for column in list(df.select_dtypes(include=\"category\").columns):\n        df[column] = df[column].astype(\"str\")\n\n    # Special case for datetime types\n    for column in df.select_dtypes(include=[\"datetime\"]).columns:\n        df[column] = df[column].map(datetime_to_iso8601).astype(\"string\")\n\n    \"\"\"FIXME Not sure we need these conversions\n    for column in list(df.select_dtypes(include=\"object\").columns):\n        df[column] = df[column].astype(\"string\")\n    for column in list(df.select_dtypes(include=[pd.Int64Dtype]).columns):\n        df[column] = df[column].astype(\"int64\")\n    for column in list(df.select_dtypes(include=[pd.Float64Dtype]).columns):\n        df[column] = df[column].astype(\"float64\")\n    \"\"\"\n    return df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.create_feature_group","title":"<code>create_feature_group()</code>","text":"<p>Create a Feature Group, load our Feature Definitions, and wait for it to be ready</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def create_feature_group(self):\n    \"\"\"Create a Feature Group, load our Feature Definitions, and wait for it to be ready\"\"\"\n\n    # Create a Feature Group and load our Feature Definitions\n    my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)\n    my_feature_group.load_feature_definitions(data_frame=self.output_df)\n\n    # Create the Output S3 Storage Path for this Feature Set\n    s3_storage_path = f\"{self.feature_sets_s3_path}/{self.output_uuid}\"\n\n    # Get the metadata/tags to push into AWS\n    aws_tags = self.get_aws_tags()\n\n    # Create the Feature Group\n    my_feature_group.create(\n        s3_uri=s3_storage_path,\n        record_identifier_name=self.id_column,\n        event_time_feature_name=self.event_time_column,\n        role_arn=self.sageworks_role_arn,\n        enable_online_store=True,\n        table_format=self.table_format,\n        tags=aws_tags,\n    )\n\n    # Ensure/wait for the feature group to be created\n    self.ensure_feature_group_created(my_feature_group)\n    return my_feature_group\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.manual_categorical_converter","title":"<code>manual_categorical_converter()</code>","text":"<p>Convert object and string types to Categorical</p> Note <p>This method is used for streaming/chunking. You can set the categorical_dtypes attribute to a dictionary of column names and their respective categorical types.</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def manual_categorical_converter(self):\n    \"\"\"Convert object and string types to Categorical\n\n    Note:\n        This method is used for streaming/chunking. You can set the\n        categorical_dtypes attribute to a dictionary of column names and\n        their respective categorical types.\n    \"\"\"\n    for column, cat_d_type in self.categorical_dtypes.items():\n        self.output_df[column] = self.output_df[column].astype(cat_d_type)\n\n    # Now convert Categorical Types to One Hot Encoding\n    categorical_columns = list(self.categorical_dtypes.keys())\n    self.output_df = self.one_hot_encoding(self.output_df, categorical_columns)\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.one_hot_encoding","title":"<code>one_hot_encoding(df, categorical_columns)</code>","text":"<p>One Hot Encoding for Categorical Columns with additional column name management</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def one_hot_encoding(self, df, categorical_columns: list) -&gt; pd.DataFrame:\n    \"\"\"One Hot Encoding for Categorical Columns with additional column name management\"\"\"\n\n    # Now convert Categorical Types to One Hot Encoding\n    current_columns = list(df.columns)\n    df = pd.get_dummies(df, columns=categorical_columns)\n\n    # Compute the new columns generated by get_dummies\n    new_columns = list(set(df.columns) - set(current_columns))\n\n    # Convert new columns to int32\n    df[new_columns] = df[new_columns].astype(\"int32\")\n\n    # For the new columns we're going to shorten the names\n    renamed_columns = {col: self.process_column_name(col) for col in new_columns}\n\n    # Rename the columns in the DataFrame\n    df.rename(columns=renamed_columns, inplace=True)\n\n    return df\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Populating Offline Storage and onboard()</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Populating Offline Storage and onboard()\"\"\"\n    self.log.info(\"Post-Transform: Populating Offline Storage and onboard()...\")\n\n    # Feature Group Ingestion takes a while, so we need to wait for it to finish\n    self.output_feature_set = FeatureSetCore(self.output_uuid, force_refresh=True)\n    self.log.important(\"Waiting for AWS Feature Group Offline storage to be ready...\")\n    self.log.important(\"This will often take 10-20 minutes...go have coffee or lunch :)\")\n    self.output_feature_set.set_status(\"initializing\")\n    self.wait_for_rows(self.expected_rows)\n\n    # Call the FeatureSet onboard method to compute a bunch of EDA stuff\n    self.output_feature_set.onboard()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.pre_transform","title":"<code>pre_transform(**kwargs)</code>","text":"<p>Pre-Transform: Create the Feature Group</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def pre_transform(self, **kwargs):\n    \"\"\"Pre-Transform: Create the Feature Group\"\"\"\n    self.output_feature_group = self.create_feature_group()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.prep_dataframe","title":"<code>prep_dataframe()</code>","text":"<p>Prep the DataFrame for Feature Store Creation</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def prep_dataframe(self):\n    \"\"\"Prep the DataFrame for Feature Store Creation\"\"\"\n    self.log.info(\"Prep the output_df (cat_convert, convert types, and lowercase columns)...\")\n\n    # Make sure we have the required id and event_time columns\n    self._ensure_id_column()\n    self._ensure_event_time()\n\n    # Convert object and string types to Categorical\n    if self.auto_one_hot:\n        self.auto_convert_columns_to_categorical()\n    else:\n        self.manual_categorical_converter()\n\n    # We need to convert some of our column types to the correct types\n    # Feature Store only supports these data types:\n    # - Integral\n    # - Fractional\n    # - String (timestamp/datetime types need to be converted to string)\n    self.output_df = self.convert_column_types(self.output_df)\n\n    # Convert columns names to lowercase, Athena will not work with uppercase column names\n    if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n        for c in self.output_df.columns:\n            if c != c.lower():\n                self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n        self.output_df.columns = self.output_df.columns.str.lower()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.process_column_name","title":"<code>process_column_name(column, shorten=False)</code>","text":"<p>Call various methods to make sure the column is ready for Feature Store Args:     column (str): The column name to process     shorten (bool): Should we shorten the column name? (default: False)</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def process_column_name(self, column: str, shorten: bool = False) -&gt; str:\n    \"\"\"Call various methods to make sure the column is ready for Feature Store\n    Args:\n        column (str): The column name to process\n        shorten (bool): Should we shorten the column name? (default: False)\n    \"\"\"\n    self.log.debug(f\"Processing column {column}...\")\n\n    # Make sure the column name is valid\n    column = self.sanitize_column_name(column)\n\n    # Make sure the column name isn't too long\n    if shorten:\n        column = self.shorten_column_name(column)\n\n    return column\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.set_input","title":"<code>set_input(input_df, target_column=None, id_column=None, event_time_column=None)</code>","text":"<p>Set the Input DataFrame for this Transform Args:     input_df (pd.DataFrame): The input DataFrame     target_column (str): The name of the target column (default: None)     id_column (str): The name of the id column (default: None)     event_time_column (str): The name of the event_time column (default: None)</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def set_input(self, input_df: pd.DataFrame, target_column=None, id_column=None, event_time_column=None):\n    \"\"\"Set the Input DataFrame for this Transform\n    Args:\n        input_df (pd.DataFrame): The input DataFrame\n        target_column (str): The name of the target column (default: None)\n        id_column (str): The name of the id column (default: None)\n        event_time_column (str): The name of the event_time column (default: None)\n    \"\"\"\n    self.target_column = target_column\n    self.id_column = id_column\n    self.event_time_column = event_time_column\n    self.output_df = input_df.copy()\n\n    # Now Prepare the DataFrame for its journey into an AWS FeatureGroup\n    self.prep_dataframe()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.transform_impl","title":"<code>transform_impl()</code>","text":"<p>Transform Implementation: Ingest the data into the Feature Group</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def transform_impl(self):\n    \"\"\"Transform Implementation: Ingest the data into the Feature Group\"\"\"\n\n    # Now we actually push the data into the Feature Group (called ingestion)\n    self.log.important(\"Ingesting rows into Feature Group...\")\n    ingest_manager = self.output_feature_group.ingest(self.output_df, max_processes=8, wait=False)\n    try:\n        ingest_manager.wait()\n    except IngestionError as exc:\n        self.log.warning(f\"Some rows had an ingesting error: {exc}\")\n\n    # Report on any rows that failed to ingest\n    if ingest_manager.failed_rows:\n        self.log.warning(f\"Number of Failed Rows: {len(ingest_manager.failed_rows)}\")\n\n        # FIXME: This may or may not give us the correct rows\n        # If any index is greater then the number of rows, then the index needs\n        # to be converted to a relative index in our current output_df\n        df_rows = len(self.output_df)\n        relative_indexes = [idx - df_rows if idx &gt;= df_rows else idx for idx in ingest_manager.failed_rows]\n        failed_data = self.output_df.iloc[relative_indexes]\n        for idx, row in failed_data.iterrows():\n            self.log.warning(f\"Failed Row {idx}: {row.to_dict()}\")\n\n    # Keep track of the number of rows we expect to be ingested\n    self.expected_rows += len(self.output_df) - len(ingest_manager.failed_rows)\n    self.log.info(f\"Added rows: {len(self.output_df)}\")\n    self.log.info(f\"Failed rows: {len(ingest_manager.failed_rows)}\")\n    self.log.info(f\"Total rows to be ingested: {self.expected_rows}\")\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.wait_for_rows","title":"<code>wait_for_rows(expected_rows)</code>","text":"<p>Wait for AWS Feature Group to fully populate the Offline Storage</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py</code> <pre><code>def wait_for_rows(self, expected_rows: int):\n    \"\"\"Wait for AWS Feature Group to fully populate the Offline Storage\"\"\"\n    rows = self.output_feature_set.num_rows()\n\n    # Wait for the rows to be populated\n    self.log.info(f\"Waiting for AWS Feature Group {self.output_uuid} Offline Storage...\")\n    not_all_rows_retry = 5\n    while rows &lt; expected_rows and not_all_rows_retry &gt; 0:\n        sleep_time = 5 if rows else 60\n        not_all_rows_retry -= 1 if rows else 0\n        time.sleep(sleep_time)\n        rows = self.output_feature_set.num_rows()\n        self.log.info(f\"Offline Storage {self.output_uuid}: {rows} rows out of {expected_rows}\")\n    if rows == expected_rows:\n        self.log.important(f\"Success: Reached Expected Rows ({rows} rows)...\")\n    else:\n        self.log.warning(\n            f\"Did not reach expected rows ({rows}/{expected_rows}) but we're not sweating the small stuff...\"\n        )\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked","title":"<code>PandasToFeaturesChunked</code>","text":"<p>               Bases: <code>Transform</code></p> <p>PandasToFeaturesChunked:  Class to manage a bunch of chunked Pandas DataFrames into a FeatureSet</p> Common Usage <pre><code>to_features = PandasToFeaturesChunked(output_uuid, id_column=\"id\"/None, event_time_column=\"date\"/None)\nto_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\ncat_column_info = {\"sex\": [\"M\", \"F\", \"I\"]}\nto_features.set_categorical_info(cat_column_info)\nto_features.add_chunk(df)\nto_features.add_chunk(df)\n...\nto_features.finalize()\n</code></pre> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>class PandasToFeaturesChunked(Transform):\n    \"\"\"PandasToFeaturesChunked:  Class to manage a bunch of chunked Pandas DataFrames into a FeatureSet\n\n    Common Usage:\n        ```\n        to_features = PandasToFeaturesChunked(output_uuid, id_column=\"id\"/None, event_time_column=\"date\"/None)\n        to_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n        cat_column_info = {\"sex\": [\"M\", \"F\", \"I\"]}\n        to_features.set_categorical_info(cat_column_info)\n        to_features.add_chunk(df)\n        to_features.add_chunk(df)\n        ...\n        to_features.finalize()\n        ```\n    \"\"\"\n\n    def __init__(self, output_uuid: str, id_column=None, event_time_column=None):\n        \"\"\"PandasToFeaturesChunked Initialization\"\"\"\n\n        # Make sure the output_uuid is a valid name\n        Artifact.ensure_valid_name(output_uuid)\n\n        # Call superclass init\n        super().__init__(\"DataFrame\", output_uuid)\n\n        # Set up all my instance attributes\n        self.id_column = id_column\n        self.event_time_column = event_time_column\n        self.first_chunk = None\n        self.pandas_to_features = PandasToFeatures(output_uuid, auto_one_hot=False)\n\n    def set_categorical_info(self, cat_column_info: dict[list[str]]):\n        \"\"\"Set the Categorical Columns\n        Args:\n            cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values\n        \"\"\"\n\n        # Create the CategoricalDtypes\n        cat_d_types = {}\n        for col, vals in cat_column_info.items():\n            cat_d_types[col] = CategoricalDtype(categories=vals)\n\n        # Now set the CategoricalDtypes on our underlying PandasToFeatures\n        self.pandas_to_features.categorical_dtypes = cat_d_types\n\n    def add_chunk(self, chunk_df: pd.DataFrame):\n        \"\"\"Add a Chunk of Data to the FeatureSet\"\"\"\n\n        # Is this the first chunk? If so we need to run the pre_transform\n        if self.first_chunk is None:\n            self.log.info(f\"Adding first chunk {chunk_df.shape}...\")\n            self.first_chunk = chunk_df\n            self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n            self.pandas_to_features.pre_transform()\n            self.pandas_to_features.transform_impl()\n        else:\n            self.log.info(f\"Adding chunk {chunk_df.shape}...\")\n            self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n            self.pandas_to_features.transform_impl()\n\n    def pre_transform(self, **kwargs):\n        \"\"\"Pre-Transform: Create the Feature Group with Chunked Data\"\"\"\n\n        # Loading data into a Feature Group takes a while, so set status to loading\n        FeatureSetCore(self.output_uuid).set_status(\"loading\")\n\n    def transform_impl(self):\n        \"\"\"Required implementation of the Transform interface\"\"\"\n        self.log.warning(\"PandasToFeaturesChunked.transform_impl() called.  This is a no-op.\")\n\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform: Any Post Transform Steps\"\"\"\n        self.pandas_to_features.post_transform()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.__init__","title":"<code>__init__(output_uuid, id_column=None, event_time_column=None)</code>","text":"<p>PandasToFeaturesChunked Initialization</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def __init__(self, output_uuid: str, id_column=None, event_time_column=None):\n    \"\"\"PandasToFeaturesChunked Initialization\"\"\"\n\n    # Make sure the output_uuid is a valid name\n    Artifact.ensure_valid_name(output_uuid)\n\n    # Call superclass init\n    super().__init__(\"DataFrame\", output_uuid)\n\n    # Set up all my instance attributes\n    self.id_column = id_column\n    self.event_time_column = event_time_column\n    self.first_chunk = None\n    self.pandas_to_features = PandasToFeatures(output_uuid, auto_one_hot=False)\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.add_chunk","title":"<code>add_chunk(chunk_df)</code>","text":"<p>Add a Chunk of Data to the FeatureSet</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def add_chunk(self, chunk_df: pd.DataFrame):\n    \"\"\"Add a Chunk of Data to the FeatureSet\"\"\"\n\n    # Is this the first chunk? If so we need to run the pre_transform\n    if self.first_chunk is None:\n        self.log.info(f\"Adding first chunk {chunk_df.shape}...\")\n        self.first_chunk = chunk_df\n        self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n        self.pandas_to_features.pre_transform()\n        self.pandas_to_features.transform_impl()\n    else:\n        self.log.info(f\"Adding chunk {chunk_df.shape}...\")\n        self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n        self.pandas_to_features.transform_impl()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.post_transform","title":"<code>post_transform(**kwargs)</code>","text":"<p>Post-Transform: Any Post Transform Steps</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def post_transform(self, **kwargs):\n    \"\"\"Post-Transform: Any Post Transform Steps\"\"\"\n    self.pandas_to_features.post_transform()\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.pre_transform","title":"<code>pre_transform(**kwargs)</code>","text":"<p>Pre-Transform: Create the Feature Group with Chunked Data</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def pre_transform(self, **kwargs):\n    \"\"\"Pre-Transform: Create the Feature Group with Chunked Data\"\"\"\n\n    # Loading data into a Feature Group takes a while, so set status to loading\n    FeatureSetCore(self.output_uuid).set_status(\"loading\")\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.set_categorical_info","title":"<code>set_categorical_info(cat_column_info)</code>","text":"<p>Set the Categorical Columns Args:     cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def set_categorical_info(self, cat_column_info: dict[list[str]]):\n    \"\"\"Set the Categorical Columns\n    Args:\n        cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values\n    \"\"\"\n\n    # Create the CategoricalDtypes\n    cat_d_types = {}\n    for col, vals in cat_column_info.items():\n        cat_d_types[col] = CategoricalDtype(categories=vals)\n\n    # Now set the CategoricalDtypes on our underlying PandasToFeatures\n    self.pandas_to_features.categorical_dtypes = cat_d_types\n</code></pre>"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.transform_impl","title":"<code>transform_impl()</code>","text":"<p>Required implementation of the Transform interface</p> Source code in <code>src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py</code> <pre><code>def transform_impl(self):\n    \"\"\"Required implementation of the Transform interface\"\"\"\n    self.log.warning(\"PandasToFeaturesChunked.transform_impl() called.  This is a no-op.\")\n</code></pre>"},{"location":"core_classes/transforms/transform/","title":"Transform","text":"<p>API Classes</p> <p>The API Classes will use Transforms internally. So model.to_endpoint() uses the ModelToEndpoint() transform. If you need more control over the Transform you can use the Core Classes directly.</p> <p>The SageWorks Transform class is a base/abstract class that defines API implemented by all the child classes (DataLoaders, DataSourceToFeatureSet, ModelToEndpoint, etc).</p> <p>Transform: Base Class for all transforms within SageWorks Inherited Classes must implement the abstract transform_impl() method</p>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform","title":"<code>Transform</code>","text":"<p>               Bases: <code>ABC</code></p> <p>Transform: Base Class for all transforms within SageWorks. Inherited Classes must implement the abstract transform_impl() method</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>class Transform(ABC):\n    \"\"\"Transform: Base Class for all transforms within SageWorks. Inherited Classes\n    must implement the abstract transform_impl() method\"\"\"\n\n    def __init__(self, input_uuid: str, output_uuid: str):\n        \"\"\"Transform Initialization\"\"\"\n\n        self.log = logging.getLogger(\"sageworks\")\n        self.input_type = None\n        self.output_type = None\n        self.output_tags = \"\"\n        self.input_uuid = str(input_uuid)  # Occasionally we get a pathlib.Path object\n        self.output_uuid = str(output_uuid)  # Occasionally we get a pathlib.Path object\n        self.output_meta = {\"sageworks_input\": self.input_uuid}\n        self.data_catalog_db = \"sageworks\"\n\n        # Grab our SageWorks Bucket\n        cm = ConfigManager()\n        if not cm.config_okay():\n            self.log.error(\"SageWorks Configuration Incomplete...\")\n            self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n            raise FatalConfigError()\n        self.sageworks_bucket = cm.get_config(\"SAGEWORKS_BUCKET\")\n        self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n        self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n        self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n        self.endpoints_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n        # Grab a SageWorks Role ARN, Boto3, SageMaker Session, and SageMaker Client\n        self.aws_account_clamp = AWSAccountClamp()\n        self.sageworks_role_arn = self.aws_account_clamp.sageworks_execution_role_arn()\n        self.boto_session = self.aws_account_clamp.boto_session()\n        self.sm_session = self.aws_account_clamp.sagemaker_session(self.boto_session)\n        self.sm_client = self.aws_account_clamp.sagemaker_client(self.boto_session)\n\n        # Delimiter for storing lists in AWS Tags\n        self.tag_delimiter = \"::\"\n\n    @abstractmethod\n    def transform_impl(self, **kwargs):\n        \"\"\"Abstract Method: Implement the Transformation from Input to Output\"\"\"\n        pass\n\n    def pre_transform(self, **kwargs):\n        \"\"\"Perform any Pre-Transform operations\"\"\"\n        self.log.debug(\"Pre-Transform...\")\n\n    @abstractmethod\n    def post_transform(self, **kwargs):\n        \"\"\"Post-Transform ensures that the output Artifact is ready for use\"\"\"\n        pass\n\n    def set_output_tags(self, tags: list | str):\n        \"\"\"Set the tags that will be associated with the output object\n        Args:\n            tags (list | str): The list of tags or a '::' separated string of tags\"\"\"\n        if isinstance(tags, list):\n            self.output_tags = self.tag_delimiter.join(tags)\n        else:\n            self.output_tags = tags\n\n    def add_output_meta(self, meta: dict):\n        \"\"\"Add additional metadata that will be associated with the output artifact\n        Args:\n            meta (dict): A dictionary of metadata\"\"\"\n        self.output_meta = self.output_meta | meta\n\n    @staticmethod\n    def convert_to_aws_tags(metadata: dict):\n        \"\"\"Convert a dictionary to the AWS tag format (list of dicts)\n        [ {Key: key_name, Value: value}, {..}, ...]\"\"\"\n        return [{\"Key\": key, \"Value\": value} for key, value in metadata.items()]\n\n    def get_aws_tags(self):\n        \"\"\"Get the metadata/tags and convert them into AWS Tag Format\"\"\"\n        # Set up our metadata storage\n        sageworks_meta = {\"sageworks_tags\": self.output_tags}\n        for key, value in self.output_meta.items():\n            sageworks_meta[key] = value\n        aws_tags = self.convert_to_aws_tags(sageworks_meta)\n        return aws_tags\n\n    @final\n    def transform(self, **kwargs):\n        \"\"\"Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations\"\"\"\n        self.pre_transform(**kwargs)\n        self.transform_impl(**kwargs)\n        self.post_transform(**kwargs)\n\n    def input_type(self) -&gt; TransformInput:\n        \"\"\"What Input Type does this Transform Consume\"\"\"\n        return self.input_type\n\n    def output_type(self) -&gt; TransformOutput:\n        \"\"\"What Output Type does this Transform Produce\"\"\"\n        return self.output_type\n\n    def set_input_uuid(self, input_uuid: str):\n        \"\"\"Set the Input UUID (Name) for this Transform\"\"\"\n        self.input_uuid = input_uuid\n\n    def set_output_uuid(self, output_uuid: str):\n        \"\"\"Set the Output UUID (Name) for this Transform\"\"\"\n        self.output_uuid = output_uuid\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.__init__","title":"<code>__init__(input_uuid, output_uuid)</code>","text":"<p>Transform Initialization</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def __init__(self, input_uuid: str, output_uuid: str):\n    \"\"\"Transform Initialization\"\"\"\n\n    self.log = logging.getLogger(\"sageworks\")\n    self.input_type = None\n    self.output_type = None\n    self.output_tags = \"\"\n    self.input_uuid = str(input_uuid)  # Occasionally we get a pathlib.Path object\n    self.output_uuid = str(output_uuid)  # Occasionally we get a pathlib.Path object\n    self.output_meta = {\"sageworks_input\": self.input_uuid}\n    self.data_catalog_db = \"sageworks\"\n\n    # Grab our SageWorks Bucket\n    cm = ConfigManager()\n    if not cm.config_okay():\n        self.log.error(\"SageWorks Configuration Incomplete...\")\n        self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n        raise FatalConfigError()\n    self.sageworks_bucket = cm.get_config(\"SAGEWORKS_BUCKET\")\n    self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n    self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n    self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n    self.endpoints_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n    # Grab a SageWorks Role ARN, Boto3, SageMaker Session, and SageMaker Client\n    self.aws_account_clamp = AWSAccountClamp()\n    self.sageworks_role_arn = self.aws_account_clamp.sageworks_execution_role_arn()\n    self.boto_session = self.aws_account_clamp.boto_session()\n    self.sm_session = self.aws_account_clamp.sagemaker_session(self.boto_session)\n    self.sm_client = self.aws_account_clamp.sagemaker_client(self.boto_session)\n\n    # Delimiter for storing lists in AWS Tags\n    self.tag_delimiter = \"::\"\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.add_output_meta","title":"<code>add_output_meta(meta)</code>","text":"<p>Add additional metadata that will be associated with the output artifact Args:     meta (dict): A dictionary of metadata</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def add_output_meta(self, meta: dict):\n    \"\"\"Add additional metadata that will be associated with the output artifact\n    Args:\n        meta (dict): A dictionary of metadata\"\"\"\n    self.output_meta = self.output_meta | meta\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.convert_to_aws_tags","title":"<code>convert_to_aws_tags(metadata)</code>  <code>staticmethod</code>","text":"<p>Convert a dictionary to the AWS tag format (list of dicts) [ {Key: key_name, Value: value}, {..}, ...]</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>@staticmethod\ndef convert_to_aws_tags(metadata: dict):\n    \"\"\"Convert a dictionary to the AWS tag format (list of dicts)\n    [ {Key: key_name, Value: value}, {..}, ...]\"\"\"\n    return [{\"Key\": key, \"Value\": value} for key, value in metadata.items()]\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.get_aws_tags","title":"<code>get_aws_tags()</code>","text":"<p>Get the metadata/tags and convert them into AWS Tag Format</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def get_aws_tags(self):\n    \"\"\"Get the metadata/tags and convert them into AWS Tag Format\"\"\"\n    # Set up our metadata storage\n    sageworks_meta = {\"sageworks_tags\": self.output_tags}\n    for key, value in self.output_meta.items():\n        sageworks_meta[key] = value\n    aws_tags = self.convert_to_aws_tags(sageworks_meta)\n    return aws_tags\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.input_type","title":"<code>input_type()</code>","text":"<p>What Input Type does this Transform Consume</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def input_type(self) -&gt; TransformInput:\n    \"\"\"What Input Type does this Transform Consume\"\"\"\n    return self.input_type\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.output_type","title":"<code>output_type()</code>","text":"<p>What Output Type does this Transform Produce</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def output_type(self) -&gt; TransformOutput:\n    \"\"\"What Output Type does this Transform Produce\"\"\"\n    return self.output_type\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.post_transform","title":"<code>post_transform(**kwargs)</code>  <code>abstractmethod</code>","text":"<p>Post-Transform ensures that the output Artifact is ready for use</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>@abstractmethod\ndef post_transform(self, **kwargs):\n    \"\"\"Post-Transform ensures that the output Artifact is ready for use\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.pre_transform","title":"<code>pre_transform(**kwargs)</code>","text":"<p>Perform any Pre-Transform operations</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def pre_transform(self, **kwargs):\n    \"\"\"Perform any Pre-Transform operations\"\"\"\n    self.log.debug(\"Pre-Transform...\")\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_input_uuid","title":"<code>set_input_uuid(input_uuid)</code>","text":"<p>Set the Input UUID (Name) for this Transform</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def set_input_uuid(self, input_uuid: str):\n    \"\"\"Set the Input UUID (Name) for this Transform\"\"\"\n    self.input_uuid = input_uuid\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_output_tags","title":"<code>set_output_tags(tags)</code>","text":"<p>Set the tags that will be associated with the output object Args:     tags (list | str): The list of tags or a '::' separated string of tags</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def set_output_tags(self, tags: list | str):\n    \"\"\"Set the tags that will be associated with the output object\n    Args:\n        tags (list | str): The list of tags or a '::' separated string of tags\"\"\"\n    if isinstance(tags, list):\n        self.output_tags = self.tag_delimiter.join(tags)\n    else:\n        self.output_tags = tags\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_output_uuid","title":"<code>set_output_uuid(output_uuid)</code>","text":"<p>Set the Output UUID (Name) for this Transform</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>def set_output_uuid(self, output_uuid: str):\n    \"\"\"Set the Output UUID (Name) for this Transform\"\"\"\n    self.output_uuid = output_uuid\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.transform","title":"<code>transform(**kwargs)</code>","text":"<p>Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>@final\ndef transform(self, **kwargs):\n    \"\"\"Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations\"\"\"\n    self.pre_transform(**kwargs)\n    self.transform_impl(**kwargs)\n    self.post_transform(**kwargs)\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.transform_impl","title":"<code>transform_impl(**kwargs)</code>  <code>abstractmethod</code>","text":"<p>Abstract Method: Implement the Transformation from Input to Output</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>@abstractmethod\ndef transform_impl(self, **kwargs):\n    \"\"\"Abstract Method: Implement the Transformation from Input to Output\"\"\"\n    pass\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.TransformInput","title":"<code>TransformInput</code>","text":"<p>               Bases: <code>Enum</code></p> <p>Enumerated Types for SageWorks Transform Inputs</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>class TransformInput(Enum):\n    \"\"\"Enumerated Types for SageWorks Transform Inputs\"\"\"\n\n    LOCAL_FILE = auto()\n    PANDAS_DF = auto()\n    SPARK_DF = auto()\n    S3_OBJECT = auto()\n    DATA_SOURCE = auto()\n    FEATURE_SET = auto()\n    MODEL = auto()\n</code></pre>"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.TransformOutput","title":"<code>TransformOutput</code>","text":"<p>               Bases: <code>Enum</code></p> <p>Enumerated Types for SageWorks Transform Outputs</p> Source code in <code>src/sageworks/core/transforms/transform.py</code> <pre><code>class TransformOutput(Enum):\n    \"\"\"Enumerated Types for SageWorks Transform Outputs\"\"\"\n\n    PANDAS_DF = auto()\n    SPARK_DF = auto()\n    S3_OBJECT = auto()\n    DATA_SOURCE = auto()\n    FEATURE_SET = auto()\n    MODEL = auto()\n    ENDPOINT = auto()\n</code></pre>"},{"location":"enterprise/","title":"SageWorks Enterprise","text":"<p>The SageWorks API and User Interfaces cover a broad set of AWS Machine Learning services and provide easy to use abstractions and visualizations of your AWS ML data. We offer a wide range of options to best fit your companies needs.</p> Accelerate ML Pipeline development with an Enterprise License! Free Enterprise: Lite Enterprise: Standard Enterprise: Pro Python API \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 SageWorks REPL \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Dashboard \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 AWS Onboarding \u2796 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Dashboard Plugins \u2796 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Custom Pages \u2796 \u2796 \ud83d\udfe2 \ud83d\udfe2 Themes \u2796 \u2796 \ud83d\udfe2 \ud83d\udfe2 ML Pipelines \u2796 \u2796 \u2796 \ud83d\udfe2 Project Branding \u2796 \u2796 \u2796 \ud83d\udfe2 Prioritized Feature Requests \u2796 \u2796 \u2796 \ud83d\udfe2 Pricing \u2796 $1500* $3000* $4000* <p>*USD per month, includes AWS setup, support, and training: Everything needed to accelerate your AWS ML Development team. Interested in Data Science/Engineering consulting? We have top notch Consultants with a depth and breadth of AWS ML/DS/Engineering expertise.</p>"},{"location":"enterprise/#try-sageworks","title":"Try SageWorks","text":"<p>We encourage new users to try out the free version, first. We offer support in our Discord channel and our Documentation has instructions for how to get started with SageWorks. So try it out and when you're ready to accelerate your AWS ML Adventure with an Enterprise licence contact us at SageWorks Sales</p>"},{"location":"enterprise/#data-engineeringscience-consulting","title":"Data Engineering/Science Consulting","text":"<p>Alongside our SageWorks Enterprise offerings, we provide comprehensive consulting services and domain expertise through our Partnerships. We specialize in AWS Machine Learning Systems and our extended team of Data Scientists and Engineers, have Masters and Ph.D. degrees in Computer Science, Chemistry, and Pharmacology. We also have a parntership with Nomic Networks to support our Network Security Clients.</p> <p>Using AWS and SageWorks, our experts are equipped to deliver tailored solutions that are focused on your project needs and deliverables. For more information please touch base and we'll set up a free initial consultation SageWorks Consulting</p>"},{"location":"enterprise/#contact-us","title":"Contact Us","text":"<p>Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or  SageWorks Sales</p>"},{"location":"enterprise/private_saas/","title":"Benefits of a Private SaaS Architecture","text":""},{"location":"enterprise/private_saas/#self-hosted-vs-private-saas-vs-public-saas","title":"Self Hosted vs Private SaaS vs Public SaaS?","text":"<p>At the top level your team/project is making a decision about how they are going to build, expand, support, and maintain a machine learning pipeline.</p> <p>Conceptual ML Pipeline</p> <pre><code>Data \u2b95 Features \u2b95 Models \u2b95 Deployment (end-user application)\n</code></pre> <p>Concrete/Real World Example</p> <pre><code>S3 \u2b95 Glue Job \u2b95 Data Catalog \u2b95 FeatureGroups \u2b95 Models \u2b95 Endpoints \u2b95 App\n</code></pre> <p>When building out a framework to support ML Pipelines there are three main options:</p> <ul> <li>Self Hosted</li> <li>Private SaaS</li> <li>Public SaaS</li> </ul> <p>The other choice, that we're not going to cover here, is whether you use AWS, Azure, GCP, or something else. SageWorks is architected and powered by a broad and rich set of AWS ML Pipeline services. We believe that AWS provides the best set of functionality and APIs for flexible, real world ML architectures.</p> <p></p>"},{"location":"enterprise/private_saas/#resources","title":"Resources","text":"<p>See our full presentation on the SageWorks Private SaaS Architecture</p>"},{"location":"enterprise/project_branding/","title":"Project Branding","text":"<p>The SageWorks Dashboard can be customized extensively. Using SageWorks Project Branding allows you to change page headers, titles, and logos to match your project. All user interfaces will reflect your project name and company logos. </p>"},{"location":"enterprise/project_branding/#contact-us","title":"Contact Us","text":"<p>Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or  SageWorks Sales.</p>"},{"location":"enterprise/themes/","title":"SageWorks Themes","text":"<p>The SageWorks Dashboard can be customized extensively. Using SageWorks Themes allows you to customize the User Interfaces to suit your preferences, including completely customized color palettes and fonts. We offer a set of default 'dark' and 'light' themes, but we'll also customize the theme to match your company's color palette and logos.</p>"},{"location":"enterprise/themes/#contact-us","title":"Contact Us","text":"<p>Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or  SageWorks Sales.</p>"},{"location":"getting_started/","title":"Getting Started","text":"<p>For the initial setup of SageWorks we'll be using the SageWorks REPL. When you start <code>sageworks</code> it will recognize that it needs to complete the initial configuration and will guide you through that process.</p> <p>Need Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p>"},{"location":"getting_started/#initial-setupconfig","title":"Initial Setup/Config","text":"<p>Notes: Use the SageWorks REPL to setup your AWS connection for both API Usage (Data Scientists/Engineers) and AWS Initial Setup (AWS Folks). Also if you don't already have an AWS Profile or SSO Setup you'll need to do that first Developer SSO Setup </p> <p><pre><code>&gt; pip install sageworks\n&gt; sageworks &lt;-- This starts the REPL\n\nWelcome to SageWorks!\nLooks like this is your first time using SageWorks...\nLet's get you set up...\nAWS_PROFILE: my_aws_profile\nSAGEWORKS_BUCKET: my-company-sageworks\n[optional] REDIS_HOST(localhost): my-redis.cache.amazon (or leave blank)\n[optional] REDIS_PORT(6379):\n[optional] REDIS_PASSWORD():\n[optional] SAGEWORKS_API_KEY(open_source): my_api_key (or leave blank)\n</code></pre> That's It: You're now all set. This configuration only needs to be ONCE :)</p>"},{"location":"getting_started/#data-scientistsengineers","title":"Data Scientists/Engineers","text":"<ul> <li>SageWorks REPL: SageWorks REPL</li> <li>Using SageWorks for ML Pipelines: SageWorks API Classes</li> <li>SCP SageWorks Github: Github Repo</li> </ul>"},{"location":"getting_started/#aws-administrators","title":"AWS Administrators","text":"<p>For companies that are setting up SageWorks on an internal AWS Account: Company AWS Setup</p>"},{"location":"getting_started/#additional-resources","title":"Additional Resources","text":"<ul> <li>SageWorks Core Classes: Core Classes</li> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"glue/","title":"AWS Glue Jobs","text":"<p>AWS Glue Simplified</p> <p>AWS Glue Jobs are a great way to automate ETL and data processing. SageWorks takes all the hassle out of creating and debugging Glue Jobs. Follow this guide and empower your Glue Jobs with SageWorks!</p> <p>SageWorks make creating, testing, and debugging of AWS Glue Jobs easy. The exact same SageWorks API Classes are used in your Glue Jobs. Also since SageWorks manages the roles for both API and Glue Jobs you'll be able to test new Glue Jobs locally and minimizes surprises when deploying your Glue Job.</p>"},{"location":"glue/#glue-job-setup","title":"Glue Job Setup","text":"<p>Setting up a AWS Glue Job that uses SageWorks is straight forward. SageWorks can be 'installed' on AWS Glue via the <code>--additional-python-modules</code> parameter and then you can use the Sageworks API just like normal. </p> <p></p> <p>Here are the settings and a screen shot to guide you. There are several ways to set up and run Glue Jobs, with either the SageWorks-ExecutionRole or using the SageWorksAPIPolicy. Please feel free to contact SageWorks support if you need any help with setting up Glue Jobs.</p> <ul> <li>IAM Role: SageWorks-ExecutionRole</li> <li>Type: Spark</li> <li>Glue Version: Glue 4.0</li> <li>Worker Type: G.1X</li> <li>Number of Workers: 2</li> <li>Job Parameters</li> <li>--additional-python-modules: sageworks&gt;=0.4.6</li> <li>--sageworks-bucket: &lt;your sageworks bucket&gt;</li> </ul> <p>Glue IAM Role Details</p> <p>If your Glue Jobs already use an existing IAM Role then you can add the <code>SageWorksAPIPolicy</code> to that Role to enable the Glue Job to perform SageWorks API Tasks.</p>"},{"location":"glue/#sageworks-glue-example","title":"SageWorks Glue Example","text":"<p>Anyone familiar with a typical Glue Job should be pleasantly surpised by how simple the example below is. Also SageWorks allows you to test Glue Jobs locally using the same code that you use for script and Notebooks (see Glue Testing)</p> <p>Glue Job Arguments</p> <p>AWS Glue Jobs take arguments in the form of Job Parameters (see screenshot above). There's a SageWorks utility function <code>glue_args_to_dict</code> that turns these Job Parameters into a nice dictionary for ease of use.</p> examples/glue_hello_world.py<pre><code>import sys\n\n# SageWorks Imports\nfrom sageworks.api.data_source import DataSource\nfrom sageworks.utils.config_manager import ConfigManager\nfrom sageworks.utils.glue_utils import glue_args_to_dict\n\n# Convert Glue Job Args to a Dictionary\nglue_args = glue_args_to_dict(sys.argv)\n\n# Set the SAGEWORKS_BUCKET for the ConfigManager\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", glue_args[\"--sageworks-bucket\"])\n\n# Create a new Data Source from an S3 Path\nsource_path = \"s3://sageworks-public-data/common/abalone.csv\"\nmy_data = DataSource(source_path, name=\"abalone_glue_test\")\n</code></pre>"},{"location":"glue/#glue-example-2","title":"Glue Example 2","text":"<p>This example takes two 'Job Parameters'</p> <ul> <li>--sageworks-bucket : &lt;your sageworks bucket&gt;</li> <li>--input-s3-path : &lt;your S3 input path&gt;</li> </ul> <p>The example will convert all CSV files in an S3 bucket/prefix and load them up as DataSources in SageWorks.</p> examples/glue_load_s3_bucket.py<pre><code>import sys\n\n# SageWorks Imports\nfrom sageworks.api.data_source import DataSource\nfrom sageworks.utils.config_manager import ConfigManager\nfrom sageworks.utils.glue_utils import glue_args_to_dict, list_s3_files\n\n# Convert Glue Job Args to a Dictionary\nglue_args = glue_args_to_dict(sys.argv)\n\n# Set the SAGEWORKS_BUCKET for the ConfigManager\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", glue_args[\"--sageworks-bucket\"])\n\n# List all the CSV files in the given S3 Path\ninput_s3_path = glue_args[\"--input-s3-path\"]\nfor input_file in list_s3_files(input_s3_path):\n\n    # Note: If we don't specify a name, one will be 'auto-generated'\n    my_data = DataSource(input_file, name=None)\n</code></pre>"},{"location":"glue/#glue-job-local-testing","title":"Glue Job Local Testing","text":"<p>Glue Power without the Pain. SageWorks manages the AWS Execution Role, so local API and Glue Jobs will have the same permissions/access. Also using the same Code as your notebooks or scripts makes creating and testing Glue Jobs a breeze.</p> <pre><code>export SAGEWORKS_CONFIG=&lt;your config&gt;  # Only if not already set up\npython my_glue_job.py --sageworks-bucket &lt;your bucket&gt;\n</code></pre>"},{"location":"glue/#additional-resources","title":"Additional Resources","text":"<ul> <li>SageWorks Glue Jobs: SageWorks Glue</li> <li>Setting up SageWorks on your AWS Account: AWS Setup</li> <li>Using SageWorks for ML Pipelines: SageWorks API Classes</li> </ul> <ul> <li>SageWorks Core Classes: Core Classes</li> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"lambda_layer/","title":"AWS Lambda Layer","text":"<p>SageWorks Lambda Layers</p> <p>AWS Lambda Jobs are a great way to spin up data processing jobs. Follow this guide and empower AWS Lambda with SageWorks!</p> <p>SageWorks makes creating, testing, and debugging of AWS Lambda Functions easy. The exact same SageWorks API Classes are used in your AWS Lambda Functions. Also since SageWorks manages the access policies you'll be able to test new Lambda Jobs locally and minimizes surprises when deploying.</p> <p>Work In Progress</p> <p>The SageWorks Lambda Layers are a great way to use SageWorks but they are still in 'beta' mode so please let us know if you have any issues.</p>"},{"location":"lambda_layer/#lambda-job-setup","title":"Lambda Job Setup","text":"<p>Setting up a AWS Lambda Job that uses SageWorks is straight forward. SageWorks can be 'installed' using a Lambda Layer and then you can use the Sageworks API just like normal.</p> <p>Here are the ARNs for the current SageWorks Lambda Layers, please note they are specified with region and Python version in the name, so if your lambda is us-east-1, python 3.12, pick this ARN with those values in it.</p> <p>us-east-1</p> <ul> <li>arn:aws:lambda:us-east-1:507740646243:layer:sageworks_lambda_layer-us-east-1-python310:1</li> <li>arn:aws:lambda:us-east-1:507740646243:layer:sageworks_lambda_layer-us-east-1-python311:2</li> <li>arn:aws:lambda:us-east-1:507740646243:layer:sageworks_lambda_layer-us-east-1-python312:1</li> </ul> <p>us-west-2</p> <ul> <li>arn:aws:lambda:us-west-2:507740646243:layer:sageworks_lambda_layer-us-west-2-python310:1</li> <li>arn:aws:lambda:us-west-2:507740646243:layer:sageworks_lambda_layer-us-west-2-python311:2</li> <li>arn:aws:lambda:us-west-2:507740646243:layer:sageworks_lambda_layer-us-west-2-python312:1</li> </ul> <p>Note: If you're using lambdas on a different region or with a different Python version, just let us know and we'll publish some additional layers.</p> <p></p> <p>At the bottom of the Lambda page there's an 'Add Layer' button. You can click that button and specify the layer using the ARN above. Also in the 'General Configuration' set these parameters:</p> <ul> <li>Timeout: 5 Minutes</li> <li>Memory: 4096</li> </ul> <p>Set the SAGEWORKS_BUCKET ENV SageWorks will need to know what bucket to work out of, so go into the Configuration...Environment Variables... and add one for the SageWorks bucket that your are using for AWS Account (dev, prod, etc). </p> <p>Lambda Role Details</p> <p>If your Lambda Function already use an existing IAM Role then you can add the SageWorks policies to that Role to enable the Lambda Job to perform SageWorks API Tasks. See SageWorks Access Controls</p>"},{"location":"lambda_layer/#sageworks-lambda-example","title":"SageWorks Lambda Example","text":"<p>Here's a simple example of using SageWorks in your Lambda Function.</p> examples/lambda_hello_world.py<pre><code>import json\nfrom sageworks.utils.lambda_utils import load_lambda_layer\n\n# Load the SageWorks Lambda Layer\nload_lambda_layer()\n\n# Now we can use the normal SageWorks imports\nfrom sageworks.api import Meta, Model \n\ndef lambda_handler(event, context):\n\n    # Create our Meta Class and get a list of our Models\n    meta = Meta()\n    models = meta.models()\n\n    print(f\"Number of Models: {len(models)}\")\n    print(models)\n\n    # Get more details data on the Endpoints\n    models_groups = meta.models_deep()\n    for name, model_versions in models_groups.items():\n        print(name)\n\n    # Onboard a model\n    model = Model(\"abalone-regression\")\n    model.onboard()\n\n    # Return success\n    return {\n        'statusCode': 200,\n        'body': { \"incoming_event\": event}\n    }\n</code></pre>"},{"location":"lambda_layer/#lambda-function-local-testing","title":"Lambda Function Local Testing","text":"<p>Lambda Power without the Pain. SageWorks manages the AWS Execution Role/Policies, so local API and Lambda Functions will have the same permissions/access. Also using the same Code as your notebooks or scripts makes creating and testing Lambda Functions a breeze.</p> <pre><code>python my_lambda_function.py --sageworks-bucket &lt;your bucket&gt;\n</code></pre>"},{"location":"lambda_layer/#additional-resources","title":"Additional Resources","text":"<ul> <li>SageWorks Access Management: SageWorks Access Management</li> <li>Setting up SageWorks on your AWS Account: AWS Setup</li> </ul> <ul> <li> <p>Using SageWorks for ML Pipelines: SageWorks API Classes</p> </li> <li> <p>Consulting Available: SuperCowPowers LLC</p> </li> </ul>"},{"location":"misc/faq/","title":"SageWorks: FAQ","text":"<p>Artifact and Column Naming?</p> <p>You might have noticed that SageWorks has some unintuitive constraints when naming Artifacts and restrictions on column names. All of these restrictions come from AWS. SageWorks uses Glue, Athena, Feature Store, Models and Endpoints, each of these services have their own constraints, SageWorks simply 'reflects' those contraints.</p>"},{"location":"misc/faq/#naming-underscores-dashes-and-lower-case","title":"Naming: Underscores, Dashes, and Lower Case","text":"<p>Data Sources and Feature Sets must adhere to AWS restrictions on table names and columns names (here is a snippet from the AWS documentation)</p> <p>Database, table, and column names</p> <p>When you create schema in AWS Glue to query in Athena, consider the following:</p> <p>A database name cannot be longer than 255 characters. A table name cannot be longer than 255 characters. A column name cannot be longer than 255 characters.</p> <p>The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character.</p> <p>For more info see: Glue Best Practices</p>"},{"location":"misc/faq/#datasourcefeatureset-use-_-and-modelendpoint-use-","title":"DataSource/FeatureSet use '_'  and Model/Endpoint use '-'","text":"<p>You may notice that DataSource and FeatureSet uuid/name examples have underscores but the model and endpoints have dashes. Yes, it\u2019s super annoying to have one convention for DataSources and FeatureSets and another for Models and Endpoints but this is an AWS restriction and not something that SageWorks can control.</p> <p>DataSources and FeatureSet: Underscores. You cannot use a dash because both classes use Athena for Storage and Athena tables names cannot have a dash.</p> <p>Models and Endpoints: Dashes. You cannot use an underscores because AWS imposes a restriction on the naming.</p>"},{"location":"misc/faq/#additional-information-on-the-lower-case-issue","title":"Additional information on the lower case issue","text":"<p>We\u2019ve tried to create a glue table with Mixed Case column names and haven\u2019t had any luck. We\u2019ve bypassed wrangler and used the boto3 low level calls directly. In all cases when it shows up in the Glue Table the columns have always been converted to lower case. We've also tried uses the Athena DDL directly, that also doesn't work. Here's the relevant AWS documentation and the two scripts that reproduce the issue.</p> <p>AWS Docs</p> <ul> <li>Athena Naming Restrictions</li> <li>Glue Best Practices</li> </ul> <p>Scripts to Reproduce</p> <ul> <li>scripts/athena_ddl_mixed_case.py</li> <li>scripts/glue_mixed_case.py</li> </ul>"},{"location":"misc/general_info/","title":"General info","text":""},{"location":"misc/general_info/#general-info","title":"General Info","text":""},{"location":"misc/general_info/#sageworks-the-scientists-workbench-powered-by-aws-for-scalability-flexibility-and-security","title":"SageWorks: The scientist's workbench powered by AWS\u00ae for scalability, flexibility, and security.","text":"<p>SageWorks is a medium granularity framework that manages and aggregates AWS\u00ae Services into classes and concepts. When you use SageWorks you think about DataSources, FeatureSets, Models, and Endpoints. Underneath the hood those classes handle all the details around updating and managing a complex set of AWS Services. All the power and none of the pain so that your team can Do Science Faster!</p>"},{"location":"misc/general_info/#sageworks-documentation","title":"SageWorks Documentation","text":"<p>See our Python API and AWS documentation here: SageWorks Documentation</p>"},{"location":"misc/general_info/#full-sageworks-overview","title":"Full SageWorks OverView","text":"<p>SageWorks Architected FrameWork</p>"},{"location":"misc/general_info/#why-sageworks","title":"Why SageWorks?","text":"<ul> <li>The AWS SageMaker\u00ae ecosystem is awesome but has a large number of services with significant complexity</li> <li>SageWorks provides rapid prototyping through easy to use classes and transforms</li> <li>SageWorks provides visibility and transparency into AWS SageMaker\u00ae Pipelines<ul> <li>What S3 data sources are getting pulled?</li> <li>What Features Store/Group is the Model Using?</li> <li>What's the Provenance of a Model in Model Registry?</li> <li>What SageMaker Endpoints are associated with this model?</li> </ul> </li> </ul>"},{"location":"misc/general_info/#single-pane-of-glass","title":"Single Pane of Glass","text":"<p>Visibility into the AWS Services that underpin the SageWorks Classes. We can see that SageWorks automatically tags and tracks the inputs of all artifacts providing 'data provenance' for all steps in the AWS modeling pipeline.</p> <p>Image TBD</p> <p> Clearly illustrated: SageWorks provides intuitive and transparent visibility into the full pipeline of your AWS Sagemaker Deployments.</p>"},{"location":"misc/general_info/#getting-started","title":"Getting Started","text":"<ul> <li>SageWorks Overview Slides that cover and illustrate the SageWorks Modeling Pipeline.</li> <li>SageWorks Docs/Wiki Our general documentation for getting started with SageWorks.</li> <li>SageWorks AWS Onboarding Deploy the SageWorks Stack to your AWS Account. </li> <li>Notebook: Start to Finish AWS ML Pipeline Building an AWS\u00ae ML Pipeline from start to finish.</li> <li>Video: Coding with SageWorks Informal coding + chatting while building a full ML pipeline.</li> <li>Join our Discord for questions and advice on using SageWorks within your organization.</li> </ul>"},{"location":"misc/general_info/#sageworks-zen","title":"SageWorks Zen","text":"<ul> <li>The AWS SageMaker\u00ae set of services is vast and complex.</li> <li>SageWorks Classes encapsulate, organize, and manage sets of AWS\u00ae Services.</li> <li>Heavy transforms typically use AWS Athena or Apache Spark (AWS Glue/EMR Serverless).</li> <li>Light transforms will typically use Pandas.</li> <li>Heavy and Light transforms both update AWS Artifacts (collections of AWS Services).</li> <li>Quick prototypes are typically built with the light path and then flipped to the heavy path as the system matures and usage grows.</li> </ul>"},{"location":"misc/general_info/#classes-and-concepts","title":"Classes and Concepts","text":"<p>The SageWorks Classes are organized to work in concert with AWS Services. For more details on the current classes and class hierarchies see SageWorks Classes and Concepts.</p>"},{"location":"misc/general_info/#contributions","title":"Contributions","text":"<p>If you'd like to contribute to the SageWorks project, you're more than welcome. All contributions will fall under the existing project license. If you are interested in contributing or have questions please feel free to contact us at sageworks@supercowpowers.com.</p>"},{"location":"misc/general_info/#sageworks-alpha-testers-wanted","title":"SageWorks Alpha Testers Wanted","text":"<p>Our experienced team can provide development and consulting services to help you effectively use Amazon\u2019s Machine Learning services within your organization.</p> <p>The popularity of cloud based Machine Learning services is booming. The problem many companies face is how that capability gets effectively used and harnessed to drive real business decisions and provide concrete value for their organization.</p> <p>Using SageWorks will minimize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a SageWorks Alpha Tester, contact us at sageworks@supercowpowers.com.</p> <p>\u00ae Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates.</p> <p>Readme change</p>"},{"location":"misc/sageworks_classes_concepts/","title":"SageWorks Classes and Concepts","text":"<p>A flexible, rapid, and customizable AWS\u00ae ML Sandbox. Here's some of the classes and concepts we use in the SageWorks system:</p> <p></p> <ul> <li>Artifacts</li> <li>DataLoader</li> <li>DataSource</li> <li>FeatureSet</li> <li>Model</li> <li> <p>Endpoint</p> </li> <li> <p>Transforms</p> </li> <li>DataSource to DataSource<ul> <li>Heavy <ul> <li>AWS Glue Jobs</li> <li>AWS EMR Serverless</li> </ul> </li> <li>Light<ul> <li>Local/Laptop</li> <li>Lambdas</li> <li>StepFunctions</li> </ul> </li> </ul> </li> <li>DataSource to FeatureSet<ul> <li>Heavy/Light (see above breakout)</li> </ul> </li> <li>FeatureSet to FeatureSet<ul> <li>Heavy/Light (see above breakout)</li> </ul> </li> <li>FeatureSet to Model</li> <li>Model to Endpoint</li> </ul>"},{"location":"misc/scp_consulting/","title":"Scp consulting","text":""},{"location":"misc/scp_consulting/#consulting","title":"Consulting","text":""},{"location":"misc/scp_consulting/#sageworks-scp-consulting-awesome","title":"SageWorks + SCP Consulting = Awesome","text":"<p>Our experienced team can provide development and consulting services to help you effectively use Amazon\u2019s Machine Learning services within your organization.</p> <p>The popularity of cloud based Machine Learning services is booming. The problem many companies face is how that capability gets effectively used and harnessed to drive real business decisions and provide concrete value for their organization.</p> <p>Using SageWorks will minimizize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a SageWorks Alpha Tester, contact us at sageworks@supercowpowers.com.</p>"},{"location":"misc/scp_consulting/#typical-engagements","title":"Typical Engagements","text":"<p>SageWorks clients typically want a tailored web_interface that helps to drive business decisions and provides value for their organization.</p> <ul> <li>SageWorks components provide a set of classes and transforms the will dramatically reduce time and increase productivity when building AWS ML Systems.</li> <li>SageWorks enables rapid prototyping via it's light paths and provides AWS production workflows on large scale data through it's heavy paths.</li> <li> <p>Rapid Prototyping is typically done via these steps.</p> </li> <li> <p>Quick Construction of Web Interface (tailored)</p> </li> <li>Custom Components (tailored)</li> <li>Demo/Review the Application with the Client</li> <li>Get Feedback/Changes/Improvements</li> <li> <p>Goto Step 1</p> </li> <li> <p>When the client is happy/excited about the ProtoType we then bolt down the system, test the heavy paths, review AWS access, security and ensure 'least privileged' roles and policies.</p> </li> </ul> <p>Contact us for a free initial consultation on how we can accelerate the use of AWS ML at your company sageworks@supercowpowers.com.</p>"},{"location":"plugins/","title":"OverView","text":"<p>SageWorks Plugins</p> <p>The SageWorks toolkit provides a flexible plugin architecture to expand, enhance, or even replace the Dashboard. Make custom UI components, views, and entire pages with the plugin classes described here.</p> <p>The SageWorks Plugin system allows clients to customize how their AWS Machine Learning Pipeline is displayed, analyzed, and visualized. Our easy to use Python API enables developers to make new Dash/Plotly components, data views, and entirely new web pages focused on business use cases.</p>"},{"location":"plugins/#concept-docs","title":"Concept Docs","text":"<p>Many classes in SageWorks need additional high-level material that covers class design and illustrates class usage. Here's the Concept Docs for Plugins:</p> <ul> <li>Plugin Concepts: Read First!</li> <li>How to Write a Plugin </li> <li>Plugin Pages</li> <li>Plugins Advanced</li> </ul>"},{"location":"plugins/#make-a-plugin","title":"Make a plugin","text":"<p>Each plugin class inherits from the SageWorks PluginInterface class and needs to set two attributes and implement two methods. These requirements are set so that each Plugin will conform to the Sageworks infrastructure; if the required attributes and methods aren\u2019t included in the class definition, errors will be raised during tests and at runtime.</p> <pre><code>from sageworks.web_components.plugin_interface import PluginInterface, PluginPage\n\nclass MyPlugin(PluginInterface):\n    \"\"\"My Awesome Component\"\"\"\n\n    # Initialize the required attributes\"\"\"\n    plugin_page = PluginPage.MODEL\n    plugin_input_type = PluginInputType.MODEL\n\n    # Implement the two methods\n    def create_component(self, component_id: str) -&gt; ComponentTypes:\n        &lt; Function logic which creates a Dash Component &gt;\n        return dcc.Graph(id=component_id, figure=self.waiting_figure())\n\n    def update_content(self, data_object: SageworksObject) -&gt; ContentTypes:\n        &lt; Function logic which creates a figure (go.Figure) \n        return figure\n</code></pre>"},{"location":"plugins/#required-attributes","title":"Required Attributes","text":"<p>The class variable plugin_page determines what type of plugin the MyPlugin class is. This variable is inspected during plugin loading at runtime in order to load the plugin to the correct artifact page in the Sageworks dashboard. The PluginPage class can be DATA_SOURCE, FEATURE_SET, MODEL, or ENDPOINT.</p>"},{"location":"plugins/#s3-bucket-plugins-work-in-progress","title":"S3 Bucket Plugins (Work in Progress)","text":"<p>Note: This functionality is coming soon</p> <p>Offers the most flexibility and fast prototyping. Simple set your config/env for  blah to an S3 Path and SageWorks will load the plugins from S3 directly.</p> <p>Helpful Tip</p> <p>You can copy files from your local system up to S3 with this handy AWS CLI call</p> <pre><code> aws s3 cp . s3://my-sageworks/sageworks_plugins \\\n --recursive --exclude \"*\" --include \"*.py\"\n</code></pre>"},{"location":"plugins/#additional-resources","title":"Additional Resources","text":"<p>Need help with plugins? Want to develop a customized application tailored to your business needs?</p> <ul> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"},{"location":"plugins/plugin_api_changes/","title":"API Changes","text":""},{"location":"plugins/plugin_api_changes/#plugin-api-changes","title":"Plugin API Changes","text":"<p>There were quite a fiew API changes for Plugins between <code>0.4.43</code> and <code>0.5.0</code> versions of SageWorks.</p> <p>General: Classes that inherit from <code>component_interface</code> or <code>plugin_interface</code>  are now 'auto wrapped' with an exception container. This container not only catches errors/crashes so they don't crash the application but it also displays the error in the widget.</p> <p>Specific Changes:</p> <ul> <li>The <code>generate_component_figure</code> method is now <code>update_contents</code></li> <li>The <code>message_figure</code> method is now <code>display_text</code></li> <li><code>PluginType</code> was changed to <code>PluginPage</code> (use CUSTOM to NOT autoload)</li> <li><code>PluginInputType.MODEL_DETAILS</code>  changed to <code>PluginInputType.MODEL</code>  (since your now getting a model object)</li> <li><code>FigureTypes</code> is now <code>ContentTypes</code></li> </ul>"},{"location":"presentations/","title":"SageWorks Presentations","text":"<p>The SageWorks framework makes AWS\u00ae both easier to use and more powerful. SageWorks handles all the details around updating and managing a complex set of AWS Services. With a simple-to-use Python API and a beautiful set of web interfaces, SageWorks makes creating AWS ML pipelines a snap.</p> <p>Need Help?</p> <p>The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p>"},{"location":"presentations/#sageworks-presentations_1","title":"SageWorks Presentations","text":"<ul> <li>SageWorks Overview</li> <li>Private SaaS Architecture</li> <li>Python API</li> <li>Plugins Overview</li> <li>Plugins: Getting Started</li> <li>Plugins: Pages</li> <li>Plugins: Advanced</li> <li>Exploratory Data Analysis</li> <li>Architected ML Framework</li> <li>AWS Access Management</li> <li>Sageworks Config</li> <li>SageWorks REPL</li> </ul>"},{"location":"presentations/#sageworks-python-api-docs","title":"SageWorks Python API Docs","text":"<p>The SageWorks API documentation SageWorks API covers our in-depth Python API and contains code examples. The code examples are provided in the Github repo <code>examples/</code> directory. For a full code listing of any example please visit our SageWorks Examples</p>"},{"location":"presentations/#questions","title":"Questions?","text":"<p>The SuperCowPowers team is happy to anser any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord </p> <p></p> <p>\u00ae Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates</p>"},{"location":"repl/","title":"SageWorks REPL","text":"<p>Visibility and Control</p> <p>The SageWorks REPL provides AWS ML Pipeline visibility just like the SageWorks Dashboard but also provides control over the creation, modification, and deletion of artifacts through the Python API.</p> <p>The SageWorks REPL is a customized iPython shell. It provides tailored functionality for easy interaction with SageWorks objects and since it's based on iPython developers will feel right at home using autocomplete, history, help, etc. Both easy and powerful, the SageWorks REPL puts control of AWS ML Pipelines at your fingertips.</p>"},{"location":"repl/#installation","title":"Installation","text":"<p><code>pip install sageworks</code></p>"},{"location":"repl/#usage","title":"Usage","text":"<p>Just type <code>sageworks</code> at the command line and the SageWorks shell will spin up and provide a command view of your AWS Machine Learning Pipelines.</p> <p>At startup the SageWorks shell, will connect to your AWS Account and create a summary of the Machine Learning artifacts currently residing on the account.</p> <p></p> <p>Available Commands:</p> <ul> <li>status</li> <li>config</li> <li>incoming_data</li> <li>glue_jobs</li> <li>data_sources</li> <li>feature_sets</li> <li>models</li> <li>endpoints</li> <li>aws_refresh</li> <li>and more...</li> </ul> <p>All of the API Classes are auto-loaded, so drilling down on an individual artifact is easy. The same Python API is provided so if you want additional info on a model, for instance, simply create a model object and use any of the documented API methods.</p> <pre><code>m = Model(\"abalone-regression\")\nm.details()\n&lt;shows info about the model&gt;\n</code></pre>"},{"location":"repl/#additional-resources","title":"Additional Resources","text":"<ul> <li>Setting up SageWorks on your AWS Account: AWS Setup</li> <li>Using SageWorks for ML Pipelines: SageWorks API Classes</li> </ul> <ul> <li>SageWorks Core Classes: Core Classes</li> <li>Consulting Available: SuperCowPowers LLC</li> </ul>"}]}
\ No newline at end of file