From 4c4ffbc993811081a74a4f474104e544eb4b7886 Mon Sep 17 00:00:00 2001
From: brifordwylie TBD There was a small refactor of the cache decorator. We fixed a case where if we blocked on getting a value we also spun up a background thread to get it. This chance will no affect existing code or APIs. Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward: The SageWorks framework makes AWS\u00ae both easier to use and more powerful. SageWorks handles all the details around updating and managing a complex set of AWS Services. With a simple-to-use Python API and a beautiful set of web interfaces, SageWorks makes creating AWS ML pipelines a snap. It also dramatically improves both the usability and visibility across the entire spectrum of services: Glue Jobs, Athena, Feature Store, Models, and Endpoints. SageWorks makes it easy to build production ready, AWS powered, machine learning pipelines. Secure your Data, Empower your ML Pipelines SageWorks is architected as a Private SaaS. This hybrid architecture is the ultimate solution for businesses that prioritize data control and security. SageWorks deploys as an AWS Stack within your own cloud environment, ensuring compliance with stringent corporate and regulatory standards. It offers the flexibility to tailor solutions to your specific business needs through our comprehensive plugin support, both components and full web interfaces. By using SageWorks, you maintain absolute control over your data while benefiting from the power, security, and scalability of AWS cloud services. SageWorks Private SaaS Architecture The SageWorks package has two main components, a Web Interface that provides visibility into AWS ML PIpelines and a Python API that makes creation and usage of the AWS ML Services easier than using/learning the services directly. The SageWorks Dashboard has a set of web interfaces that give visibility into the AWS Glue and SageMaker Services. There are currently 5 web interfaces available: SageWorks API Documentation: SageWorks API Classes The main functionality of the Python API is to encapsulate and manage a set of AWS services underneath a Python Object interface. The Python Classes are used to create and interact with Machine Learning Pipeline Artifacts. SageWorks will need some initial setup when you first start using it. See our Getting Started guide on how to connect SageWorks to your AWS Account. Need Help? The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord In general SageWorks works well, out of the box, with the standard set of limits for AWS accounts. SageWorks supports throttling, timeouts, and a broad set of AWS error handling routines for general purpose usage. When using SageWorks for large scale deployments there are a set of AWS Service limits that will need to be increased. There are two serverless endpoint quotas that will need to be adjusted. When running a large set of parallel Glue/Batch Jobs that are creating FeatureGroups, some clients have hit this limit. \"ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateFeatureGroup operation: The account-level service limit 'Maximum number of feature group creation workflows executing in parallel' is 4 FeatureGroups, with current utilization of 4 FeatureGroups and a request delta of 1 FeatureGroups. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.\"} Unfortunately this one is not adjustable through the AWS Service Quota console and you'll have to initiate an AWS Support ticket. The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord Notes and information on how to do the Docker Builds and Push to AWS ECR. Note: For a client specific config file you'll need to copy it locally so that it's within Dockers 'build context'. If you're building the 'vanilla' open source Docker image, then you can use the You have a API Changes
table_hash()
method to DataSources and FeatuerSet (see above).Internal Changes
-Specific Code Changes
"},{"location":"#drill-down-views","title":"Drill-Down Views","text":"
"},{"location":"#private-saas-architecture","title":"Private SaaS Architecture","text":"
"},{"location":"#python-api","title":"Python API","text":"
"},{"location":"admin/aws_service_limits/","title":"AWS Service Limits","text":"
"},{"location":"admin/aws_service_limits/#parallel-featuregroup-creation","title":"Parallel FeatureGroup Creation","text":"
"},{"location":"admin/base_docker_push/#build-the-docker-image","title":"Build the Docker Image","text":"vi Dockerfile\n\n# Install latest Sageworks\nRUN pip install --no-cache-dir 'sageworks[ml-tool,chem]'==0.7.0\n
open_source_config.json
that's in the directory already.
"},{"location":"admin/base_docker_push/#test-the-image-locally","title":"Test the Image Locally","text":"docker build --build-arg SAGEWORKS_CONFIG=open_source_config.json -t \\\nsageworks_base:v0_7_0_amd64 --platform linux/amd64 .\n
docker_local_base
alias in your ~/.zshrc
:)
"},{"location":"admin/base_docker_push/#tagpush-the-image-to-aws-ecr","title":"Tag/Push the Image to AWS ECR","text":"aws ecr-public get-login-password --region us-east-1 --profile \\\nscp_sandbox_admin | docker login --username AWS \\\n--password-stdin public.ecr.aws\n
docker tag sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64\n
docker push public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64\n
docker tag public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:latest\n
docker push public.ecr.aws/m6i5k1r2/sageworks_base:latest\n
"},{"location":"admin/base_docker_push/#update-the-stable-tag","title":"Update the 'stable' tag","text":"This is obviously only when you want to mark a version as stable. Meaning that it seems to 'be good and stable (ish)' :)
docker tag public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:stable\n
docker push public.ecr.aws/m6i5k1r2/sageworks_base:stable\n
"},{"location":"admin/base_docker_push/#test-the-ecr-image","title":"Test the ECR Image","text":"You have a docker_ecr_base
alias in your ~/.zshrc
:)
Notes and information on how to do the Dashboard Docker Builds and Push to AWS ECR.
"},{"location":"admin/dashboard_docker_push/#update-sageworks-version","title":"Update SageWorks Version","text":"cd applications/aws_dashboard\nvi Dockerfile\n\n# Install Sageworks (changes often)\nRUN pip install --no-cache-dir sageworks==0.4.13 <-- change this\n
"},{"location":"admin/dashboard_docker_push/#build-the-docker-image","title":"Build the Docker Image","text":"Note: For a client specific config file you'll need to copy it locally so that it's within Dockers 'build context'. If you're building the 'vanilla' open source Docker image, then you can use the open_source_config.json
that's in the directory already.
docker build --build-arg SAGEWORKS_CONFIG=open_source_config.json -t \\\nsageworks_dashboard:v0_4_13_amd64 --platform linux/amd64 .\n
Docker with Custom Plugins: If you're using custom plugins you should visit our Dashboard with Plugins) page.
"},{"location":"admin/dashboard_docker_push/#test-the-image-locally","title":"Test the Image Locally","text":"You have a docker_local_dashboard
alias in your ~/.zshrc
:)
aws ecr-public get-login-password --region us-east-1 --profile \\\nscp_sandbox_admin | docker login --username AWS \\\n--password-stdin public.ecr.aws\n
"},{"location":"admin/dashboard_docker_push/#tagpush-the-image-to-aws-ecr","title":"Tag/Push the Image to AWS ECR","text":"docker tag sageworks_dashboard:v0_4_13_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64\n
docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64\n
"},{"location":"admin/dashboard_docker_push/#update-the-latest-tag","title":"Update the 'latest' tag","text":"docker tag public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:latest\n
docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:latest\n
"},{"location":"admin/dashboard_docker_push/#update-the-stable-tag","title":"Update the 'stable' tag","text":"This is obviously only when you want to mark a version as stable. Meaning that it seems to 'be good and stable (ish)' :)
docker tag public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_5_4_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:stable\n
docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:stable\n
"},{"location":"admin/dashboard_docker_push/#test-the-ecr-image","title":"Test the ECR Image","text":"You have a docker_ecr_dashboard
alias in your ~/.zshrc
:)
Notes and information on how to include plugins with your SageWorks Dashboard.
If you don't already have a Dockerfile, here's one to get you started, just place this into your repo/directory that has the plugins.
# Pull base sageworks dashboard image with specific tag (pick latest or stable)\nFROM public.ecr.aws/m6i5k1r2/sageworks_dashboard:latest\n\n# Copy the plugin files into the Dashboard plugins dir\nCOPY ./sageworks_plugins /app/sageworks_plugins\nENV SAGEWORKS_PLUGINS=/app/sageworks_plugins\n
Note: Your plugins directory should looks like this
sageworks_plugins/\n pages/\n my_plugin_page.py\n ...\n views/\n my_plugin_view.py\n ...\n web_components/\n my_component.py\n ...\n
"},{"location":"admin/dashboard_with_plugins/#build-it","title":"Build it","text":"docker build -t my_sageworks_with_plugins:v1_0 --platform linux/amd64 .\n
"},{"location":"admin/dashboard_with_plugins/#test-the-image-locally","title":"Test the Image Locally","text":"You'll need to use AWS Credentials for this, it's a bit complicated, please contact SageWorks Support sageworks@supercowpowers.com or chat us up on Discord
"},{"location":"admin/dashboard_with_plugins/#login-to-your-ecr","title":"Login to your ECR","text":"Okay.. so after testing locally you're ready to push the Docker image (with Plugins) to the your ECR.
Note: This ECR should be private as your plugins are customized for specific business use cases.
Your ECR location will have this form
<aws_account_id>.dkr.ecr.us-east-1.amazonaws.com\n
aws ecr get-login-password --region us-east-1 --profile <aws_profile> \\\n| docker login --username AWS --password-stdin \\\n<aws_account_id>.dkr.ecr.us-east-1.amazonaws.com\n
"},{"location":"admin/dashboard_with_plugins/#tagpush-the-image-to-aws-ecr","title":"Tag/Push the Image to AWS ECR","text":"docker tag my_sageworks_with_plugins:v1_0 \\\n<aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/sageworks_with_plugins:v1_0\n
docker push \\\n<aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/sageworks_with_plugins:v1_0\n
"},{"location":"admin/dashboard_with_plugins/#deploying-plugin-docker-image-to-aws","title":"Deploying Plugin Docker Image to AWS","text":"Okay now that you have your plugin Docker Image you can deploy to your AWS account:
Copy the Dashboard CDK files
This is cheesy but just copy all the CDK files into your repo/directory.
cp -r sageworks/aws_setup/sageworks_dashboard_full /my/sageworks/stuff/\n
Change the Docker Image to Deploy
Now open up the app.py
file and change this line to your Docker Image
# When you want a different docker image change this line\ndashboard_image = \"public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_8_3_amd64\"\n
Make sure your SAGEWORKS_CONFIG
is properly set, and run the following commands:
export SAGEWORKS_CONFIG=/Users/<user_name>/.sageworks/sageworks_config.json\ncdk diff\ncdk deploy\n
CDK Diff
In particular, pay attention to the cdk diff
it should ONLY have the image name as a difference.
cdk diff\n[-] \"Image\": \"<account>.dkr.ecr.us-east-1/my-plugins:latest_123\",\n[+] \"Image\": \"<account>.dkr.ecr.us-east-1/my-plugins:latest_456\",\n
"},{"location":"admin/dashboard_with_plugins/#note-on-sageworks-configuration","title":"Note on SageWorks Configuration","text":"All Configuration is managed by the CDK Python Script and the SAGEWORKS_CONFIG
ENV var. If you want to change things like REDIS_HOST
or SAGEWORKS_BUCKET
you should do that with a sageworks.config
file and then point the SAGEWORKS_CONFIG
ENV var to that file.
Notes and information on how to do the PyPI release for the SageMaker project. For full details on packaging you can reference this page Packaging
The following instructions should work, but things change :)
"},{"location":"admin/pypi_release/#package-requirements","title":"Package Requirements","text":"The easiest thing to do is setup a \\~/.pypirc file with the following contents
[distutils]\nindex-servers =\n pypi\n testpypi\n\n[pypi]\nusername = __token__\npassword = pypi-AgEIcH...\n\n[testpypi]\nusername = __token__\npassword = pypi-AgENdG...\n
"},{"location":"admin/pypi_release/#tox-background","title":"Tox Background","text":"Tox will install the SageMaker Sandbox package into a blank virtualenv and then execute all the tests against the newly installed package. So if everything goes okay, you know the pypi package installed fine and the tests (which puls from the installed sageworks
package) also ran okay.
$ cd sageworks\n$ tox \n
If ALL the test above pass...
"},{"location":"admin/pypi_release/#clean-any-previous-distribution-files","title":"Clean any previous distribution files","text":"make clean\n
"},{"location":"admin/pypi_release/#tag-the-new-version","title":"Tag the New Version","text":"git tag v0.1.8 (or whatever)\ngit push --tags\n
"},{"location":"admin/pypi_release/#create-the-test-pypi-release","title":"Create the TEST PyPI Release","text":"python -m build\ntwine upload dist/* -r testpypi\n
"},{"location":"admin/pypi_release/#install-the-test-pypi-release","title":"Install the TEST PyPI Release","text":"pip install --index-url https://test.pypi.org/simple sageworks\n
"},{"location":"admin/pypi_release/#create-the-real-pypi-release","title":"Create the REAL PyPI Release","text":"twine upload dist/* -r pypi\n
"},{"location":"admin/pypi_release/#push-any-possible-changes-to-github","title":"Push any possible changes to Github","text":"git push\n
"},{"location":"admin/sageworks_docker_for_lambdas/","title":"SageWorks Docker Image for Lambdas","text":"Using the SageWorks Docker Image for AWS Lambda Jobs allows your Lambda Jobs to use and create AWS ML Pipeline Artifacts with SageWorks.
AWS, for some reason, does not allow Public ECRs to be used for Lambda Docker images. So you'll have to copy the Docker image into your private ECR.
"},{"location":"admin/sageworks_docker_for_lambdas/#creating-a-private-ecr","title":"Creating a Private ECR","text":"You only need to do this if you don't already have a private ECR.
"},{"location":"admin/sageworks_docker_for_lambdas/#aws-console-to-create-private-ecr","title":"AWS Console to create Private ECR","text":"sageworks_base
.Create the ECR repository using the AWS CLI:
aws ecr create-repository --repository-name sageworks_base --region <region>\n
"},{"location":"admin/sageworks_docker_for_lambdas/#pulling-docker-image-into-private-ecr","title":"Pulling Docker Image into Private ECR","text":"Note: You'll only need to do this when you want to update the SageWorks Docker image
Pull the SageWorks Public ECR Image
docker pull public.ecr.aws/m6i5k1r2/sageworks_base:latest\n
Tag the image for your private ECR
docker tag public.ecr.aws/m6i5k1r2/sageworks_base:latest \\\n<your-account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:latest\n
Push the image to your private ECR
aws ecr get-login-password --region <region> --profile <profile> | \\\ndocker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com\n\ndocker push <account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:<tag>\n
"},{"location":"admin/sageworks_docker_for_lambdas/#using-the-docker-image-for-your-lambdas","title":"Using the Docker Image for your Lambdas","text":"Okay, now that you have the SageWorks Docker image in your private ECR, here's how you use that image for your Lambda jobs.
"},{"location":"admin/sageworks_docker_for_lambdas/#aws-console","title":"AWS Console","text":"<account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:<tag>
.Create the Lambda function using the AWS CLI:
aws lambda create-function \\\n --region <region> \\\n --function-name myLambdaFunction \\\n --package-type Image \\\n --code ImageUri=<account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:<tag> \\\n --role arn:aws:iam::<account-id>:role/<execution-role>\n
"},{"location":"admin/sageworks_docker_for_lambdas/#python-cdk","title":"Python CDK","text":"Define the Lambda function in your CDK app:
from aws_cdk import (\n aws_lambda as _lambda,\n core\n)\n\nclass MyLambdaStack(core.Stack):\n def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:\n super().__init__(scope, id, **kwargs)\n\n _lambda.Function(self, \"MyLambdaFunction\",\n code=_lambda.Code.from_ecr_image(\"<account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:<tag>\"),\n handler=_lambda.Handler.FROM_IMAGE,\n runtime=_lambda.Runtime.FROM_IMAGE,\n role=iam.Role.from_role_arn(self, \"LambdaRole\", \"arn:aws:iam::<account-id>:role/<execution-role>\"))\n\napp = core.App()\nMyLambdaStack(app, \"MyLambdaStack\")\napp.synth()\n
"},{"location":"admin/sageworks_docker_for_lambdas/#cloudformation","title":"Cloudformation","text":"Define the Lambda function in your CloudFormation template.
Resources:\n MyLambdaFunction:\n Type: AWS::Lambda::Function\n Properties:\n Code:\n ImageUri: <account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:<tag>\n Role: arn:aws:iam::<account-id>:role/<execution-role>\n PackageType: Image\n
"},{"location":"api_classes/data_source/","title":"DataSource","text":"DataSource Examples
Examples of using the DataSource class are in the Examples section at the bottom of this page. S3 data, local files, and Pandas dataframes, DataSource can read data from many different sources.
DataSource: Manages AWS Data Catalog creation and management. DataSources are set up so that can easily be queried with AWS Athena. All DataSources are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) DataSources can be viewed and explored within the SageWorks Dashboard UI.
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource","title":"DataSource
","text":" Bases: AthenaSource
DataSource: SageWorks DataSource API Class
Common Usagemy_data = DataSource(name_of_source)\nmy_data.details()\nmy_features = my_data.to_features()\n
Source code in src/sageworks/api/data_source.py
class DataSource(AthenaSource):\n \"\"\"DataSource: SageWorks DataSource API Class\n\n Common Usage:\n ```python\n my_data = DataSource(name_of_source)\n my_data.details()\n my_features = my_data.to_features()\n ```\n \"\"\"\n\n def __init__(self, source: Union[str, pd.DataFrame], name: str = None, tags: list = None, **kwargs):\n \"\"\"\n Initializes a new DataSource object.\n\n Args:\n source (Union[str, pd.DataFrame]): Source of data (existing name, filepath, S3 path, or a Pandas DataFrame)\n name (str): The name of the data source (must be lowercase). If not specified, a name will be generated\n tags (list[str]): A list of tags associated with the data source. If not specified tags will be generated.\n \"\"\"\n\n # Ensure the ds_name is valid\n if name:\n Artifact.is_name_valid(name)\n\n # If the data source name wasn't given, generate it\n else:\n name = extract_data_source_basename(source)\n name = Artifact.generate_valid_name(name)\n\n # Sanity check for dataframe sources\n if name == \"dataframe\":\n msg = \"Set the 'name' argument in the constructor: DataSource(df, name='my_data')\"\n self.log.critical(msg)\n raise ValueError(msg)\n\n # Set the tags and load the source\n tags = [name] if tags is None else tags\n self._load_source(source, name, tags)\n\n # Call superclass init\n super().__init__(name, **kwargs)\n\n def details(self, **kwargs) -> dict:\n \"\"\"DataSource Details\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().details(**kwargs)\n\n def query(self, query: str) -> pd.DataFrame:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the DataSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n return super().query(query)\n\n def pull_dataframe(self, include_aws_columns=False) -> pd.DataFrame:\n \"\"\"Return a DataFrame of ALL the data from this DataSource\n\n Args:\n include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of ALL the data from this DataSource\n\n Note:\n Obviously this is not recommended for large datasets :)\n \"\"\"\n\n # Get the table associated with the data\n self.log.info(f\"Pulling all data from {self.uuid}...\")\n table = super().table\n query = f'SELECT * FROM \"{table}\"'\n df = self.query(query)\n\n # Drop any columns generated from AWS\n if not include_aws_columns:\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n df = df.drop(columns=aws_cols, errors=\"ignore\")\n return df\n\n def to_features(\n self,\n name: str,\n id_column: str,\n tags: list = None,\n event_time_column: str = None,\n one_hot_columns: list = None,\n ) -> Union[FeatureSet, None]:\n \"\"\"\n Convert the DataSource to a FeatureSet\n\n Args:\n name (str): Set the name for feature set (must be lowercase).\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n tags (list, optional: Set the tags for the feature set. If not specified tags will be generated\n event_time_column (str, optional): Set the event time for feature set. If not specified will be generated\n one_hot_columns (list, optional): Set the columns to be one-hot encoded. (default: None)\n\n Returns:\n FeatureSet: The FeatureSet created from the DataSource (or None if the FeatureSet isn't created)\n \"\"\"\n\n # Ensure the feature_set_name is valid\n if not Artifact.is_name_valid(name):\n self.log.critical(f\"Invalid FeatureSet name: {name}, not creating FeatureSet!\")\n return None\n\n # Set the Tags\n tags = [name] if tags is None else tags\n\n # Transform the DataSource to a FeatureSet\n data_to_features = DataToFeaturesLight(self.uuid, name)\n data_to_features.set_output_tags(tags)\n data_to_features.transform(\n id_column=id_column,\n event_time_column=event_time_column,\n one_hot_columns=one_hot_columns,\n )\n\n # Return the FeatureSet (which will now be up-to-date)\n return FeatureSet(name)\n\n def _load_source(self, source: str, name: str, tags: list):\n \"\"\"Load the source of the data\"\"\"\n self.log.info(f\"Loading source: {source}...\")\n\n # Pandas DataFrame Source\n if isinstance(source, pd.DataFrame):\n my_loader = PandasToData(name)\n my_loader.set_input(source)\n my_loader.set_output_tags(tags)\n my_loader.transform()\n\n # S3 Source\n source = source if isinstance(source, str) else str(source)\n if source.startswith(\"s3://\"):\n my_loader = S3ToDataSourceLight(source, name)\n my_loader.set_output_tags(tags)\n my_loader.transform()\n\n # File Source\n elif os.path.isfile(source):\n my_loader = CSVToDataSource(source, name)\n my_loader.set_output_tags(tags)\n my_loader.transform()\n
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.__init__","title":"__init__(source, name=None, tags=None, **kwargs)
","text":"Initializes a new DataSource object.
Parameters:
Name Type Description Defaultsource
Union[str, DataFrame]
Source of data (existing name, filepath, S3 path, or a Pandas DataFrame)
requiredname
str
The name of the data source (must be lowercase). If not specified, a name will be generated
None
tags
list[str]
A list of tags associated with the data source. If not specified tags will be generated.
None
Source code in src/sageworks/api/data_source.py
def __init__(self, source: Union[str, pd.DataFrame], name: str = None, tags: list = None, **kwargs):\n \"\"\"\n Initializes a new DataSource object.\n\n Args:\n source (Union[str, pd.DataFrame]): Source of data (existing name, filepath, S3 path, or a Pandas DataFrame)\n name (str): The name of the data source (must be lowercase). If not specified, a name will be generated\n tags (list[str]): A list of tags associated with the data source. If not specified tags will be generated.\n \"\"\"\n\n # Ensure the ds_name is valid\n if name:\n Artifact.is_name_valid(name)\n\n # If the data source name wasn't given, generate it\n else:\n name = extract_data_source_basename(source)\n name = Artifact.generate_valid_name(name)\n\n # Sanity check for dataframe sources\n if name == \"dataframe\":\n msg = \"Set the 'name' argument in the constructor: DataSource(df, name='my_data')\"\n self.log.critical(msg)\n raise ValueError(msg)\n\n # Set the tags and load the source\n tags = [name] if tags is None else tags\n self._load_source(source, name, tags)\n\n # Call superclass init\n super().__init__(name, **kwargs)\n
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.details","title":"details(**kwargs)
","text":"DataSource Details
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the DataSource
Source code insrc/sageworks/api/data_source.py
def details(self, **kwargs) -> dict:\n \"\"\"DataSource Details\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.pull_dataframe","title":"pull_dataframe(include_aws_columns=False)
","text":"Return a DataFrame of ALL the data from this DataSource
Parameters:
Name Type Description Defaultinclude_aws_columns
bool
Include the AWS columns in the DataFrame (default: False)
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame of ALL the data from this DataSource
NoteObviously this is not recommended for large datasets :)
Source code insrc/sageworks/api/data_source.py
def pull_dataframe(self, include_aws_columns=False) -> pd.DataFrame:\n \"\"\"Return a DataFrame of ALL the data from this DataSource\n\n Args:\n include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of ALL the data from this DataSource\n\n Note:\n Obviously this is not recommended for large datasets :)\n \"\"\"\n\n # Get the table associated with the data\n self.log.info(f\"Pulling all data from {self.uuid}...\")\n table = super().table\n query = f'SELECT * FROM \"{table}\"'\n df = self.query(query)\n\n # Drop any columns generated from AWS\n if not include_aws_columns:\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n df = df.drop(columns=aws_cols, errors=\"ignore\")\n return df\n
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.query","title":"query(query)
","text":"Query the AthenaSource
Parameters:
Name Type Description Defaultquery
str
The query to run against the DataSource
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The results of the query
Source code insrc/sageworks/api/data_source.py
def query(self, query: str) -> pd.DataFrame:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the DataSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n return super().query(query)\n
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.to_features","title":"to_features(name, id_column, tags=None, event_time_column=None, one_hot_columns=None)
","text":"Convert the DataSource to a FeatureSet
Parameters:
Name Type Description Defaultname
str
Set the name for feature set (must be lowercase).
requiredid_column
str
The ID column (must be specified, use \"auto\" for auto-generated IDs).
requiredtags
list
Set the tags for the feature set. If not specified tags will be generated
None
event_time_column
str
Set the event time for feature set. If not specified will be generated
None
one_hot_columns
list
Set the columns to be one-hot encoded. (default: None)
None
Returns:
Name Type DescriptionFeatureSet
Union[FeatureSet, None]
The FeatureSet created from the DataSource (or None if the FeatureSet isn't created)
Source code insrc/sageworks/api/data_source.py
def to_features(\n self,\n name: str,\n id_column: str,\n tags: list = None,\n event_time_column: str = None,\n one_hot_columns: list = None,\n) -> Union[FeatureSet, None]:\n \"\"\"\n Convert the DataSource to a FeatureSet\n\n Args:\n name (str): Set the name for feature set (must be lowercase).\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n tags (list, optional: Set the tags for the feature set. If not specified tags will be generated\n event_time_column (str, optional): Set the event time for feature set. If not specified will be generated\n one_hot_columns (list, optional): Set the columns to be one-hot encoded. (default: None)\n\n Returns:\n FeatureSet: The FeatureSet created from the DataSource (or None if the FeatureSet isn't created)\n \"\"\"\n\n # Ensure the feature_set_name is valid\n if not Artifact.is_name_valid(name):\n self.log.critical(f\"Invalid FeatureSet name: {name}, not creating FeatureSet!\")\n return None\n\n # Set the Tags\n tags = [name] if tags is None else tags\n\n # Transform the DataSource to a FeatureSet\n data_to_features = DataToFeaturesLight(self.uuid, name)\n data_to_features.set_output_tags(tags)\n data_to_features.transform(\n id_column=id_column,\n event_time_column=event_time_column,\n one_hot_columns=one_hot_columns,\n )\n\n # Return the FeatureSet (which will now be up-to-date)\n return FeatureSet(name)\n
"},{"location":"api_classes/data_source/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Create a DataSource from an S3 Path or File Path
datasource_from_s3.pyfrom sageworks.api.data_source import DataSource\n\n# Create a DataSource from an S3 Path (or a local file)\nsource_path = \"s3://sageworks-public-data/common/abalone.csv\"\n# source_path = \"/full/path/to/local/file.csv\"\n\nmy_data = DataSource(source_path)\nprint(my_data.details())\n
Create a DataSource from a Pandas Dataframe
datasource_from_df.pyfrom sageworks.utils.test_data_generator import TestDataGenerator\nfrom sageworks.api.data_source import DataSource\n\n# Create a DataSource from a Pandas DataFrame\ngen_data = TestDataGenerator()\ndf = gen_data.person_data()\n\ntest_data = DataSource(df, name=\"test_data\")\nprint(test_data.details())\n
Query a DataSource
All SageWorks DataSources use AWS Athena, so any query that you can make with Athena is accessible through the DataSource API.
datasource_query.pyfrom sageworks.api.data_source import DataSource\n\n# Grab a DataSource\nmy_data = DataSource(\"abalone_data\")\n\n# Make some queries using the Athena backend\ndf = my_data.query(\"select * from abalone_data where height > .3\")\nprint(df.head())\n\ndf = my_data.query(\"select * from abalone_data where class_number_of_rings < 3\")\nprint(df.head())\n
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10\n1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8\n\n sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1\n1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2\n
Create a FeatureSet from a DataSource
datasource_to_featureset.pyfrom sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\ntest_data = DataSource('test_data')\nmy_features = test_data.to_features()\nprint(my_features.details())\n
"},{"location":"api_classes/data_source/#sageworks-ui","title":"SageWorks UI","text":"Whenever a DataSource is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.
SageWorks Dashboard: DataSourcesNot Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/df_store/","title":"SageWorks DataFrame Storage","text":"Examples
Examples of using the Parameter Storage class are listed at the bottom of this page Examples.
"},{"location":"api_classes/df_store/#why-dataframe-storage","title":"Why DataFrame Storage?","text":"Great question, there's a couple of reasons. The first is that the Parameter Store in AWS has a 4KB limit, so that won't support any kind of 'real data'. The second reason is that DataFrames are commonly used as part of the data engineering, science, and ML pipeline construction process. Providing storage of named DataFrames in an accessible location that can be inspected and used by your ML Team comes in super handy.
"},{"location":"api_classes/df_store/#efficient-storage","title":"Efficient Storage","text":"All DataFrames are stored in the Parquet format using 'snappy' storage. Parquet is a columnar storage format that efficiently handles large datasets, and using Snappy compression reduces file size while maintaining fast read/write speeds.
DFStore: Fast/efficient storage of DataFrames using AWS S3/Parquet/Snappy
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore","title":"DFStore
","text":" Bases: AWSDFStore
DFStore: Fast/efficient storage of DataFrames using AWS S3/Parquet/Snappy
Common Usagedf_store = DFStore()\n\n# List Data\ndf_store.list()\n\n# Add DataFrame\ndf = pd.DataFrame({\"A\": [1, 2], \"B\": [3, 4]})\ndf_store.upsert(\"/test/my_data\", df)\n\n# Retrieve DataFrame\ndf = df_store.get(\"/test/my_data\")\nprint(df)\n\n# Delete Data\ndf_store.delete(\"/test/my_data\")\n
Source code in src/sageworks/api/df_store.py
class DFStore(AWSDFStore):\n \"\"\"DFStore: Fast/efficient storage of DataFrames using AWS S3/Parquet/Snappy\n\n Common Usage:\n ```python\n df_store = DFStore()\n\n # List Data\n df_store.list()\n\n # Add DataFrame\n df = pd.DataFrame({\"A\": [1, 2], \"B\": [3, 4]})\n df_store.upsert(\"/test/my_data\", df)\n\n # Retrieve DataFrame\n df = df_store.get(\"/test/my_data\")\n print(df)\n\n # Delete Data\n df_store.delete(\"/test/my_data\")\n ```\n \"\"\"\n\n def __init__(self, path_prefix: Union[str, None] = None):\n \"\"\"DFStore Init Method\n\n Args:\n path_prefix (Union[str, None], optional): Add a path prefix to storage locations (Defaults to None)\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Initialize the SuperClass\n super().__init__(path_prefix=path_prefix)\n\n def list(self, include_cache: bool = False) -> list:\n \"\"\"List all the objects in the data_store prefix.\n\n Args:\n include_cache (bool, optional): Include cache objects in the list (Defaults to False).\n\n Returns:\n list: A list of all the objects in the data_store prefix.\n \"\"\"\n return super().list(include_cache=include_cache)\n\n def summary(self, include_cache: bool = False) -> pd.DataFrame:\n \"\"\"Return a nicely formatted summary of object locations, sizes (in MB), and modified dates.\n\n Args:\n include_cache (bool, optional): Include cache objects in the summary (Defaults to False).\n\n Returns:\n pd.DataFrame: A formatted DataFrame with the summary details.\n \"\"\"\n return super().summary(include_cache=include_cache)\n\n def details(self, include_cache: bool = False) -> pd.DataFrame:\n \"\"\"Return a DataFrame with detailed metadata for all objects in the data_store prefix.\n\n Args:\n include_cache (bool, optional): Include cache objects in the details (Defaults to False).\n\n Returns:\n pd.DataFrame: A DataFrame with detailed metadata for all objects in the data_store prefix.\n \"\"\"\n return super().details(include_cache=include_cache)\n\n def check(self, location: str) -> bool:\n \"\"\"Check if a DataFrame exists at the specified location\n\n Args:\n location (str): The location of the data to check.\n\n Returns:\n bool: True if the data exists, False otherwise.\n \"\"\"\n return super().check(location)\n\n def get(self, location: str) -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve a DataFrame from AWS S3.\n\n Args:\n location (str): The location of the data to retrieve.\n\n Returns:\n pd.DataFrame: The retrieved DataFrame or None if not found.\n \"\"\"\n _df = super().get(location)\n if _df is None:\n self.log.error(f\"Dataframe not found at location: {location}\")\n return _df\n\n def upsert(self, location: str, data: Union[pd.DataFrame, pd.Series]):\n \"\"\"Insert or update a DataFrame or Series in the AWS S3.\n\n Args:\n location (str): The location of the data.\n data (Union[pd.DataFrame, pd.Series]): The data to be stored.\n \"\"\"\n super().upsert(location, data)\n\n def last_modified(self, location: str) -> Union[datetime, None]:\n \"\"\"Get the last modified date of the DataFrame at the specified location.\n\n Args:\n location (str): The location of the data to check.\n\n Returns:\n Union[datetime, None]: The last modified date of the DataFrame or None if not found.\n \"\"\"\n return super().last_modified(location)\n\n def delete(self, location: str):\n \"\"\"Delete a DataFrame from the AWS S3.\n\n Args:\n location (str): The location of the data to delete.\n \"\"\"\n super().delete(location)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.__init__","title":"__init__(path_prefix=None)
","text":"DFStore Init Method
Parameters:
Name Type Description Defaultpath_prefix
Union[str, None]
Add a path prefix to storage locations (Defaults to None)
None
Source code in src/sageworks/api/df_store.py
def __init__(self, path_prefix: Union[str, None] = None):\n \"\"\"DFStore Init Method\n\n Args:\n path_prefix (Union[str, None], optional): Add a path prefix to storage locations (Defaults to None)\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Initialize the SuperClass\n super().__init__(path_prefix=path_prefix)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.check","title":"check(location)
","text":"Check if a DataFrame exists at the specified location
Parameters:
Name Type Description Defaultlocation
str
The location of the data to check.
requiredReturns:
Name Type Descriptionbool
bool
True if the data exists, False otherwise.
Source code insrc/sageworks/api/df_store.py
def check(self, location: str) -> bool:\n \"\"\"Check if a DataFrame exists at the specified location\n\n Args:\n location (str): The location of the data to check.\n\n Returns:\n bool: True if the data exists, False otherwise.\n \"\"\"\n return super().check(location)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.delete","title":"delete(location)
","text":"Delete a DataFrame from the AWS S3.
Parameters:
Name Type Description Defaultlocation
str
The location of the data to delete.
required Source code insrc/sageworks/api/df_store.py
def delete(self, location: str):\n \"\"\"Delete a DataFrame from the AWS S3.\n\n Args:\n location (str): The location of the data to delete.\n \"\"\"\n super().delete(location)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.details","title":"details(include_cache=False)
","text":"Return a DataFrame with detailed metadata for all objects in the data_store prefix.
Parameters:
Name Type Description Defaultinclude_cache
bool
Include cache objects in the details (Defaults to False).
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame with detailed metadata for all objects in the data_store prefix.
Source code insrc/sageworks/api/df_store.py
def details(self, include_cache: bool = False) -> pd.DataFrame:\n \"\"\"Return a DataFrame with detailed metadata for all objects in the data_store prefix.\n\n Args:\n include_cache (bool, optional): Include cache objects in the details (Defaults to False).\n\n Returns:\n pd.DataFrame: A DataFrame with detailed metadata for all objects in the data_store prefix.\n \"\"\"\n return super().details(include_cache=include_cache)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.get","title":"get(location)
","text":"Retrieve a DataFrame from AWS S3.
Parameters:
Name Type Description Defaultlocation
str
The location of the data to retrieve.
requiredReturns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The retrieved DataFrame or None if not found.
Source code insrc/sageworks/api/df_store.py
def get(self, location: str) -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve a DataFrame from AWS S3.\n\n Args:\n location (str): The location of the data to retrieve.\n\n Returns:\n pd.DataFrame: The retrieved DataFrame or None if not found.\n \"\"\"\n _df = super().get(location)\n if _df is None:\n self.log.error(f\"Dataframe not found at location: {location}\")\n return _df\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.last_modified","title":"last_modified(location)
","text":"Get the last modified date of the DataFrame at the specified location.
Parameters:
Name Type Description Defaultlocation
str
The location of the data to check.
requiredReturns:
Type DescriptionUnion[datetime, None]
Union[datetime, None]: The last modified date of the DataFrame or None if not found.
Source code insrc/sageworks/api/df_store.py
def last_modified(self, location: str) -> Union[datetime, None]:\n \"\"\"Get the last modified date of the DataFrame at the specified location.\n\n Args:\n location (str): The location of the data to check.\n\n Returns:\n Union[datetime, None]: The last modified date of the DataFrame or None if not found.\n \"\"\"\n return super().last_modified(location)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.list","title":"list(include_cache=False)
","text":"List all the objects in the data_store prefix.
Parameters:
Name Type Description Defaultinclude_cache
bool
Include cache objects in the list (Defaults to False).
False
Returns:
Name Type Descriptionlist
list
A list of all the objects in the data_store prefix.
Source code insrc/sageworks/api/df_store.py
def list(self, include_cache: bool = False) -> list:\n \"\"\"List all the objects in the data_store prefix.\n\n Args:\n include_cache (bool, optional): Include cache objects in the list (Defaults to False).\n\n Returns:\n list: A list of all the objects in the data_store prefix.\n \"\"\"\n return super().list(include_cache=include_cache)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.summary","title":"summary(include_cache=False)
","text":"Return a nicely formatted summary of object locations, sizes (in MB), and modified dates.
Parameters:
Name Type Description Defaultinclude_cache
bool
Include cache objects in the summary (Defaults to False).
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A formatted DataFrame with the summary details.
Source code insrc/sageworks/api/df_store.py
def summary(self, include_cache: bool = False) -> pd.DataFrame:\n \"\"\"Return a nicely formatted summary of object locations, sizes (in MB), and modified dates.\n\n Args:\n include_cache (bool, optional): Include cache objects in the summary (Defaults to False).\n\n Returns:\n pd.DataFrame: A formatted DataFrame with the summary details.\n \"\"\"\n return super().summary(include_cache=include_cache)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.upsert","title":"upsert(location, data)
","text":"Insert or update a DataFrame or Series in the AWS S3.
Parameters:
Name Type Description Defaultlocation
str
The location of the data.
requireddata
Union[DataFrame, Series]
The data to be stored.
required Source code insrc/sageworks/api/df_store.py
def upsert(self, location: str, data: Union[pd.DataFrame, pd.Series]):\n \"\"\"Insert or update a DataFrame or Series in the AWS S3.\n\n Args:\n location (str): The location of the data.\n data (Union[pd.DataFrame, pd.Series]): The data to be stored.\n \"\"\"\n super().upsert(location, data)\n
"},{"location":"api_classes/df_store/#examples","title":"Examples","text":"These example show how to use the DFStore()
class to list, add, and get dataframes from AWS Storage.
SageWorks REPL
If you'd like to experiment with listing, adding, and getting dataframe with the DFStore()
class, you can spin up the SageWorks REPL, use the class and test out all the methods. Try it out! SageWorks REPL
from sageworks.api.df_store import DFStore\ndf_store = DFStore()\n\n# List DataFrames\ndf_store().list()\n\nOut[1]:\nml/confustion_matrix (0.002MB/2024-09-23 16:44:48)\nml/hold_out_ids (0.094MB/2024-09-23 16:57:01)\nml/my_awesome_df (0.002MB/2024-09-23 16:43:30)\nml/shap_values (0.019MB/2024-09-23 16:57:21)\n\n# Add a DataFrame\ndf = pd.DataFrame({\"A\": [1]*1000, \"B\": [3]*1000})\ndf_store.upsert(\"test/test_df\", df)\n\n# List DataFrames (we can just use the REPR)\ndf_store\n\nOut[2]:\nml/confustion_matrix (0.002MB/2024-09-23 16:44:48)\nml/hold_out_ids (0.094MB/2024-09-23 16:57:01)\nml/my_awesome_df (0.002MB/2024-09-23 16:43:30)\nml/shap_values (0.019MB/2024-09-23 16:57:21)\ntest/test_df (0.002MB/2024-09-23 16:59:27)\n\n# Retrieve dataframes\nreturn_df = df_store.get(\"test/test_df\")\nreturn_df.head()\n\nOut[3]:\n A B\n0 1 3\n1 1 3\n2 1 3\n3 1 3\n4 1 3\n\n# Delete dataframes\ndf_store.delete(\"test/test_df\")\n
Compressed Storage is Automatic
All DataFrames are stored in the Parquet format using 'snappy' storage. Parquet is a columnar storage format that efficiently handles large datasets, and using Snappy compression reduces file size while maintaining fast read/write speeds.
"},{"location":"api_classes/endpoint/","title":"Endpoint","text":"Endpoint Examples
Examples of using the Endpoint class are listed at the bottom of this page Examples.
Endpoint: Manages AWS Endpoint creation and deployment. Endpoints are automatically set up and provisioned for deployment into AWS. Endpoints can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional model details and performance metrics
"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint","title":"Endpoint
","text":" Bases: EndpointCore
Endpoint: SageWorks Endpoint API Class
Common Usagemy_endpoint = Endpoint(name)\nmy_endpoint.details()\nmy_endpoint.inference(eval_df)\n
Source code in src/sageworks/api/endpoint.py
class Endpoint(EndpointCore):\n \"\"\"Endpoint: SageWorks Endpoint API Class\n\n Common Usage:\n ```python\n my_endpoint = Endpoint(name)\n my_endpoint.details()\n my_endpoint.inference(eval_df)\n ```\n \"\"\"\n\n def details(self, **kwargs) -> dict:\n \"\"\"Endpoint Details\n\n Returns:\n dict: A dictionary of details about the Endpoint\n \"\"\"\n return super().details(**kwargs)\n\n def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n capture_uuid (str, optional): The UUID of the capture to use (default: None)\n id_column (str, optional): The name of the column to use as the ID (default: None)\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n \"\"\"\n return super().inference(eval_df, capture_uuid, id_column)\n\n def auto_inference(self, capture: bool = False) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the FeatureSet evaluation data\n\n Args:\n capture (bool): Capture the inference results\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n \"\"\"\n return super().auto_inference(capture)\n\n def fast_inference(self, eval_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n\n Note:\n There's no sanity checks or error handling... just FAST Inference!\n \"\"\"\n return super().fast_inference(eval_df)\n
"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.auto_inference","title":"auto_inference(capture=False)
","text":"Run inference on the Endpoint using the FeatureSet evaluation data
Parameters:
Name Type Description Defaultcapture
bool
Capture the inference results
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: The DataFrame with predictions
Source code insrc/sageworks/api/endpoint.py
def auto_inference(self, capture: bool = False) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the FeatureSet evaluation data\n\n Args:\n capture (bool): Capture the inference results\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n \"\"\"\n return super().auto_inference(capture)\n
"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.details","title":"details(**kwargs)
","text":"Endpoint Details
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the Endpoint
Source code insrc/sageworks/api/endpoint.py
def details(self, **kwargs) -> dict:\n \"\"\"Endpoint Details\n\n Returns:\n dict: A dictionary of details about the Endpoint\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.fast_inference","title":"fast_inference(eval_df)
","text":"Run inference on the Endpoint using the provided DataFrame
Parameters:
Name Type Description Defaulteval_df
DataFrame
The DataFrame to run predictions on
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The DataFrame with predictions
NoteThere's no sanity checks or error handling... just FAST Inference!
Source code insrc/sageworks/api/endpoint.py
def fast_inference(self, eval_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n\n Note:\n There's no sanity checks or error handling... just FAST Inference!\n \"\"\"\n return super().fast_inference(eval_df)\n
"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.inference","title":"inference(eval_df, capture_uuid=None, id_column=None)
","text":"Run inference on the Endpoint using the provided DataFrame
Parameters:
Name Type Description Defaulteval_df
DataFrame
The DataFrame to run predictions on
requiredcapture_uuid
str
The UUID of the capture to use (default: None)
None
id_column
str
The name of the column to use as the ID (default: None)
None
Returns:
Type DescriptionDataFrame
pd.DataFrame: The DataFrame with predictions
Source code insrc/sageworks/api/endpoint.py
def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n capture_uuid (str, optional): The UUID of the capture to use (default: None)\n id_column (str, optional): The name of the column to use as the ID (default: None)\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n \"\"\"\n return super().inference(eval_df, capture_uuid, id_column)\n
"},{"location":"api_classes/endpoint/#examples","title":"Examples","text":"Run Inference on an Endpoint
endpoint_inference.pyfrom sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import Model\nfrom sageworks.api.endpoint import Endpoint\n\n# Grab an existing Endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# SageWorks has full ML Pipeline provenance, so we can backtrack the inputs,\n# get a DataFrame of data (not used for training) and run inference\nmodel = Model(endpoint.get_input())\nfs = FeatureSet(model.get_input())\nathena_table = fs.view(\"training\").table\ndf = fs.query(f\"SELECT * FROM {athena_table} where training = FALSE\")\n\n# Run inference/predictions on the Endpoint\nresults_df = endpoint.inference(df)\n\n# Run inference/predictions and capture the results\nresults_df = endpoint.inference(df, capture=True)\n\n# Run inference/predictions using the FeatureSet evaluation data\nresults_df = endpoint.auto_inference(capture=True)\n
Output
Processing...\n class_number_of_rings prediction\n0 13 11.477922\n1 12 12.316887\n2 8 7.612847\n3 8 9.663341\n4 9 9.075263\n.. ... ...\n839 8 8.069856\n840 15 14.915502\n841 11 10.977605\n842 10 10.173433\n843 7 7.297976\n
Endpoint Details The details() method
The detail()
method on the Endpoint class provides a lot of useful information. All of the SageWorks classes have a details()
method try it out!
from sageworks.api.endpoint import Endpoint\nfrom pprint import pprint\n\n# Get Endpoint and print out it's details\nendpoint = Endpoint(\"abalone-regression-end\")\npprint(endpoint.details())\n
Output
{\n 'input': 'abalone-regression',\n 'instance': 'Serverless (2GB/5)',\n 'model_metrics': metric_name value\n 0 RMSE 2.190\n 1 MAE 1.544\n 2 R2 0.504,\n 'model_name': 'abalone-regression',\n 'model_type': 'regressor',\n 'modified': datetime.datetime(2023, 12, 29, 17, 48, 35, 115000, tzinfo=datetime.timezone.utc),\n class_number_of_rings prediction\n0 9 8.648378\n1 11 9.717787\n2 11 10.933070\n3 10 9.899738\n4 9 10.014504\n.. ... ...\n495 10 10.261657\n496 9 10.788254\n497 13 7.779886\n498 12 14.718514\n499 13 10.637320\n 'sageworks_tags': ['abalone', 'regression'],\n 'status': 'InService',\n 'uuid': 'abalone-regression-end',\n 'variant': 'AllTraffic'}\n
Endpoint Metrics
endpoint_metrics.pyfrom sageworks.api.endpoint import Endpoint\n\n# Grab an existing Endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# SageWorks tracks both Model performance and Endpoint Metrics\nmodel_metrics = endpoint.details()[\"model_metrics\"]\nendpoint_metrics = endpoint.endpoint_metrics()\nprint(model_metrics)\nprint(endpoint_metrics)\n
Output
metric_name value\n0 RMSE 2.190\n1 MAE 1.544\n2 R2 0.504\n\n Invocations ModelLatency OverheadLatency ModelSetupTime Invocation5XXErrors\n29 0.0 0.00 0.00 0.00 0.0\n30 1.0 1.11 23.73 23.34 0.0\n31 0.0 0.00 0.00 0.00 0.0\n48 0.0 0.00 0.00 0.00 0.0\n49 5.0 0.45 9.64 23.57 0.0\n50 2.0 0.57 0.08 0.00 0.0\n51 0.0 0.00 0.00 0.00 0.0\n60 4.0 0.33 5.80 22.65 0.0\n61 1.0 1.11 23.35 23.10 0.0\n62 0.0 0.00 0.00 0.00 0.0\n...\n
"},{"location":"api_classes/endpoint/#sageworks-ui","title":"SageWorks UI","text":"Running these few lines of code creates and deploys an AWS Endpoint. The Endpoint artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI. SageWorks will monitor the endpoint, plot invocations, latencies, and tracks error metrics.
SageWorks Dashboard: EndpointsNot Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/feature_set/","title":"FeatureSet","text":"FeatureSet Examples
Examples of using the FeatureSet Class are in the Examples section at the bottom of this page. AWS Feature Store and Feature Groups are quite complicated to set up manually but the SageWorks FeatureSet makes it a breeze!
FeatureSet: Manages AWS Feature Store/Group creation and management. FeatureSets are set up so they can easily be queried with AWS Athena. All FeatureSets are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) FeatureSets can be viewed and explored within the SageWorks Dashboard UI.
"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet","title":"FeatureSet
","text":" Bases: FeatureSetCore
FeatureSet: SageWorks FeatureSet API Class
Common Usagemy_features = FeatureSet(name)\nmy_features.details()\nmy_features.to_model(\n ModelType.REGRESSOR,\n name=\"abalone-regression\",\n target_column=\"class_number_of_rings\"\n feature_list=[\"my\", \"best\", \"features\"])\n)\n
Source code in src/sageworks/api/feature_set.py
class FeatureSet(FeatureSetCore):\n \"\"\"FeatureSet: SageWorks FeatureSet API Class\n\n Common Usage:\n ```python\n my_features = FeatureSet(name)\n my_features.details()\n my_features.to_model(\n ModelType.REGRESSOR,\n name=\"abalone-regression\",\n target_column=\"class_number_of_rings\"\n feature_list=[\"my\", \"best\", \"features\"])\n )\n ```\n \"\"\"\n\n def details(self, **kwargs) -> dict:\n \"\"\"FeatureSet Details\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().details(**kwargs)\n\n def query(self, query: str, **kwargs) -> pd.DataFrame:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the FeatureSet\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n return super().query(query, **kwargs)\n\n def pull_dataframe(self, include_aws_columns=False) -> pd.DataFrame:\n \"\"\"Return a DataFrame of ALL the data from this FeatureSet\n\n Args:\n include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of ALL the data from this FeatureSet\n\n Note:\n Obviously this is not recommended for large datasets :)\n \"\"\"\n\n # Get the table associated with the data\n self.log.info(f\"Pulling all data from {self.uuid}...\")\n query = f\"SELECT * FROM {self.athena_table}\"\n df = self.query(query)\n\n # Drop any columns generated from AWS\n if not include_aws_columns:\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n df = df.drop(columns=aws_cols, errors=\"ignore\")\n return df\n\n def to_model(\n self,\n model_type: ModelType = ModelType.UNKNOWN,\n model_class: str = None,\n name: str = None,\n tags: list = None,\n description: str = None,\n feature_list: list = None,\n target_column: str = None,\n **kwargs,\n ) -> Union[Model, None]:\n \"\"\"Create a Model from the FeatureSet\n\n Args:\n\n model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\n model_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\n name (str): Set the name for the model. If not specified, a name will be generated\n tags (list): Set the tags for the model. If not specified tags will be generated.\n description (str): Set the description for the model. If not specified a description is generated.\n feature_list (list): Set the feature list for the model. If not specified a feature list is generated.\n target_column (str): The target column for the model (use None for unsupervised model)\n\n Returns:\n Model: The Model created from the FeatureSet (or None if the Model could not be created)\n \"\"\"\n\n # Ensure the model_name is valid\n if name:\n if not Artifact.is_name_valid(name, delimiter=\"-\", lower_case=False):\n self.log.critical(f\"Invalid Model name: {name}, not creating Model!\")\n return None\n\n # If the model_name wasn't given generate it\n else:\n name = self.uuid.replace(\"_features\", \"\") + \"-model\"\n name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n # Create the Model Tags\n tags = [name] if tags is None else tags\n\n # Transform the FeatureSet into a Model\n features_to_model = FeaturesToModel(self.uuid, name, model_type=model_type, model_class=model_class)\n features_to_model.set_output_tags(tags)\n features_to_model.transform(\n target_column=target_column, description=description, feature_list=feature_list, **kwargs\n )\n\n # Return the Model\n return Model(name)\n
"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.details","title":"details(**kwargs)
","text":"FeatureSet Details
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the FeatureSet
Source code insrc/sageworks/api/feature_set.py
def details(self, **kwargs) -> dict:\n \"\"\"FeatureSet Details\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.pull_dataframe","title":"pull_dataframe(include_aws_columns=False)
","text":"Return a DataFrame of ALL the data from this FeatureSet
Parameters:
Name Type Description Defaultinclude_aws_columns
bool
Include the AWS columns in the DataFrame (default: False)
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame of ALL the data from this FeatureSet
NoteObviously this is not recommended for large datasets :)
Source code insrc/sageworks/api/feature_set.py
def pull_dataframe(self, include_aws_columns=False) -> pd.DataFrame:\n \"\"\"Return a DataFrame of ALL the data from this FeatureSet\n\n Args:\n include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of ALL the data from this FeatureSet\n\n Note:\n Obviously this is not recommended for large datasets :)\n \"\"\"\n\n # Get the table associated with the data\n self.log.info(f\"Pulling all data from {self.uuid}...\")\n query = f\"SELECT * FROM {self.athena_table}\"\n df = self.query(query)\n\n # Drop any columns generated from AWS\n if not include_aws_columns:\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n df = df.drop(columns=aws_cols, errors=\"ignore\")\n return df\n
"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.query","title":"query(query, **kwargs)
","text":"Query the AthenaSource
Parameters:
Name Type Description Defaultquery
str
The query to run against the FeatureSet
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The results of the query
Source code insrc/sageworks/api/feature_set.py
def query(self, query: str, **kwargs) -> pd.DataFrame:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the FeatureSet\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n return super().query(query, **kwargs)\n
"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.to_model","title":"to_model(model_type=ModelType.UNKNOWN, model_class=None, name=None, tags=None, description=None, feature_list=None, target_column=None, **kwargs)
","text":"Create a Model from the FeatureSet
Args:
model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\nmodel_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\nname (str): Set the name for the model. If not specified, a name will be generated\ntags (list): Set the tags for the model. If not specified tags will be generated.\ndescription (str): Set the description for the model. If not specified a description is generated.\nfeature_list (list): Set the feature list for the model. If not specified a feature list is generated.\ntarget_column (str): The target column for the model (use None for unsupervised model)\n
Returns:
Name Type DescriptionModel
Union[Model, None]
The Model created from the FeatureSet (or None if the Model could not be created)
Source code insrc/sageworks/api/feature_set.py
def to_model(\n self,\n model_type: ModelType = ModelType.UNKNOWN,\n model_class: str = None,\n name: str = None,\n tags: list = None,\n description: str = None,\n feature_list: list = None,\n target_column: str = None,\n **kwargs,\n) -> Union[Model, None]:\n \"\"\"Create a Model from the FeatureSet\n\n Args:\n\n model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\n model_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\n name (str): Set the name for the model. If not specified, a name will be generated\n tags (list): Set the tags for the model. If not specified tags will be generated.\n description (str): Set the description for the model. If not specified a description is generated.\n feature_list (list): Set the feature list for the model. If not specified a feature list is generated.\n target_column (str): The target column for the model (use None for unsupervised model)\n\n Returns:\n Model: The Model created from the FeatureSet (or None if the Model could not be created)\n \"\"\"\n\n # Ensure the model_name is valid\n if name:\n if not Artifact.is_name_valid(name, delimiter=\"-\", lower_case=False):\n self.log.critical(f\"Invalid Model name: {name}, not creating Model!\")\n return None\n\n # If the model_name wasn't given generate it\n else:\n name = self.uuid.replace(\"_features\", \"\") + \"-model\"\n name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n # Create the Model Tags\n tags = [name] if tags is None else tags\n\n # Transform the FeatureSet into a Model\n features_to_model = FeaturesToModel(self.uuid, name, model_type=model_type, model_class=model_class)\n features_to_model.set_output_tags(tags)\n features_to_model.transform(\n target_column=target_column, description=description, feature_list=feature_list, **kwargs\n )\n\n # Return the Model\n return Model(name)\n
"},{"location":"api_classes/feature_set/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Create a FeatureSet from a Datasource
datasource_to_featureset.pyfrom sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\nds = DataSource('test_data')\nfs = ds.to_features(\"test_features\", id_column=\"id\")\nprint(fs.details())\n
FeatureSet EDA Statistics
featureset_eda.py
from sageworks.api.feature_set import FeatureSet\nimport pandas as pd\n\n# Grab a FeatureSet and pull some of the EDA Stats\nmy_features = FeatureSet('test_features')\n\n# Grab some of the EDA Stats\ncorr_data = my_features.correlations()\ncorr_df = pd.DataFrame(corr_data)\nprint(corr_df)\n\n# Get some outliers\noutliers = my_features.outliers()\npprint(outliers.head())\n\n# Full set of EDA Stats\neda_stats = my_features.column_stats()\npprint(eda_stats)\n
Output age food_pizza food_steak food_sushi food_tacos height id iq_score\nage NaN -0.188645 -0.256356 0.263048 0.054211 0.439678 -0.054948 -0.295513\nfood_pizza -0.188645 NaN -0.288175 -0.229591 -0.196818 -0.494380 0.137282 0.395378\nfood_steak -0.256356 -0.288175 NaN -0.374920 -0.321403 -0.002542 -0.005199 0.076477\nfood_sushi 0.263048 -0.229591 -0.374920 NaN -0.256064 0.536396 0.038279 -0.435033\nfood_tacos 0.054211 -0.196818 -0.321403 -0.256064 NaN -0.091493 -0.051398 0.033364\nheight 0.439678 -0.494380 -0.002542 0.536396 -0.091493 NaN -0.117372 -0.655210\nid -0.054948 0.137282 -0.005199 0.038279 -0.051398 -0.117372 NaN 0.106020\niq_score -0.295513 0.395378 0.076477 -0.435033 0.033364 -0.655210 0.106020 NaN\n\n name height weight salary age iq_score likes_dogs food_pizza food_steak food_sushi food_tacos outlier_group\n0 Person 96 57.582840 148.461349 80000.000000 43 150.000000 1 0 0 0 0 height_low\n1 Person 68 73.918663 189.527313 219994.000000 80 100.000000 0 0 0 1 0 iq_score_low\n2 Person 49 70.381790 261.237000 175633.703125 49 107.933998 0 0 0 1 0 iq_score_low\n3 Person 90 73.488739 193.840698 227760.000000 72 110.821541 1 0 0 0 0 salary_high\n\n<lots of EDA data and statistics>\n
Query a FeatureSet
All SageWorks FeatureSet have an 'offline' store that uses AWS Athena, so any query that you can make with Athena is accessible through the FeatureSet API.
featureset_query.pyfrom sageworks.api.feature_set import FeatureSet\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"abalone_features\")\n\n# Make some queries using the Athena backend\ndf = my_features.query(\"select * from abalone_features where height > .3\")\nprint(df.head())\n\ndf = my_features.query(\"select * from abalone_features where class_number_of_rings < 3\")\nprint(df.head())\n
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10\n1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8\n\n sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1\n1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2\n
Create a Model from a FeatureSet
featureset_to_model.pyfrom sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import ModelType\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet('test_features')\n\n# Create a Model from the FeatureSet\n# Note: ModelTypes can be CLASSIFIER, REGRESSOR, \n# UNSUPERVISED, or TRANSFORMER\nmy_model = my_features.to_model(model_type=ModelType.REGRESSOR, \n target_column=\"iq_score\")\npprint(my_model.details())\n
Output
{'approval_status': 'Approved',\n 'content_types': ['text/csv'],\n ...\n 'inference_types': ['ml.t2.medium'],\n 'input': 'test_features',\n 'model_metrics': metric_name value\n 0 RMSE 7.924\n 1 MAE 6.554,\n 2 R2 0.604,\n 'regression_predictions': iq_score prediction\n 0 136.519012 139.964460\n 1 133.616974 130.819950\n 2 122.495415 124.967834\n 3 133.279510 121.010284\n 4 127.881073 113.825005\n ...\n 'response_types': ['text/csv'],\n 'sageworks_tags': ['test-model'],\n 'shapley_values': None,\n 'size': 0.0,\n 'status': 'Completed',\n 'transform_types': ['ml.m5.large'],\n 'uuid': 'test-model',\n 'version': 1}\n
"},{"location":"api_classes/feature_set/#sageworks-ui","title":"SageWorks UI","text":"Whenever a FeatureSet is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.
SageWorks Dashboard: FeatureSetsNot Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/meta/","title":"Meta","text":"Meta Examples
Examples of using the Meta class are listed at the bottom of this page Examples.
Meta: A class that provides high level information and summaries of Cloud Platform Artifacts. The Meta class provides 'account' information, configuration, etc. It also provides metadata for Artifacts, such as Data Sources, Feature Sets, Models, and Endpoints.
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta","title":"Meta
","text":" Bases: CloudMeta
Meta: A class that provides metadata functionality for Cloud Platform Artifacts.
Common Usagefrom sageworks.api import Meta\nmeta = Meta()\n\n# Get the AWS Account Info\nmeta.account()\nmeta.config()\n\n# These are 'list' methods\nmeta.etl_jobs()\nmeta.data_sources()\nmeta.feature_sets(details=True/False)\nmeta.models(details=True/False)\nmeta.endpoints()\nmeta.views()\n\n# These are 'describe' methods\nmeta.data_source(\"abalone_data\")\nmeta.feature_set(\"abalone_features\")\nmeta.model(\"abalone-regression\")\nmeta.endpoint(\"abalone-endpoint\")\n
Source code in src/sageworks/api/meta.py
class Meta(CloudMeta):\n \"\"\"Meta: A class that provides metadata functionality for Cloud Platform Artifacts.\n\n Common Usage:\n ```python\n from sageworks.api import Meta\n meta = Meta()\n\n # Get the AWS Account Info\n meta.account()\n meta.config()\n\n # These are 'list' methods\n meta.etl_jobs()\n meta.data_sources()\n meta.feature_sets(details=True/False)\n meta.models(details=True/False)\n meta.endpoints()\n meta.views()\n\n # These are 'describe' methods\n meta.data_source(\"abalone_data\")\n meta.feature_set(\"abalone_features\")\n meta.model(\"abalone-regression\")\n meta.endpoint(\"abalone-endpoint\")\n ```\n \"\"\"\n\n def account(self) -> dict:\n \"\"\"Cloud Platform Account Info\n\n Returns:\n dict: Cloud Platform Account Info\n \"\"\"\n return super().account()\n\n def config(self) -> dict:\n \"\"\"Return the current SageWorks Configuration\n\n Returns:\n dict: The current SageWorks Configuration\n \"\"\"\n return super().config()\n\n def incoming_data(self) -> pd.DataFrame:\n \"\"\"Get summary data about data in the incoming raw data\n\n Returns:\n pd.DataFrame: A summary of the incoming raw data\n \"\"\"\n return super().incoming_data()\n\n def etl_jobs(self) -> pd.DataFrame:\n \"\"\"Get summary data about Extract, Transform, Load (ETL) Jobs\n\n Returns:\n pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform\n \"\"\"\n return super().etl_jobs()\n\n def data_sources(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Data Sources deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform\n \"\"\"\n return super().data_sources()\n\n def views(self, database: str = \"sageworks\") -> pd.DataFrame:\n \"\"\"Get a summary of the all the Views, for the given database, in AWS\n\n Args:\n database (str, optional): Glue database. Defaults to 'sageworks'.\n\n Returns:\n pd.DataFrame: A summary of all the Views, for the given database, in AWS\n \"\"\"\n return super().views(database=database)\n\n def feature_sets(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Feature Sets deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform\n \"\"\"\n return super().feature_sets(details=details)\n\n def models(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Models deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Models deployed in the Cloud Platform\n \"\"\"\n return super().models(details=details)\n\n def endpoints(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Endpoints deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Endpoints in the Cloud Platform\n \"\"\"\n return super().endpoints()\n\n def glue_job(self, job_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Glue Job\n\n Args:\n job_name (str): The name of the Glue Job\n\n Returns:\n dict: The details of the Glue Job (None if not found)\n \"\"\"\n return super().glue_job(job_name=job_name)\n\n def data_source(self, data_source_name: str, database: str = \"sageworks\") -> Union[dict, None]:\n \"\"\"Get the details of a specific Data Source\n\n Args:\n data_source_name (str): The name of the Data Source\n database (str, optional): The Glue database. Defaults to 'sageworks'.\n\n Returns:\n dict: The details of the Data Source (None if not found)\n \"\"\"\n return super().data_source(table_name=data_source_name, database=database)\n\n def feature_set(self, feature_set_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Feature Set\n\n Args:\n feature_set_name (str): The name of the Feature Set\n\n Returns:\n dict: The details of the Feature Set (None if not found)\n \"\"\"\n return super().feature_set(feature_group_name=feature_set_name)\n\n def model(self, model_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Model\n\n Args:\n model_name (str): The name of the Model\n\n Returns:\n dict: The details of the Model (None if not found)\n \"\"\"\n return super().model(model_group_name=model_name)\n\n def endpoint(self, endpoint_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: The details of the Endpoint (None if not found)\n \"\"\"\n return super().endpoint(endpoint_name=endpoint_name)\n\n def __repr__(self):\n return f\"Meta()\\n\\t{super().__repr__()}\"\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.account","title":"account()
","text":"Cloud Platform Account Info
Returns:
Name Type Descriptiondict
dict
Cloud Platform Account Info
Source code insrc/sageworks/api/meta.py
def account(self) -> dict:\n \"\"\"Cloud Platform Account Info\n\n Returns:\n dict: Cloud Platform Account Info\n \"\"\"\n return super().account()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.config","title":"config()
","text":"Return the current SageWorks Configuration
Returns:
Name Type Descriptiondict
dict
The current SageWorks Configuration
Source code insrc/sageworks/api/meta.py
def config(self) -> dict:\n \"\"\"Return the current SageWorks Configuration\n\n Returns:\n dict: The current SageWorks Configuration\n \"\"\"\n return super().config()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.data_source","title":"data_source(data_source_name, database='sageworks')
","text":"Get the details of a specific Data Source
Parameters:
Name Type Description Defaultdata_source_name
str
The name of the Data Source
requireddatabase
str
The Glue database. Defaults to 'sageworks'.
'sageworks'
Returns:
Name Type Descriptiondict
Union[dict, None]
The details of the Data Source (None if not found)
Source code insrc/sageworks/api/meta.py
def data_source(self, data_source_name: str, database: str = \"sageworks\") -> Union[dict, None]:\n \"\"\"Get the details of a specific Data Source\n\n Args:\n data_source_name (str): The name of the Data Source\n database (str, optional): The Glue database. Defaults to 'sageworks'.\n\n Returns:\n dict: The details of the Data Source (None if not found)\n \"\"\"\n return super().data_source(table_name=data_source_name, database=database)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.data_sources","title":"data_sources()
","text":"Get a summary of the Data Sources deployed in the Cloud Platform
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform
Source code insrc/sageworks/api/meta.py
def data_sources(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Data Sources deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform\n \"\"\"\n return super().data_sources()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.endpoint","title":"endpoint(endpoint_name)
","text":"Get the details of a specific Endpoint
Parameters:
Name Type Description Defaultendpoint_name
str
The name of the Endpoint
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Endpoint (None if not found)
Source code insrc/sageworks/api/meta.py
def endpoint(self, endpoint_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: The details of the Endpoint (None if not found)\n \"\"\"\n return super().endpoint(endpoint_name=endpoint_name)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.endpoints","title":"endpoints()
","text":"Get a summary of the Endpoints deployed in the Cloud Platform
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Endpoints in the Cloud Platform
Source code insrc/sageworks/api/meta.py
def endpoints(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Endpoints deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Endpoints in the Cloud Platform\n \"\"\"\n return super().endpoints()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.etl_jobs","title":"etl_jobs()
","text":"Get summary data about Extract, Transform, Load (ETL) Jobs
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform
Source code insrc/sageworks/api/meta.py
def etl_jobs(self) -> pd.DataFrame:\n \"\"\"Get summary data about Extract, Transform, Load (ETL) Jobs\n\n Returns:\n pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform\n \"\"\"\n return super().etl_jobs()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.feature_set","title":"feature_set(feature_set_name)
","text":"Get the details of a specific Feature Set
Parameters:
Name Type Description Defaultfeature_set_name
str
The name of the Feature Set
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Feature Set (None if not found)
Source code insrc/sageworks/api/meta.py
def feature_set(self, feature_set_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Feature Set\n\n Args:\n feature_set_name (str): The name of the Feature Set\n\n Returns:\n dict: The details of the Feature Set (None if not found)\n \"\"\"\n return super().feature_set(feature_group_name=feature_set_name)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.feature_sets","title":"feature_sets(details=False)
","text":"Get a summary of the Feature Sets deployed in the Cloud Platform
Parameters:
Name Type Description Defaultdetails
bool
Include detailed information. Defaults to False.
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform
Source code insrc/sageworks/api/meta.py
def feature_sets(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Feature Sets deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform\n \"\"\"\n return super().feature_sets(details=details)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.glue_job","title":"glue_job(job_name)
","text":"Get the details of a specific Glue Job
Parameters:
Name Type Description Defaultjob_name
str
The name of the Glue Job
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Glue Job (None if not found)
Source code insrc/sageworks/api/meta.py
def glue_job(self, job_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Glue Job\n\n Args:\n job_name (str): The name of the Glue Job\n\n Returns:\n dict: The details of the Glue Job (None if not found)\n \"\"\"\n return super().glue_job(job_name=job_name)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.incoming_data","title":"incoming_data()
","text":"Get summary data about data in the incoming raw data
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the incoming raw data
Source code insrc/sageworks/api/meta.py
def incoming_data(self) -> pd.DataFrame:\n \"\"\"Get summary data about data in the incoming raw data\n\n Returns:\n pd.DataFrame: A summary of the incoming raw data\n \"\"\"\n return super().incoming_data()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.model","title":"model(model_name)
","text":"Get the details of a specific Model
Parameters:
Name Type Description Defaultmodel_name
str
The name of the Model
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Model (None if not found)
Source code insrc/sageworks/api/meta.py
def model(self, model_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Model\n\n Args:\n model_name (str): The name of the Model\n\n Returns:\n dict: The details of the Model (None if not found)\n \"\"\"\n return super().model(model_group_name=model_name)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.models","title":"models(details=False)
","text":"Get a summary of the Models deployed in the Cloud Platform
Parameters:
Name Type Description Defaultdetails
bool
Include detailed information. Defaults to False.
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Models deployed in the Cloud Platform
Source code insrc/sageworks/api/meta.py
def models(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Models deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Models deployed in the Cloud Platform\n \"\"\"\n return super().models(details=details)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.views","title":"views(database='sageworks')
","text":"Get a summary of the all the Views, for the given database, in AWS
Parameters:
Name Type Description Defaultdatabase
str
Glue database. Defaults to 'sageworks'.
'sageworks'
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of all the Views, for the given database, in AWS
Source code insrc/sageworks/api/meta.py
def views(self, database: str = \"sageworks\") -> pd.DataFrame:\n \"\"\"Get a summary of the all the Views, for the given database, in AWS\n\n Args:\n database (str, optional): Glue database. Defaults to 'sageworks'.\n\n Returns:\n pd.DataFrame: A summary of all the Views, for the given database, in AWS\n \"\"\"\n return super().views(database=database)\n
"},{"location":"api_classes/meta/#examples","title":"Examples","text":"These example show how to use the Meta()
class to pull lists of artifacts from AWS. DataSources, FeatureSets, Models, Endpoints and more. If you're building a web interface plugin, the Meta class is a great place to start.
SageWorks REPL
If you'd like to see exactly what data/details you get back from the Meta()
class, you can spin up the SageWorks REPL, use the class and test out all the methods. Try it out! SageWorks REPL
meta = Meta()\nmodel_df = meta.models()\nmodel_df\n Model Group Health Owner ... Input Status Description\n0 wine-classification healthy - ... wine_features Completed Wine Classification Model\n1 abalone-regression-full healthy - ... abalone_features Completed Abalone Regression Model\n2 abalone-regression healthy - ... abalone_features Completed Abalone Regression Model\n\n[3 rows x 10 columns]\n
List the Models in AWS
meta_list_models.pyfrom sageworks.api import Meta\n\n# Create our Meta Class and get a list of our Models\nmeta = Meta()\nmodel_df = meta.models()\n\nprint(f\"Number of Models: {len(model_df)}\")\nprint(model_df)\n\n# Get more details data on the Models\nmodel_names = model_df[\"Model Group\"].tolist()\nfor name in model_names:\n pprint(meta.model(name))\n
Output
Number of Models: 3\n Model Group Health Owner ... Input Status Description\n0 wine-classification healthy - ... wine_features Completed Wine Classification Model\n1 abalone-regression-full healthy - ... abalone_features Completed Abalone Regression Model\n2 abalone-regression healthy - ... abalone_features Completed Abalone Regression Model\n\n[3 rows x 10 columns]\nwine-classification\nabalone-regression-full\nabalone-regression\n
Getting Model Performance Metrics
meta_model_metrics.pyfrom sageworks.api import Meta\n\n# Create our Meta Class and get a list of our Models\nmeta = Meta()\nmodel_df = meta.models()\n\nprint(f\"Number of Models: {len(model_df)}\")\nprint(model_df)\n\n# Get more details data on the Models\nmodel_names = model_df[\"Model Group\"].tolist()\nfor name in model_names[:5]:\n model_details = meta.model(name)\n print(f\"\\n\\nModel: {name}\")\n performance_metrics = model_details[\"sageworks_meta\"][\"sageworks_inference_metrics\"]\n print(f\"\\tPerformance Metrics: {performance_metrics}\")\n
Output
wine-classification\n ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/wine-classification\n Description: Wine Classification Model\n Tags: wine::classification\n Performance Metrics:\n [{'wine_class': 'TypeA', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 12}, {'wine_class': 'TypeB', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 14}, {'wine_class': 'TypeC', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 9}]\n\nabalone-regression\n ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/abalone-regression\n Description: Abalone Regression Model\n Tags: abalone::regression\n Performance Metrics:\n [{'MAE': 1.64, 'RMSE': 2.246, 'R2': 0.502, 'MAPE': 16.393, 'MedAE': 1.209, 'NumRows': 834}]\n
List the Endpoints in AWS
meta_list_endpoints.pyfrom pprint import pprint\nfrom sageworks.api import Meta\n\n# Create our Meta Class and get a list of our Endpoints\nmeta = Meta()\nendpoint_df = meta.endpoints()\nprint(f\"Number of Endpoints: {len(endpoint_df)}\")\nprint(endpoint_df)\n\n# Get more details data on the Endpoints\nendpoint_names = endpoint_df[\"Name\"].tolist()\nfor name in endpoint_names:\n pprint(meta.endpoint(name))\n
Output
Number of Endpoints: 2\n Name Health Instance Created ... Status Variant Capture Samp(%)\n0 wine-classification-end healthy Serverless (2GB/5) 2024-03-23 23:09 ... InService AllTraffic False -\n1 abalone-regression-end healthy Serverless (2GB/5) 2024-03-23 21:11 ... InService AllTraffic False -\n\n[2 rows x 10 columns]\nwine-classification-end\n<lots of details about endpoints>\n
Not Finding some particular AWS Data?
The SageWorks Meta API Class also has (details=True)
arguments, so make sure to check those out.
Model Examples
Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!
Model: Manages AWS Model Package/Group creation and management.
Models are automatically set up and provisioned for deployment into AWS. Models can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional model details and performance metrics
"},{"location":"api_classes/model/#sageworks.api.model.Model","title":"Model
","text":" Bases: ModelCore
Model: SageWorks Model API Class.
Common Usagemy_model = Model(name)\nmy_model.details()\nmy_model.to_endpoint()\n
Source code in src/sageworks/api/model.py
class Model(ModelCore):\n \"\"\"Model: SageWorks Model API Class.\n\n Common Usage:\n ```python\n my_model = Model(name)\n my_model.details()\n my_model.to_endpoint()\n ```\n \"\"\"\n\n def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the Model Details.\n\n Returns:\n dict: A dictionary of details about the Model\n \"\"\"\n return super().details(**kwargs)\n\n def to_endpoint(self, name: str = None, tags: list = None, serverless: bool = True) -> Endpoint:\n \"\"\"Create an Endpoint from the Model.\n\n Args:\n name (str): Set the name for the endpoint. If not specified, an automatic name will be generated\n tags (list): Set the tags for the endpoint. If not specified automatic tags will be generated.\n serverless (bool): Set the endpoint to be serverless (default: True)\n\n Returns:\n Endpoint: The Endpoint created from the Model\n \"\"\"\n\n # Ensure the endpoint_name is valid\n if name:\n Artifact.is_name_valid(name, delimiter=\"-\", lower_case=False)\n\n # If the endpoint_name wasn't given generate it\n else:\n name = self.uuid.replace(\"_features\", \"\") + \"-end\"\n name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n # Create the Endpoint Tags\n tags = [name] if tags is None else tags\n\n # Create an Endpoint from the Model\n model_to_endpoint = ModelToEndpoint(self.uuid, name, serverless=serverless)\n model_to_endpoint.set_output_tags(tags)\n model_to_endpoint.transform()\n\n # Return the Endpoint\n return Endpoint(name)\n
"},{"location":"api_classes/model/#sageworks.api.model.Model.details","title":"details(**kwargs)
","text":"Retrieve the Model Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the Model
Source code insrc/sageworks/api/model.py
def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the Model Details.\n\n Returns:\n dict: A dictionary of details about the Model\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"api_classes/model/#sageworks.api.model.Model.to_endpoint","title":"to_endpoint(name=None, tags=None, serverless=True)
","text":"Create an Endpoint from the Model.
Parameters:
Name Type Description Defaultname
str
Set the name for the endpoint. If not specified, an automatic name will be generated
None
tags
list
Set the tags for the endpoint. If not specified automatic tags will be generated.
None
serverless
bool
Set the endpoint to be serverless (default: True)
True
Returns:
Name Type DescriptionEndpoint
Endpoint
The Endpoint created from the Model
Source code insrc/sageworks/api/model.py
def to_endpoint(self, name: str = None, tags: list = None, serverless: bool = True) -> Endpoint:\n \"\"\"Create an Endpoint from the Model.\n\n Args:\n name (str): Set the name for the endpoint. If not specified, an automatic name will be generated\n tags (list): Set the tags for the endpoint. If not specified automatic tags will be generated.\n serverless (bool): Set the endpoint to be serverless (default: True)\n\n Returns:\n Endpoint: The Endpoint created from the Model\n \"\"\"\n\n # Ensure the endpoint_name is valid\n if name:\n Artifact.is_name_valid(name, delimiter=\"-\", lower_case=False)\n\n # If the endpoint_name wasn't given generate it\n else:\n name = self.uuid.replace(\"_features\", \"\") + \"-end\"\n name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n # Create the Endpoint Tags\n tags = [name] if tags is None else tags\n\n # Create an Endpoint from the Model\n model_to_endpoint = ModelToEndpoint(self.uuid, name, serverless=serverless)\n model_to_endpoint.set_output_tags(tags)\n model_to_endpoint.transform()\n\n # Return the Endpoint\n return Endpoint(name)\n
"},{"location":"api_classes/model/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Create a Model from a FeatureSet
featureset_to_model.pyfrom sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import ModelType\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"test_features\")\n\n# Create a Model from the FeatureSet\n# Note: ModelTypes can be CLASSIFIER, REGRESSOR (XGBoost is default)\nmy_model = my_features.to_model(model_type=ModelType.REGRESSOR, \n target_column=\"iq_score\")\npprint(my_model.details())\n
Output
{'approval_status': 'Approved',\n 'content_types': ['text/csv'],\n ...\n 'inference_types': ['ml.t2.medium'],\n 'input': 'test_features',\n 'model_metrics': metric_name value\n 0 RMSE 7.924\n 1 MAE 6.554,\n 2 R2 0.604,\n 'regression_predictions': iq_score prediction\n 0 136.519012 139.964460\n 1 133.616974 130.819950\n 2 122.495415 124.967834\n 3 133.279510 121.010284\n 4 127.881073 113.825005\n ...\n 'response_types': ['text/csv'],\n 'sageworks_tags': ['test-model'],\n 'shapley_values': None,\n 'size': 0.0,\n 'status': 'Completed',\n 'transform_types': ['ml.m5.large'],\n 'uuid': 'test-model',\n 'version': 1}\n
Use a specific Scikit-Learn Model
featureset_to_knn.py
from sageworks.api.feature_set import FeatureSet\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"abalone_features\")\n\n# Transform FeatureSet into KNN Regression Model\n# Note: model_class can be any sckit-learn model \n# \"KNeighborsRegressor\", \"BayesianRidge\",\n# \"GaussianNB\", \"AdaBoostClassifier\", etc\nmy_model = my_features.to_model(\n model_class=\"KNeighborsRegressor\",\n target_column=\"class_number_of_rings\",\n name=\"abalone-knn-reg\",\n description=\"Abalone KNN Regression\",\n tags=[\"abalone\", \"knn\"],\n train_all_data=True,\n)\npprint(my_model.details())\n
Another Scikit-Learn Example featureset_to_rfc.pyfrom sageworks.api.feature_set import FeatureSet\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"wine_features\")\n\n# Using a Scikit-Learn Model\n# Note: model_class can be any sckit-learn model (\"KNeighborsRegressor\", \"BayesianRidge\",\n# \"GaussianNB\", \"AdaBoostClassifier\", \"Ridge, \"Lasso\", \"SVC\", \"SVR\", etc...)\nmy_model = my_features.to_model(\n model_class=\"RandomForestClassifier\",\n target_column=\"wine_class\",\n name=\"wine-rfc-class\",\n description=\"Wine RandomForest Classification\",\n tags=[\"wine\", \"rfc\"]\n)\npprint(my_model.details())\n
Create an Endpoint from a Model
Endpoint Costs
Serverless endpoints are a great option, they have no AWS charges when not running. A realtime endpoint has less latency (no cold start) but AWS charges an hourly fee which can add up quickly!
model_to_endpoint.pyfrom sageworks.api.model import Model\n\n# Grab the abalone regression Model\nmodel = Model(\"abalone-regression\")\n\n# By default, an Endpoint is serverless, you can\n# make a realtime endpoint with serverless=False\nmodel.to_endpoint(name=\"abalone-regression-end\",\n tags=[\"abalone\", \"regression\"],\n serverless=True)\n
Model Health Check and Metrics
model_metrics.pyfrom sageworks.api.model import Model\n\n# Grab the abalone-regression Model\nmodel = Model(\"abalone-regression\")\n\n# Perform a health check on the model\n# Note: The health_check() method returns 'issues' if there are any\n# problems, so if there are no issues, the model is healthy\nhealth_issues = model.health_check()\nif not health_issues:\n print(\"Model is Healthy\")\nelse:\n print(\"Model has issues\")\n print(health_issues)\n\n# Get the model metrics and regression predictions\nprint(model.model_metrics())\nprint(model.regression_predictions())\n
Output
Model is Healthy\n metric_name value\n0 RMSE 2.190\n1 MAE 1.544\n2 R2 0.504\n\n class_number_of_rings prediction\n0 9 8.648378\n1 11 9.717787\n2 11 10.933070\n3 10 9.899738\n4 9 10.014504\n.. ... ...\n495 10 10.261657\n496 9 10.788254\n497 13 7.779886\n498 12 14.718514\n499 13 10.637320\n
"},{"location":"api_classes/model/#sageworks-ui","title":"SageWorks UI","text":"Running these few lines of code creates an AWS Model Package Group and an AWS Model Package. These model artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI.
SageWorks Dashboard: ModelsNot Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/monitor/","title":"Monitor","text":"Monitor Examples
Examples of using the Monitor class are listed at the bottom of this page Examples.
Monitor: Manages AWS Endpoint Monitor creation and deployment. Endpoints Monitors are set up and provisioned for deployment into AWS. Monitors can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional monitor details and performance metrics
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor","title":"Monitor
","text":" Bases: MonitorCore
Monitor: SageWorks Monitor API Class
Common Usagemon = Endpoint(name).get_monitor() # Pull from endpoint OR\nmon = Monitor(name) # Create using Endpoint Name\nmon.summary()\nmon.details()\n\n# One time setup methods\nmon.add_data_capture()\nmon.create_baseline()\nmon.create_monitoring_schedule()\n\n# Pull information from the monitor\nbaseline_df = mon.get_baseline()\nconstraints_df = mon.get_constraints()\nstats_df = mon.get_statistics()\ninput_df, output_df = mon.get_latest_data_capture()\n
Source code in src/sageworks/api/monitor.py
class Monitor(MonitorCore):\n \"\"\"Monitor: SageWorks Monitor API Class\n\n Common Usage:\n ```\n mon = Endpoint(name).get_monitor() # Pull from endpoint OR\n mon = Monitor(name) # Create using Endpoint Name\n mon.summary()\n mon.details()\n\n # One time setup methods\n mon.add_data_capture()\n mon.create_baseline()\n mon.create_monitoring_schedule()\n\n # Pull information from the monitor\n baseline_df = mon.get_baseline()\n constraints_df = mon.get_constraints()\n stats_df = mon.get_statistics()\n input_df, output_df = mon.get_latest_data_capture()\n ```\n \"\"\"\n\n def summary(self) -> dict:\n \"\"\"Monitor Summary\n\n Returns:\n dict: A dictionary of summary information about the Monitor\n \"\"\"\n return super().summary()\n\n def details(self) -> dict:\n \"\"\"Monitor Details\n\n Returns:\n dict: A dictionary of details about the Monitor\n \"\"\"\n return super().details()\n\n def add_data_capture(self, capture_percentage=100):\n \"\"\"\n Add data capture configuration for this Monitor/endpoint.\n\n Args:\n capture_percentage (int): Percentage of data to capture. Defaults to 100.\n \"\"\"\n super().add_data_capture(capture_percentage)\n\n def create_baseline(self, recreate: bool = False):\n \"\"\"Code to create a baseline for monitoring\n\n Args:\n recreate (bool): If True, recreate the baseline even if it already exists\n\n Notes:\n This will create/write three files to the baseline_dir:\n - baseline.csv\n - constraints.json\n - statistics.json\n \"\"\"\n super().create_baseline(recreate)\n\n def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n \"\"\"\n Sets up the monitoring schedule for the model endpoint.\n\n Args:\n schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n \"\"\"\n super().create_monitoring_schedule(schedule, recreate)\n\n def get_latest_data_capture(self) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Get the latest data capture input and output from S3.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n return super().get_latest_data_capture()\n\n def get_baseline(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n Returns:\n pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n \"\"\"\n return super().get_baseline()\n\n def get_constraints(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the constraints from the baseline\n\n Returns:\n pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n \"\"\"\n return super().get_constraints()\n\n def get_statistics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the statistics from the baseline\n\n Returns:\n pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n \"\"\"\n return super().get_statistics()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.add_data_capture","title":"add_data_capture(capture_percentage=100)
","text":"Add data capture configuration for this Monitor/endpoint.
Parameters:
Name Type Description Defaultcapture_percentage
int
Percentage of data to capture. Defaults to 100.
100
Source code in src/sageworks/api/monitor.py
def add_data_capture(self, capture_percentage=100):\n \"\"\"\n Add data capture configuration for this Monitor/endpoint.\n\n Args:\n capture_percentage (int): Percentage of data to capture. Defaults to 100.\n \"\"\"\n super().add_data_capture(capture_percentage)\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.create_baseline","title":"create_baseline(recreate=False)
","text":"Code to create a baseline for monitoring
Parameters:
Name Type Description Defaultrecreate
bool
If True, recreate the baseline even if it already exists
False
Notes This will create/write three files to the baseline_dir: - baseline.csv - constraints.json - statistics.json
Source code insrc/sageworks/api/monitor.py
def create_baseline(self, recreate: bool = False):\n \"\"\"Code to create a baseline for monitoring\n\n Args:\n recreate (bool): If True, recreate the baseline even if it already exists\n\n Notes:\n This will create/write three files to the baseline_dir:\n - baseline.csv\n - constraints.json\n - statistics.json\n \"\"\"\n super().create_baseline(recreate)\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.create_monitoring_schedule","title":"create_monitoring_schedule(schedule='hourly', recreate=False)
","text":"Sets up the monitoring schedule for the model endpoint.
Parameters:
Name Type Description Defaultschedule
str
The schedule for the monitoring job (hourly or daily, defaults to hourly).
'hourly'
recreate
bool
If True, recreate the monitoring schedule even if it already exists.
False
Source code in src/sageworks/api/monitor.py
def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n \"\"\"\n Sets up the monitoring schedule for the model endpoint.\n\n Args:\n schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n \"\"\"\n super().create_monitoring_schedule(schedule, recreate)\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.details","title":"details()
","text":"Monitor Details
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the Monitor
Source code insrc/sageworks/api/monitor.py
def details(self) -> dict:\n \"\"\"Monitor Details\n\n Returns:\n dict: A dictionary of details about the Monitor\n \"\"\"\n return super().details()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_baseline","title":"get_baseline()
","text":"Code to get the baseline CSV from the S3 baseline directory
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)
Source code insrc/sageworks/api/monitor.py
def get_baseline(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n Returns:\n pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n \"\"\"\n return super().get_baseline()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_constraints","title":"get_constraints()
","text":"Code to get the constraints from the baseline
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)
Source code insrc/sageworks/api/monitor.py
def get_constraints(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the constraints from the baseline\n\n Returns:\n pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n \"\"\"\n return super().get_constraints()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_latest_data_capture","title":"get_latest_data_capture()
","text":"Get the latest data capture input and output from S3.
Returns:
Name Type DescriptionDataFrame
input), DataFrame(output
Flattened and processed DataFrames for input and output data.
Source code insrc/sageworks/api/monitor.py
def get_latest_data_capture(self) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Get the latest data capture input and output from S3.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n return super().get_latest_data_capture()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_statistics","title":"get_statistics()
","text":"Code to get the statistics from the baseline
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)
Source code insrc/sageworks/api/monitor.py
def get_statistics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the statistics from the baseline\n\n Returns:\n pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n \"\"\"\n return super().get_statistics()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.summary","title":"summary()
","text":"Monitor Summary
Returns:
Name Type Descriptiondict
dict
A dictionary of summary information about the Monitor
Source code insrc/sageworks/api/monitor.py
def summary(self) -> dict:\n \"\"\"Monitor Summary\n\n Returns:\n dict: A dictionary of summary information about the Monitor\n \"\"\"\n return super().summary()\n
"},{"location":"api_classes/monitor/#examples","title":"Examples","text":"Initial Setup of the Endpoint Monitor
monitor_setup.pyfrom sageworks.api.monitor import Monitor\n\n# Create an Endpoint Monitor Class and perform initial Setup\nendpoint_name = \"abalone-regression-end-rt\"\nmon = Monitor(endpoint_name)\n\n# Add data capture to the endpoint\nmon.add_data_capture(capture_percentage=100)\n\n# Create a baseline for monitoring\nmon.create_baseline()\n\n# Set up the monitoring schedule\nmon.create_monitoring_schedule(schedule=\"hourly\")\n
Pulling Information from an Existing Monitor
monitor_usage.pyfrom sageworks.api.monitor import Monitor\nfrom sageworks.api.endpoint import Endpoint\n\n# Construct a Monitor Class in one of Two Ways\nmon = Endpoint(\"abalone-regression-end-rt\").get_monitor()\nmon = Monitor(\"abalone-regression-end-rt\")\n\n# Check the summary and details of the monitoring class\nmon.summary()\nmon.details()\n\n# Check the baseline outputs (baseline, constraints, statistics)\nbase_df = mon.get_baseline()\nbase_df.head()\n\nconstraints_df = mon.get_constraints()\nconstraints_df.head()\n\nstatistics_df = mon.get_statistics()\nstatistics_df.head()\n\n# Get the latest data capture (inputs and outputs)\ninput_df, output_df = mon.get_latest_data_capture()\ninput_df.head()\noutput_df.head()\n
"},{"location":"api_classes/monitor/#sageworks-ui","title":"SageWorks UI","text":"Running these few lines of code creates and deploys an AWS Endpoint Monitor. The Monitor status and outputs can be viewed in the Sagemaker Console interfaces or in the SageWorks Dashboard UI. SageWorks will use the monitor to track various metrics including Data Quality, Model Bias, etc...
SageWorks Dashboard: EndpointsNot Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/overview/","title":"Overview","text":"Just Getting Started?
You're in the right place, the SageWorks API Classes are the best way to get started with SageWorks!
"},{"location":"api_classes/overview/#welcome-to-the-sageworks-api-classes","title":"Welcome to the SageWorks API Classes","text":"These classes provide high-level APIs for the SageWorks package, they enable your team to build full AWS Machine Learning Pipelines. They handle all the details around updating and managing a complex set of AWS Services. Each class provides an essential component of the overall ML Pipline. Simply combine the classes to build production ready, AWS powered, machine learning pipelines.
from sageworks.api.data_source import DataSource\nfrom sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import Model, ModelType\nfrom sageworks.api.endpoint import Endpoint\n\n# Create the abalone_data DataSource\nds = DataSource(\"s3://sageworks-public-data/common/abalone.csv\")\n\n# Now create a FeatureSet\nds.to_features(\"abalone_features\")\n\n# Create the abalone_regression Model\nfs = FeatureSet(\"abalone_features\")\nfs.to_model(\n ModelType.REGRESSOR,\n name=\"abalone-regression\",\n target_column=\"class_number_of_rings\",\n tags=[\"abalone\", \"regression\"],\n description=\"Abalone Regression Model\",\n)\n\n# Create the abalone_regression Endpoint\nmodel = Model(\"abalone-regression\")\nmodel.to_endpoint(name=\"abalone-regression-end\", tags=[\"abalone\", \"regression\"])\n\n# Now we'll run inference on the endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# Get a DataFrame of data (not used to train) and run predictions\nathena_table = fs.view(\"training\").table\ndf = fs.query(f\"SELECT * FROM {athena_table} where training = FALSE\")\nresults = endpoint.predict(df)\nprint(results[[\"class_number_of_rings\", \"prediction\"]])\n
Output
Processing...\n class_number_of_rings prediction\n0 12 10.477794\n1 11 11.11835\n2 14 13.605763\n3 12 11.744759\n4 17 15.55189\n.. ... ...\n826 7 7.981503\n827 11 11.246113\n828 9 9.592911\n829 6 6.129388\n830 8 7.628252\n
Full AWS ML Pipeline Achievement Unlocked!
Bing! You just built and deployed a full AWS Machine Learning Pipeline. You can now use the SageWorks Dashboard web interface to inspect your AWS artifacts. A comprehensive set of Exploratory Data Analysis techniques and Model Performance Metrics are available for your entire team to review, inspect and interact with.
Examples
All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Examples
Examples of using the Parameter Storage class are listed at the bottom of this page Examples.
ParameterStore: Manages SageWorks parameters in AWS Systems Manager Parameter Store.
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore","title":"ParameterStore
","text":"ParameterStore: Manages SageWorks parameters in AWS Systems Manager Parameter Store.
Common Usageparams = ParameterStore()\n\n# List Parameters\nparams.list()\n\n['/sageworks/abalone_info',\n '/sageworks/my_data',\n '/sageworks/test',\n '/sageworks/pipelines/my_pipeline']\n\n# Add Key\nparams.upsert(\"key\", \"value\")\nvalue = params.get(\"key\")\n\n# Add any data (lists, dictionaries, etc..)\nmy_data = {\"key\": \"value\", \"number\": 4.2, \"list\": [1,2,3]}\nparams.upsert(\"my_data\", my_data)\n\n# Retrieve data\nreturn_value = params.get(\"my_data\")\npprint(return_value)\n\n{'key': 'value', 'list': [1, 2, 3], 'number': 4.2}\n\n# Delete parameters\nparam_store.delete(\"my_data\")\n
Source code in src/sageworks/api/parameter_store.py
class ParameterStore:\n \"\"\"ParameterStore: Manages SageWorks parameters in AWS Systems Manager Parameter Store.\n\n Common Usage:\n ```python\n params = ParameterStore()\n\n # List Parameters\n params.list()\n\n ['/sageworks/abalone_info',\n '/sageworks/my_data',\n '/sageworks/test',\n '/sageworks/pipelines/my_pipeline']\n\n # Add Key\n params.upsert(\"key\", \"value\")\n value = params.get(\"key\")\n\n # Add any data (lists, dictionaries, etc..)\n my_data = {\"key\": \"value\", \"number\": 4.2, \"list\": [1,2,3]}\n params.upsert(\"my_data\", my_data)\n\n # Retrieve data\n return_value = params.get(\"my_data\")\n pprint(return_value)\n\n {'key': 'value', 'list': [1, 2, 3], 'number': 4.2}\n\n # Delete parameters\n param_store.delete(\"my_data\")\n ```\n \"\"\"\n\n def __init__(self):\n \"\"\"ParameterStore Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Initialize a SageWorks Session (to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSSession().boto3_session\n\n # Create a Systems Manager (SSM) client for Parameter Store operations\n self.ssm_client = self.boto3_session.client(\"ssm\")\n\n def list(self) -> list:\n \"\"\"List all parameters in the AWS Parameter Store.\n\n Returns:\n list: A list of parameter names and details.\n \"\"\"\n try:\n # Set up parameters for our search\n params = {\"MaxResults\": 50}\n\n # Initialize the list to collect parameter names\n all_parameters = []\n\n # Make the initial call to describe parameters\n response = self.ssm_client.describe_parameters(**params)\n\n # Aggregate the names from the initial response\n all_parameters.extend(param[\"Name\"] for param in response[\"Parameters\"])\n\n # Continue to paginate if there's a NextToken\n while \"NextToken\" in response:\n # Update the parameters with the NextToken for subsequent calls\n params[\"NextToken\"] = response[\"NextToken\"]\n response = self.ssm_client.describe_parameters(**params)\n\n # Aggregate the names from the subsequent responses\n all_parameters.extend(param[\"Name\"] for param in response[\"Parameters\"])\n\n except Exception as e:\n self.log.error(f\"Failed to list parameters: {e}\")\n return []\n\n # Return the aggregated list of parameter names\n return all_parameters\n\n def get(self, name: str, warn: bool = True, decrypt: bool = True) -> Union[str, list, dict, None]:\n \"\"\"Retrieve a parameter value from the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter to retrieve.\n warn (bool): Whether to log a warning if the parameter is not found.\n decrypt (bool): Whether to decrypt secure string parameters.\n\n Returns:\n Union[str, list, dict, None]: The value of the parameter or None if not found.\n \"\"\"\n try:\n # Retrieve the parameter from Parameter Store\n response = self.ssm_client.get_parameter(Name=name, WithDecryption=decrypt)\n value = response[\"Parameter\"][\"Value\"]\n\n # Auto-detect and decompress if needed\n if value.startswith(\"COMPRESSED:\"):\n # Base64 decode and decompress\n self.log.important(f\"Decompressing parameter '{name}'...\")\n compressed_value = base64.b64decode(value[len(\"COMPRESSED:\") :])\n value = zlib.decompress(compressed_value).decode(\"utf-8\")\n\n # Attempt to parse the value back to its original type\n try:\n parsed_value = json.loads(value)\n return parsed_value\n except (json.JSONDecodeError, TypeError):\n # If parsing fails, return the value as is (assumed to be a simple string)\n return value\n\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] == \"ParameterNotFound\":\n if warn:\n self.log.warning(f\"Parameter '{name}' not found\")\n else:\n self.log.error(f\"Failed to get parameter '{name}': {e}\")\n return None\n\n def upsert(self, name: str, value, overwrite: bool = True):\n \"\"\"Insert or update a parameter in the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter.\n value (str | list | dict): The value of the parameter.\n overwrite (bool): Whether to overwrite an existing parameter (default: True)\n \"\"\"\n try:\n\n # Anything that's not a string gets converted to JSON\n if not isinstance(value, str):\n value = json.dumps(value)\n\n # Check size and compress if necessary\n if len(value) > 4096:\n self.log.warning(f\"Parameter {name} exceeds 4KB ({len(value)} Bytes) Compressing...\")\n compressed_value = zlib.compress(value.encode(\"utf-8\"), level=9)\n encoded_value = \"COMPRESSED:\" + base64.b64encode(compressed_value).decode(\"utf-8\")\n\n # Report on the size of the compressed value\n compressed_size = len(compressed_value)\n if compressed_size > 4096:\n doc_link = \"https://supercowpowers.github.io/sageworks/api_classes/df_store\"\n self.log.error(f\"Compressed size {compressed_size} bytes, cannot store > 4KB\")\n self.log.error(f\"For larger data use the DFStore() class ({doc_link})\")\n return\n\n # Insert or update the compressed parameter in Parameter Store\n try:\n # Insert or update the compressed parameter in Parameter Store\n self.ssm_client.put_parameter(Name=name, Value=encoded_value, Type=\"String\", Overwrite=overwrite)\n self.log.info(f\"Parameter '{name}' added/updated successfully with compression.\")\n return\n except Exception as e:\n self.log.critical(f\"Failed to add/update compressed parameter '{name}': {e}\")\n raise\n\n # Insert or update the parameter normally if under 4KB\n self.ssm_client.put_parameter(Name=name, Value=value, Type=\"String\", Overwrite=overwrite)\n self.log.info(f\"Parameter '{name}' added/updated successfully.\")\n\n except Exception as e:\n self.log.critical(f\"Failed to add/update parameter '{name}': {e}\")\n raise\n\n def delete(self, name: str):\n \"\"\"Delete a parameter from the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter to delete.\n \"\"\"\n try:\n # Delete the parameter from Parameter Store\n self.ssm_client.delete_parameter(Name=name)\n self.log.info(f\"Parameter '{name}' deleted successfully.\")\n except Exception as e:\n self.log.error(f\"Failed to delete parameter '{name}': {e}\")\n\n def __repr__(self):\n \"\"\"Return a string representation of the ParameterStore object.\"\"\"\n return \"\\n\".join(self.list())\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.__init__","title":"__init__()
","text":"ParameterStore Init Method
Source code insrc/sageworks/api/parameter_store.py
def __init__(self):\n \"\"\"ParameterStore Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Initialize a SageWorks Session (to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSSession().boto3_session\n\n # Create a Systems Manager (SSM) client for Parameter Store operations\n self.ssm_client = self.boto3_session.client(\"ssm\")\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.__repr__","title":"__repr__()
","text":"Return a string representation of the ParameterStore object.
Source code insrc/sageworks/api/parameter_store.py
def __repr__(self):\n \"\"\"Return a string representation of the ParameterStore object.\"\"\"\n return \"\\n\".join(self.list())\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.delete","title":"delete(name)
","text":"Delete a parameter from the AWS Parameter Store.
Parameters:
Name Type Description Defaultname
str
The name of the parameter to delete.
required Source code insrc/sageworks/api/parameter_store.py
def delete(self, name: str):\n \"\"\"Delete a parameter from the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter to delete.\n \"\"\"\n try:\n # Delete the parameter from Parameter Store\n self.ssm_client.delete_parameter(Name=name)\n self.log.info(f\"Parameter '{name}' deleted successfully.\")\n except Exception as e:\n self.log.error(f\"Failed to delete parameter '{name}': {e}\")\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.get","title":"get(name, warn=True, decrypt=True)
","text":"Retrieve a parameter value from the AWS Parameter Store.
Parameters:
Name Type Description Defaultname
str
The name of the parameter to retrieve.
requiredwarn
bool
Whether to log a warning if the parameter is not found.
True
decrypt
bool
Whether to decrypt secure string parameters.
True
Returns:
Type DescriptionUnion[str, list, dict, None]
Union[str, list, dict, None]: The value of the parameter or None if not found.
Source code insrc/sageworks/api/parameter_store.py
def get(self, name: str, warn: bool = True, decrypt: bool = True) -> Union[str, list, dict, None]:\n \"\"\"Retrieve a parameter value from the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter to retrieve.\n warn (bool): Whether to log a warning if the parameter is not found.\n decrypt (bool): Whether to decrypt secure string parameters.\n\n Returns:\n Union[str, list, dict, None]: The value of the parameter or None if not found.\n \"\"\"\n try:\n # Retrieve the parameter from Parameter Store\n response = self.ssm_client.get_parameter(Name=name, WithDecryption=decrypt)\n value = response[\"Parameter\"][\"Value\"]\n\n # Auto-detect and decompress if needed\n if value.startswith(\"COMPRESSED:\"):\n # Base64 decode and decompress\n self.log.important(f\"Decompressing parameter '{name}'...\")\n compressed_value = base64.b64decode(value[len(\"COMPRESSED:\") :])\n value = zlib.decompress(compressed_value).decode(\"utf-8\")\n\n # Attempt to parse the value back to its original type\n try:\n parsed_value = json.loads(value)\n return parsed_value\n except (json.JSONDecodeError, TypeError):\n # If parsing fails, return the value as is (assumed to be a simple string)\n return value\n\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] == \"ParameterNotFound\":\n if warn:\n self.log.warning(f\"Parameter '{name}' not found\")\n else:\n self.log.error(f\"Failed to get parameter '{name}': {e}\")\n return None\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.list","title":"list()
","text":"List all parameters in the AWS Parameter Store.
Returns:
Name Type Descriptionlist
list
A list of parameter names and details.
Source code insrc/sageworks/api/parameter_store.py
def list(self) -> list:\n \"\"\"List all parameters in the AWS Parameter Store.\n\n Returns:\n list: A list of parameter names and details.\n \"\"\"\n try:\n # Set up parameters for our search\n params = {\"MaxResults\": 50}\n\n # Initialize the list to collect parameter names\n all_parameters = []\n\n # Make the initial call to describe parameters\n response = self.ssm_client.describe_parameters(**params)\n\n # Aggregate the names from the initial response\n all_parameters.extend(param[\"Name\"] for param in response[\"Parameters\"])\n\n # Continue to paginate if there's a NextToken\n while \"NextToken\" in response:\n # Update the parameters with the NextToken for subsequent calls\n params[\"NextToken\"] = response[\"NextToken\"]\n response = self.ssm_client.describe_parameters(**params)\n\n # Aggregate the names from the subsequent responses\n all_parameters.extend(param[\"Name\"] for param in response[\"Parameters\"])\n\n except Exception as e:\n self.log.error(f\"Failed to list parameters: {e}\")\n return []\n\n # Return the aggregated list of parameter names\n return all_parameters\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.upsert","title":"upsert(name, value, overwrite=True)
","text":"Insert or update a parameter in the AWS Parameter Store.
Parameters:
Name Type Description Defaultname
str
The name of the parameter.
requiredvalue
str | list | dict
The value of the parameter.
requiredoverwrite
bool
Whether to overwrite an existing parameter (default: True)
True
Source code in src/sageworks/api/parameter_store.py
def upsert(self, name: str, value, overwrite: bool = True):\n \"\"\"Insert or update a parameter in the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter.\n value (str | list | dict): The value of the parameter.\n overwrite (bool): Whether to overwrite an existing parameter (default: True)\n \"\"\"\n try:\n\n # Anything that's not a string gets converted to JSON\n if not isinstance(value, str):\n value = json.dumps(value)\n\n # Check size and compress if necessary\n if len(value) > 4096:\n self.log.warning(f\"Parameter {name} exceeds 4KB ({len(value)} Bytes) Compressing...\")\n compressed_value = zlib.compress(value.encode(\"utf-8\"), level=9)\n encoded_value = \"COMPRESSED:\" + base64.b64encode(compressed_value).decode(\"utf-8\")\n\n # Report on the size of the compressed value\n compressed_size = len(compressed_value)\n if compressed_size > 4096:\n doc_link = \"https://supercowpowers.github.io/sageworks/api_classes/df_store\"\n self.log.error(f\"Compressed size {compressed_size} bytes, cannot store > 4KB\")\n self.log.error(f\"For larger data use the DFStore() class ({doc_link})\")\n return\n\n # Insert or update the compressed parameter in Parameter Store\n try:\n # Insert or update the compressed parameter in Parameter Store\n self.ssm_client.put_parameter(Name=name, Value=encoded_value, Type=\"String\", Overwrite=overwrite)\n self.log.info(f\"Parameter '{name}' added/updated successfully with compression.\")\n return\n except Exception as e:\n self.log.critical(f\"Failed to add/update compressed parameter '{name}': {e}\")\n raise\n\n # Insert or update the parameter normally if under 4KB\n self.ssm_client.put_parameter(Name=name, Value=value, Type=\"String\", Overwrite=overwrite)\n self.log.info(f\"Parameter '{name}' added/updated successfully.\")\n\n except Exception as e:\n self.log.critical(f\"Failed to add/update parameter '{name}': {e}\")\n raise\n
"},{"location":"api_classes/parameter_store/#bypassing-the-4k-limit","title":"Bypassing the 4k Limit","text":"AWS Parameter Storage has a 4k limit on values, the SageWorks class bypasses this limit by detecting large values (strings, data, whatever) and compressing those on the fly. The decompressing is also handled automatically, so for larger data simply use the add()
and get()
methods and it will all just work.
These example show how to use the ParameterStore()
class to list, add, and get parameters from the AWS Parameter Store Service.
SageWorks REPL
If you'd like to experiment with listing, adding, and getting data with the ParameterStore()
class, you can spin up the SageWorks REPL, use the class and test out all the methods. Try it out! SageWorks REPL
params = ParameterStore()\n\n# List Parameters\nparams.list()\n\n['/sageworks/abalone_info',\n '/sageworks/my_data',\n '/sageworks/test',\n '/sageworks/pipelines/my_pipeline']\n\n# Add Key\nparams.upsert(\"key\", \"value\")\nvalue = params.get(\"key\")\n\n# Add any data (lists, dictionaries, etc..)\nmy_data = {\"key\": \"value\", \"number\": 4.2, \"list\": [1,2,3]}\nparams.upsert(\"my_data\", my_data)\n\n# Retrieve data\nreturn_value = params.get(\"my_data\")\npprint(return_value)\n\n{'key': 'value', 'list': [1, 2, 3], 'number': 4.2}\n\n# Delete parameters\nparam_store.delete(\"my_data\")\n
list()
not showing ALL parameters?
If you want access to ALL the parameters in the parameter store set prefix=None
and everything will show up.
params = ParameterStore(prefix=None)\nparams.list()\n<all the keys>\n
"},{"location":"api_classes/pipelines/","title":"Pipelines","text":"Pipeline Examples
Examples of using the Pipeline classes are listed at the bottom of this page Examples.
Pipelines store sequences of SageWorks transforms. So if you have a nightly ML workflow you can capture that as a Pipeline. Here's an example pipeline:
nightly_sol_pipeline_v1.json{\n \"data_source\": {\n \"name\": \"nightly_data\",\n \"tags\": [\"solubility\", \"foo\"],\n \"s3_input\": \"s3://blah/blah.csv\"\n },\n \"feature_set\": {\n \"name\": \"nightly_features\",\n \"tags\": [\"blah\", \"blah\"],\n \"input\": \"nightly_data\"\n \"schema\": \"mol_descriptors_v1\"\n },\n \"model\": {\n \"name\": \u201cnightly_model\u201d,\n \"tags\": [\"blah\", \"blah\"],\n \"features\": [\"col1\", \"col2\"],\n \"target\": \u201csol\u201d,\n \"input\": \u201cnightly_features\u201d\n \"endpoint\": {\n ...\n} \n
PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.
Pipeline: Manages the details around a SageWorks Pipeline, including Execution
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager","title":"PipelineManager
","text":"PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.
Common Usagemy_manager = PipelineManager()\nmy_manager.list_pipelines()\nabalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\nmy_manager.save_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n
Source code in src/sageworks/api/pipeline_manager.py
class PipelineManager:\n \"\"\"PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.\n\n Common Usage:\n ```python\n my_manager = PipelineManager()\n my_manager.list_pipelines()\n abalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\n my_manager.save_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n ```\n \"\"\"\n\n def __init__(self):\n \"\"\"Pipeline Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Grab our SageWorks Bucket from Config\n self.cm = ConfigManager()\n self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n if self.sageworks_bucket is None:\n self.log = logging.getLogger(\"sageworks\")\n self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n sys.exit(1)\n\n # Set the S3 Path for Pipelines\n self.bucket = self.sageworks_bucket\n self.prefix = \"pipelines/\"\n self.pipelines_s3_path = f\"s3://{self.sageworks_bucket}/pipelines/\"\n\n # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSAccountClamp().boto3_session\n\n # Read all the Pipelines from this S3 path\n self.s3_client = self.boto3_session.client(\"s3\")\n\n def list_pipelines(self) -> list:\n \"\"\"List all the Pipelines in the S3 Bucket\n\n Returns:\n list: A list of Pipeline names and details\n \"\"\"\n # List objects using the S3 client\n response = self.s3_client.list_objects_v2(Bucket=self.bucket, Prefix=self.prefix)\n\n # Check if there are objects\n if \"Contents\" in response:\n # Process the list of dictionaries (we only need the filename, the LastModified, and the Size)\n pipelines = [\n {\n \"name\": pipeline[\"Key\"].split(\"/\")[-1].replace(\".json\", \"\"),\n \"last_modified\": pipeline[\"LastModified\"],\n \"size\": pipeline[\"Size\"],\n }\n for pipeline in response[\"Contents\"]\n ]\n return pipelines\n else:\n self.log.important(f\"No pipelines found at {self.pipelines_s3_path}...\")\n return []\n\n # Create a new Pipeline from an Endpoint\n def create_from_endpoint(self, endpoint_name: str) -> dict:\n \"\"\"Create a Pipeline from an Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: A dictionary of the Pipeline\n \"\"\"\n self.log.important(f\"Creating Pipeline from Endpoint: {endpoint_name}...\")\n pipeline = {}\n endpoint = Endpoint(endpoint_name)\n model = Model(endpoint.get_input())\n feature_set = FeatureSet(model.get_input())\n data_source = DataSource(feature_set.get_input())\n s3_source = data_source.get_input()\n for name in [\"data_source\", \"feature_set\", \"model\", \"endpoint\"]:\n artifact = locals()[name]\n pipeline[name] = {\"name\": artifact.uuid, \"tags\": artifact.get_tags(), \"input\": artifact.get_input()}\n if name == \"model\":\n pipeline[name][\"model_type\"] = artifact.model_type.value\n pipeline[name][\"target_column\"] = artifact.target()\n pipeline[name][\"feature_list\"] = artifact.features()\n\n # Return the Pipeline\n return pipeline\n\n # Publish a Pipeline to SageWorks\n def publish_pipeline(self, name: str, pipeline: dict):\n \"\"\"Save a Pipeline to S3\n\n Args:\n name (str): The name of the Pipeline\n pipeline (dict): The Pipeline to save\n \"\"\"\n key = f\"{self.prefix}{name}.json\"\n self.log.important(f\"Saving {name} to S3: {self.bucket}/{key}...\")\n\n # Save the pipeline as an S3 JSON object\n self.s3_client.put_object(Body=json.dumps(pipeline, indent=4), Bucket=self.bucket, Key=key)\n\n def delete_pipeline(self, name: str):\n \"\"\"Delete a Pipeline from S3\n\n Args:\n name (str): The name of the Pipeline to delete\n \"\"\"\n key = f\"{self.prefix}{name}.json\"\n self.log.important(f\"Deleting {name} from S3: {self.bucket}/{key}...\")\n\n # Delete the pipeline object from S3\n self.s3_client.delete_object(Bucket=self.bucket, Key=key)\n\n # Save a Pipeline to a local file\n def save_pipeline_to_file(self, pipeline: dict, filepath: str):\n \"\"\"Save a Pipeline to a local file\n\n Args:\n pipeline (dict): The Pipeline to save\n filepath (str): The path to save the Pipeline\n \"\"\"\n\n # Sanity check the filepath\n if not filepath.endswith(\".json\"):\n filepath += \".json\"\n\n # Save the pipeline as a local JSON file\n with open(filepath, \"w\") as fp:\n json.dump(pipeline, fp, indent=4)\n\n def load_pipeline_from_file(self, filepath: str) -> dict:\n \"\"\"Load a Pipeline from a local file\n\n Args:\n filepath (str): The path of the Pipeline to load\n\n Returns:\n dict: The Pipeline loaded from the file\n \"\"\"\n\n # Load a pipeline as a local JSON file\n with open(filepath, \"r\") as fp:\n pipeline = json.load(fp)\n return pipeline\n\n def publish_pipeline_from_file(self, filepath: str):\n \"\"\"Publish a Pipeline to SageWorks from a local file\n\n Args:\n filepath (str): The path of the Pipeline to publish\n \"\"\"\n\n # Load a pipeline as a local JSON file\n pipeline = self.load_pipeline_from_file(filepath)\n\n # Get the pipeline name\n pipeline_name = filepath.split(\"/\")[-1].replace(\".json\", \"\")\n\n # Publish the Pipeline\n self.publish_pipeline(pipeline_name, pipeline)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.__init__","title":"__init__()
","text":"Pipeline Init Method
Source code insrc/sageworks/api/pipeline_manager.py
def __init__(self):\n \"\"\"Pipeline Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Grab our SageWorks Bucket from Config\n self.cm = ConfigManager()\n self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n if self.sageworks_bucket is None:\n self.log = logging.getLogger(\"sageworks\")\n self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n sys.exit(1)\n\n # Set the S3 Path for Pipelines\n self.bucket = self.sageworks_bucket\n self.prefix = \"pipelines/\"\n self.pipelines_s3_path = f\"s3://{self.sageworks_bucket}/pipelines/\"\n\n # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSAccountClamp().boto3_session\n\n # Read all the Pipelines from this S3 path\n self.s3_client = self.boto3_session.client(\"s3\")\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.create_from_endpoint","title":"create_from_endpoint(endpoint_name)
","text":"Create a Pipeline from an Endpoint
Parameters:
Name Type Description Defaultendpoint_name
str
The name of the Endpoint
requiredReturns:
Name Type Descriptiondict
dict
A dictionary of the Pipeline
Source code insrc/sageworks/api/pipeline_manager.py
def create_from_endpoint(self, endpoint_name: str) -> dict:\n \"\"\"Create a Pipeline from an Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: A dictionary of the Pipeline\n \"\"\"\n self.log.important(f\"Creating Pipeline from Endpoint: {endpoint_name}...\")\n pipeline = {}\n endpoint = Endpoint(endpoint_name)\n model = Model(endpoint.get_input())\n feature_set = FeatureSet(model.get_input())\n data_source = DataSource(feature_set.get_input())\n s3_source = data_source.get_input()\n for name in [\"data_source\", \"feature_set\", \"model\", \"endpoint\"]:\n artifact = locals()[name]\n pipeline[name] = {\"name\": artifact.uuid, \"tags\": artifact.get_tags(), \"input\": artifact.get_input()}\n if name == \"model\":\n pipeline[name][\"model_type\"] = artifact.model_type.value\n pipeline[name][\"target_column\"] = artifact.target()\n pipeline[name][\"feature_list\"] = artifact.features()\n\n # Return the Pipeline\n return pipeline\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.delete_pipeline","title":"delete_pipeline(name)
","text":"Delete a Pipeline from S3
Parameters:
Name Type Description Defaultname
str
The name of the Pipeline to delete
required Source code insrc/sageworks/api/pipeline_manager.py
def delete_pipeline(self, name: str):\n \"\"\"Delete a Pipeline from S3\n\n Args:\n name (str): The name of the Pipeline to delete\n \"\"\"\n key = f\"{self.prefix}{name}.json\"\n self.log.important(f\"Deleting {name} from S3: {self.bucket}/{key}...\")\n\n # Delete the pipeline object from S3\n self.s3_client.delete_object(Bucket=self.bucket, Key=key)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.list_pipelines","title":"list_pipelines()
","text":"List all the Pipelines in the S3 Bucket
Returns:
Name Type Descriptionlist
list
A list of Pipeline names and details
Source code insrc/sageworks/api/pipeline_manager.py
def list_pipelines(self) -> list:\n \"\"\"List all the Pipelines in the S3 Bucket\n\n Returns:\n list: A list of Pipeline names and details\n \"\"\"\n # List objects using the S3 client\n response = self.s3_client.list_objects_v2(Bucket=self.bucket, Prefix=self.prefix)\n\n # Check if there are objects\n if \"Contents\" in response:\n # Process the list of dictionaries (we only need the filename, the LastModified, and the Size)\n pipelines = [\n {\n \"name\": pipeline[\"Key\"].split(\"/\")[-1].replace(\".json\", \"\"),\n \"last_modified\": pipeline[\"LastModified\"],\n \"size\": pipeline[\"Size\"],\n }\n for pipeline in response[\"Contents\"]\n ]\n return pipelines\n else:\n self.log.important(f\"No pipelines found at {self.pipelines_s3_path}...\")\n return []\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.load_pipeline_from_file","title":"load_pipeline_from_file(filepath)
","text":"Load a Pipeline from a local file
Parameters:
Name Type Description Defaultfilepath
str
The path of the Pipeline to load
requiredReturns:
Name Type Descriptiondict
dict
The Pipeline loaded from the file
Source code insrc/sageworks/api/pipeline_manager.py
def load_pipeline_from_file(self, filepath: str) -> dict:\n \"\"\"Load a Pipeline from a local file\n\n Args:\n filepath (str): The path of the Pipeline to load\n\n Returns:\n dict: The Pipeline loaded from the file\n \"\"\"\n\n # Load a pipeline as a local JSON file\n with open(filepath, \"r\") as fp:\n pipeline = json.load(fp)\n return pipeline\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.publish_pipeline","title":"publish_pipeline(name, pipeline)
","text":"Save a Pipeline to S3
Parameters:
Name Type Description Defaultname
str
The name of the Pipeline
requiredpipeline
dict
The Pipeline to save
required Source code insrc/sageworks/api/pipeline_manager.py
def publish_pipeline(self, name: str, pipeline: dict):\n \"\"\"Save a Pipeline to S3\n\n Args:\n name (str): The name of the Pipeline\n pipeline (dict): The Pipeline to save\n \"\"\"\n key = f\"{self.prefix}{name}.json\"\n self.log.important(f\"Saving {name} to S3: {self.bucket}/{key}...\")\n\n # Save the pipeline as an S3 JSON object\n self.s3_client.put_object(Body=json.dumps(pipeline, indent=4), Bucket=self.bucket, Key=key)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.publish_pipeline_from_file","title":"publish_pipeline_from_file(filepath)
","text":"Publish a Pipeline to SageWorks from a local file
Parameters:
Name Type Description Defaultfilepath
str
The path of the Pipeline to publish
required Source code insrc/sageworks/api/pipeline_manager.py
def publish_pipeline_from_file(self, filepath: str):\n \"\"\"Publish a Pipeline to SageWorks from a local file\n\n Args:\n filepath (str): The path of the Pipeline to publish\n \"\"\"\n\n # Load a pipeline as a local JSON file\n pipeline = self.load_pipeline_from_file(filepath)\n\n # Get the pipeline name\n pipeline_name = filepath.split(\"/\")[-1].replace(\".json\", \"\")\n\n # Publish the Pipeline\n self.publish_pipeline(pipeline_name, pipeline)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.save_pipeline_to_file","title":"save_pipeline_to_file(pipeline, filepath)
","text":"Save a Pipeline to a local file
Parameters:
Name Type Description Defaultpipeline
dict
The Pipeline to save
requiredfilepath
str
The path to save the Pipeline
required Source code insrc/sageworks/api/pipeline_manager.py
def save_pipeline_to_file(self, pipeline: dict, filepath: str):\n \"\"\"Save a Pipeline to a local file\n\n Args:\n pipeline (dict): The Pipeline to save\n filepath (str): The path to save the Pipeline\n \"\"\"\n\n # Sanity check the filepath\n if not filepath.endswith(\".json\"):\n filepath += \".json\"\n\n # Save the pipeline as a local JSON file\n with open(filepath, \"w\") as fp:\n json.dump(pipeline, fp, indent=4)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline","title":"Pipeline
","text":"Pipeline: SageWorks Pipeline API Class
Common Usagemy_pipeline = Pipeline(\"name\")\nmy_pipeline.details()\nmy_pipeline.execute() # Execute entire pipeline\nmy_pipeline.execute_partial([\"data_source\", \"feature_set\"])\nmy_pipeline.execute_partial([\"model\", \"endpoint\"])\n
Source code in src/sageworks/api/pipeline.py
class Pipeline:\n \"\"\"Pipeline: SageWorks Pipeline API Class\n\n Common Usage:\n ```python\n my_pipeline = Pipeline(\"name\")\n my_pipeline.details()\n my_pipeline.execute() # Execute entire pipeline\n my_pipeline.execute_partial([\"data_source\", \"feature_set\"])\n my_pipeline.execute_partial([\"model\", \"endpoint\"])\n ```\n \"\"\"\n\n def __init__(self, name: str):\n \"\"\"Pipeline Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.name = name\n\n # Grab our SageWorks Bucket from Config\n self.cm = ConfigManager()\n self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n if self.sageworks_bucket is None:\n self.log = logging.getLogger(\"sageworks\")\n self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n sys.exit(1)\n\n # Set the S3 Path for this Pipeline\n self.bucket = self.sageworks_bucket\n self.key = f\"pipelines/{self.name}.json\"\n self.s3_path = f\"s3://{self.bucket}/{self.key}\"\n\n # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSAccountClamp().boto3_session\n self.s3_client = self.boto3_session.client(\"s3\")\n\n # If this S3 Path exists, load the Pipeline\n if wr.s3.does_object_exist(self.s3_path):\n self.pipeline = self._get_pipeline()\n else:\n self.log.warning(f\"Pipeline {self.name} not found at {self.s3_path}\")\n self.pipeline = None\n\n def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n \"\"\"Set the input for the Pipeline\n\n Args:\n input (Union[str, pd.DataFrame]): The input for the Pipeline\n artifact (str): The artifact to set the input for (default: \"data_source\")\n \"\"\"\n self.pipeline[artifact][\"input\"] = input\n\n def set_training_holdouts(self, id_column: str, holdout_ids: list[str]):\n \"\"\"Set the input for the Pipeline\n\n Args:\n id_column (str): The column name of the unique identifier\n holdout_ids (list[str]): The list of unique identifiers to hold out\n \"\"\"\n self.pipeline[\"feature_set\"][\"id_column\"] = id_column\n self.pipeline[\"feature_set\"][\"holdout_ids\"] = holdout_ids\n\n def execute(self):\n \"\"\"Execute the entire Pipeline\n\n Raises:\n RunTimeException: If the pipeline execution fails in any way\n \"\"\"\n pipeline_executor = PipelineExecutor(self)\n pipeline_executor.execute()\n\n def execute_partial(self, subset: list):\n \"\"\"Execute a partial Pipeline\n\n Args:\n subset (list): A subset of the pipeline to execute\n\n Raises:\n RunTimeException: If the pipeline execution fails in any way\n \"\"\"\n pipeline_executor = PipelineExecutor(self)\n pipeline_executor.execute_partial(subset)\n\n def report_settable_fields(self, pipeline: dict = {}, path: str = \"\") -> None:\n \"\"\"\n Recursively finds and prints keys with settable fields in a JSON-like dictionary.\n\n Args:\n pipeline (dict): pipeline (or sub pipeline) to process.\n path (str): Current path to the key, used for nested dictionaries.\n \"\"\"\n # Grab the entire pipeline if not provided (first call)\n if not pipeline:\n self.log.important(f\"Checking Pipeline: {self.name}...\")\n pipeline = self.pipeline\n for key, value in pipeline.items():\n if isinstance(value, dict):\n # Recurse into sub-dictionary\n self.report_settable_fields(value, path + key + \" -> \")\n elif isinstance(value, str) and value.startswith(\"<<\") and value.endswith(\">>\"):\n # Check if required or optional\n required = \"[Required]\" if \"required\" in value else \"[Optional]\"\n self.log.important(f\"{required} Path: {path + key}\")\n\n def delete(self):\n \"\"\"Pipeline Deletion\"\"\"\n self.log.info(f\"Deleting Pipeline: {self.name}...\")\n wr.s3.delete_objects(self.s3_path)\n\n def _get_pipeline(self) -> dict:\n \"\"\"Internal: Get the pipeline as a JSON object from the specified S3 bucket and key.\"\"\"\n response = self.s3_client.get_object(Bucket=self.bucket, Key=self.key)\n json_object = json.loads(response[\"Body\"].read())\n return json_object\n\n def __repr__(self) -> str:\n \"\"\"String representation of this pipeline\n\n Returns:\n str: String representation of this pipeline\n \"\"\"\n # Class name and details\n class_name = self.__class__.__name__\n pipeline_details = json.dumps(self.pipeline, indent=4)\n return f\"{class_name}({pipeline_details})\"\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.__init__","title":"__init__(name)
","text":"Pipeline Init Method
Source code insrc/sageworks/api/pipeline.py
def __init__(self, name: str):\n \"\"\"Pipeline Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.name = name\n\n # Grab our SageWorks Bucket from Config\n self.cm = ConfigManager()\n self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n if self.sageworks_bucket is None:\n self.log = logging.getLogger(\"sageworks\")\n self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n sys.exit(1)\n\n # Set the S3 Path for this Pipeline\n self.bucket = self.sageworks_bucket\n self.key = f\"pipelines/{self.name}.json\"\n self.s3_path = f\"s3://{self.bucket}/{self.key}\"\n\n # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSAccountClamp().boto3_session\n self.s3_client = self.boto3_session.client(\"s3\")\n\n # If this S3 Path exists, load the Pipeline\n if wr.s3.does_object_exist(self.s3_path):\n self.pipeline = self._get_pipeline()\n else:\n self.log.warning(f\"Pipeline {self.name} not found at {self.s3_path}\")\n self.pipeline = None\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.__repr__","title":"__repr__()
","text":"String representation of this pipeline
Returns:
Name Type Descriptionstr
str
String representation of this pipeline
Source code insrc/sageworks/api/pipeline.py
def __repr__(self) -> str:\n \"\"\"String representation of this pipeline\n\n Returns:\n str: String representation of this pipeline\n \"\"\"\n # Class name and details\n class_name = self.__class__.__name__\n pipeline_details = json.dumps(self.pipeline, indent=4)\n return f\"{class_name}({pipeline_details})\"\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.delete","title":"delete()
","text":"Pipeline Deletion
Source code insrc/sageworks/api/pipeline.py
def delete(self):\n \"\"\"Pipeline Deletion\"\"\"\n self.log.info(f\"Deleting Pipeline: {self.name}...\")\n wr.s3.delete_objects(self.s3_path)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.execute","title":"execute()
","text":"Execute the entire Pipeline
Raises:
Type DescriptionRunTimeException
If the pipeline execution fails in any way
Source code insrc/sageworks/api/pipeline.py
def execute(self):\n \"\"\"Execute the entire Pipeline\n\n Raises:\n RunTimeException: If the pipeline execution fails in any way\n \"\"\"\n pipeline_executor = PipelineExecutor(self)\n pipeline_executor.execute()\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.execute_partial","title":"execute_partial(subset)
","text":"Execute a partial Pipeline
Parameters:
Name Type Description Defaultsubset
list
A subset of the pipeline to execute
requiredRaises:
Type DescriptionRunTimeException
If the pipeline execution fails in any way
Source code insrc/sageworks/api/pipeline.py
def execute_partial(self, subset: list):\n \"\"\"Execute a partial Pipeline\n\n Args:\n subset (list): A subset of the pipeline to execute\n\n Raises:\n RunTimeException: If the pipeline execution fails in any way\n \"\"\"\n pipeline_executor = PipelineExecutor(self)\n pipeline_executor.execute_partial(subset)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.report_settable_fields","title":"report_settable_fields(pipeline={}, path='')
","text":"Recursively finds and prints keys with settable fields in a JSON-like dictionary.
Args: pipeline (dict): pipeline (or sub pipeline) to process. path (str): Current path to the key, used for nested dictionaries.
Source code insrc/sageworks/api/pipeline.py
def report_settable_fields(self, pipeline: dict = {}, path: str = \"\") -> None:\n \"\"\"\n Recursively finds and prints keys with settable fields in a JSON-like dictionary.\n\n Args:\n pipeline (dict): pipeline (or sub pipeline) to process.\n path (str): Current path to the key, used for nested dictionaries.\n \"\"\"\n # Grab the entire pipeline if not provided (first call)\n if not pipeline:\n self.log.important(f\"Checking Pipeline: {self.name}...\")\n pipeline = self.pipeline\n for key, value in pipeline.items():\n if isinstance(value, dict):\n # Recurse into sub-dictionary\n self.report_settable_fields(value, path + key + \" -> \")\n elif isinstance(value, str) and value.startswith(\"<<\") and value.endswith(\">>\"):\n # Check if required or optional\n required = \"[Required]\" if \"required\" in value else \"[Optional]\"\n self.log.important(f\"{required} Path: {path + key}\")\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.set_input","title":"set_input(input, artifact='data_source')
","text":"Set the input for the Pipeline
Parameters:
Name Type Description Defaultinput
Union[str, DataFrame]
The input for the Pipeline
requiredartifact
str
The artifact to set the input for (default: \"data_source\")
'data_source'
Source code in src/sageworks/api/pipeline.py
def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n \"\"\"Set the input for the Pipeline\n\n Args:\n input (Union[str, pd.DataFrame]): The input for the Pipeline\n artifact (str): The artifact to set the input for (default: \"data_source\")\n \"\"\"\n self.pipeline[artifact][\"input\"] = input\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.set_training_holdouts","title":"set_training_holdouts(id_column, holdout_ids)
","text":"Set the input for the Pipeline
Parameters:
Name Type Description Defaultid_column
str
The column name of the unique identifier
requiredholdout_ids
list[str]
The list of unique identifiers to hold out
required Source code insrc/sageworks/api/pipeline.py
def set_training_holdouts(self, id_column: str, holdout_ids: list[str]):\n \"\"\"Set the input for the Pipeline\n\n Args:\n id_column (str): The column name of the unique identifier\n holdout_ids (list[str]): The list of unique identifiers to hold out\n \"\"\"\n self.pipeline[\"feature_set\"][\"id_column\"] = id_column\n self.pipeline[\"feature_set\"][\"holdout_ids\"] = holdout_ids\n
"},{"location":"api_classes/pipelines/#examples","title":"Examples","text":"Make a Pipeline
Pipelines are just JSON files (see sageworks/examples/pipelines/
). You can copy one and make changes to fit your objects/use case, or if you have a set of SageWorks artifacts created you can 'backtrack' from the Endpoint and have it create the Pipeline for you.
from sageworks.api.pipeline_manager import PipelineManager\n\n # Create a PipelineManager\nmy_manager = PipelineManager()\n\n# List the Pipelines\npprint(my_manager.list_pipelines())\n\n# Create a Pipeline from an Endpoint\nabalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\n\n# Publish the Pipeline\nmy_manager.publish_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n
Output
Listing Pipelines...\n[{'last_modified': datetime.datetime(2024, 4, 16, 21, 10, 6, tzinfo=tzutc()),\n 'name': 'abalone_pipeline_v1',\n 'size': 445}]\n
Pipeline Details pipeline_details.pyfrom sageworks.api.pipeline import Pipeline\n\n# Retrieve an existing Pipeline\nmy_pipeline = Pipeline(\"abalone_pipeline_v1\")\npprint(my_pipeline.details())\n
Output
{\n \"name\": \"abalone_pipeline_v1\",\n \"s3_path\": \"s3://sandbox/pipelines/abalone_pipeline_v1.json\",\n \"pipeline\": {\n \"data_source\": {\n \"name\": \"abalone_data\",\n \"tags\": [\n \"abalone_data\"\n ],\n \"input\": \"/Users/briford/work/sageworks/data/abalone.csv\"\n },\n \"feature_set\": {\n \"name\": \"abalone_features\",\n \"tags\": [\n \"abalone_features\"\n ],\n \"input\": \"abalone_data\"\n },\n \"model\": {\n \"name\": \"abalone-regression\",\n \"tags\": [\n \"abalone\",\n \"regression\"\n ],\n \"input\": \"abalone_features\"\n },\n ...\n }\n}\n
Pipeline Execution
Pipeline Execution
Executing the Pipeline is obviously the most important reason for creating one. If gives you a reproducible way to capture, inspect, and run the same ML pipeline on different data (nightly).
pipeline_execution.pyfrom sageworks.api.pipeline import Pipeline\n\n# Retrieve an existing Pipeline\nmy_pipeline = Pipeline(\"abalone_pipeline_v1\")\n\n# Execute the Pipeline\nmy_pipeline.execute() # Full execution\n\n# Partial executions\nmy_pipeline.execute_partial([\"data_source\", \"feature_set\"])\nmy_pipeline.execute_partial([\"model\", \"endpoint\"])\n
"},{"location":"api_classes/pipelines/#pipelines-advanced","title":"Pipelines Advanced","text":"As part of the flexible architecture sometimes DataSources or FeatureSets can be created with a Pandas DataFrame. To support a DataFrame as input to a pipeline we can call the set_input()
method to the pipeline object. If you'd like to specify the set_hold_out_ids()
you can also provide a list of ids.
def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n \"\"\"Set the input for the Pipeline\n\n Args:\n input (Union[str, pd.DataFrame]): The input for the Pipeline\n artifact (str): The artifact to set the input for (default: \"data_source\")\n \"\"\"\n self.pipeline[artifact][\"input\"] = input\n\n def set_hold_out_ids(self, id_list: list):\n \"\"\"Set the input for the Pipeline\n\n Args:\n id_list (list): The list of hold out ids\n \"\"\"\n self.pipeline[\"feature_set\"][\"hold_out_ids\"] = id_list\n
Running a pipeline creates and deploys a set of SageWorks Artifacts, DataSource, FeatureSet, Model and Endpoint. These artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI.
Not Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/views/","title":"Views","text":"View Examples
Examples of using the Views classes to extend the functionality of SageWorks Artifacts are in the Examples section at the bottom of this page.
Views are a powerful way to filter and agument your DataSources and FeatureSets. With Views you can subset columns, rows, and even add data to existing SageWorks Artifacts. If you want to compute outliers, runs some statistics or engineer some new features, Views are an easy way to change, modify, and add to DataSources and FeatureSets.
View: Read from a view (training, display, etc) for DataSources and FeatureSets.
"},{"location":"api_classes/views/#sageworks.core.views.view.View","title":"View
","text":"View: Read from a view (training, display, etc) for DataSources and FeatureSets.
Common Usage# Grab the Display View for a DataSource\ndisplay_view = ds.view(\"display\")\nprint(display_view.columns)\n\n# Pull a DataFrame for the view\ndf = display_view.pull_dataframe()\n\n# Views also work with FeatureSets\ncomp_view = fs.view(\"computation\")\ncomp_df = comp_view.pull_dataframe()\n\n# Query the view with a custom SQL query\nquery = f\"SELECT * FROM {comp_view.table} WHERE age > 30\"\ndf = comp_view.query(query)\n\n# Delete the view\ncomp_view.delete()\n
Source code in src/sageworks/core/views/view.py
class View:\n \"\"\"View: Read from a view (training, display, etc) for DataSources and FeatureSets.\n\n Common Usage:\n ```python\n\n # Grab the Display View for a DataSource\n display_view = ds.view(\"display\")\n print(display_view.columns)\n\n # Pull a DataFrame for the view\n df = display_view.pull_dataframe()\n\n # Views also work with FeatureSets\n comp_view = fs.view(\"computation\")\n comp_df = comp_view.pull_dataframe()\n\n # Query the view with a custom SQL query\n query = f\"SELECT * FROM {comp_view.table} WHERE age > 30\"\n df = comp_view.query(query)\n\n # Delete the view\n comp_view.delete()\n ```\n \"\"\"\n\n # Class attributes\n log = logging.getLogger(\"sageworks\")\n meta = Meta()\n\n def __init__(self, artifact: Union[DataSource, FeatureSet], view_name: str, **kwargs):\n \"\"\"View Constructor: Retrieve a View for the given artifact\n\n Args:\n artifact (Union[DataSource, FeatureSet]): A DataSource or FeatureSet object\n view_name (str): The name of the view to retrieve (e.g. \"training\")\n \"\"\"\n\n # Set the view name\n self.view_name = view_name\n\n # Is this a DataSource or a FeatureSet?\n self.is_feature_set = isinstance(artifact, FeatureSetCore)\n self.auto_id_column = artifact.id_column if self.is_feature_set else None\n\n # Get the data_source from the artifact\n self.artifact_name = artifact.uuid\n self.data_source = artifact.data_source if self.is_feature_set else artifact\n self.database = self.data_source.database\n\n # Construct our base_table_name\n self.base_table_name = self.data_source.table\n\n # Check if the view should be auto created\n self.auto_created = False\n if kwargs.get(\"auto_create_view\", True) and not self.exists():\n\n # A direct double check before we auto-create\n if not self.exists(skip_cache=True):\n self.log.important(\n f\"View {self.view_name} for {self.artifact_name} doesn't exist, attempting to auto-create...\"\n )\n self.auto_created = self._auto_create_view()\n\n # Check for failure of the auto-creation\n if not self.auto_created:\n self.log.error(\n f\"View {self.view_name} for {self.artifact_name} doesn't exist and cannot be auto-created...\"\n )\n self.view_name = self.columns = self.column_types = self.source_table = self.base_table_name = None\n return\n\n # Now fill some details about the view\n self.columns, self.column_types, self.source_table, self.join_view = view_details(\n self.table, self.data_source.database, self.data_source.boto3_session\n )\n\n def pull_dataframe(self, limit: int = 50000, head: bool = False) -> Union[pd.DataFrame, None]:\n \"\"\"Pull a DataFrame based on the view type\n\n Args:\n limit (int): The maximum number of rows to pull (default: 50000)\n head (bool): Return just the head of the DataFrame (default: False)\n\n Returns:\n Union[pd.DataFrame, None]: The DataFrame for the view or None if it doesn't exist\n \"\"\"\n\n # Pull the DataFrame\n if head:\n limit = 5\n pull_query = f'SELECT * FROM \"{self.table}\" LIMIT {limit}'\n df = self.data_source.query(pull_query)\n return df\n\n def query(self, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Query the view with a custom SQL query\n\n Args:\n query (str): The SQL query to execute\n\n Returns:\n Union[pd.DataFrame, None]: The DataFrame for the query or None if it doesn't exist\n \"\"\"\n return self.data_source.query(query)\n\n def column_details(self) -> dict:\n \"\"\"Return a dictionary of the column names and types for this view\n\n Returns:\n dict: A dictionary of the column names and types\n \"\"\"\n return dict(zip(self.columns, self.column_types))\n\n @property\n def table(self) -> str:\n \"\"\"Construct the view table name for the given view type\n\n Returns:\n str: The view table name\n \"\"\"\n if self.view_name is None:\n return None\n if self.view_name == \"base\":\n return self.base_table_name\n return f\"{self.base_table_name}_{self.view_name}\"\n\n def delete(self):\n \"\"\"Delete the database view (and supplemental data) if it exists.\"\"\"\n\n # List any supplemental tables for this data source\n supplemental_tables = list_supplemental_data_tables(self.base_table_name, self.database)\n for table in supplemental_tables:\n if self.view_name in table:\n self.log.important(f\"Deleting Supplemental Table {table}...\")\n delete_table(table, self.database, self.data_source.boto3_session)\n\n # Now drop the view\n self.log.important(f\"Dropping View {self.table}...\")\n drop_view_query = f'DROP VIEW \"{self.table}\"'\n\n # Execute the DROP VIEW query\n try:\n self.data_source.execute_statement(drop_view_query, silence_errors=True)\n except wr.exceptions.QueryFailed as e:\n if \"View not found\" in str(e):\n self.log.info(f\"View {self.table} not found, this is fine...\")\n else:\n raise\n\n # We want to do a small sleep so that AWS has time to catch up\n self.log.info(\"Sleeping for 3 seconds after dropping view to allow AWS to catch up...\")\n time.sleep(3)\n\n def exists(self, skip_cache: bool = False) -> bool:\n \"\"\"Check if the view exists in the database\n\n Args:\n skip_cache (bool): Skip the cache and check the database directly (default: False)\n Returns:\n bool: True if the view exists, False otherwise.\n \"\"\"\n # The BaseView always exists\n if self.view_name == \"base\":\n return True\n\n # If we're skipping the cache, we need to check the database directly\n if skip_cache:\n return self._check_database()\n\n # Use the meta class to see if the view exists\n views_df = self.meta.views(self.database)\n\n # Check if we have ANY views\n if views_df.empty:\n return False\n\n # Check if the view exists\n return self.table in views_df[\"Name\"].values\n\n def ensure_exists(self):\n \"\"\"Ensure if the view exists by making a query directly to the database. If it doesn't exist, create it\"\"\"\n\n # The BaseView always exists\n if self.view_name == \"base\":\n return True\n\n # Check the database directly\n if not self._check_database():\n self._auto_create_view()\n\n def _check_database(self) -> bool:\n \"\"\"Internal: Check if the view exists in the database\n\n Returns:\n bool: True if the view exists, False otherwise\n \"\"\"\n # Query to check if the table/view exists\n check_table_query = f\"\"\"\n SELECT table_name\n FROM information_schema.tables\n WHERE table_schema = '{self.database}' AND table_name = '{self.table}'\n \"\"\"\n _df = self.data_source.query(check_table_query)\n return not _df.empty\n\n def _auto_create_view(self) -> bool:\n \"\"\"Internal: Automatically create a view training, display, and computation views\n\n Returns:\n bool: True if the view was created, False otherwise\n\n Raises:\n ValueError: If the view type is not supported\n \"\"\"\n from sageworks.core.views import DisplayView, ComputationView, TrainingView\n\n # First if we're going to auto-create, we need to make sure the data source exists\n if not self.data_source.exists():\n self.log.error(f\"Data Source {self.data_source.uuid} does not exist...\")\n return False\n\n # DisplayView\n if self.view_name == \"display\":\n self.log.important(f\"Auto creating View {self.view_name} for {self.data_source.uuid}...\")\n DisplayView.create(self.data_source)\n return True\n\n # ComputationView\n if self.view_name == \"computation\":\n self.log.important(f\"Auto creating View {self.view_name} for {self.data_source.uuid}...\")\n ComputationView.create(self.data_source)\n return True\n\n # TrainingView\n if self.view_name == \"training\":\n # We're only going to create training views for FeatureSets\n if self.is_feature_set:\n self.log.important(f\"Auto creating View {self.view_name} for {self.data_source.uuid}...\")\n TrainingView.create(self.data_source, id_column=self.auto_id_column)\n return True\n else:\n self.log.warning(\"Training Views are only supported for FeatureSets...\")\n return False\n\n # If we get here, we don't support auto-creating this view\n self.log.warning(f\"Auto-Create for {self.view_name} not implemented yet...\")\n return False\n\n def __repr__(self):\n \"\"\"Return a string representation of this object\"\"\"\n\n # Set up various details that we want to print out\n auto = \"(Auto-Created)\" if self.auto_created else \"\"\n artifact = \"FeatureSet\" if self.is_feature_set else \"DataSource\"\n\n info = f'View: \"{self.view_name}\" for {artifact}(\"{self.artifact_name}\")\\n'\n info += f\" Database: {self.database}\\n\"\n info += f\" Table: {self.table}{auto}\\n\"\n info += f\" Source Table: {self.source_table}\\n\"\n info += f\" Join View: {self.join_view}\"\n return info\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.table","title":"table: str
property
","text":"Construct the view table name for the given view type
Returns:
Name Type Descriptionstr
str
The view table name
"},{"location":"api_classes/views/#sageworks.core.views.view.View.__init__","title":"__init__(artifact, view_name, **kwargs)
","text":"View Constructor: Retrieve a View for the given artifact
Parameters:
Name Type Description Defaultartifact
Union[DataSource, FeatureSet]
A DataSource or FeatureSet object
requiredview_name
str
The name of the view to retrieve (e.g. \"training\")
required Source code insrc/sageworks/core/views/view.py
def __init__(self, artifact: Union[DataSource, FeatureSet], view_name: str, **kwargs):\n \"\"\"View Constructor: Retrieve a View for the given artifact\n\n Args:\n artifact (Union[DataSource, FeatureSet]): A DataSource or FeatureSet object\n view_name (str): The name of the view to retrieve (e.g. \"training\")\n \"\"\"\n\n # Set the view name\n self.view_name = view_name\n\n # Is this a DataSource or a FeatureSet?\n self.is_feature_set = isinstance(artifact, FeatureSetCore)\n self.auto_id_column = artifact.id_column if self.is_feature_set else None\n\n # Get the data_source from the artifact\n self.artifact_name = artifact.uuid\n self.data_source = artifact.data_source if self.is_feature_set else artifact\n self.database = self.data_source.database\n\n # Construct our base_table_name\n self.base_table_name = self.data_source.table\n\n # Check if the view should be auto created\n self.auto_created = False\n if kwargs.get(\"auto_create_view\", True) and not self.exists():\n\n # A direct double check before we auto-create\n if not self.exists(skip_cache=True):\n self.log.important(\n f\"View {self.view_name} for {self.artifact_name} doesn't exist, attempting to auto-create...\"\n )\n self.auto_created = self._auto_create_view()\n\n # Check for failure of the auto-creation\n if not self.auto_created:\n self.log.error(\n f\"View {self.view_name} for {self.artifact_name} doesn't exist and cannot be auto-created...\"\n )\n self.view_name = self.columns = self.column_types = self.source_table = self.base_table_name = None\n return\n\n # Now fill some details about the view\n self.columns, self.column_types, self.source_table, self.join_view = view_details(\n self.table, self.data_source.database, self.data_source.boto3_session\n )\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.__repr__","title":"__repr__()
","text":"Return a string representation of this object
Source code insrc/sageworks/core/views/view.py
def __repr__(self):\n \"\"\"Return a string representation of this object\"\"\"\n\n # Set up various details that we want to print out\n auto = \"(Auto-Created)\" if self.auto_created else \"\"\n artifact = \"FeatureSet\" if self.is_feature_set else \"DataSource\"\n\n info = f'View: \"{self.view_name}\" for {artifact}(\"{self.artifact_name}\")\\n'\n info += f\" Database: {self.database}\\n\"\n info += f\" Table: {self.table}{auto}\\n\"\n info += f\" Source Table: {self.source_table}\\n\"\n info += f\" Join View: {self.join_view}\"\n return info\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.column_details","title":"column_details()
","text":"Return a dictionary of the column names and types for this view
Returns:
Name Type Descriptiondict
dict
A dictionary of the column names and types
Source code insrc/sageworks/core/views/view.py
def column_details(self) -> dict:\n \"\"\"Return a dictionary of the column names and types for this view\n\n Returns:\n dict: A dictionary of the column names and types\n \"\"\"\n return dict(zip(self.columns, self.column_types))\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.delete","title":"delete()
","text":"Delete the database view (and supplemental data) if it exists.
Source code insrc/sageworks/core/views/view.py
def delete(self):\n \"\"\"Delete the database view (and supplemental data) if it exists.\"\"\"\n\n # List any supplemental tables for this data source\n supplemental_tables = list_supplemental_data_tables(self.base_table_name, self.database)\n for table in supplemental_tables:\n if self.view_name in table:\n self.log.important(f\"Deleting Supplemental Table {table}...\")\n delete_table(table, self.database, self.data_source.boto3_session)\n\n # Now drop the view\n self.log.important(f\"Dropping View {self.table}...\")\n drop_view_query = f'DROP VIEW \"{self.table}\"'\n\n # Execute the DROP VIEW query\n try:\n self.data_source.execute_statement(drop_view_query, silence_errors=True)\n except wr.exceptions.QueryFailed as e:\n if \"View not found\" in str(e):\n self.log.info(f\"View {self.table} not found, this is fine...\")\n else:\n raise\n\n # We want to do a small sleep so that AWS has time to catch up\n self.log.info(\"Sleeping for 3 seconds after dropping view to allow AWS to catch up...\")\n time.sleep(3)\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.ensure_exists","title":"ensure_exists()
","text":"Ensure if the view exists by making a query directly to the database. If it doesn't exist, create it
Source code insrc/sageworks/core/views/view.py
def ensure_exists(self):\n \"\"\"Ensure if the view exists by making a query directly to the database. If it doesn't exist, create it\"\"\"\n\n # The BaseView always exists\n if self.view_name == \"base\":\n return True\n\n # Check the database directly\n if not self._check_database():\n self._auto_create_view()\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.exists","title":"exists(skip_cache=False)
","text":"Check if the view exists in the database
Parameters:
Name Type Description Defaultskip_cache
bool
Skip the cache and check the database directly (default: False)
False
Returns: bool: True if the view exists, False otherwise.
Source code insrc/sageworks/core/views/view.py
def exists(self, skip_cache: bool = False) -> bool:\n \"\"\"Check if the view exists in the database\n\n Args:\n skip_cache (bool): Skip the cache and check the database directly (default: False)\n Returns:\n bool: True if the view exists, False otherwise.\n \"\"\"\n # The BaseView always exists\n if self.view_name == \"base\":\n return True\n\n # If we're skipping the cache, we need to check the database directly\n if skip_cache:\n return self._check_database()\n\n # Use the meta class to see if the view exists\n views_df = self.meta.views(self.database)\n\n # Check if we have ANY views\n if views_df.empty:\n return False\n\n # Check if the view exists\n return self.table in views_df[\"Name\"].values\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.pull_dataframe","title":"pull_dataframe(limit=50000, head=False)
","text":"Pull a DataFrame based on the view type
Parameters:
Name Type Description Defaultlimit
int
The maximum number of rows to pull (default: 50000)
50000
head
bool
Return just the head of the DataFrame (default: False)
False
Returns:
Type DescriptionUnion[DataFrame, None]
Union[pd.DataFrame, None]: The DataFrame for the view or None if it doesn't exist
Source code insrc/sageworks/core/views/view.py
def pull_dataframe(self, limit: int = 50000, head: bool = False) -> Union[pd.DataFrame, None]:\n \"\"\"Pull a DataFrame based on the view type\n\n Args:\n limit (int): The maximum number of rows to pull (default: 50000)\n head (bool): Return just the head of the DataFrame (default: False)\n\n Returns:\n Union[pd.DataFrame, None]: The DataFrame for the view or None if it doesn't exist\n \"\"\"\n\n # Pull the DataFrame\n if head:\n limit = 5\n pull_query = f'SELECT * FROM \"{self.table}\" LIMIT {limit}'\n df = self.data_source.query(pull_query)\n return df\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.query","title":"query(query)
","text":"Query the view with a custom SQL query
Parameters:
Name Type Description Defaultquery
str
The SQL query to execute
requiredReturns:
Type DescriptionUnion[DataFrame, None]
Union[pd.DataFrame, None]: The DataFrame for the query or None if it doesn't exist
Source code insrc/sageworks/core/views/view.py
def query(self, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Query the view with a custom SQL query\n\n Args:\n query (str): The SQL query to execute\n\n Returns:\n Union[pd.DataFrame, None]: The DataFrame for the query or None if it doesn't exist\n \"\"\"\n return self.data_source.query(query)\n
"},{"location":"api_classes/views/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Listing Views
views.pyfrom sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\ntest_data = DataSource('test_data')\ntest_data.views()\n[\"display\", \"training\", \"computation\"]\n
Getting a Particular View
views.pyfrom sageworks.api.feature_set import FeatureSet\n\nfs = FeatureSet('test_features')\n\n# Grab the columns for the display view\ndisplay_view = fs.view(\"display\")\ndisplay_view.columns\n['id', 'name', 'height', 'weight', 'salary', ...]\n\n# Pull the dataframe for this view\ndf = display_view.pull_dataframe()\n id name height weight salary ...\n0 58 Person 58 71.781227 275.088196 162053.140625 \n
View Queries
All SageWorks Views are stored in AWS Athena, so any query that you can make with Athena is accessible through the View Query API.
view_query.pyfrom sageworks.api.feature_set import FeatureSet\n\n# Grab a FeatureSet View\nfs = FeatureSet(\"abalone_features\")\nt_view = fs.view(\"training\")\n\n# Make some queries using the Athena backend\ndf = t_view(f\"select * from {t_view.table} where height > .3\")\nprint(df.head())\n\ndf = t_view.query(\"select * from abalone_features where class_number_of_rings < 3\")\nprint(df.head())\n
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10\n1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8\n\n sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1\n1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2\n
Classes to construct View
The SageWorks Classes used to construct viewss are currently in 'Core'. So you can check out the documentation for those classes here: SageWorks View Creators
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"aws_setup/aws_access_management/","title":"AWS Acesss Management","text":"Need AWS Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
This page gives an overview of how SageWorks sets up roles and policies in a granular way that provides 'least priviledge' and also provides a unified framework for AWS access management.
"},{"location":"aws_setup/aws_access_management/#conceptual-slide-deck","title":"Conceptual Slide Deck","text":"SageWorks AWS Acesss Management
"},{"location":"aws_setup/aws_access_management/#aws-resources","title":"AWS Resources","text":"Follow the steps below to set up and connect using AWS Client VPN.
"},{"location":"aws_setup/aws_client_vpn/#step-1-create-a-client-vpn-endpoint-in-aws","title":"Step 1: Create a Client VPN Endpoint in AWS","text":"10.0.0.0/22
) that doesn\u2019t overlap with your VPC CIDR.0.0.0.0/0
to allow access to all resources in the VPC.Allow access
and specify the group you created or allow all users.AWS Client VPN is a straightforward, secure, and effective solution for connecting your laptop to an AWS VPC. It requires minimal setup and provides all the security controls you need, making it ideal for a single laptop and user.
"},{"location":"aws_setup/aws_setup/","title":"AWS Setup","text":"Need AWS Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"aws_setup/aws_setup/#get-some-information","title":"Get some information","text":"Write these values down, you'll need them as part of this AWS setup.
"},{"location":"aws_setup/aws_setup/#install-aws-cli","title":"Install AWS CLI","text":"AWS CLI Instructions
"},{"location":"aws_setup/aws_setup/#running-the-sso-configuration","title":"Running the SSO Configuration","text":"Note: You only need to do this once! Also this will create a NEW profile, so name the profile something like aws_sso
.
aws configure sso --profile <whatever> (e.g. aws_sso)\nSSO session name (Recommended): sso-session\nSSO start URL []: <the Start URL from info above>\nSSO region []: <the Region from info above>\nSSO registration scopes [sso:account:access]: <just hit return>\n
You will get a browser open/redirect at this point and get a list of available accounts.. something like below, just pick the correct account
There are 2 AWS accounts available to you.\n> SCP_Sandbox, briford+sandbox@supercowpowers.com (XXXX40646YYY)\n SCP_Main, briford@supercowpowers.com (XXX576391YYY)\n
Now pick the role that you're going to use
There are 2 roles available to you.\n> DataScientist\n AdministratorAccess\n\nCLI default client Region [None]: <same region as above>\nCLI default output format [None]: json\n
"},{"location":"aws_setup/aws_setup/#setting-up-some-aliases-for-bashzsh","title":"Setting up some aliases for bash/zsh","text":"Edit your favorite ~/.bashrc ~/.zshrc and add these nice aliases/helper
# AWS Aliases\nalias aws_sso='export AWS_PROFILE=aws_sso'\n\n# Default AWS Profile\nexport AWS_PROFILE=aws_sso\n
"},{"location":"aws_setup/aws_setup/#testing-your-new-aws-profile","title":"Testing your new AWS Profile","text":"Make sure your profile is active/set
env | grep AWS\nAWS_PROFILE=<aws_sso or whatever>\n
Now you can list the S3 buckets in the AWS Account aws ls s3\n
If you get some message like this... The SSO session associated with this profile has\nexpired or is otherwise invalid. To refresh this SSO\nsession run aws sso login with the corresponding\nprofile.\n
This is fine/good, a browser will open up and you can refresh your SSO Token.
After that you should get a listing of the S3 buckets without needed to refresh your token.
aws s3 ls\n\u276f aws s3 ls\n2023-03-20 20:06:53 aws-athena-query-results-XXXYYY-us-west-2\n2023-03-30 13:22:28 sagemaker-studio-XXXYYY-dbgyvq8ruka\n2023-03-24 22:05:55 sagemaker-us-west-2-XXXYYY\n2023-04-30 13:43:29 scp-sageworks-artifacts\n
"},{"location":"aws_setup/aws_setup/#back-to-initial-setup","title":"Back to Initial Setup","text":"If you're doing the initial setup of SageWorks you should now go back and finish that process: Getting Started
"},{"location":"aws_setup/aws_setup/#aws-resources","title":"AWS Resources","text":"Need AWS Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
This page tries to give helpful guidance when setting up AWS Accounts, Users, and Groups. In general AWS can be a bit tricky to set up the first time. Feel free to use any material in this guide but we're more than happy to help clients get their AWS Setup ready to go for FREE. Below are some guides for setting up a new AWS account for SageWorks and also setting up SSO Users and Groups within AWS.
"},{"location":"aws_setup/aws_tips_and_tricks/#new-aws-account-with-aws-organizations-easy","title":"New AWS Account (with AWS Organizations: easy)","text":"Email Trick
AWS will often not allow the same email to be used for different accounts. If you need a 'new' email just add a plus sign '+' at the end of your existing email (e.g. bob.smith+aws@gmail.com). This email will 'auto forward' to bob.smith@gmail.com.
"},{"location":"aws_setup/aws_tips_and_tricks/#new-aws-account-without-aws-organizations-a-bit-harder","title":"New AWS Account (without AWS Organizations: a bit harder)","text":"AWS SSO (Single Sign-On) is a cloud-based service that allows users to manage access to multiple AWS accounts and business applications using a single set of credentials. It simplifies the authentication process for users and provides centralized management of permissions and access control across various AWS resources. With AWS SSO, users can log in once and access all the applications and accounts they need, streamlining the user experience and increasing productivity. AWS SSO also enables IT administrators to manage access more efficiently by providing a single point of control for managing user access, permissions, and policies, reducing the risk of unauthorized access or security breaches.
"},{"location":"aws_setup/aws_tips_and_tricks/#setting-up-sso-users","title":"Setting up SSO Users","text":"The 'Add User' setup is fairly straight forward but here are some screen shots:
On the first panel you can fill in the users information.
"},{"location":"aws_setup/aws_tips_and_tricks/#groups","title":"Groups","text":"On the second panel we suggest that you have at LEAST two groups:
This allows you to put most of the users into the DataScientists group that has AWS policies based on their job role. AWS uses 'permission sets' and you assign AWS Policies. This approach makes it easy to give a group of users a set of relevant policies for their tasks.
Our standard setup is to have two permission sets with the following policies:
Add Policy: arn:aws:iam::aws:policy/job-function/DataScientist
IAM Identity Center --> Permission sets --> AdministratorAccess
See: Permission Sets for more details and instructions.
Another benefit of creating groups is that you can include that group in 'Trust Policy (assume_role)' for the SageWorks-ExecutionRole (this gets deployed as part of the SageWorks AWS Stack). This means that the management of what SageWorks can do/see/read/write is completely done through the SageWorks-ExecutionRole.
"},{"location":"aws_setup/aws_tips_and_tricks/#back-to-adding-user","title":"Back to Adding User","text":"Okay now that we have our groups set up we can go back to our original goal of adding a user. So here's the second panel with the groups and now we can hit 'Next'
On the third panel just review the details and hit the 'Add User' button at the bottom. The user will get an email giving them instructions on how to log on to their AWS account.
"},{"location":"aws_setup/aws_tips_and_tricks/#aws-console","title":"AWS Console","text":"Now when the user logs onto the AWS Console they should see something like this:
"},{"location":"aws_setup/aws_tips_and_tricks/#aws-clisso-setup-for-command-linepython-usage","title":"AWS CLI/SSO Setup for Command Line/Python Usage","text":"Please see our AWS Setup
"},{"location":"aws_setup/aws_tips_and_tricks/#aws-resources","title":"AWS Resources","text":"Welcome to the SageWorks AWS Setup Guide. SageWorks is deployed as an AWS Stack following the well architected system practices of AWS.
AWS Setup can be a bit complex
Setting up SageWorks with AWS can be a bit complex, but this only needs to be done ONCE for your entire company. The install uses standard CDK --> AWS Stacks and SageWorks tries to make it straight forward. If you have any troubles at all feel free to contact us a sageworks@supercowpowers.com or on Discord and we're happy to help you with AWS for FREE.
"},{"location":"aws_setup/core_stack/#two-main-options-when-using-sageworks","title":"Two main options when using SageWorks","text":"Either of these options are fully supported, but we highly suggest a NEW account as it gives the following benefits:
If your AWS Account already has users and groups set up you can skip this but here's our recommendations on setting up SSO Users and Groups
"},{"location":"aws_setup/core_stack/#onboarding-sageworks-to-your-aws-account","title":"Onboarding SageWorks to your AWS Account","text":"Pulling down the SageWorks Repo
git clone https://github.com/SuperCowPowers/sageworks.git\n
"},{"location":"aws_setup/core_stack/#sageworks-uses-aws-python-cdk-for-deployments","title":"SageWorks uses AWS Python CDK for Deployments","text":"If you don't have AWS CDK already installed you can do these steps:
Mac
brew install node \nnpm install -g aws-cdk\n
Linux sudo apt install nodejs\nsudo npm install -g aws-cdk\n
For more information on Linux installs see Digital Ocean NodeJS"},{"location":"aws_setup/core_stack/#create-an-s3-bucket-for-sageworks","title":"Create an S3 Bucket for SageWorks","text":"SageWorks pushes and pulls data from AWS, it will use this S3 Bucket for storage and processing. You should create a NEW S3 Bucket, we suggest a name like <company_name>-sageworks
Do the initial setup/config here: Getting Started. After you've done that come back to this section. For Stack Deployment additional things need to be added to your config file. The config file will be located in your home directory ~/.sageworks/sageworks_config.json
. Edit this file and add addition stuff for the deployment. Specifically there are two additional fields to be added (optional for both)
\"SAGEWORKS_SSO_GROUP\": DataScientist (or whatever)\n\"SAGEWORKS_ADDITIONAL_BUCKETS\": \"bucket1, bucket2\n
These are optional but are set/used by most SageWorks users. AWS Stuff
Activate your AWS Account that's used for SageWorks deployment. For this one time install you should use an Admin Account (or an account that had permissions to create/update AWS Stacks)
cd sageworks/aws_setup/sageworks_core\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n
"},{"location":"aws_setup/core_stack/#aws-account-setup-check","title":"AWS Account Setup Check","text":"After setting up SageWorks config/AWS Account you can run this test/checking script. If the results ends with INFO AWS Account Clamp: AOK!
you're in good shape. If not feel free to contact us on Discord and we'll get it straightened out for you :)
pip install sageworks (if not already installed)\ncd sageworks/aws_setup\npython aws_account_check.py\n<lot of print outs for various checks>\nINFO AWS Account Clamp: AOK!\n
Success
Congratulations: SageWorks is now deployed to your AWS Account. Deploying the AWS Stack only needs to be done once. Now that this is complete your developers can simply pip install sageworks
and start using the API.
If you ran into any issues with this procedure please contact us via Discord or email sageworks@supercowpowers.com and the SCP team will provide free setup and support for new SageWorks users.
"},{"location":"aws_setup/dashboard_stack/","title":"Deploy the SageWorks Dashboard Stack","text":"Deploying the Dashboard Stack is reasonably straight forward, it's the same approach as the Core Stack that you've already deployed.
Please review the Stack Details section to understand all the AWS components that are included and utilized in the SageWorks Dashboard Stack.
"},{"location":"aws_setup/dashboard_stack/#deploying-the-dashboard-stack","title":"Deploying the Dashboard Stack","text":"AWS Stuff
Activate your AWS Account that's used for SageWorks deployment. For this one time install you should use an Admin Account (or an account that had permissions to create/update AWS Stacks)
cd sageworks/aws_setup/sageworks_dashboard_full\nexport SAGEWORKS_CONFIG=/full/path/to/config.json\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n
"},{"location":"aws_setup/dashboard_stack/#stack-details","title":"Stack Details","text":"AWS Questions?
There's quite a bit to unpack when deploying an AWS powered Web Service. We're happy to help walk you through the details and options. Contact us anytime for a free consultation.
AWS Costs
Deploying the SageWorks Dashboard does incur some monthly AWS costs. If you're on a tight budget you can deploy the 'lite' version of the Dashboard Stack.
cd sageworks/aws_setup/sageworks_dashboard_lite\nexport SAGEWORKS_CONFIG=/full/path/to/config.json\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n
"},{"location":"aws_setup/domain_cert_setup/","title":"AWS Domain and Certificate Instructions","text":"Need AWS Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
This page tries to give helpful guidance when setting up a new domain and SSL Certificate in your AWS Account.
"},{"location":"aws_setup/domain_cert_setup/#new-domain","title":"New Domain","text":"You'll want the SageWorks Dashboard to have a domain for your companies internal use. Customers will typically use a domain like <company_name>-ml-dashboard.com
but you are free to choose any domain you'd like.
Domains are tied to AWS Accounts
When you create a new domain in AWS Route 53, that domain is tied to that AWS Account. You can do a cross account setup for domains but it's a bit more tricky. We recommend that each account where SageWorks gets deployed owns the domain for that Dashboard.
"},{"location":"aws_setup/domain_cert_setup/#multiple-aws-accounts","title":"Multiple AWS Accounts","text":"Many customers will have a dev/stage/prod set of AWS accounts, if that the case then the best practice is to make a domain specific to each account. So for instance:
<company_name>-ml-dashboard-dev.com
<company_name>-ml-dashboard-prod.com
.This means that when you go to that Dashboard it's super obvious which environment your on.
"},{"location":"aws_setup/domain_cert_setup/#register-the-domain","title":"Register the Domain","text":"Open Route 53 Console Route 53 Console
Register your New Domain
Open ACM Console: AWS Certificate Manager (ACM) Console
Request a Certificate:
Add Domain Names:
yourdomain.com
).www.yourdomain.com
).Validation Method:
Add Tags (Optional):
Review and Request:
To complete the domain validation process for your SSL/TLS certificate, you need to add the CNAME records provided by AWS Certificate Manager (ACM) to your Route 53 hosted zone. This step ensures that you own the domain and allows ACM to issue the certificate.
"},{"location":"aws_setup/domain_cert_setup/#finding-cname-record-names-and-values","title":"Finding CNAME Record Names and Values","text":"You can find the CNAME record names and values in the AWS Certificate Manager (ACM) console:
Open ACM Console: AWS Certificate Manager (ACM) Console
Select Your Certificate:
View Domains Section:
Open Route 53 Console: Route 53 Console
Select Your Hosted Zone:
yourdomain.com
).Add the First CNAME Record:
_3e8623442477e9eeec.your-domain.com
).CNAME
._0908c89646d92.sdgjtdhdhz.acm-validations.aws.
) (include the trailing dot).Add the Second CNAME Record:
_75cd9364c643caa.www.your-domain.com
).CNAME
._f72f8cff4fb20f4.sdgjhdhz.acm-validations.aws.
) (include the trailing dot).DNS Propagation and Cert Validation
After adding the CNAME records, these DNS records will propagate through the DNS system and ACM will automatically detect the validation records and validate the domain. This process can take a few minutes or up to an hour.
"},{"location":"aws_setup/domain_cert_setup/#certificate-states","title":"Certificate States","text":"After requesting a certificate, it will go through the following states:
Pending Validation: The initial state after you request a certificate and before you complete the validation process. ACM is waiting for you to prove domain ownership by adding the CNAME records.
Issued: This state indicates that the certificate has been successfully validated and issued. You can now use this certificate with your AWS resources.
Validation Timed Out: If you do not complete the validation process within a specified period (usually 72 hours), the certificate request times out and enters this state.
Revoked: This state indicates that the certificate has been revoked and is no longer valid.
Failed: If the validation process fails for any reason, the certificate enters this state.
Inactive: This state indicates that the certificate is not currently in use.
The certificate status should obviously be in the Issued state, if not please contact SageWorks Support Team.
"},{"location":"aws_setup/domain_cert_setup/#retrieving-the-certificate-arn","title":"Retrieving the Certificate ARN","text":"Open ACM Console:
Check the Status:
Copy the Certificate ARN:
You now have the ARN for your certificate, which you can use in your AWS resources such as API Gateway, CloudFront, etc.
"},{"location":"aws_setup/domain_cert_setup/#aws-resources","title":"AWS Resources","text":"Now that the core Sageworks AWS Stack has been deployed. Let's test out SageWorks by building a full entire AWS ML Pipeline from start to finish. The script build_ml_pipeline.py
uses the SageWorks API to quickly and easily build an AWS Modeling Pipeline.
Taste the Awesome
The SageWorks \"hello world\" builds a full AWS ML Pipeline. From S3 to deployed model and endpoint. If you have any troubles at all feel free to contact us at sageworks email or on Discord and we're happy to help you for FREE.
This script will take a LONG TiME to run, most of the time is waiting on AWS to finalize FeatureGroups, train Models or deploy Endpoints.
\u276f python build_ml_pipeline.py\n<lot of building ML pipeline outputs>\n
After the script completes you will see that it's built out an AWS ML Pipeline and testing artifacts."},{"location":"aws_setup/full_pipeline/#run-the-sageworks-dashboard-local","title":"Run the SageWorks Dashboard (Local)","text":"Dashboard AWS Stack
Deploying the Dashboard Stack is straight-forward and provides a robust AWS Web Server with Load Balancer, Elastic Container Service, VPC Networks, etc. (see AWS Dashboard Stack)
For testing it's nice to run the Dashboard locally, but for longterm use the SageWorks Dashboard should be deployed as an AWS Stack. The deployed Stack allows everyone in the company to use, view, and interact with the AWS Machine Learning Artifacts created with SageWorks.
cd sageworks/application/aws_dashboard\n./dashboard\n
This will open a browser to http://localhost:8000 SageWorks Dashboard: AWS Pipelines in a Whole New Light!
Success
Congratulations: SageWorks is now deployed to your AWS Account. Deploying the AWS Stack only needs to be done once. Now that this is complete your developers can simply pip install sageworks
and start using the API.
If you ran into any issues with this procedure please contact us via Discord or email sageworks@supercowpowers.com and the SCP team will provide free setup and support for new SageWorks users.
"},{"location":"blogs_research/","title":"SageWorks Blogs","text":"Just Getting Started?
The SageWorks Blogs is a great way to see what's possible with SageWorks. Also if you're ready to jump in the API Classes will give you details on the SageWorks ML Pipeline Classes.
"},{"location":"blogs_research/#blogs","title":"Blogs","text":"Examples
All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
SageWorks EDS
The SageWorks toolkit a set of plots that show EDA results, it also has a flexible plugin architecture to expand, enhance, or even replace the current set of web components Dashboard.
The SageWorks framework has a broad range of Exploratory Data Analysis (EDA) functionality. Each time a DataSource or FeatureSet is created that data is run through a full set of EDA techniques:
SageWorks EDS
The SageWorks toolkit a set of plots that show EDA results, it also has a flexible plugin architecture to expand, enhance, or even replace the current set of web components Dashboard.
The SageWorks framework has a broad range of Exploratory Data Analysis (EDA) functionality. Each time a DataSource or FeatureSet is created that data is run through a full set of EDA techniques:
One of the latest EDA techniques we've added is the addition of a concept called High Target Gradients
[G_{ij} = \\frac{|y_i - y_j|}{d(x_i, x_j)}]
where (d(x_i, x_j)) is the distance between (x_i) and (x_j) in the feature space. This equation gives you the rate of change of the target value with respect to the change in features, similar to a slope in a two-dimensional space.
[G_{i}^{max} = \\max_{j \\neq i} G_{ij}]
This gives you a scalar value for each point in your training data that represents the maximum rate of change of the target value in its local neighborhood.
Usage: You can use (G_{i}^{max}) to identify and filter areas in the feature space that have high target gradients, which may indicate potential issues with data quality or feature representation.
Visualization: Plotting the distribution of (G_{i}^{max}) values or visualizing them in the context of the feature space can help you identify regions or specific points that warrant further investigation.
Amazon SageMaker Model Monitor currently provides the following types of monitoring:
Overview and Definition Residual analysis involves examining the differences between observed and predicted values, known as residuals, to assess the performance of a predictive model. It is a critical step in model evaluation as it helps identify patterns of errors, diagnose potential problems, and improve model performance. By understanding where and why a model's predictions deviate from actual values, we can make informed adjustments to the model or the data to enhance accuracy and robustness.
Sparse Data Regions The observation is in a part of feature space with little or no nearby training observations, leading to poor generalization in these regions and resulting in high prediction errors.
Noisy/Inconsistent Data and Preprocessing Issues The observation is in a part of feature space where the training data is noisy, incorrect, or has high variance in the target variable. Additionally, missing values or incorrect data transformations can introduce errors, leading to unreliable predictions and high residuals.
Feature Resolution The current feature set may not fully resolve the compounds, leading to \u2018collisions\u2019 where different compounds are assigned identical features. Such unresolved features can result in different compounds exhibiting the same features, causing high residuals due to unaccounted structural or chemical nuances.
Activity Cliffs Structurally similar compounds exhibit significantly different activities, making accurate prediction challenging due to steep changes in activity with minor structural modifications.
Feature Engineering Issues Irrelevant or redundant features and poor feature scaling can negatively impact the model's performance and accuracy, resulting in higher residuals.
Model Overfitting or Underfitting Overfitting occurs when the model is too complex and captures noise, while underfitting happens when the model is too simple and misses underlying patterns, both leading to inaccurate predictions.
"},{"location":"cached/cached_data_source/","title":"CachedDataSource","text":"Model Examples
Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!
CachedDataSource: Caches the method results for SageWorks DataSources
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource","title":"CachedDataSource
","text":" Bases: CachedArtifactMixin
, AthenaSource
CachedDataSource: Caches the method results for SageWorks DataSources
Note: Cached method values may lag underlying DataSource changes.
Common Usagemy_data = CachedDataSource(name)\nmy_data.details()\nmy_data.health_check()\nmy_data.sageworks_meta()\n
Source code in src/sageworks/cached/cached_data_source.py
class CachedDataSource(CachedArtifactMixin, AthenaSource):\n \"\"\"CachedDataSource: Caches the method results for SageWorks DataSources\n\n Note: Cached method values may lag underlying DataSource changes.\n\n Common Usage:\n ```python\n my_data = CachedDataSource(name)\n my_data.details()\n my_data.health_check()\n my_data.sageworks_meta()\n ```\n \"\"\"\n\n def __init__(self, data_uuid: str, database: str = \"sageworks\"):\n \"\"\"CachedDataSource Initialization\"\"\"\n AthenaSource.__init__(self, data_uuid=data_uuid, database=database, use_cached_meta=True)\n\n @CachedArtifactMixin.cache_result\n def summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Details.\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().summary(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Details.\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().details(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Health Check.\n\n Returns:\n dict: A dictionary of health check details for the DataSource\n \"\"\"\n return super().health_check(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def sageworks_meta(self) -> Union[dict, None]:\n \"\"\"Retrieve the SageWorks Metadata for this DataSource.\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n \"\"\"\n return super().sageworks_meta()\n\n def smart_sample(self) -> pd.DataFrame:\n \"\"\"Retrieve the Smart Sample for this DataSource.\n\n Returns:\n pd.DataFrame: The Smart Sample DataFrame\n \"\"\"\n return super().smart_sample(recompute=False)\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.__init__","title":"__init__(data_uuid, database='sageworks')
","text":"CachedDataSource Initialization
Source code insrc/sageworks/cached/cached_data_source.py
def __init__(self, data_uuid: str, database: str = \"sageworks\"):\n \"\"\"CachedDataSource Initialization\"\"\"\n AthenaSource.__init__(self, data_uuid=data_uuid, database=database, use_cached_meta=True)\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.details","title":"details(**kwargs)
","text":"Retrieve the DataSource Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the DataSource
Source code insrc/sageworks/cached/cached_data_source.py
@CachedArtifactMixin.cache_result\ndef details(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Details.\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.health_check","title":"health_check(**kwargs)
","text":"Retrieve the DataSource Health Check.
Returns:
Name Type Descriptiondict
dict
A dictionary of health check details for the DataSource
Source code insrc/sageworks/cached/cached_data_source.py
@CachedArtifactMixin.cache_result\ndef health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Health Check.\n\n Returns:\n dict: A dictionary of health check details for the DataSource\n \"\"\"\n return super().health_check(**kwargs)\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.sageworks_meta","title":"sageworks_meta()
","text":"Retrieve the SageWorks Metadata for this DataSource.
Returns:
Type DescriptionUnion[dict, None]
Union[dict, None]: Dictionary of SageWorks metadata for this Artifact
Source code insrc/sageworks/cached/cached_data_source.py
@CachedArtifactMixin.cache_result\ndef sageworks_meta(self) -> Union[dict, None]:\n \"\"\"Retrieve the SageWorks Metadata for this DataSource.\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n \"\"\"\n return super().sageworks_meta()\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.smart_sample","title":"smart_sample()
","text":"Retrieve the Smart Sample for this DataSource.
Returns:
Type DescriptionDataFrame
pd.DataFrame: The Smart Sample DataFrame
Source code insrc/sageworks/cached/cached_data_source.py
def smart_sample(self) -> pd.DataFrame:\n \"\"\"Retrieve the Smart Sample for this DataSource.\n\n Returns:\n pd.DataFrame: The Smart Sample DataFrame\n \"\"\"\n return super().smart_sample(recompute=False)\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.summary","title":"summary(**kwargs)
","text":"Retrieve the DataSource Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the DataSource
Source code insrc/sageworks/cached/cached_data_source.py
@CachedArtifactMixin.cache_result\ndef summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Details.\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().summary(**kwargs)\n
"},{"location":"cached/cached_data_source/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Pull DataSource Details
from sageworks.cached.cached_data_source import CachedDataSource\n\n# Grab a DataSource\nds = CachedDataSource(\"abalone_data\")\n\n# Show the details\nds.details()\n\n> ds.details()\n\n{'uuid': 'abalone_data',\n 'health_tags': [],\n 'aws_arn': 'arn:aws:glue:x:table/sageworks/abalone_data',\n 'size': 0.070272,\n 'created': '2024-11-09T20:42:34.000Z',\n 'modified': '2024-11-10T19:57:52.000Z',\n 'input': 's3://sageworks-public-data/common/aBaLone.CSV',\n 'sageworks_health_tags': '',\n 'sageworks_correlations': {'length': {'diameter': 0.9868115846024996,\n
"},{"location":"cached/cached_endpoint/","title":"CachedEndpoint","text":"Model Examples
Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!
CachedEndpoint: Caches the method results for SageWorks Endpoints
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint","title":"CachedEndpoint
","text":" Bases: CachedArtifactMixin
, EndpointCore
CachedEndpoint: Caches the method results for SageWorks Endpoints
Note: Cached method values may lag underlying Endpoint changes.
Common Usagemy_endpoint = CachedEndpoint(name)\nmy_endpoint.details()\nmy_endpoint.health_check()\nmy_endpoint.sageworks_meta()\n
Source code in src/sageworks/cached/cached_endpoint.py
class CachedEndpoint(CachedArtifactMixin, EndpointCore):\n \"\"\"CachedEndpoint: Caches the method results for SageWorks Endpoints\n\n Note: Cached method values may lag underlying Endpoint changes.\n\n Common Usage:\n ```python\n my_endpoint = CachedEndpoint(name)\n my_endpoint.details()\n my_endpoint.health_check()\n my_endpoint.sageworks_meta()\n ```\n \"\"\"\n\n def __init__(self, endpoint_uuid: str):\n \"\"\"CachedEndpoint Initialization\"\"\"\n EndpointCore.__init__(self, endpoint_uuid=endpoint_uuid, use_cached_meta=True)\n\n @CachedArtifactMixin.cache_result\n def summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Details.\n\n Returns:\n dict: A dictionary of details about the CachedEndpoint\n \"\"\"\n return super().summary(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Details.\n\n Returns:\n dict: A dictionary of details about the CachedEndpoint\n \"\"\"\n return super().details(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Health Check.\n\n Returns:\n dict: A dictionary of health check details for the CachedEndpoint\n \"\"\"\n return super().health_check(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).\n\n Returns:\n str: The Enumerated Model Type\n \"\"\"\n return super().sageworks_meta()\n\n @CachedArtifactMixin.cache_result\n def endpoint_metrics(self) -> Union[str, None]:\n \"\"\"Retrieve the Endpoint Metrics\n\n Returns:\n str: The Endpoint Metrics\n \"\"\"\n return super().endpoint_metrics()\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.__init__","title":"__init__(endpoint_uuid)
","text":"CachedEndpoint Initialization
Source code insrc/sageworks/cached/cached_endpoint.py
def __init__(self, endpoint_uuid: str):\n \"\"\"CachedEndpoint Initialization\"\"\"\n EndpointCore.__init__(self, endpoint_uuid=endpoint_uuid, use_cached_meta=True)\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.details","title":"details(**kwargs)
","text":"Retrieve the CachedEndpoint Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the CachedEndpoint
Source code insrc/sageworks/cached/cached_endpoint.py
@CachedArtifactMixin.cache_result\ndef details(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Details.\n\n Returns:\n dict: A dictionary of details about the CachedEndpoint\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.endpoint_metrics","title":"endpoint_metrics()
","text":"Retrieve the Endpoint Metrics
Returns:
Name Type Descriptionstr
Union[str, None]
The Endpoint Metrics
Source code insrc/sageworks/cached/cached_endpoint.py
@CachedArtifactMixin.cache_result\ndef endpoint_metrics(self) -> Union[str, None]:\n \"\"\"Retrieve the Endpoint Metrics\n\n Returns:\n str: The Endpoint Metrics\n \"\"\"\n return super().endpoint_metrics()\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.health_check","title":"health_check(**kwargs)
","text":"Retrieve the CachedEndpoint Health Check.
Returns:
Name Type Descriptiondict
dict
A dictionary of health check details for the CachedEndpoint
Source code insrc/sageworks/cached/cached_endpoint.py
@CachedArtifactMixin.cache_result\ndef health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Health Check.\n\n Returns:\n dict: A dictionary of health check details for the CachedEndpoint\n \"\"\"\n return super().health_check(**kwargs)\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.sageworks_meta","title":"sageworks_meta()
","text":"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).
Returns:
Name Type Descriptionstr
Union[str, None]
The Enumerated Model Type
Source code insrc/sageworks/cached/cached_endpoint.py
@CachedArtifactMixin.cache_result\ndef sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).\n\n Returns:\n str: The Enumerated Model Type\n \"\"\"\n return super().sageworks_meta()\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.summary","title":"summary(**kwargs)
","text":"Retrieve the CachedEndpoint Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the CachedEndpoint
Source code insrc/sageworks/cached/cached_endpoint.py
@CachedArtifactMixin.cache_result\ndef summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Details.\n\n Returns:\n dict: A dictionary of details about the CachedEndpoint\n \"\"\"\n return super().summary(**kwargs)\n
"},{"location":"cached/cached_endpoint/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Get Endpoint Details
from sageworks.cached.cached_endpoint import CachedEndpoint\n\n# Grab an Endpoint\nend = CachedEndpoint(\"abalone-regression\")\n\n# Get the Details\n end.details()\n\n{'uuid': 'abalone-regression-end',\n 'health_tags': [],\n 'status': 'InService',\n 'instance': 'Serverless (2GB/5)',\n 'instance_count': '-',\n 'variant': 'AllTraffic',\n 'model_name': 'abalone-regression',\n 'model_type': 'regressor',\n 'model_metrics': RMSE R2 MAPE MedAE NumRows\n 1.64 2.246 0.502 16.393 1.209 834,\n 'confusion_matrix': None,\n 'predictions': class_number_of_rings prediction id\n 0 16 10.516158 7\n 1 9 9.031365 8\n 2 10 9.264600 17\n 3 7 8.578638 18\n 4 12 10.492446 27\n .. ... ... ...\n 829 11 11.915862 4148\n 830 8 8.210898 4157\n 831 8 7.693689 4158\n 832 9 7.542521 4167\n 833 8 9.060015 4168\n
"},{"location":"cached/cached_feature_set/","title":"CachedFeatureSet","text":"Model Examples
Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!
CachedFeatureSet: Caches the method results for SageWorks FeatureSets
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet","title":"CachedFeatureSet
","text":" Bases: CachedArtifactMixin
, FeatureSetCore
CachedFeatureSet: Caches the method results for SageWorks FeatureSets
Note: Cached method values may lag underlying FeatureSet changes.
Common Usagemy_features = CachedFeatureSet(name)\nmy_features.details()\nmy_features.health_check()\nmy_features.sageworks_meta()\n
Source code in src/sageworks/cached/cached_feature_set.py
class CachedFeatureSet(CachedArtifactMixin, FeatureSetCore):\n \"\"\"CachedFeatureSet: Caches the method results for SageWorks FeatureSets\n\n Note: Cached method values may lag underlying FeatureSet changes.\n\n Common Usage:\n ```python\n my_features = CachedFeatureSet(name)\n my_features.details()\n my_features.health_check()\n my_features.sageworks_meta()\n ```\n \"\"\"\n\n def __init__(self, feature_set_uuid: str, database: str = \"sageworks\"):\n \"\"\"CachedFeatureSet Initialization\"\"\"\n FeatureSetCore.__init__(self, feature_set_uuid=feature_set_uuid, use_cached_meta=True)\n\n @CachedArtifactMixin.cache_result\n def summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Details.\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().summary(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Details.\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().details(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Health Check.\n\n Returns:\n dict: A dictionary of health check details for the FeatureSet\n \"\"\"\n return super().health_check(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the SageWorks Metadata for this DataSource.\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n \"\"\"\n return super().sageworks_meta()\n\n @CachedArtifactMixin.cache_result\n def smart_sample(self) -> pd.DataFrame:\n \"\"\"Retrieve the Smart Sample for this FeatureSet.\n\n Returns:\n pd.DataFrame: The Smart Sample DataFrame\n \"\"\"\n return super().smart_sample(recompute=False)\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.__init__","title":"__init__(feature_set_uuid, database='sageworks')
","text":"CachedFeatureSet Initialization
Source code insrc/sageworks/cached/cached_feature_set.py
def __init__(self, feature_set_uuid: str, database: str = \"sageworks\"):\n \"\"\"CachedFeatureSet Initialization\"\"\"\n FeatureSetCore.__init__(self, feature_set_uuid=feature_set_uuid, use_cached_meta=True)\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.details","title":"details(**kwargs)
","text":"Retrieve the FeatureSet Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the FeatureSet
Source code insrc/sageworks/cached/cached_feature_set.py
@CachedArtifactMixin.cache_result\ndef details(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Details.\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.health_check","title":"health_check(**kwargs)
","text":"Retrieve the FeatureSet Health Check.
Returns:
Name Type Descriptiondict
dict
A dictionary of health check details for the FeatureSet
Source code insrc/sageworks/cached/cached_feature_set.py
@CachedArtifactMixin.cache_result\ndef health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Health Check.\n\n Returns:\n dict: A dictionary of health check details for the FeatureSet\n \"\"\"\n return super().health_check(**kwargs)\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.sageworks_meta","title":"sageworks_meta()
","text":"Retrieve the SageWorks Metadata for this DataSource.
Returns:
Type DescriptionUnion[str, None]
Union[dict, None]: Dictionary of SageWorks metadata for this Artifact
Source code insrc/sageworks/cached/cached_feature_set.py
@CachedArtifactMixin.cache_result\ndef sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the SageWorks Metadata for this DataSource.\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n \"\"\"\n return super().sageworks_meta()\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.smart_sample","title":"smart_sample()
","text":"Retrieve the Smart Sample for this FeatureSet.
Returns:
Type DescriptionDataFrame
pd.DataFrame: The Smart Sample DataFrame
Source code insrc/sageworks/cached/cached_feature_set.py
@CachedArtifactMixin.cache_result\ndef smart_sample(self) -> pd.DataFrame:\n \"\"\"Retrieve the Smart Sample for this FeatureSet.\n\n Returns:\n pd.DataFrame: The Smart Sample DataFrame\n \"\"\"\n return super().smart_sample(recompute=False)\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.summary","title":"summary(**kwargs)
","text":"Retrieve the FeatureSet Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the FeatureSet
Source code insrc/sageworks/cached/cached_feature_set.py
@CachedArtifactMixin.cache_result\ndef summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Details.\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().summary(**kwargs)\n
"},{"location":"cached/cached_feature_set/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Pull FeatureSet Details
from sageworks.cached.cached_feature_set import CachedFeatureSet\n\n# Grab a FeatureSet\nfs = CachedFeatureSet(\"abalone_features\")\n\n# Show the details\nfs.details()\n\n> fs.details()\n\n{'uuid': 'abalone_features',\n 'health_tags': [],\n 'aws_arn': 'arn:aws:glue:x:table/sageworks/abalone_data',\n 'size': 0.070272,\n 'created': '2024-11-09T20:42:34.000Z',\n 'modified': '2024-11-10T19:57:52.000Z',\n 'input': 's3://sageworks-public-data/common/aBaLone.CSV',\n 'sageworks_health_tags': '',\n 'sageworks_correlations': {'length': {'diameter': 0.9868115846024996,\n
"},{"location":"cached/cached_meta/","title":"CachedMeta","text":"CachedMeta Examples
Examples of using the CachedMeta class are listed at the bottom of this page Examples.
CachedMeta: A class that provides caching for the Meta() class
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta","title":"CachedMeta
","text":" Bases: CloudMeta
CachedMeta: Singleton class for caching metadata functionality.
Common Usagefrom sageworks.cached.cached_meta import CachedMeta\nmeta = CachedMeta()\n\n# Get the AWS Account Info\nmeta.account()\nmeta.config()\n\n# These are 'list' methods\nmeta.etl_jobs()\nmeta.data_sources()\nmeta.feature_sets(details=True/False)\nmeta.models(details=True/False)\nmeta.endpoints()\nmeta.views()\n\n# These are 'describe' methods\nmeta.data_source(\"abalone_data\")\nmeta.feature_set(\"abalone_features\")\nmeta.model(\"abalone-regression\")\nmeta.endpoint(\"abalone-endpoint\")\n
Source code in src/sageworks/cached/cached_meta.py
class CachedMeta(CloudMeta):\n \"\"\"CachedMeta: Singleton class for caching metadata functionality.\n\n Common Usage:\n ```python\n from sageworks.cached.cached_meta import CachedMeta\n meta = CachedMeta()\n\n # Get the AWS Account Info\n meta.account()\n meta.config()\n\n # These are 'list' methods\n meta.etl_jobs()\n meta.data_sources()\n meta.feature_sets(details=True/False)\n meta.models(details=True/False)\n meta.endpoints()\n meta.views()\n\n # These are 'describe' methods\n meta.data_source(\"abalone_data\")\n meta.feature_set(\"abalone_features\")\n meta.model(\"abalone-regression\")\n meta.endpoint(\"abalone-endpoint\")\n ```\n \"\"\"\n\n _instance = None # Class attribute to hold the singleton instance\n\n def __new__(cls, *args, **kwargs):\n if cls._instance is None:\n cls._instance = super(CachedMeta, cls).__new__(cls)\n return cls._instance\n\n def __init__(self):\n \"\"\"CachedMeta Initialization\"\"\"\n if hasattr(self, \"_initialized\") and self._initialized:\n return # Prevent reinitialization\n\n self.log = logging.getLogger(\"sageworks\")\n self.log.important(\"Initializing CachedMeta...\")\n super().__init__()\n\n # Create both our Meta Cache and Fresh Cache (tracks if data is stale)\n self.meta_cache = SageWorksCache(prefix=\"meta\")\n self.fresh_cache = SageWorksCache(prefix=\"meta_fresh\", expire=90) # 90-second expiration\n\n # Create a ThreadPoolExecutor for refreshing stale data\n self.thread_pool = ThreadPoolExecutor(max_workers=5)\n\n # Mark the instance as initialized\n self._initialized = True\n\n def check(self):\n \"\"\"Check if our underlying caches are working\"\"\"\n return self.meta_cache.check()\n\n def list_meta_cache(self):\n \"\"\"List the current Meta Cache\"\"\"\n return self.meta_cache.list_keys()\n\n def clear_meta_cache(self):\n \"\"\"Clear the current Meta Cache\"\"\"\n self.meta_cache.clear()\n\n @cache_result\n def account(self) -> dict:\n \"\"\"Cloud Platform Account Info\n\n Returns:\n dict: Cloud Platform Account Info\n \"\"\"\n return super().account()\n\n @cache_result\n def config(self) -> dict:\n \"\"\"Return the current SageWorks Configuration\n\n Returns:\n dict: The current SageWorks Configuration\n \"\"\"\n return super().config()\n\n @cache_result\n def incoming_data(self) -> pd.DataFrame:\n \"\"\"Get summary data about data in the incoming raw data\n\n Returns:\n pd.DataFrame: A summary of the incoming raw data\n \"\"\"\n return super().incoming_data()\n\n @cache_result\n def etl_jobs(self) -> pd.DataFrame:\n \"\"\"Get summary data about Extract, Transform, Load (ETL) Jobs\n\n Returns:\n pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform\n \"\"\"\n return super().etl_jobs()\n\n @cache_result\n def data_sources(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Data Sources deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform\n \"\"\"\n return super().data_sources()\n\n @cache_result\n def views(self, database: str = \"sageworks\") -> pd.DataFrame:\n \"\"\"Get a summary of the all the Views, for the given database, in AWS\n\n Args:\n database (str, optional): Glue database. Defaults to 'sageworks'.\n\n Returns:\n pd.DataFrame: A summary of all the Views, for the given database, in AWS\n \"\"\"\n return super().views(database=database)\n\n @cache_result\n def feature_sets(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Feature Sets deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform\n \"\"\"\n return super().feature_sets(details=details)\n\n @cache_result\n def models(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Models deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Models deployed in the Cloud Platform\n \"\"\"\n return super().models(details=details)\n\n @cache_result\n def endpoints(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Endpoints deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Endpoints in the Cloud Platform\n \"\"\"\n return super().endpoints()\n\n @cache_result\n def glue_job(self, job_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Glue Job\n\n Args:\n job_name (str): The name of the Glue Job\n\n Returns:\n dict: The details of the Glue Job (None if not found)\n \"\"\"\n return super().glue_job(job_name=job_name)\n\n @cache_result\n def data_source(self, data_source_name: str, database: str = \"sageworks\") -> Union[dict, None]:\n \"\"\"Get the details of a specific Data Source\n\n Args:\n data_source_name (str): The name of the Data Source\n database (str, optional): The Glue database. Defaults to 'sageworks'.\n\n Returns:\n dict: The details of the Data Source (None if not found)\n \"\"\"\n return super().data_source(data_source_name=data_source_name, database=database)\n\n @cache_result\n def feature_set(self, feature_set_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Feature Set\n\n Args:\n feature_set_name (str): The name of the Feature Set\n\n Returns:\n dict: The details of the Feature Set (None if not found)\n \"\"\"\n return super().feature_set(feature_set_name=feature_set_name)\n\n @cache_result\n def model(self, model_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Model\n\n Args:\n model_name (str): The name of the Model\n\n Returns:\n dict: The details of the Model (None if not found)\n \"\"\"\n return super().model(model_name=model_name)\n\n @cache_result\n def endpoint(self, endpoint_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: The details of the Endpoint (None if not found)\n \"\"\"\n return super().endpoint(endpoint_name=endpoint_name)\n\n def _refresh_data_in_background(self, cache_key, method, *args, **kwargs):\n \"\"\"Background task to refresh AWS metadata.\"\"\"\n result = method(self, *args, **kwargs)\n self.meta_cache.set(cache_key, result)\n self.log.debug(f\"Updated Metadata for {cache_key}\")\n\n @staticmethod\n def _flatten_redis_key(method, *args, **kwargs):\n \"\"\"Flatten the args and kwargs into a single string\"\"\"\n arg_str = \"_\".join(str(arg) for arg in args)\n kwarg_str = \"_\".join(f\"{k}_{v}\" for k, v in sorted(kwargs.items()))\n return f\"{method.__name__}_{arg_str}_{kwarg_str}\".replace(\" \", \"\").replace(\"'\", \"\")\n\n def __del__(self):\n \"\"\"Destructor to shut down the thread pool gracefully.\"\"\"\n self.close()\n\n def close(self):\n \"\"\"Explicitly close the thread pool, if needed.\"\"\"\n if self.thread_pool:\n self.log.important(\"Shutting down the ThreadPoolExecutor...\")\n try:\n self.thread_pool.shutdown(wait=True) # Gracefully shutdown\n except RuntimeError as e:\n self.log.error(f\"Error during thread pool shutdown: {e}\")\n finally:\n self.thread_pool = None\n\n def __repr__(self):\n return f\"CachedMeta()\\n\\t{repr(self.meta_cache)}\\n\\t{super().__repr__()}\"\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.__del__","title":"__del__()
","text":"Destructor to shut down the thread pool gracefully.
Source code insrc/sageworks/cached/cached_meta.py
def __del__(self):\n \"\"\"Destructor to shut down the thread pool gracefully.\"\"\"\n self.close()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.__init__","title":"__init__()
","text":"CachedMeta Initialization
Source code insrc/sageworks/cached/cached_meta.py
def __init__(self):\n \"\"\"CachedMeta Initialization\"\"\"\n if hasattr(self, \"_initialized\") and self._initialized:\n return # Prevent reinitialization\n\n self.log = logging.getLogger(\"sageworks\")\n self.log.important(\"Initializing CachedMeta...\")\n super().__init__()\n\n # Create both our Meta Cache and Fresh Cache (tracks if data is stale)\n self.meta_cache = SageWorksCache(prefix=\"meta\")\n self.fresh_cache = SageWorksCache(prefix=\"meta_fresh\", expire=90) # 90-second expiration\n\n # Create a ThreadPoolExecutor for refreshing stale data\n self.thread_pool = ThreadPoolExecutor(max_workers=5)\n\n # Mark the instance as initialized\n self._initialized = True\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.account","title":"account()
","text":"Cloud Platform Account Info
Returns:
Name Type Descriptiondict
dict
Cloud Platform Account Info
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef account(self) -> dict:\n \"\"\"Cloud Platform Account Info\n\n Returns:\n dict: Cloud Platform Account Info\n \"\"\"\n return super().account()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.check","title":"check()
","text":"Check if our underlying caches are working
Source code insrc/sageworks/cached/cached_meta.py
def check(self):\n \"\"\"Check if our underlying caches are working\"\"\"\n return self.meta_cache.check()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.clear_meta_cache","title":"clear_meta_cache()
","text":"Clear the current Meta Cache
Source code insrc/sageworks/cached/cached_meta.py
def clear_meta_cache(self):\n \"\"\"Clear the current Meta Cache\"\"\"\n self.meta_cache.clear()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.close","title":"close()
","text":"Explicitly close the thread pool, if needed.
Source code insrc/sageworks/cached/cached_meta.py
def close(self):\n \"\"\"Explicitly close the thread pool, if needed.\"\"\"\n if self.thread_pool:\n self.log.important(\"Shutting down the ThreadPoolExecutor...\")\n try:\n self.thread_pool.shutdown(wait=True) # Gracefully shutdown\n except RuntimeError as e:\n self.log.error(f\"Error during thread pool shutdown: {e}\")\n finally:\n self.thread_pool = None\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.config","title":"config()
","text":"Return the current SageWorks Configuration
Returns:
Name Type Descriptiondict
dict
The current SageWorks Configuration
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef config(self) -> dict:\n \"\"\"Return the current SageWorks Configuration\n\n Returns:\n dict: The current SageWorks Configuration\n \"\"\"\n return super().config()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.data_source","title":"data_source(data_source_name, database='sageworks')
","text":"Get the details of a specific Data Source
Parameters:
Name Type Description Defaultdata_source_name
str
The name of the Data Source
requireddatabase
str
The Glue database. Defaults to 'sageworks'.
'sageworks'
Returns:
Name Type Descriptiondict
Union[dict, None]
The details of the Data Source (None if not found)
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef data_source(self, data_source_name: str, database: str = \"sageworks\") -> Union[dict, None]:\n \"\"\"Get the details of a specific Data Source\n\n Args:\n data_source_name (str): The name of the Data Source\n database (str, optional): The Glue database. Defaults to 'sageworks'.\n\n Returns:\n dict: The details of the Data Source (None if not found)\n \"\"\"\n return super().data_source(data_source_name=data_source_name, database=database)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.data_sources","title":"data_sources()
","text":"Get a summary of the Data Sources deployed in the Cloud Platform
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef data_sources(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Data Sources deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform\n \"\"\"\n return super().data_sources()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.endpoint","title":"endpoint(endpoint_name)
","text":"Get the details of a specific Endpoint
Parameters:
Name Type Description Defaultendpoint_name
str
The name of the Endpoint
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Endpoint (None if not found)
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef endpoint(self, endpoint_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: The details of the Endpoint (None if not found)\n \"\"\"\n return super().endpoint(endpoint_name=endpoint_name)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.endpoints","title":"endpoints()
","text":"Get a summary of the Endpoints deployed in the Cloud Platform
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Endpoints in the Cloud Platform
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef endpoints(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Endpoints deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Endpoints in the Cloud Platform\n \"\"\"\n return super().endpoints()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.etl_jobs","title":"etl_jobs()
","text":"Get summary data about Extract, Transform, Load (ETL) Jobs
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef etl_jobs(self) -> pd.DataFrame:\n \"\"\"Get summary data about Extract, Transform, Load (ETL) Jobs\n\n Returns:\n pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform\n \"\"\"\n return super().etl_jobs()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.feature_set","title":"feature_set(feature_set_name)
","text":"Get the details of a specific Feature Set
Parameters:
Name Type Description Defaultfeature_set_name
str
The name of the Feature Set
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Feature Set (None if not found)
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef feature_set(self, feature_set_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Feature Set\n\n Args:\n feature_set_name (str): The name of the Feature Set\n\n Returns:\n dict: The details of the Feature Set (None if not found)\n \"\"\"\n return super().feature_set(feature_set_name=feature_set_name)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.feature_sets","title":"feature_sets(details=False)
","text":"Get a summary of the Feature Sets deployed in the Cloud Platform
Parameters:
Name Type Description Defaultdetails
bool
Include detailed information. Defaults to False.
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef feature_sets(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Feature Sets deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform\n \"\"\"\n return super().feature_sets(details=details)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.glue_job","title":"glue_job(job_name)
","text":"Get the details of a specific Glue Job
Parameters:
Name Type Description Defaultjob_name
str
The name of the Glue Job
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Glue Job (None if not found)
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef glue_job(self, job_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Glue Job\n\n Args:\n job_name (str): The name of the Glue Job\n\n Returns:\n dict: The details of the Glue Job (None if not found)\n \"\"\"\n return super().glue_job(job_name=job_name)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.incoming_data","title":"incoming_data()
","text":"Get summary data about data in the incoming raw data
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the incoming raw data
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef incoming_data(self) -> pd.DataFrame:\n \"\"\"Get summary data about data in the incoming raw data\n\n Returns:\n pd.DataFrame: A summary of the incoming raw data\n \"\"\"\n return super().incoming_data()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.list_meta_cache","title":"list_meta_cache()
","text":"List the current Meta Cache
Source code insrc/sageworks/cached/cached_meta.py
def list_meta_cache(self):\n \"\"\"List the current Meta Cache\"\"\"\n return self.meta_cache.list_keys()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.model","title":"model(model_name)
","text":"Get the details of a specific Model
Parameters:
Name Type Description Defaultmodel_name
str
The name of the Model
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Model (None if not found)
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef model(self, model_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Model\n\n Args:\n model_name (str): The name of the Model\n\n Returns:\n dict: The details of the Model (None if not found)\n \"\"\"\n return super().model(model_name=model_name)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.models","title":"models(details=False)
","text":"Get a summary of the Models deployed in the Cloud Platform
Parameters:
Name Type Description Defaultdetails
bool
Include detailed information. Defaults to False.
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Models deployed in the Cloud Platform
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef models(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Models deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Models deployed in the Cloud Platform\n \"\"\"\n return super().models(details=details)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.views","title":"views(database='sageworks')
","text":"Get a summary of the all the Views, for the given database, in AWS
Parameters:
Name Type Description Defaultdatabase
str
Glue database. Defaults to 'sageworks'.
'sageworks'
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of all the Views, for the given database, in AWS
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef views(self, database: str = \"sageworks\") -> pd.DataFrame:\n \"\"\"Get a summary of the all the Views, for the given database, in AWS\n\n Args:\n database (str, optional): Glue database. Defaults to 'sageworks'.\n\n Returns:\n pd.DataFrame: A summary of all the Views, for the given database, in AWS\n \"\"\"\n return super().views(database=database)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.cache_result","title":"cache_result(method)
","text":"Decorator to cache method results in meta_cache
Source code insrc/sageworks/cached/cached_meta.py
def cache_result(method):\n \"\"\"Decorator to cache method results in meta_cache\"\"\"\n\n @wraps(method)\n def wrapper(self, *args, **kwargs):\n # Create a unique cache key based on the method name and arguments\n cache_key = CachedMeta._flatten_redis_key(method, *args, **kwargs)\n\n # Check for fresh data, spawn thread to refresh if stale\n if SageWorksCache.refresh_enabled and self.fresh_cache.get(cache_key) is None:\n self.log.debug(f\"Async: Metadata for {cache_key} refresh thread started...\")\n self.fresh_cache.set(cache_key, True) # Mark as refreshed\n\n # Spawn a thread to refresh data without blocking\n self.thread_pool.submit(self._refresh_data_in_background, cache_key, method, *args, **kwargs)\n\n # Return data (fresh or stale) if available\n cached_value = self.meta_cache.get(cache_key)\n if cached_value is not None:\n return cached_value\n\n # Fall back to calling the method if no cached data found\n self.log.important(f\"Blocking: Getting Metadata for {cache_key}\")\n result = method(self, *args, **kwargs)\n self.meta_cache.set(cache_key, result)\n return result\n\n return wrapper\n
"},{"location":"cached/cached_meta/#examples","title":"Examples","text":"These example show how to use the CachedMeta()
class to pull lists of artifacts from AWS. DataSources, FeatureSets, Models, Endpoints and more. If you're building a web interface plugin, the CachedMeta class is a great place to start.
SageWorks REPL
If you'd like to see exactly what data/details you get back from the CachedMeta()
class, you can spin up the SageWorks REPL, use the class and test out all the methods. Try it out! SageWorks REPL
CachedMeta = CachedMeta()\nmodel_df = CachedMeta.models()\nmodel_df\n Model Group Health Owner ... Input Status Description\n0 wine-classification healthy - ... wine_features Completed Wine Classification Model\n1 abalone-regression-full healthy - ... abalone_features Completed Abalone Regression Model\n2 abalone-regression healthy - ... abalone_features Completed Abalone Regression Model\n\n[3 rows x 10 columns]\n
List the Models in AWS
from sageworks.cached.cached_meta import CachedMeta\n\n# Create our CachedMeta Class and get a list of our Models\nCachedMeta = CachedMeta()\nmodel_df = CachedMeta.models()\n\nprint(f\"Number of Models: {len(model_df)}\")\nprint(model_df)\n\n# Get more details data on the Models\nmodel_names = model_df[\"Model Group\"].tolist()\nfor name in model_names:\n pprint(CachedMeta.model(name))\n
Output
Number of Models: 3\n Model Group Health Owner ... Input Status Description\n0 wine-classification healthy - ... wine_features Completed Wine Classification Model\n1 abalone-regression-full healthy - ... abalone_features Completed Abalone Regression Model\n2 abalone-regression healthy - ... abalone_features Completed Abalone Regression Model\n\n[3 rows x 10 columns]\nwine-classification\nabalone-regression-full\nabalone-regression\n
Getting Model Performance Metrics
from sageworks.cached.cached_meta import CachedMeta\n\n# Create our CachedMeta Class and get a list of our Models\nCachedMeta = CachedMeta()\nmodel_df = CachedMeta.models()\n\nprint(f\"Number of Models: {len(model_df)}\")\nprint(model_df)\n\n# Get more details data on the Models\nmodel_names = model_df[\"Model Group\"].tolist()\nfor name in model_names[:5]:\n model_details = CachedMeta.model(name)\n print(f\"\\n\\nModel: {name}\")\n performance_metrics = model_details[\"sageworks_CachedMeta\"][\"sageworks_inference_metrics\"]\n print(f\"\\tPerformance Metrics: {performance_metrics}\")\n
Output
wine-classification\n ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/wine-classification\n Description: Wine Classification Model\n Tags: wine::classification\n Performance Metrics:\n [{'wine_class': 'TypeA', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 12}, {'wine_class': 'TypeB', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 14}, {'wine_class': 'TypeC', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 9}]\n\nabalone-regression\n ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/abalone-regression\n Description: Abalone Regression Model\n Tags: abalone::regression\n Performance Metrics:\n [{'MAE': 1.64, 'RMSE': 2.246, 'R2': 0.502, 'MAPE': 16.393, 'MedAE': 1.209, 'NumRows': 834}]\n
List the Endpoints in AWS
from pprint import pprint\nfrom sageworks.cached.cached_meta import CachedMeta\n\n# Create our CachedMeta Class and get a list of our Endpoints\nCachedMeta = CachedMeta()\nendpoint_df = CachedMeta.endpoints()\nprint(f\"Number of Endpoints: {len(endpoint_df)}\")\nprint(endpoint_df)\n\n# Get more details data on the Endpoints\nendpoint_names = endpoint_df[\"Name\"].tolist()\nfor name in endpoint_names:\n pprint(CachedMeta.endpoint(name))\n
Output
Number of Endpoints: 2\n Name Health Instance Created ... Status Variant Capture Samp(%)\n0 wine-classification-end healthy Serverless (2GB/5) 2024-03-23 23:09 ... InService AllTraffic False -\n1 abalone-regression-end healthy Serverless (2GB/5) 2024-03-23 21:11 ... InService AllTraffic False -\n\n[2 rows x 10 columns]\nwine-classification-end\n<lots of details about endpoints>\n
Not Finding some particular AWS Data?
The SageWorks CachedMeta API Class also has (details=True)
arguments, so make sure to check those out.
Model Examples
Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!
CachedModel: Caches the method results for SageWorks Models
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel","title":"CachedModel
","text":" Bases: CachedArtifactMixin
, ModelCore
CachedModel: Caches the method results for SageWorks Models
Note: Cached method values may lag underlying Model changes.
Common Usagemy_model = CachedModel(name)\nmy_model.details()\nmy_model.health_check()\nmy_model.sageworks_meta()\n
Source code in src/sageworks/cached/cached_model.py
class CachedModel(CachedArtifactMixin, ModelCore):\n \"\"\"CachedModel: Caches the method results for SageWorks Models\n\n Note: Cached method values may lag underlying Model changes.\n\n Common Usage:\n ```python\n my_model = CachedModel(name)\n my_model.details()\n my_model.health_check()\n my_model.sageworks_meta()\n ```\n \"\"\"\n\n def __init__(self, uuid: str):\n \"\"\"CachedModel Initialization\"\"\"\n ModelCore.__init__(self, model_uuid=uuid, use_cached_meta=True)\n\n @CachedArtifactMixin.cache_result\n def summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Details.\n\n Returns:\n dict: A dictionary of details about the CachedModel\n \"\"\"\n return super().summary(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Details.\n\n Returns:\n dict: A dictionary of details about the CachedModel\n \"\"\"\n return super().details(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Health Check.\n\n Returns:\n dict: A dictionary of health check details for the CachedModel\n \"\"\"\n return super().health_check(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).\n\n Returns:\n str: The Enumerated Model Type\n \"\"\"\n return super().sageworks_meta()\n\n @CachedArtifactMixin.cache_result\n def get_endpoint_inference_path(self) -> Union[str, None]:\n \"\"\"Retrieve the Endpoint Inference Path.\n\n Returns:\n str: The Endpoint Inference Path\n \"\"\"\n return super().get_endpoint_inference_path()\n\n @CachedArtifactMixin.cache_result\n def list_inference_runs(self) -> list[str]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Returns:\n list[str]: List of Inference Runs\n \"\"\"\n return super().list_inference_runs()\n\n @CachedArtifactMixin.cache_result\n def get_inference_metrics(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: latest)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Metrics (might be None)\n \"\"\"\n return super().get_inference_metrics(capture_uuid=capture_uuid)\n\n @CachedArtifactMixin.cache_result\n def get_inference_predictions(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n \"\"\"\n # Note: This method can generate larger dataframes, so we'll sample if needed\n df = super().get_inference_predictions(capture_uuid=capture_uuid)\n if df is not None and len(df) > 5000:\n self.log.warning(f\"{self.uuid}:{capture_uuid} Sampling Inference Predictions to 5000 rows\")\n return df.sample(5000)\n return df\n\n @CachedArtifactMixin.cache_result\n def confusion_matrix(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the confusion matrix for the model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: latest)\n\n Returns:\n pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n \"\"\"\n return super().confusion_matrix(capture_uuid=capture_uuid)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.__init__","title":"__init__(uuid)
","text":"CachedModel Initialization
Source code insrc/sageworks/cached/cached_model.py
def __init__(self, uuid: str):\n \"\"\"CachedModel Initialization\"\"\"\n ModelCore.__init__(self, model_uuid=uuid, use_cached_meta=True)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.confusion_matrix","title":"confusion_matrix(capture_uuid='latest')
","text":"Retrieve the confusion matrix for the model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid (default: latest)
'latest'
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: DataFrame of the Confusion Matrix (might be None)
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef confusion_matrix(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the confusion matrix for the model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: latest)\n\n Returns:\n pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n \"\"\"\n return super().confusion_matrix(capture_uuid=capture_uuid)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.details","title":"details(**kwargs)
","text":"Retrieve the CachedModel Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the CachedModel
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef details(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Details.\n\n Returns:\n dict: A dictionary of details about the CachedModel\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.get_endpoint_inference_path","title":"get_endpoint_inference_path()
","text":"Retrieve the Endpoint Inference Path.
Returns:
Name Type Descriptionstr
Union[str, None]
The Endpoint Inference Path
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef get_endpoint_inference_path(self) -> Union[str, None]:\n \"\"\"Retrieve the Endpoint Inference Path.\n\n Returns:\n str: The Endpoint Inference Path\n \"\"\"\n return super().get_endpoint_inference_path()\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.get_inference_metrics","title":"get_inference_metrics(capture_uuid='latest')
","text":"Retrieve the captured prediction results for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid (default: latest)
'latest'
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: DataFrame of the Captured Metrics (might be None)
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef get_inference_metrics(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: latest)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Metrics (might be None)\n \"\"\"\n return super().get_inference_metrics(capture_uuid=capture_uuid)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.get_inference_predictions","title":"get_inference_predictions(capture_uuid='auto_inference')
","text":"Retrieve the captured prediction results for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid (default: training_holdout)
'auto_inference'
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: DataFrame of the Captured Predictions (might be None)
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef get_inference_predictions(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n \"\"\"\n # Note: This method can generate larger dataframes, so we'll sample if needed\n df = super().get_inference_predictions(capture_uuid=capture_uuid)\n if df is not None and len(df) > 5000:\n self.log.warning(f\"{self.uuid}:{capture_uuid} Sampling Inference Predictions to 5000 rows\")\n return df.sample(5000)\n return df\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.health_check","title":"health_check(**kwargs)
","text":"Retrieve the CachedModel Health Check.
Returns:
Name Type Descriptiondict
dict
A dictionary of health check details for the CachedModel
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Health Check.\n\n Returns:\n dict: A dictionary of health check details for the CachedModel\n \"\"\"\n return super().health_check(**kwargs)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.list_inference_runs","title":"list_inference_runs()
","text":"Retrieve the captured prediction results for this model
Returns:
Type Descriptionlist[str]
list[str]: List of Inference Runs
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef list_inference_runs(self) -> list[str]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Returns:\n list[str]: List of Inference Runs\n \"\"\"\n return super().list_inference_runs()\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.sageworks_meta","title":"sageworks_meta()
","text":"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).
Returns:
Name Type Descriptionstr
Union[str, None]
The Enumerated Model Type
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).\n\n Returns:\n str: The Enumerated Model Type\n \"\"\"\n return super().sageworks_meta()\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.summary","title":"summary(**kwargs)
","text":"Retrieve the CachedModel Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the CachedModel
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Details.\n\n Returns:\n dict: A dictionary of details about the CachedModel\n \"\"\"\n return super().summary(**kwargs)\n
"},{"location":"cached/cached_model/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Pull Inference Run
from sageworks.cached.cached_model import CachedModel\n\n# Grab a Model\nmodel = CachedModel(\"abalone-regression\")\n\n# List the inference runs\nmodel.list_inference_runs()\n['auto_inference', 'model_training']\n\n# Grab specific inference results\nmodel.get_inference_predictions(\"auto_inference\")\n class_number_of_rings prediction id\n0 16 10.516158 7\n1 9 9.031365 8\n.. ... ... ...\n831 8 7.693689 4158\n832 9 7.542521 4167\n
"},{"location":"cached/overview/","title":"Caching Overview","text":"Caching is Crazy
Yes, but it's a necessary evil for Web Interfaces. AWS APIs (boto3, Sagemaker) often takes multiple seconds to respond and will often throttle requests if spammed. So for quicker response and less spamming we're using Cached Classes for any Web Interface work.
"},{"location":"cached/overview/#welcome-to-the-sageworks-cached-classes","title":"Welcome to the SageWorks Cached Classes","text":"These classes provide caching for the for the most used SageWorks classes. They transparently handle all the details around retrieving and caching results from the underlying classes.
Examples
All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines. As part of this we're including CloudWatch log forwarding/aggregation for any service using the SageWorks API (Dashboard, Glue, Lambda, Notebook, Laptop, etc).
"},{"location":"cloudwatch/#log-groups-and-streams","title":"Log Groups and Streams","text":"The SageWorks logging setup includes the addition of a CloudWatch 'Handler' that forwards all log messages to the SageWorksLogGroup
Individual Streams
Each process running SageWorks will get a unique individual stream.
Since many jobs are run nightly/often, the stream will also have a date on the end... glue/my_job/2024_08_01_17_15
Logs in Easy Mode
The SageWorks cloud_watch
command line tool gives you access to important logs without the hassle. Automatic display of important event and the context around those events.
pip install sageworks\ncloud_watch\n
The cloud_watch
script will automatically show the interesting (WARNING and CRITICAL) messages from any source within the last hour. There are lots of options to the script, just use --help
to see options and descriptions.
cloud_watch --help\n
Here are some example options:
# Show important logs in last 12 hours\ncloud_watch --start-time 720 \n\n# Show a particular stream\ncloud_watch --stream glue/my_job \n\n# Show/search for a message substring\ncloud_watch --search SHAP\n\n# Show a log levels (matching and above)\ncloud_watch --log-level WARNING\ncloud_watch --log-level ERROR\ncloud_watch --log-level CRITICAL\nOR\ncloud_watch --log-level ALL (for all events)\n\n# Combine flags \ncloud_watch --log-level ERROR --search SHAP\ncloud_watch --log-level ERROR --stream Dashboard\n
These options can be used in combination and try out the other options to make the perfect log search :)
"},{"location":"cloudwatch/#more-information","title":"More Information","text":"Check out our presentation on SageWorks CloudWatch
"},{"location":"cloudwatch/#access-through-aws-console","title":"Access through AWS Console","text":"Since we're leveraging AWS functionality you can always use the AWS console to look/investigate the logs. In the AWS console go to CloudWatch... Log Groups... SageWorksLogGroup
"},{"location":"cloudwatch/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/overview/","title":"Core Classes","text":"SageWorks Core Classes
These classes interact with many of the Cloud Platform services and are therefore more complex. They provide additional control and refinement over the AWS ML Pipline. For most use cases the API Classes should be used
Welcome to the SageWorks Core Classes
The Core Classes provide low-level APIs for the SageWorks package, these classes directly interface with the AWS Sagemaker Pipeline interfaces and have a large number of methods with reasonable complexity.
The API Classes have method pass-through so just call the method on the API Class and voil\u00e0 it works the same.
"},{"location":"core_classes/overview/#artifacts","title":"Artifacts","text":"Transforms are a set of classes that transform one type of Artifact
to another type. For instance DataToFeatureSet
takes a DataSource
artifact and creates a FeatureSet
artifact.
API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the any class that inherits from the Artifact Class and voil\u00e0 it works the same.
The SageWorks Artifact class is a base/abstract class that defines API implemented by all the child classes (DataSource, FeatureSet, Model, Endpoint).
Artifact: Abstract Base Class for all Artifact classes in SageWorks. Artifacts simply reflect and aggregate one or more AWS Services
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact","title":"Artifact
","text":" Bases: ABC
Artifact: Abstract Base Class for all Artifact classes in SageWorks
Source code insrc/sageworks/core/artifacts/artifact.py
class Artifact(ABC):\n \"\"\"Artifact: Abstract Base Class for all Artifact classes in SageWorks\"\"\"\n\n # Class-level shared resources\n log = logging.getLogger(\"sageworks\")\n\n # Config Manager\n cm = ConfigManager()\n if not cm.config_okay():\n log = logging.getLogger(\"sageworks\")\n log.critical(\"SageWorks Configuration Incomplete...\")\n log.critical(\"Run the 'sageworks' command and follow the prompts...\")\n raise FatalConfigError()\n\n # AWS Account Clamp\n aws_account_clamp = AWSAccountClamp()\n boto3_session = aws_account_clamp.boto3_session\n sm_session = aws_account_clamp.sagemaker_session()\n sm_client = aws_account_clamp.sagemaker_client()\n aws_region = aws_account_clamp.region\n\n # Setup Bucket Paths\n sageworks_bucket = cm.get_config(\"SAGEWORKS_BUCKET\")\n data_sources_s3_path = f\"s3://{sageworks_bucket}/data-sources\"\n feature_sets_s3_path = f\"s3://{sageworks_bucket}/feature-sets\"\n models_s3_path = f\"s3://{sageworks_bucket}/models\"\n endpoints_s3_path = f\"s3://{sageworks_bucket}/endpoints\"\n\n # Delimiter for storing lists in AWS Tags\n tag_delimiter = \"::\"\n\n # Grab our Dataframe Storage\n df_cache = DFStore(path_prefix=\"/sageworks/dataframe_cache\")\n\n def __init__(self, uuid: str, use_cached_meta: bool = False):\n \"\"\"Initialize the Artifact Base Class\n\n Args:\n uuid (str): The UUID of this artifact\n use_cached_meta (bool): Should we use cached metadata? (default: False)\n \"\"\"\n self.uuid = uuid\n if use_cached_meta:\n self.log.info(f\"Using Cached Metadata for {self.uuid}\")\n self.meta = CachedMeta()\n else:\n self.meta = CloudMeta()\n\n def __post_init__(self):\n \"\"\"Artifact Post Initialization\"\"\"\n\n # Do I exist? (very metaphysical)\n if not self.exists():\n self.log.debug(f\"Artifact {self.uuid} does not exist\")\n return\n\n # Conduct a Health Check on this Artifact\n health_issues = self.health_check()\n if health_issues:\n if \"needs_onboard\" in health_issues:\n self.log.important(f\"Artifact {self.uuid} needs to be onboarded\")\n elif health_issues == [\"no_activity\"]:\n self.log.debug(f\"Artifact {self.uuid} has no activity, which is fine\")\n else:\n self.log.warning(f\"Health Check Failed {self.uuid}: {health_issues}\")\n for issue in health_issues:\n self.add_health_tag(issue)\n else:\n self.log.info(f\"Health Check Passed {self.uuid}\")\n\n @classmethod\n def is_name_valid(cls, name: str, delimiter: str = \"_\", lower_case: bool = True) -> bool:\n \"\"\"Check if the name adheres to the naming conventions for this Artifact.\n\n Args:\n name (str): The name/id to check.\n delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n lower_case (bool): Should the name be lowercased? (default: True)\n\n Returns:\n bool: True if the name is valid, False otherwise.\n \"\"\"\n valid_name = cls.generate_valid_name(name, delimiter=delimiter, lower_case=lower_case)\n if name != valid_name:\n cls.log.warning(f\"Artifact name: '{name}' is not valid. Convert it to something like: '{valid_name}'\")\n return False\n return True\n\n @staticmethod\n def generate_valid_name(name: str, delimiter: str = \"_\", lower_case: bool = True) -> str:\n \"\"\"Only allow letters and the specified delimiter, also lowercase the string.\n\n Args:\n name (str): The name/id string to check.\n delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n lower_case (bool): Should the name be lowercased? (default: True)\n\n Returns:\n str: A generated valid name/id.\n \"\"\"\n valid_name = \"\".join(c for c in name if c.isalnum() or c in [\"_\", \"-\"])\n if lower_case:\n valid_name = valid_name.lower()\n\n # Replace with the chosen delimiter\n return valid_name.replace(\"_\", delimiter).replace(\"-\", delimiter)\n\n @abstractmethod\n def exists(self) -> bool:\n \"\"\"Does the Artifact exist? Can we connect to it?\"\"\"\n pass\n\n def sageworks_meta(self) -> Union[dict, None]:\n \"\"\"Get the SageWorks specific metadata for this Artifact\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n\n Note: This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources and Graphs, those classes need to override this method.\n \"\"\"\n return self.meta.get_aws_tags(self.arn())\n\n def expected_meta(self) -> list[str]:\n \"\"\"Metadata we expect to see for this Artifact when it's ready\n Returns:\n list[str]: List of expected metadata keys\n \"\"\"\n\n # If an artifact has additional expected metadata override this method\n return [\"sageworks_status\"]\n\n @abstractmethod\n def refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n pass\n\n def ready(self) -> bool:\n \"\"\"Is the Artifact ready? Is initial setup complete and expected metadata populated?\"\"\"\n\n # If anything goes wrong, assume the artifact is not ready\n try:\n # Check for the expected metadata\n expected_meta = self.expected_meta()\n existing_meta = self.sageworks_meta()\n ready = set(existing_meta.keys()).issuperset(expected_meta)\n if ready:\n return True\n else:\n self.log.info(\"Artifact is not ready!\")\n return False\n except Exception as e:\n self.log.error(f\"Artifact malformed: {e}\")\n return False\n\n @abstractmethod\n def onboard(self) -> bool:\n \"\"\"Onboard this Artifact into SageWorks\n Returns:\n bool: True if the Artifact was successfully onboarded, False otherwise\n \"\"\"\n pass\n\n @abstractmethod\n def details(self) -> dict:\n \"\"\"Additional Details about this Artifact\"\"\"\n pass\n\n @abstractmethod\n def size(self) -> float:\n \"\"\"Return the size of this artifact in MegaBytes\"\"\"\n pass\n\n @abstractmethod\n def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n pass\n\n @abstractmethod\n def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n pass\n\n @abstractmethod\n def hash(self) -> str:\n \"\"\"Return the hash of this artifact, useful for content validation\"\"\"\n pass\n\n @abstractmethod\n def arn(self):\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n pass\n\n @abstractmethod\n def aws_url(self):\n \"\"\"AWS console/web interface for this artifact\"\"\"\n pass\n\n @abstractmethod\n def aws_meta(self) -> dict:\n \"\"\"Get the full AWS metadata for this artifact\"\"\"\n pass\n\n @abstractmethod\n def delete(self):\n \"\"\"Delete this artifact including all related AWS objects\"\"\"\n pass\n\n def upsert_sageworks_meta(self, new_meta: dict):\n \"\"\"Add SageWorks specific metadata to this Artifact\n Args:\n new_meta (dict): Dictionary of NEW metadata to add\n Note:\n This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources. The DataSource class overrides this method.\n \"\"\"\n # Sanity check\n aws_arn = self.arn()\n if aws_arn is None:\n self.log.error(f\"ARN is None for {self.uuid}!\")\n return\n\n # Add the new metadata to the existing metadata\n self.log.info(f\"Adding Tags to {self.uuid}:{str(new_meta)[:50]}...\")\n aws_tags = dict_to_aws_tags(new_meta)\n try:\n self.sm_client.add_tags(ResourceArn=aws_arn, Tags=aws_tags)\n except Exception as e:\n self.log.error(f\"Error adding metadata to {aws_arn}: {e}\")\n\n def remove_sageworks_meta(self, key_to_remove: str):\n \"\"\"Remove SageWorks specific metadata from this Artifact\n Args:\n key_to_remove (str): The metadata key to remove\n Note:\n This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources. The DataSource class overrides this method.\n \"\"\"\n aws_arn = self.arn()\n # Sanity check\n if aws_arn is None:\n self.log.error(f\"ARN is None for {self.uuid}!\")\n return\n self.log.info(f\"Removing SageWorks Metadata {key_to_remove} for Artifact: {aws_arn}...\")\n sagemaker_delete_tag(aws_arn, self.sm_session, key_to_remove)\n\n def get_tags(self, tag_type=\"user\") -> list:\n \"\"\"Get the tags for this artifact\n Args:\n tag_type (str): Type of tags to return (user or health)\n Returns:\n list[str]: List of tags for this artifact\n \"\"\"\n if tag_type == \"user\":\n user_tags = self.sageworks_meta().get(\"sageworks_tags\")\n return user_tags.split(self.tag_delimiter) if user_tags else []\n\n # Grab our health tags\n health_tags = self.sageworks_meta().get(\"sageworks_health_tags\")\n\n # If we don't have health tags, create the storage and return an empty list\n if health_tags is None:\n self.log.important(f\"{self.uuid} creating sageworks_health_tags storage...\")\n self.upsert_sageworks_meta({\"sageworks_health_tags\": \"\"})\n return []\n\n # Otherwise, return the health tags\n return health_tags.split(self.tag_delimiter) if health_tags else []\n\n def set_tags(self, tags):\n self.upsert_sageworks_meta({\"sageworks_tags\": self.tag_delimiter.join(tags)})\n\n def add_tag(self, tag, tag_type=\"user\"):\n \"\"\"Add a tag for this artifact, ensuring no duplicates and maintaining order.\n Args:\n tag (str): Tag to add for this artifact\n tag_type (str): Type of tag to add (user or health)\n \"\"\"\n current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n if tag not in current_tags:\n current_tags.append(tag)\n combined_tags = self.tag_delimiter.join(current_tags)\n if tag_type == \"user\":\n self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n else:\n self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n\n def remove_sageworks_tag(self, tag, tag_type=\"user\"):\n \"\"\"Remove a tag from this artifact if it exists.\n Args:\n tag (str): Tag to remove from this artifact\n tag_type (str): Type of tag to remove (user or health)\n \"\"\"\n current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n if tag in current_tags:\n current_tags.remove(tag)\n combined_tags = self.tag_delimiter.join(current_tags)\n if tag_type == \"user\":\n self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n elif tag_type == \"health\":\n self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n\n # Syntactic sugar for health tags\n def get_health_tags(self):\n return self.get_tags(tag_type=\"health\")\n\n def set_health_tags(self, tags):\n self.upsert_sageworks_meta({\"sageworks_health_tags\": self.tag_delimiter.join(tags)})\n\n def add_health_tag(self, tag):\n self.add_tag(tag, tag_type=\"health\")\n\n def remove_health_tag(self, tag):\n self.remove_sageworks_tag(tag, tag_type=\"health\")\n\n # Owner of this artifact\n def get_owner(self) -> str:\n \"\"\"Get the owner of this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_owner\", \"unknown\")\n\n def set_owner(self, owner: str):\n \"\"\"Set the owner of this artifact\n\n Args:\n owner (str): Owner to set for this artifact\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_owner\": owner})\n\n def get_input(self) -> str:\n \"\"\"Get the input data for this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_input\", \"unknown\")\n\n def set_input(self, input_data: str):\n \"\"\"Set the input data for this artifact\n\n Args:\n input_data (str): Name of input data for this artifact\n Note:\n This breaks the official provenance of the artifact, so use with caution.\n \"\"\"\n self.log.important(f\"{self.uuid}: Setting input to {input_data}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input_data})\n\n def get_status(self) -> str:\n \"\"\"Get the status for this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_status\", \"unknown\")\n\n def set_status(self, status: str):\n \"\"\"Set the status for this artifact\n Args:\n status (str): Status to set for this artifact\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_status\": status})\n\n def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this artifact\n Returns:\n list[str]: List of health issues\n \"\"\"\n health_issues = []\n if not self.ready():\n return [\"needs_onboard\"]\n # FIXME: Revisit AWS URL check\n # if \"unknown\" in self.aws_url():\n # health_issues.append(\"aws_url_unknown\")\n return health_issues\n\n def summary(self) -> dict:\n \"\"\"This is generic summary information for all Artifacts. If you\n want to get more detailed information, call the details() method\n which is implemented by the specific Artifact class\"\"\"\n basic = {\n \"uuid\": self.uuid,\n \"health_tags\": self.get_health_tags(),\n \"aws_arn\": self.arn(),\n \"size\": self.size(),\n \"created\": self.created(),\n \"modified\": self.modified(),\n \"input\": self.get_input(),\n }\n # Combine the sageworks metadata with the basic metadata\n return {**basic, **self.sageworks_meta()}\n\n def __repr__(self) -> str:\n \"\"\"String representation of this artifact\n\n Returns:\n str: String representation of this artifact\n \"\"\"\n\n # If the artifact does not exist, return a message\n if not self.exists():\n return f\"{self.__class__.__name__}: {self.uuid} does not exist\"\n\n summary_dict = self.summary()\n display_keys = [\n \"aws_arn\",\n \"health_tags\",\n \"size\",\n \"created\",\n \"modified\",\n \"input\",\n \"sageworks_status\",\n \"sageworks_tags\",\n ]\n summary_items = [f\" {repr(key)}: {repr(value)}\" for key, value in summary_dict.items() if key in display_keys]\n summary_str = f\"{self.__class__.__name__}: {self.uuid}\\n\" + \",\\n\".join(summary_items)\n return summary_str\n\n def delete_metadata(self, key_to_delete: str):\n \"\"\"Delete specific metadata from this artifact\n Args:\n key_to_delete (str): Metadata key to delete\n \"\"\"\n\n aws_arn = self.arn()\n self.log.important(f\"Deleting Metadata {key_to_delete} for Artifact: {aws_arn}...\")\n\n # First, fetch all the existing tags\n response = self.sm_session.list_tags(aws_arn)\n existing_tags = response.get(\"Tags\", [])\n\n # Convert existing AWS tags to a dictionary for easy manipulation\n existing_tags_dict = {item[\"Key\"]: item[\"Value\"] for item in existing_tags}\n\n # Identify tags to delete\n tag_list_to_delete = []\n for key in existing_tags_dict.keys():\n if key == key_to_delete or key.startswith(f\"{key_to_delete}_chunk_\"):\n tag_list_to_delete.append(key)\n\n # Delete the identified tags\n if tag_list_to_delete:\n self.sm_client.delete_tags(ResourceArn=aws_arn, TagKeys=tag_list_to_delete)\n else:\n self.log.info(f\"No Metadata found: {key_to_delete}...\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__init__","title":"__init__(uuid, use_cached_meta=False)
","text":"Initialize the Artifact Base Class
Parameters:
Name Type Description Defaultuuid
str
The UUID of this artifact
requireduse_cached_meta
bool
Should we use cached metadata? (default: False)
False
Source code in src/sageworks/core/artifacts/artifact.py
def __init__(self, uuid: str, use_cached_meta: bool = False):\n \"\"\"Initialize the Artifact Base Class\n\n Args:\n uuid (str): The UUID of this artifact\n use_cached_meta (bool): Should we use cached metadata? (default: False)\n \"\"\"\n self.uuid = uuid\n if use_cached_meta:\n self.log.info(f\"Using Cached Metadata for {self.uuid}\")\n self.meta = CachedMeta()\n else:\n self.meta = CloudMeta()\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__post_init__","title":"__post_init__()
","text":"Artifact Post Initialization
Source code insrc/sageworks/core/artifacts/artifact.py
def __post_init__(self):\n \"\"\"Artifact Post Initialization\"\"\"\n\n # Do I exist? (very metaphysical)\n if not self.exists():\n self.log.debug(f\"Artifact {self.uuid} does not exist\")\n return\n\n # Conduct a Health Check on this Artifact\n health_issues = self.health_check()\n if health_issues:\n if \"needs_onboard\" in health_issues:\n self.log.important(f\"Artifact {self.uuid} needs to be onboarded\")\n elif health_issues == [\"no_activity\"]:\n self.log.debug(f\"Artifact {self.uuid} has no activity, which is fine\")\n else:\n self.log.warning(f\"Health Check Failed {self.uuid}: {health_issues}\")\n for issue in health_issues:\n self.add_health_tag(issue)\n else:\n self.log.info(f\"Health Check Passed {self.uuid}\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__repr__","title":"__repr__()
","text":"String representation of this artifact
Returns:
Name Type Descriptionstr
str
String representation of this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def __repr__(self) -> str:\n \"\"\"String representation of this artifact\n\n Returns:\n str: String representation of this artifact\n \"\"\"\n\n # If the artifact does not exist, return a message\n if not self.exists():\n return f\"{self.__class__.__name__}: {self.uuid} does not exist\"\n\n summary_dict = self.summary()\n display_keys = [\n \"aws_arn\",\n \"health_tags\",\n \"size\",\n \"created\",\n \"modified\",\n \"input\",\n \"sageworks_status\",\n \"sageworks_tags\",\n ]\n summary_items = [f\" {repr(key)}: {repr(value)}\" for key, value in summary_dict.items() if key in display_keys]\n summary_str = f\"{self.__class__.__name__}: {self.uuid}\\n\" + \",\\n\".join(summary_items)\n return summary_str\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.add_tag","title":"add_tag(tag, tag_type='user')
","text":"Add a tag for this artifact, ensuring no duplicates and maintaining order. Args: tag (str): Tag to add for this artifact tag_type (str): Type of tag to add (user or health)
Source code insrc/sageworks/core/artifacts/artifact.py
def add_tag(self, tag, tag_type=\"user\"):\n \"\"\"Add a tag for this artifact, ensuring no duplicates and maintaining order.\n Args:\n tag (str): Tag to add for this artifact\n tag_type (str): Type of tag to add (user or health)\n \"\"\"\n current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n if tag not in current_tags:\n current_tags.append(tag)\n combined_tags = self.tag_delimiter.join(current_tags)\n if tag_type == \"user\":\n self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n else:\n self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.arn","title":"arn()
abstractmethod
","text":"AWS ARN (Amazon Resource Name) for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef arn(self):\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.aws_meta","title":"aws_meta()
abstractmethod
","text":"Get the full AWS metadata for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef aws_meta(self) -> dict:\n \"\"\"Get the full AWS metadata for this artifact\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.aws_url","title":"aws_url()
abstractmethod
","text":"AWS console/web interface for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef aws_url(self):\n \"\"\"AWS console/web interface for this artifact\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.created","title":"created()
abstractmethod
","text":"Return the datetime when this artifact was created
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.delete","title":"delete()
abstractmethod
","text":"Delete this artifact including all related AWS objects
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef delete(self):\n \"\"\"Delete this artifact including all related AWS objects\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.delete_metadata","title":"delete_metadata(key_to_delete)
","text":"Delete specific metadata from this artifact Args: key_to_delete (str): Metadata key to delete
Source code insrc/sageworks/core/artifacts/artifact.py
def delete_metadata(self, key_to_delete: str):\n \"\"\"Delete specific metadata from this artifact\n Args:\n key_to_delete (str): Metadata key to delete\n \"\"\"\n\n aws_arn = self.arn()\n self.log.important(f\"Deleting Metadata {key_to_delete} for Artifact: {aws_arn}...\")\n\n # First, fetch all the existing tags\n response = self.sm_session.list_tags(aws_arn)\n existing_tags = response.get(\"Tags\", [])\n\n # Convert existing AWS tags to a dictionary for easy manipulation\n existing_tags_dict = {item[\"Key\"]: item[\"Value\"] for item in existing_tags}\n\n # Identify tags to delete\n tag_list_to_delete = []\n for key in existing_tags_dict.keys():\n if key == key_to_delete or key.startswith(f\"{key_to_delete}_chunk_\"):\n tag_list_to_delete.append(key)\n\n # Delete the identified tags\n if tag_list_to_delete:\n self.sm_client.delete_tags(ResourceArn=aws_arn, TagKeys=tag_list_to_delete)\n else:\n self.log.info(f\"No Metadata found: {key_to_delete}...\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.details","title":"details()
abstractmethod
","text":"Additional Details about this Artifact
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef details(self) -> dict:\n \"\"\"Additional Details about this Artifact\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.exists","title":"exists()
abstractmethod
","text":"Does the Artifact exist? Can we connect to it?
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef exists(self) -> bool:\n \"\"\"Does the Artifact exist? Can we connect to it?\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.expected_meta","title":"expected_meta()
","text":"Metadata we expect to see for this Artifact when it's ready Returns: list[str]: List of expected metadata keys
Source code insrc/sageworks/core/artifacts/artifact.py
def expected_meta(self) -> list[str]:\n \"\"\"Metadata we expect to see for this Artifact when it's ready\n Returns:\n list[str]: List of expected metadata keys\n \"\"\"\n\n # If an artifact has additional expected metadata override this method\n return [\"sageworks_status\"]\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.generate_valid_name","title":"generate_valid_name(name, delimiter='_', lower_case=True)
staticmethod
","text":"Only allow letters and the specified delimiter, also lowercase the string.
Parameters:
Name Type Description Defaultname
str
The name/id string to check.
requireddelimiter
str
The delimiter to use in the name/id string (default: \"_\")
'_'
lower_case
bool
Should the name be lowercased? (default: True)
True
Returns:
Name Type Descriptionstr
str
A generated valid name/id.
Source code insrc/sageworks/core/artifacts/artifact.py
@staticmethod\ndef generate_valid_name(name: str, delimiter: str = \"_\", lower_case: bool = True) -> str:\n \"\"\"Only allow letters and the specified delimiter, also lowercase the string.\n\n Args:\n name (str): The name/id string to check.\n delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n lower_case (bool): Should the name be lowercased? (default: True)\n\n Returns:\n str: A generated valid name/id.\n \"\"\"\n valid_name = \"\".join(c for c in name if c.isalnum() or c in [\"_\", \"-\"])\n if lower_case:\n valid_name = valid_name.lower()\n\n # Replace with the chosen delimiter\n return valid_name.replace(\"_\", delimiter).replace(\"-\", delimiter)\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_input","title":"get_input()
","text":"Get the input data for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def get_input(self) -> str:\n \"\"\"Get the input data for this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_input\", \"unknown\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_owner","title":"get_owner()
","text":"Get the owner of this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def get_owner(self) -> str:\n \"\"\"Get the owner of this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_owner\", \"unknown\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_status","title":"get_status()
","text":"Get the status for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def get_status(self) -> str:\n \"\"\"Get the status for this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_status\", \"unknown\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_tags","title":"get_tags(tag_type='user')
","text":"Get the tags for this artifact Args: tag_type (str): Type of tags to return (user or health) Returns: list[str]: List of tags for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def get_tags(self, tag_type=\"user\") -> list:\n \"\"\"Get the tags for this artifact\n Args:\n tag_type (str): Type of tags to return (user or health)\n Returns:\n list[str]: List of tags for this artifact\n \"\"\"\n if tag_type == \"user\":\n user_tags = self.sageworks_meta().get(\"sageworks_tags\")\n return user_tags.split(self.tag_delimiter) if user_tags else []\n\n # Grab our health tags\n health_tags = self.sageworks_meta().get(\"sageworks_health_tags\")\n\n # If we don't have health tags, create the storage and return an empty list\n if health_tags is None:\n self.log.important(f\"{self.uuid} creating sageworks_health_tags storage...\")\n self.upsert_sageworks_meta({\"sageworks_health_tags\": \"\"})\n return []\n\n # Otherwise, return the health tags\n return health_tags.split(self.tag_delimiter) if health_tags else []\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.hash","title":"hash()
abstractmethod
","text":"Return the hash of this artifact, useful for content validation
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef hash(self) -> str:\n \"\"\"Return the hash of this artifact, useful for content validation\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.health_check","title":"health_check()
","text":"Perform a health check on this artifact Returns: list[str]: List of health issues
Source code insrc/sageworks/core/artifacts/artifact.py
def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this artifact\n Returns:\n list[str]: List of health issues\n \"\"\"\n health_issues = []\n if not self.ready():\n return [\"needs_onboard\"]\n # FIXME: Revisit AWS URL check\n # if \"unknown\" in self.aws_url():\n # health_issues.append(\"aws_url_unknown\")\n return health_issues\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.is_name_valid","title":"is_name_valid(name, delimiter='_', lower_case=True)
classmethod
","text":"Check if the name adheres to the naming conventions for this Artifact.
Parameters:
Name Type Description Defaultname
str
The name/id to check.
requireddelimiter
str
The delimiter to use in the name/id string (default: \"_\")
'_'
lower_case
bool
Should the name be lowercased? (default: True)
True
Returns:
Name Type Descriptionbool
bool
True if the name is valid, False otherwise.
Source code insrc/sageworks/core/artifacts/artifact.py
@classmethod\ndef is_name_valid(cls, name: str, delimiter: str = \"_\", lower_case: bool = True) -> bool:\n \"\"\"Check if the name adheres to the naming conventions for this Artifact.\n\n Args:\n name (str): The name/id to check.\n delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n lower_case (bool): Should the name be lowercased? (default: True)\n\n Returns:\n bool: True if the name is valid, False otherwise.\n \"\"\"\n valid_name = cls.generate_valid_name(name, delimiter=delimiter, lower_case=lower_case)\n if name != valid_name:\n cls.log.warning(f\"Artifact name: '{name}' is not valid. Convert it to something like: '{valid_name}'\")\n return False\n return True\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.modified","title":"modified()
abstractmethod
","text":"Return the datetime when this artifact was last modified
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.onboard","title":"onboard()
abstractmethod
","text":"Onboard this Artifact into SageWorks Returns: bool: True if the Artifact was successfully onboarded, False otherwise
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef onboard(self) -> bool:\n \"\"\"Onboard this Artifact into SageWorks\n Returns:\n bool: True if the Artifact was successfully onboarded, False otherwise\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.ready","title":"ready()
","text":"Is the Artifact ready? Is initial setup complete and expected metadata populated?
Source code insrc/sageworks/core/artifacts/artifact.py
def ready(self) -> bool:\n \"\"\"Is the Artifact ready? Is initial setup complete and expected metadata populated?\"\"\"\n\n # If anything goes wrong, assume the artifact is not ready\n try:\n # Check for the expected metadata\n expected_meta = self.expected_meta()\n existing_meta = self.sageworks_meta()\n ready = set(existing_meta.keys()).issuperset(expected_meta)\n if ready:\n return True\n else:\n self.log.info(\"Artifact is not ready!\")\n return False\n except Exception as e:\n self.log.error(f\"Artifact malformed: {e}\")\n return False\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.refresh_meta","title":"refresh_meta()
abstractmethod
","text":"Refresh the Artifact's metadata
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.remove_sageworks_meta","title":"remove_sageworks_meta(key_to_remove)
","text":"Remove SageWorks specific metadata from this Artifact Args: key_to_remove (str): The metadata key to remove Note: This functionality will work for FeatureSets, Models, and Endpoints but not for DataSources. The DataSource class overrides this method.
Source code insrc/sageworks/core/artifacts/artifact.py
def remove_sageworks_meta(self, key_to_remove: str):\n \"\"\"Remove SageWorks specific metadata from this Artifact\n Args:\n key_to_remove (str): The metadata key to remove\n Note:\n This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources. The DataSource class overrides this method.\n \"\"\"\n aws_arn = self.arn()\n # Sanity check\n if aws_arn is None:\n self.log.error(f\"ARN is None for {self.uuid}!\")\n return\n self.log.info(f\"Removing SageWorks Metadata {key_to_remove} for Artifact: {aws_arn}...\")\n sagemaker_delete_tag(aws_arn, self.sm_session, key_to_remove)\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.remove_sageworks_tag","title":"remove_sageworks_tag(tag, tag_type='user')
","text":"Remove a tag from this artifact if it exists. Args: tag (str): Tag to remove from this artifact tag_type (str): Type of tag to remove (user or health)
Source code insrc/sageworks/core/artifacts/artifact.py
def remove_sageworks_tag(self, tag, tag_type=\"user\"):\n \"\"\"Remove a tag from this artifact if it exists.\n Args:\n tag (str): Tag to remove from this artifact\n tag_type (str): Type of tag to remove (user or health)\n \"\"\"\n current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n if tag in current_tags:\n current_tags.remove(tag)\n combined_tags = self.tag_delimiter.join(current_tags)\n if tag_type == \"user\":\n self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n elif tag_type == \"health\":\n self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.sageworks_meta","title":"sageworks_meta()
","text":"Get the SageWorks specific metadata for this Artifact
Returns:
Type DescriptionUnion[dict, None]
Union[dict, None]: Dictionary of SageWorks metadata for this Artifact
This functionality will work for FeatureSets, Models, and Endpointsbut not for DataSources and Graphs, those classes need to override this method.
Source code insrc/sageworks/core/artifacts/artifact.py
def sageworks_meta(self) -> Union[dict, None]:\n \"\"\"Get the SageWorks specific metadata for this Artifact\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n\n Note: This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources and Graphs, those classes need to override this method.\n \"\"\"\n return self.meta.get_aws_tags(self.arn())\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_input","title":"set_input(input_data)
","text":"Set the input data for this artifact
Parameters:
Name Type Description Defaultinput_data
str
Name of input data for this artifact
requiredNote: This breaks the official provenance of the artifact, so use with caution.
Source code insrc/sageworks/core/artifacts/artifact.py
def set_input(self, input_data: str):\n \"\"\"Set the input data for this artifact\n\n Args:\n input_data (str): Name of input data for this artifact\n Note:\n This breaks the official provenance of the artifact, so use with caution.\n \"\"\"\n self.log.important(f\"{self.uuid}: Setting input to {input_data}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input_data})\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_owner","title":"set_owner(owner)
","text":"Set the owner of this artifact
Parameters:
Name Type Description Defaultowner
str
Owner to set for this artifact
required Source code insrc/sageworks/core/artifacts/artifact.py
def set_owner(self, owner: str):\n \"\"\"Set the owner of this artifact\n\n Args:\n owner (str): Owner to set for this artifact\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_owner\": owner})\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_status","title":"set_status(status)
","text":"Set the status for this artifact Args: status (str): Status to set for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def set_status(self, status: str):\n \"\"\"Set the status for this artifact\n Args:\n status (str): Status to set for this artifact\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_status\": status})\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.size","title":"size()
abstractmethod
","text":"Return the size of this artifact in MegaBytes
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef size(self) -> float:\n \"\"\"Return the size of this artifact in MegaBytes\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.summary","title":"summary()
","text":"This is generic summary information for all Artifacts. If you want to get more detailed information, call the details() method which is implemented by the specific Artifact class
Source code insrc/sageworks/core/artifacts/artifact.py
def summary(self) -> dict:\n \"\"\"This is generic summary information for all Artifacts. If you\n want to get more detailed information, call the details() method\n which is implemented by the specific Artifact class\"\"\"\n basic = {\n \"uuid\": self.uuid,\n \"health_tags\": self.get_health_tags(),\n \"aws_arn\": self.arn(),\n \"size\": self.size(),\n \"created\": self.created(),\n \"modified\": self.modified(),\n \"input\": self.get_input(),\n }\n # Combine the sageworks metadata with the basic metadata\n return {**basic, **self.sageworks_meta()}\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.upsert_sageworks_meta","title":"upsert_sageworks_meta(new_meta)
","text":"Add SageWorks specific metadata to this Artifact Args: new_meta (dict): Dictionary of NEW metadata to add Note: This functionality will work for FeatureSets, Models, and Endpoints but not for DataSources. The DataSource class overrides this method.
Source code insrc/sageworks/core/artifacts/artifact.py
def upsert_sageworks_meta(self, new_meta: dict):\n \"\"\"Add SageWorks specific metadata to this Artifact\n Args:\n new_meta (dict): Dictionary of NEW metadata to add\n Note:\n This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources. The DataSource class overrides this method.\n \"\"\"\n # Sanity check\n aws_arn = self.arn()\n if aws_arn is None:\n self.log.error(f\"ARN is None for {self.uuid}!\")\n return\n\n # Add the new metadata to the existing metadata\n self.log.info(f\"Adding Tags to {self.uuid}:{str(new_meta)[:50]}...\")\n aws_tags = dict_to_aws_tags(new_meta)\n try:\n self.sm_client.add_tags(ResourceArn=aws_arn, Tags=aws_tags)\n except Exception as e:\n self.log.error(f\"Error adding metadata to {aws_arn}: {e}\")\n
"},{"location":"core_classes/artifacts/athena_source/","title":"AthenaSource","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the DataSource API Class and voil\u00e0 it works the same.
AthenaSource: SageWorks Data Source accessible through Athena
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource","title":"AthenaSource
","text":" Bases: DataSourceAbstract
AthenaSource: SageWorks Data Source accessible through Athena
Common Usagemy_data = AthenaSource(data_uuid, database=\"sageworks\")\nmy_data.summary()\nmy_data.details()\ndf = my_data.query(f\"select * from {data_uuid} limit 5\")\n
Source code in src/sageworks/core/artifacts/athena_source.py
class AthenaSource(DataSourceAbstract):\n \"\"\"AthenaSource: SageWorks Data Source accessible through Athena\n\n Common Usage:\n ```python\n my_data = AthenaSource(data_uuid, database=\"sageworks\")\n my_data.summary()\n my_data.details()\n df = my_data.query(f\"select * from {data_uuid} limit 5\")\n ```\n \"\"\"\n\n def __init__(self, data_uuid, database=\"sageworks\", **kwargs):\n \"\"\"AthenaSource Initialization\n\n Args:\n data_uuid (str): Name of Athena Table\n database (str): Athena Database Name (default: sageworks)\n \"\"\"\n # Ensure the data_uuid is a valid name/id\n self.is_name_valid(data_uuid)\n\n # Call superclass init\n super().__init__(data_uuid, database, **kwargs)\n\n # Grab our metadata from the Meta class\n self.log.info(f\"Retrieving metadata for: {self.uuid}...\")\n self.data_source_meta = self.meta.data_source(data_uuid, database=database)\n if self.data_source_meta is None:\n self.log.error(f\"Unable to find {database}:{self.table} in Glue Catalogs...\")\n return\n\n # Call superclass post init\n super().__post_init__()\n\n # All done\n self.log.debug(f\"AthenaSource Initialized: {database}.{self.table}\")\n\n def refresh_meta(self):\n \"\"\"Refresh our internal AWS Broker catalog metadata\"\"\"\n self.data_source_meta = self.meta.data_source(self.uuid, database=self.database)\n\n def exists(self) -> bool:\n \"\"\"Validation Checks for this Data Source\"\"\"\n\n # Are we able to pull AWS Metadata for this table_name?\"\"\"\n # Do we have a valid data_source_meta?\n if getattr(self, \"data_source_meta\", None) is None:\n self.log.debug(f\"AthenaSource {self.table} not found in SageWorks Metadata...\")\n return False\n return True\n\n def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n # Grab our SageWorks Role Manager, get our AWS account id, and region for ARN creation\n account_id = self.aws_account_clamp.account_id\n region = self.aws_account_clamp.region\n arn = f\"arn:aws:glue:{region}:{account_id}:table/{self.database}/{self.table}\"\n return arn\n\n def sageworks_meta(self) -> dict:\n \"\"\"Get the SageWorks specific metadata for this Artifact\"\"\"\n\n # Sanity Check if we have invalid AWS Metadata\n if self.data_source_meta is None:\n if not self.exists():\n self.log.error(f\"DataSource {self.uuid} doesn't appear to exist...\")\n else:\n self.log.critical(f\"Unable to get AWS Metadata for {self.table}\")\n self.log.critical(\"Malformed Artifact! Delete this Artifact and recreate it!\")\n return {}\n\n # Get the SageWorks Metadata from the 'Parameters' section of the DataSource Metadata\n params = self.data_source_meta.get(\"Parameters\", {})\n return {key: decode_value(value) for key, value in params.items() if \"sageworks\" in key}\n\n def upsert_sageworks_meta(self, new_meta: dict):\n \"\"\"Add SageWorks specific metadata to this Artifact\n\n Args:\n new_meta (dict): Dictionary of new metadata to add\n \"\"\"\n self.log.important(f\"Upserting SageWorks Metadata {self.uuid}:{str(new_meta)[:50]}...\")\n\n # Give a warning message for keys that don't start with sageworks_\n for key in new_meta.keys():\n if not key.startswith(\"sageworks_\"):\n self.log.warning(\"Append 'sageworks_' to key names to avoid overwriting AWS meta data\")\n\n # Now convert any non-string values to JSON strings\n for key, value in new_meta.items():\n if not isinstance(value, str):\n new_meta[key] = json.dumps(value, cls=CustomEncoder)\n\n # Store our updated metadata\n try:\n wr.catalog.upsert_table_parameters(\n parameters=new_meta,\n database=self.database,\n table=self.table,\n boto3_session=self.boto3_session,\n )\n except botocore.exceptions.ClientError as e:\n error_code = e.response[\"Error\"][\"Code\"]\n if error_code == \"InvalidInputException\":\n self.log.error(f\"Unable to upsert metadata for {self.table}\")\n self.log.error(\"Probably because the metadata is too large\")\n self.log.error(new_meta)\n elif error_code == \"ConcurrentModificationException\":\n self.log.warning(\"ConcurrentModificationException... trying again...\")\n time.sleep(5)\n wr.catalog.upsert_table_parameters(\n parameters=new_meta,\n database=self.database,\n table=self.table,\n boto3_session=self.boto3_session,\n )\n else:\n self.log.critical(f\"Failed to upsert metadata: {e}\")\n self.log.critical(f\"{self.uuid} is Malformed! Delete this Artifact and recreate it!\")\n except Exception as e:\n self.log.critical(f\"Failed to upsert metadata: {e}\")\n self.log.critical(f\"{self.uuid} is Malformed! Delete this Artifact and recreate it!\")\n\n def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n size_in_bytes = sum(wr.s3.size_objects(self.s3_storage_location(), boto3_session=self.boto3_session).values())\n size_in_mb = size_in_bytes / 1_000_000\n return size_in_mb\n\n def aws_meta(self) -> dict:\n \"\"\"Get the FULL AWS metadata for this artifact\"\"\"\n return self.data_source_meta\n\n def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n sageworks_details = self.sageworks_meta().get(\"sageworks_details\", {})\n return sageworks_details.get(\"aws_url\", \"unknown\")\n\n def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.data_source_meta[\"CreateTime\"]\n\n def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n return self.data_source_meta[\"UpdateTime\"]\n\n def hash(self) -> str:\n \"\"\"Get the hash for the set of Parquet files used for this Artifact\"\"\"\n s3_uri = self.s3_storage_location()\n return compute_parquet_hash(s3_uri, self.boto3_session)\n\n def table_hash(self) -> str:\n \"\"\"Get the table hash for this AthenaSource\"\"\"\n s3_scratch = f\"s3://{self.sageworks_bucket}/temp/athena_output\"\n return compute_athena_table_hash(self.database, self.table, self.boto3_session, s3_scratch)\n\n def num_rows(self) -> int:\n \"\"\"Return the number of rows for this Data Source\"\"\"\n count_df = self.query(f'select count(*) AS sageworks_count from \"{self.database}\".\"{self.table}\"')\n return count_df[\"sageworks_count\"][0] if count_df is not None else 0\n\n def num_columns(self) -> int:\n \"\"\"Return the number of columns for this Data Source\"\"\"\n return len(self.columns)\n\n @property\n def columns(self) -> list[str]:\n \"\"\"Return the column names for this Athena Table\"\"\"\n return [item[\"Name\"] for item in self.data_source_meta[\"StorageDescriptor\"][\"Columns\"]]\n\n @property\n def column_types(self) -> list[str]:\n \"\"\"Return the column types of the internal AthenaSource\"\"\"\n return [item[\"Type\"] for item in self.data_source_meta[\"StorageDescriptor\"][\"Columns\"]]\n\n def query(self, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the AthenaSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n\n # Call internal class _query method\n return self.database_query(self.database, query)\n\n @classmethod\n def database_query(cls, database: str, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Specify the Database and Query the Athena Service\n\n Args:\n database (str): The Athena Database to query\n query (str): The query to run against the AthenaSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n cls.log.debug(f\"Executing Query: {query}...\")\n try:\n df = wr.athena.read_sql_query(\n sql=query,\n database=database,\n ctas_approach=False,\n boto3_session=cls.boto3_session,\n )\n scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n if scanned_bytes > 0:\n cls.log.debug(f\"Athena Query successful (scanned bytes: {scanned_bytes})\")\n return df\n except wr.exceptions.QueryFailed as e:\n cls.log.critical(f\"Failed to execute query: {e}\")\n return None\n\n def execute_statement(self, query: str, silence_errors: bool = False):\n \"\"\"Execute a non-returning SQL statement in Athena with retries.\n\n Args:\n query (str): The query to run against the AthenaSource\n silence_errors (bool): Silence errors (default: False)\n \"\"\"\n attempt = 0\n max_retries = 3\n retry_delay = 10\n while attempt < max_retries:\n try:\n # Start the query execution\n query_execution_id = wr.athena.start_query_execution(\n sql=query,\n database=self.database,\n boto3_session=self.boto3_session,\n )\n self.log.debug(f\"QueryExecutionId: {query_execution_id}\")\n\n # Wait for the query to complete\n wr.athena.wait_query(query_execution_id=query_execution_id, boto3_session=self.boto3_session)\n self.log.debug(f\"Statement executed successfully: {query_execution_id}\")\n break # If successful, exit the retry loop\n except wr.exceptions.QueryFailed as e:\n if \"AlreadyExistsException\" in str(e):\n self.log.warning(f\"Table already exists: {e} \\nIgnoring...\")\n break # No need to retry for this error\n elif \"ConcurrentModificationException\" in str(e):\n self.log.warning(f\"Concurrent modification detected: {e}\\nRetrying...\")\n attempt += 1\n if attempt < max_retries:\n time.sleep(retry_delay)\n else:\n if not silence_errors:\n self.log.critical(f\"Failed to execute statement after {max_retries} attempts: {e}\")\n raise\n else:\n if not silence_errors:\n self.log.critical(f\"Failed to execute statement: {e}\")\n raise\n\n def s3_storage_location(self) -> str:\n \"\"\"Get the S3 Storage Location for this Data Source\"\"\"\n return self.data_source_meta[\"StorageDescriptor\"][\"Location\"]\n\n def athena_test_query(self):\n \"\"\"Validate that Athena Queries are working\"\"\"\n query = f'select count(*) as sageworks_count from \"{self.table}\"'\n df = wr.athena.read_sql_query(\n sql=query,\n database=self.database,\n ctas_approach=False,\n boto3_session=self.boto3_session,\n )\n scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n self.log.info(f\"Athena TEST Query successful (scanned bytes: {scanned_bytes})\")\n\n def descriptive_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the descriptive stats (default: False)\n\n Returns:\n dict(dict): A dictionary of descriptive stats for each column in the form\n {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n 'col2': ...}\n \"\"\"\n\n # First check if we have already computed the descriptive stats\n stat_dict = self.sageworks_meta().get(\"sageworks_descriptive_stats\")\n if stat_dict and not recompute:\n return stat_dict\n\n # Call the SQL function to compute descriptive stats\n stat_dict = sql.descriptive_stats(self)\n\n # Push the descriptive stat data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_descriptive_stats\": stat_dict})\n\n # Return the descriptive stats\n return stat_dict\n\n @cache_dataframe(\"sample\")\n def sample(self) -> pd.DataFrame:\n \"\"\"Pull a sample of rows from the DataSource\n\n Returns:\n pd.DataFrame: A sample DataFrame for an Athena DataSource\n \"\"\"\n\n # Call the SQL function to pull a sample of the rows\n return sql.sample_rows(self)\n\n @cache_dataframe(\"outliers\")\n def outliers(self, scale: float = 1.5, use_stddev=False) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n use_stddev (bool): Use Standard Deviation instead of IQR (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n\n # Compute outliers using the SQL Outliers class\n sql_outliers = sql.outliers.Outliers()\n return sql_outliers.compute_outliers(self, scale=scale, use_stddev=use_stddev)\n\n @cache_dataframe(\"smart_sample\")\n def smart_sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a smart sample dataframe for this DataSource\n\n Args:\n recompute (bool): Recompute the smart sample (default: False)\n\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n\n # Compute/recompute the smart sample\n self.log.important(f\"Computing Smart Sample {self.uuid}...\")\n\n # Outliers DataFrame\n outlier_rows = self.outliers()\n\n # Sample DataFrame\n sample_rows = self.sample()\n sample_rows[\"outlier_group\"] = \"sample\"\n\n # Combine the sample rows with the outlier rows\n all_rows = pd.concat([outlier_rows, sample_rows]).reset_index(drop=True)\n\n # Drop duplicates\n all_except_outlier_group = [col for col in all_rows.columns if col != \"outlier_group\"]\n all_rows = all_rows.drop_duplicates(subset=all_except_outlier_group, ignore_index=True)\n\n # Return the smart_sample data\n return all_rows\n\n def correlations(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of correlations for each column in this format\n {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n \"\"\"\n\n # First check if we have already computed the correlations\n correlations_dict = self.sageworks_meta().get(\"sageworks_correlations\")\n if correlations_dict and not recompute:\n return correlations_dict\n\n # Call the SQL function to compute correlations\n correlations_dict = sql.correlations(self)\n\n # Push the correlation data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_correlations\": correlations_dict})\n\n # Return the correlation data\n return correlations_dict\n\n def column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros, descriptive_stats or correlation data\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100,\n 'descriptive_stats': {...}, 'correlations': {...}},\n ...}\n \"\"\"\n\n # First check if we have already computed the column stats\n columns_stats_dict = self.sageworks_meta().get(\"sageworks_column_stats\")\n if columns_stats_dict and not recompute:\n return columns_stats_dict\n\n # Call the SQL function to compute column stats\n column_stats_dict = sql.column_stats(self, recompute=recompute)\n\n # Push the column stats data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_column_stats\": column_stats_dict})\n\n # Return the column stats data\n return column_stats_dict\n\n def value_counts(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n\n Args:\n recompute (bool): Recompute the value counts (default: False)\n\n Returns:\n dict(dict): A dictionary of value counts for each column in the form\n {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...},\n 'col2': ...}\n \"\"\"\n\n # First check if we have already computed the value counts\n value_counts_dict = self.sageworks_meta().get(\"sageworks_value_counts\")\n if value_counts_dict and not recompute:\n return value_counts_dict\n\n # Call the SQL function to compute value_counts\n value_count_dict = sql.value_counts(self)\n\n # Push the value_count data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_value_counts\": value_count_dict})\n\n # Return the value_count data\n return value_count_dict\n\n def details(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Additional Details about this AthenaSource Artifact\n\n Args:\n recompute (bool): Recompute the details (default: False)\n\n Returns:\n dict(dict): A dictionary of details about this AthenaSource\n \"\"\"\n self.log.info(f\"Computing DataSource Details ({self.uuid})...\")\n\n # Get the details from the base class\n details = super().details()\n\n # Compute additional details\n details[\"s3_storage_location\"] = self.s3_storage_location()\n details[\"storage_type\"] = \"athena\"\n\n # Compute our AWS URL\n query = f'select * from \"{self.database}.{self.table}\" limit 10'\n query_exec_id = wr.athena.start_query_execution(\n sql=query, database=self.database, boto3_session=self.boto3_session\n )\n base_url = \"https://console.aws.amazon.com/athena/home\"\n details[\"aws_url\"] = f\"{base_url}?region={self.aws_region}#query/history/{query_exec_id}\"\n\n # Push the aws_url data into our DataSource Metadata\n # FIXME: We need to revisit this but doing an upsert just for aws_url is silly\n # self.upsert_sageworks_meta({\"sageworks_details\": {\"aws_url\": details[\"aws_url\"]}})\n\n # Convert any datetime fields to ISO-8601 strings\n details = convert_all_to_iso8601(details)\n\n # Add the column stats\n details[\"column_stats\"] = self.column_stats()\n\n # Return the details data\n return details\n\n def delete(self):\n \"\"\"Instance Method: Delete the AWS Data Catalog Table and S3 Storage Objects\"\"\"\n\n # Make sure the AthenaSource exists\n if not self.exists():\n self.log.warning(f\"Trying to delete an AthenaSource that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the AthenaSource\n AthenaSource.managed_delete(self.uuid, database=self.database)\n\n @classmethod\n def managed_delete(cls, data_source_name: str, database: str = \"sageworks\"):\n \"\"\"Class Method: Delete the AWS Data Catalog Table and S3 Storage Objects\n\n Args:\n data_source_name (str): Name of DataSource (AthenaSource)\n database (str): Athena Database Name (default: sageworks)\n \"\"\"\n table = data_source_name # The table name is the same as the data_source_name\n\n # Check if the Glue Catalog Table exists\n if not wr.catalog.does_table_exist(database, table, boto3_session=cls.boto3_session):\n cls.log.info(f\"DataSource {table} not found in database {database}.\")\n return\n\n # Delete any views associated with this AthenaSource\n cls.delete_views(table, database)\n\n # Delete S3 Storage Objects (if they exist)\n try:\n # Make an AWS Query to get the S3 storage location\n s3_path = wr.catalog.get_table_location(database, table, boto3_session=cls.boto3_session)\n\n # Delete Data Catalog Table\n cls.log.info(f\"Deleting DataCatalog Table: {database}.{table}...\")\n wr.catalog.delete_table_if_exists(database, table, boto3_session=cls.boto3_session)\n\n # Make sure we add the trailing slash\n s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n cls.log.info(f\"Deleting S3 Storage Objects: {s3_path}...\")\n wr.s3.delete_objects(s3_path, boto3_session=cls.boto3_session)\n except Exception as e:\n cls.log.error(f\"Failure when trying to delete {data_source_name}: {e}\")\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(data_source_name)\n\n @classmethod\n def delete_views(cls, table: str, database: str):\n \"\"\"Delete any views associated with this FeatureSet\n\n Args:\n table (str): Name of Athena Table\n database (str): Athena Database Name\n \"\"\"\n from sageworks.core.views.view_utils import delete_views_and_supplemental_data\n\n delete_views_and_supplemental_data(table, database, cls.boto3_session)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.column_types","title":"column_types: list[str]
property
","text":"Return the column types of the internal AthenaSource
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.columns","title":"columns: list[str]
property
","text":"Return the column names for this Athena Table
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.__init__","title":"__init__(data_uuid, database='sageworks', **kwargs)
","text":"AthenaSource Initialization
Parameters:
Name Type Description Defaultdata_uuid
str
Name of Athena Table
requireddatabase
str
Athena Database Name (default: sageworks)
'sageworks'
Source code in src/sageworks/core/artifacts/athena_source.py
def __init__(self, data_uuid, database=\"sageworks\", **kwargs):\n \"\"\"AthenaSource Initialization\n\n Args:\n data_uuid (str): Name of Athena Table\n database (str): Athena Database Name (default: sageworks)\n \"\"\"\n # Ensure the data_uuid is a valid name/id\n self.is_name_valid(data_uuid)\n\n # Call superclass init\n super().__init__(data_uuid, database, **kwargs)\n\n # Grab our metadata from the Meta class\n self.log.info(f\"Retrieving metadata for: {self.uuid}...\")\n self.data_source_meta = self.meta.data_source(data_uuid, database=database)\n if self.data_source_meta is None:\n self.log.error(f\"Unable to find {database}:{self.table} in Glue Catalogs...\")\n return\n\n # Call superclass post init\n super().__post_init__()\n\n # All done\n self.log.debug(f\"AthenaSource Initialized: {database}.{self.table}\")\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.arn","title":"arn()
","text":"AWS ARN (Amazon Resource Name) for this artifact
Source code insrc/sageworks/core/artifacts/athena_source.py
def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n # Grab our SageWorks Role Manager, get our AWS account id, and region for ARN creation\n account_id = self.aws_account_clamp.account_id\n region = self.aws_account_clamp.region\n arn = f\"arn:aws:glue:{region}:{account_id}:table/{self.database}/{self.table}\"\n return arn\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.athena_test_query","title":"athena_test_query()
","text":"Validate that Athena Queries are working
Source code insrc/sageworks/core/artifacts/athena_source.py
def athena_test_query(self):\n \"\"\"Validate that Athena Queries are working\"\"\"\n query = f'select count(*) as sageworks_count from \"{self.table}\"'\n df = wr.athena.read_sql_query(\n sql=query,\n database=self.database,\n ctas_approach=False,\n boto3_session=self.boto3_session,\n )\n scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n self.log.info(f\"Athena TEST Query successful (scanned bytes: {scanned_bytes})\")\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.aws_meta","title":"aws_meta()
","text":"Get the FULL AWS metadata for this artifact
Source code insrc/sageworks/core/artifacts/athena_source.py
def aws_meta(self) -> dict:\n \"\"\"Get the FULL AWS metadata for this artifact\"\"\"\n return self.data_source_meta\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.aws_url","title":"aws_url()
","text":"The AWS URL for looking at/querying this data source
Source code insrc/sageworks/core/artifacts/athena_source.py
def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n sageworks_details = self.sageworks_meta().get(\"sageworks_details\", {})\n return sageworks_details.get(\"aws_url\", \"unknown\")\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.column_stats","title":"column_stats(recompute=False)
","text":"Compute Column Stats for all the columns in a DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the column stats (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of stats for each column this format
NB
dict[dict]
String columns will NOT have num_zeros, descriptive_stats or correlation data {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12}, 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}, 'correlations': {...}}, ...}
Source code insrc/sageworks/core/artifacts/athena_source.py
def column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros, descriptive_stats or correlation data\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100,\n 'descriptive_stats': {...}, 'correlations': {...}},\n ...}\n \"\"\"\n\n # First check if we have already computed the column stats\n columns_stats_dict = self.sageworks_meta().get(\"sageworks_column_stats\")\n if columns_stats_dict and not recompute:\n return columns_stats_dict\n\n # Call the SQL function to compute column stats\n column_stats_dict = sql.column_stats(self, recompute=recompute)\n\n # Push the column stats data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_column_stats\": column_stats_dict})\n\n # Return the column stats data\n return column_stats_dict\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.correlations","title":"correlations(recompute=False)
","text":"Compute Correlations for all the numeric columns in a DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the column stats (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of correlations for each column in this format {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...}, 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}
Source code insrc/sageworks/core/artifacts/athena_source.py
def correlations(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of correlations for each column in this format\n {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n \"\"\"\n\n # First check if we have already computed the correlations\n correlations_dict = self.sageworks_meta().get(\"sageworks_correlations\")\n if correlations_dict and not recompute:\n return correlations_dict\n\n # Call the SQL function to compute correlations\n correlations_dict = sql.correlations(self)\n\n # Push the correlation data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_correlations\": correlations_dict})\n\n # Return the correlation data\n return correlations_dict\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.created","title":"created()
","text":"Return the datetime when this artifact was created
Source code insrc/sageworks/core/artifacts/athena_source.py
def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.data_source_meta[\"CreateTime\"]\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.database_query","title":"database_query(database, query)
classmethod
","text":"Specify the Database and Query the Athena Service
Parameters:
Name Type Description Defaultdatabase
str
The Athena Database to query
requiredquery
str
The query to run against the AthenaSource
requiredReturns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The results of the query
Source code insrc/sageworks/core/artifacts/athena_source.py
@classmethod\ndef database_query(cls, database: str, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Specify the Database and Query the Athena Service\n\n Args:\n database (str): The Athena Database to query\n query (str): The query to run against the AthenaSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n cls.log.debug(f\"Executing Query: {query}...\")\n try:\n df = wr.athena.read_sql_query(\n sql=query,\n database=database,\n ctas_approach=False,\n boto3_session=cls.boto3_session,\n )\n scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n if scanned_bytes > 0:\n cls.log.debug(f\"Athena Query successful (scanned bytes: {scanned_bytes})\")\n return df\n except wr.exceptions.QueryFailed as e:\n cls.log.critical(f\"Failed to execute query: {e}\")\n return None\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.delete","title":"delete()
","text":"Instance Method: Delete the AWS Data Catalog Table and S3 Storage Objects
Source code insrc/sageworks/core/artifacts/athena_source.py
def delete(self):\n \"\"\"Instance Method: Delete the AWS Data Catalog Table and S3 Storage Objects\"\"\"\n\n # Make sure the AthenaSource exists\n if not self.exists():\n self.log.warning(f\"Trying to delete an AthenaSource that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the AthenaSource\n AthenaSource.managed_delete(self.uuid, database=self.database)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.delete_views","title":"delete_views(table, database)
classmethod
","text":"Delete any views associated with this FeatureSet
Parameters:
Name Type Description Defaulttable
str
Name of Athena Table
requireddatabase
str
Athena Database Name
required Source code insrc/sageworks/core/artifacts/athena_source.py
@classmethod\ndef delete_views(cls, table: str, database: str):\n \"\"\"Delete any views associated with this FeatureSet\n\n Args:\n table (str): Name of Athena Table\n database (str): Athena Database Name\n \"\"\"\n from sageworks.core.views.view_utils import delete_views_and_supplemental_data\n\n delete_views_and_supplemental_data(table, database, cls.boto3_session)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.descriptive_stats","title":"descriptive_stats(recompute=False)
","text":"Compute Descriptive Stats for all the numeric columns in a DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the descriptive stats (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of descriptive stats for each column in the form {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4}, 'col2': ...}
Source code insrc/sageworks/core/artifacts/athena_source.py
def descriptive_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the descriptive stats (default: False)\n\n Returns:\n dict(dict): A dictionary of descriptive stats for each column in the form\n {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n 'col2': ...}\n \"\"\"\n\n # First check if we have already computed the descriptive stats\n stat_dict = self.sageworks_meta().get(\"sageworks_descriptive_stats\")\n if stat_dict and not recompute:\n return stat_dict\n\n # Call the SQL function to compute descriptive stats\n stat_dict = sql.descriptive_stats(self)\n\n # Push the descriptive stat data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_descriptive_stats\": stat_dict})\n\n # Return the descriptive stats\n return stat_dict\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.details","title":"details(recompute=False)
","text":"Additional Details about this AthenaSource Artifact
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the details (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of details about this AthenaSource
Source code insrc/sageworks/core/artifacts/athena_source.py
def details(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Additional Details about this AthenaSource Artifact\n\n Args:\n recompute (bool): Recompute the details (default: False)\n\n Returns:\n dict(dict): A dictionary of details about this AthenaSource\n \"\"\"\n self.log.info(f\"Computing DataSource Details ({self.uuid})...\")\n\n # Get the details from the base class\n details = super().details()\n\n # Compute additional details\n details[\"s3_storage_location\"] = self.s3_storage_location()\n details[\"storage_type\"] = \"athena\"\n\n # Compute our AWS URL\n query = f'select * from \"{self.database}.{self.table}\" limit 10'\n query_exec_id = wr.athena.start_query_execution(\n sql=query, database=self.database, boto3_session=self.boto3_session\n )\n base_url = \"https://console.aws.amazon.com/athena/home\"\n details[\"aws_url\"] = f\"{base_url}?region={self.aws_region}#query/history/{query_exec_id}\"\n\n # Push the aws_url data into our DataSource Metadata\n # FIXME: We need to revisit this but doing an upsert just for aws_url is silly\n # self.upsert_sageworks_meta({\"sageworks_details\": {\"aws_url\": details[\"aws_url\"]}})\n\n # Convert any datetime fields to ISO-8601 strings\n details = convert_all_to_iso8601(details)\n\n # Add the column stats\n details[\"column_stats\"] = self.column_stats()\n\n # Return the details data\n return details\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.execute_statement","title":"execute_statement(query, silence_errors=False)
","text":"Execute a non-returning SQL statement in Athena with retries.
Parameters:
Name Type Description Defaultquery
str
The query to run against the AthenaSource
requiredsilence_errors
bool
Silence errors (default: False)
False
Source code in src/sageworks/core/artifacts/athena_source.py
def execute_statement(self, query: str, silence_errors: bool = False):\n \"\"\"Execute a non-returning SQL statement in Athena with retries.\n\n Args:\n query (str): The query to run against the AthenaSource\n silence_errors (bool): Silence errors (default: False)\n \"\"\"\n attempt = 0\n max_retries = 3\n retry_delay = 10\n while attempt < max_retries:\n try:\n # Start the query execution\n query_execution_id = wr.athena.start_query_execution(\n sql=query,\n database=self.database,\n boto3_session=self.boto3_session,\n )\n self.log.debug(f\"QueryExecutionId: {query_execution_id}\")\n\n # Wait for the query to complete\n wr.athena.wait_query(query_execution_id=query_execution_id, boto3_session=self.boto3_session)\n self.log.debug(f\"Statement executed successfully: {query_execution_id}\")\n break # If successful, exit the retry loop\n except wr.exceptions.QueryFailed as e:\n if \"AlreadyExistsException\" in str(e):\n self.log.warning(f\"Table already exists: {e} \\nIgnoring...\")\n break # No need to retry for this error\n elif \"ConcurrentModificationException\" in str(e):\n self.log.warning(f\"Concurrent modification detected: {e}\\nRetrying...\")\n attempt += 1\n if attempt < max_retries:\n time.sleep(retry_delay)\n else:\n if not silence_errors:\n self.log.critical(f\"Failed to execute statement after {max_retries} attempts: {e}\")\n raise\n else:\n if not silence_errors:\n self.log.critical(f\"Failed to execute statement: {e}\")\n raise\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.exists","title":"exists()
","text":"Validation Checks for this Data Source
Source code insrc/sageworks/core/artifacts/athena_source.py
def exists(self) -> bool:\n \"\"\"Validation Checks for this Data Source\"\"\"\n\n # Are we able to pull AWS Metadata for this table_name?\"\"\"\n # Do we have a valid data_source_meta?\n if getattr(self, \"data_source_meta\", None) is None:\n self.log.debug(f\"AthenaSource {self.table} not found in SageWorks Metadata...\")\n return False\n return True\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.hash","title":"hash()
","text":"Get the hash for the set of Parquet files used for this Artifact
Source code insrc/sageworks/core/artifacts/athena_source.py
def hash(self) -> str:\n \"\"\"Get the hash for the set of Parquet files used for this Artifact\"\"\"\n s3_uri = self.s3_storage_location()\n return compute_parquet_hash(s3_uri, self.boto3_session)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.managed_delete","title":"managed_delete(data_source_name, database='sageworks')
classmethod
","text":"Class Method: Delete the AWS Data Catalog Table and S3 Storage Objects
Parameters:
Name Type Description Defaultdata_source_name
str
Name of DataSource (AthenaSource)
requireddatabase
str
Athena Database Name (default: sageworks)
'sageworks'
Source code in src/sageworks/core/artifacts/athena_source.py
@classmethod\ndef managed_delete(cls, data_source_name: str, database: str = \"sageworks\"):\n \"\"\"Class Method: Delete the AWS Data Catalog Table and S3 Storage Objects\n\n Args:\n data_source_name (str): Name of DataSource (AthenaSource)\n database (str): Athena Database Name (default: sageworks)\n \"\"\"\n table = data_source_name # The table name is the same as the data_source_name\n\n # Check if the Glue Catalog Table exists\n if not wr.catalog.does_table_exist(database, table, boto3_session=cls.boto3_session):\n cls.log.info(f\"DataSource {table} not found in database {database}.\")\n return\n\n # Delete any views associated with this AthenaSource\n cls.delete_views(table, database)\n\n # Delete S3 Storage Objects (if they exist)\n try:\n # Make an AWS Query to get the S3 storage location\n s3_path = wr.catalog.get_table_location(database, table, boto3_session=cls.boto3_session)\n\n # Delete Data Catalog Table\n cls.log.info(f\"Deleting DataCatalog Table: {database}.{table}...\")\n wr.catalog.delete_table_if_exists(database, table, boto3_session=cls.boto3_session)\n\n # Make sure we add the trailing slash\n s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n cls.log.info(f\"Deleting S3 Storage Objects: {s3_path}...\")\n wr.s3.delete_objects(s3_path, boto3_session=cls.boto3_session)\n except Exception as e:\n cls.log.error(f\"Failure when trying to delete {data_source_name}: {e}\")\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(data_source_name)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.modified","title":"modified()
","text":"Return the datetime when this artifact was last modified
Source code insrc/sageworks/core/artifacts/athena_source.py
def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n return self.data_source_meta[\"UpdateTime\"]\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.num_columns","title":"num_columns()
","text":"Return the number of columns for this Data Source
Source code insrc/sageworks/core/artifacts/athena_source.py
def num_columns(self) -> int:\n \"\"\"Return the number of columns for this Data Source\"\"\"\n return len(self.columns)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.num_rows","title":"num_rows()
","text":"Return the number of rows for this Data Source
Source code insrc/sageworks/core/artifacts/athena_source.py
def num_rows(self) -> int:\n \"\"\"Return the number of rows for this Data Source\"\"\"\n count_df = self.query(f'select count(*) AS sageworks_count from \"{self.database}\".\"{self.table}\"')\n return count_df[\"sageworks_count\"][0] if count_df is not None else 0\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.outliers","title":"outliers(scale=1.5, use_stddev=False)
","text":"Compute outliers for all the numeric columns in a DataSource
Parameters:
Name Type Description Defaultscale
float
The scale to use for the IQR (default: 1.5)
1.5
use_stddev
bool
Use Standard Deviation instead of IQR (default: False)
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame of outliers from this DataSource
NotesUses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma) The scale parameter can be adjusted to change the IQR multiplier
Source code insrc/sageworks/core/artifacts/athena_source.py
@cache_dataframe(\"outliers\")\ndef outliers(self, scale: float = 1.5, use_stddev=False) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n use_stddev (bool): Use Standard Deviation instead of IQR (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n\n # Compute outliers using the SQL Outliers class\n sql_outliers = sql.outliers.Outliers()\n return sql_outliers.compute_outliers(self, scale=scale, use_stddev=use_stddev)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.query","title":"query(query)
","text":"Query the AthenaSource
Parameters:
Name Type Description Defaultquery
str
The query to run against the AthenaSource
requiredReturns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The results of the query
Source code insrc/sageworks/core/artifacts/athena_source.py
def query(self, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the AthenaSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n\n # Call internal class _query method\n return self.database_query(self.database, query)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.refresh_meta","title":"refresh_meta()
","text":"Refresh our internal AWS Broker catalog metadata
Source code insrc/sageworks/core/artifacts/athena_source.py
def refresh_meta(self):\n \"\"\"Refresh our internal AWS Broker catalog metadata\"\"\"\n self.data_source_meta = self.meta.data_source(self.uuid, database=self.database)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.s3_storage_location","title":"s3_storage_location()
","text":"Get the S3 Storage Location for this Data Source
Source code insrc/sageworks/core/artifacts/athena_source.py
def s3_storage_location(self) -> str:\n \"\"\"Get the S3 Storage Location for this Data Source\"\"\"\n return self.data_source_meta[\"StorageDescriptor\"][\"Location\"]\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.sageworks_meta","title":"sageworks_meta()
","text":"Get the SageWorks specific metadata for this Artifact
Source code insrc/sageworks/core/artifacts/athena_source.py
def sageworks_meta(self) -> dict:\n \"\"\"Get the SageWorks specific metadata for this Artifact\"\"\"\n\n # Sanity Check if we have invalid AWS Metadata\n if self.data_source_meta is None:\n if not self.exists():\n self.log.error(f\"DataSource {self.uuid} doesn't appear to exist...\")\n else:\n self.log.critical(f\"Unable to get AWS Metadata for {self.table}\")\n self.log.critical(\"Malformed Artifact! Delete this Artifact and recreate it!\")\n return {}\n\n # Get the SageWorks Metadata from the 'Parameters' section of the DataSource Metadata\n params = self.data_source_meta.get(\"Parameters\", {})\n return {key: decode_value(value) for key, value in params.items() if \"sageworks\" in key}\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.sample","title":"sample()
","text":"Pull a sample of rows from the DataSource
Returns:
Type DescriptionDataFrame
pd.DataFrame: A sample DataFrame for an Athena DataSource
Source code insrc/sageworks/core/artifacts/athena_source.py
@cache_dataframe(\"sample\")\ndef sample(self) -> pd.DataFrame:\n \"\"\"Pull a sample of rows from the DataSource\n\n Returns:\n pd.DataFrame: A sample DataFrame for an Athena DataSource\n \"\"\"\n\n # Call the SQL function to pull a sample of the rows\n return sql.sample_rows(self)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.size","title":"size()
","text":"Return the size of this data in MegaBytes
Source code insrc/sageworks/core/artifacts/athena_source.py
def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n size_in_bytes = sum(wr.s3.size_objects(self.s3_storage_location(), boto3_session=self.boto3_session).values())\n size_in_mb = size_in_bytes / 1_000_000\n return size_in_mb\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.smart_sample","title":"smart_sample(recompute=False)
","text":"Get a smart sample dataframe for this DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the smart sample (default: False)
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A combined DataFrame of sample data + outliers
Source code insrc/sageworks/core/artifacts/athena_source.py
@cache_dataframe(\"smart_sample\")\ndef smart_sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a smart sample dataframe for this DataSource\n\n Args:\n recompute (bool): Recompute the smart sample (default: False)\n\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n\n # Compute/recompute the smart sample\n self.log.important(f\"Computing Smart Sample {self.uuid}...\")\n\n # Outliers DataFrame\n outlier_rows = self.outliers()\n\n # Sample DataFrame\n sample_rows = self.sample()\n sample_rows[\"outlier_group\"] = \"sample\"\n\n # Combine the sample rows with the outlier rows\n all_rows = pd.concat([outlier_rows, sample_rows]).reset_index(drop=True)\n\n # Drop duplicates\n all_except_outlier_group = [col for col in all_rows.columns if col != \"outlier_group\"]\n all_rows = all_rows.drop_duplicates(subset=all_except_outlier_group, ignore_index=True)\n\n # Return the smart_sample data\n return all_rows\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.table_hash","title":"table_hash()
","text":"Get the table hash for this AthenaSource
Source code insrc/sageworks/core/artifacts/athena_source.py
def table_hash(self) -> str:\n \"\"\"Get the table hash for this AthenaSource\"\"\"\n s3_scratch = f\"s3://{self.sageworks_bucket}/temp/athena_output\"\n return compute_athena_table_hash(self.database, self.table, self.boto3_session, s3_scratch)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.upsert_sageworks_meta","title":"upsert_sageworks_meta(new_meta)
","text":"Add SageWorks specific metadata to this Artifact
Parameters:
Name Type Description Defaultnew_meta
dict
Dictionary of new metadata to add
required Source code insrc/sageworks/core/artifacts/athena_source.py
def upsert_sageworks_meta(self, new_meta: dict):\n \"\"\"Add SageWorks specific metadata to this Artifact\n\n Args:\n new_meta (dict): Dictionary of new metadata to add\n \"\"\"\n self.log.important(f\"Upserting SageWorks Metadata {self.uuid}:{str(new_meta)[:50]}...\")\n\n # Give a warning message for keys that don't start with sageworks_\n for key in new_meta.keys():\n if not key.startswith(\"sageworks_\"):\n self.log.warning(\"Append 'sageworks_' to key names to avoid overwriting AWS meta data\")\n\n # Now convert any non-string values to JSON strings\n for key, value in new_meta.items():\n if not isinstance(value, str):\n new_meta[key] = json.dumps(value, cls=CustomEncoder)\n\n # Store our updated metadata\n try:\n wr.catalog.upsert_table_parameters(\n parameters=new_meta,\n database=self.database,\n table=self.table,\n boto3_session=self.boto3_session,\n )\n except botocore.exceptions.ClientError as e:\n error_code = e.response[\"Error\"][\"Code\"]\n if error_code == \"InvalidInputException\":\n self.log.error(f\"Unable to upsert metadata for {self.table}\")\n self.log.error(\"Probably because the metadata is too large\")\n self.log.error(new_meta)\n elif error_code == \"ConcurrentModificationException\":\n self.log.warning(\"ConcurrentModificationException... trying again...\")\n time.sleep(5)\n wr.catalog.upsert_table_parameters(\n parameters=new_meta,\n database=self.database,\n table=self.table,\n boto3_session=self.boto3_session,\n )\n else:\n self.log.critical(f\"Failed to upsert metadata: {e}\")\n self.log.critical(f\"{self.uuid} is Malformed! Delete this Artifact and recreate it!\")\n except Exception as e:\n self.log.critical(f\"Failed to upsert metadata: {e}\")\n self.log.critical(f\"{self.uuid} is Malformed! Delete this Artifact and recreate it!\")\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.value_counts","title":"value_counts(recompute=False)
","text":"Compute 'value_counts' for all the string columns in a DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the value counts (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of value counts for each column in the form {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...}, 'col2': ...}
Source code insrc/sageworks/core/artifacts/athena_source.py
def value_counts(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n\n Args:\n recompute (bool): Recompute the value counts (default: False)\n\n Returns:\n dict(dict): A dictionary of value counts for each column in the form\n {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...},\n 'col2': ...}\n \"\"\"\n\n # First check if we have already computed the value counts\n value_counts_dict = self.sageworks_meta().get(\"sageworks_value_counts\")\n if value_counts_dict and not recompute:\n return value_counts_dict\n\n # Call the SQL function to compute value_counts\n value_count_dict = sql.value_counts(self)\n\n # Push the value_count data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_value_counts\": value_count_dict})\n\n # Return the value_count data\n return value_count_dict\n
"},{"location":"core_classes/artifacts/data_source_abstract/","title":"DataSource Abstract","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the DataSource API Class and voil\u00e0 it works the same.
The DataSource Abstract class is a base/abstract class that defines API implemented by all the child classes (currently just AthenaSource but later RDSSource, FutureThing ).
DataSourceAbstract: Abstract Base Class for all data sources (S3: CSV, JSONL, Parquet, RDS, etc)
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract","title":"DataSourceAbstract
","text":" Bases: Artifact
src/sageworks/core/artifacts/data_source_abstract.py
class DataSourceAbstract(Artifact):\n def __init__(self, data_uuid: str, database: str = \"sageworks\", **kwargs):\n \"\"\"DataSourceAbstract: Abstract Base Class for all data sources\n Args:\n data_uuid(str): The UUID for this Data Source\n database(str): The database to use for this Data Source (default: sageworks)\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, **kwargs)\n\n # Set up our instance attributes\n self._database = database\n self._table_name = data_uuid\n\n def __post_init__(self):\n # Call superclass post_init\n super().__post_init__()\n\n @deprecated(version=\"0.9\")\n def get_database(self) -> str:\n \"\"\"Get the database for this Data Source\"\"\"\n return self._database\n\n @property\n def database(self) -> str:\n \"\"\"Get the database for this Data Source\"\"\"\n return self._database\n\n @property\n def table(self) -> str:\n \"\"\"Get the base table name for this Data Source\"\"\"\n return self._table_name\n\n @abstractmethod\n def num_rows(self) -> int:\n \"\"\"Return the number of rows for this Data Source\"\"\"\n pass\n\n @abstractmethod\n def num_columns(self) -> int:\n \"\"\"Return the number of columns for this Data Source\"\"\"\n pass\n\n @property\n @abstractmethod\n def columns(self) -> list[str]:\n \"\"\"Return the column names for this Data Source\"\"\"\n pass\n\n @property\n @abstractmethod\n def column_types(self) -> list[str]:\n \"\"\"Return the column types for this Data Source\"\"\"\n pass\n\n def column_details(self) -> dict:\n \"\"\"Return the column details for this Data Source\n\n Returns:\n dict: The column details for this Data Source\n \"\"\"\n return dict(zip(self.columns, self.column_types))\n\n def views(self) -> list[str]:\n \"\"\"Return the views for this Data Source\"\"\"\n from sageworks.core.views.view_utils import list_views\n\n return list_views(self)\n\n def view(self, view_name: str) -> \"View\":\n \"\"\"Return a DataFrame for a specific view\n Args:\n view_name (str): The name of the view to return\n Returns:\n pd.DataFrame: A DataFrame for the specified view\n \"\"\"\n from sageworks.core.views import View\n\n return View(self, view_name)\n\n def set_display_columns(self, diplay_columns: list[str]):\n \"\"\"Set the display columns for this Data Source\n\n Args:\n diplay_columns (list[str]): The display columns for this Data Source\n \"\"\"\n # Check mismatch of display columns to computation columns\n c_view = self.view(\"computation\")\n computation_columns = c_view.columns\n mismatch_columns = [col for col in diplay_columns if col not in computation_columns]\n if mismatch_columns:\n self.log.monitor(f\"Display View/Computation mismatch: {mismatch_columns}\")\n\n self.log.important(f\"Setting Display Columns...{diplay_columns}\")\n from sageworks.core.views import DisplayView\n\n # Create a NEW display view\n DisplayView.create(self, source_table=c_view.table, column_list=diplay_columns)\n\n def set_computation_columns(self, computation_columns: list[str], recompute_stats: bool = True):\n \"\"\"Set the computation columns for this Data Source\n\n Args:\n computation_columns (list[str]): The computation columns for this Data Source\n recompute_stats (bool): Recomputes all the stats for this Data Source (default: True)\n \"\"\"\n self.log.important(f\"Setting Computation Columns...{computation_columns}\")\n from sageworks.core.views import ComputationView\n\n # Create a NEW computation view\n ComputationView.create(self, column_list=computation_columns)\n if recompute_stats:\n self.recompute_stats()\n\n def _create_display_view(self):\n \"\"\"Internal: Create the Display View for this DataSource\"\"\"\n from sageworks.core.views import View\n\n View(self, \"display\")\n\n @abstractmethod\n def query(self, query: str) -> pd.DataFrame:\n \"\"\"Query the DataSourceAbstract\n Args:\n query(str): The SQL query to execute\n \"\"\"\n pass\n\n @abstractmethod\n def execute_statement(self, query: str):\n \"\"\"Execute an SQL statement that doesn't return a result\n Args:\n query(str): The SQL statement to execute\n \"\"\"\n pass\n\n @abstractmethod\n def sample(self) -> pd.DataFrame:\n \"\"\"Return a sample DataFrame from this DataSourceAbstract\n\n Returns:\n pd.DataFrame: A sample DataFrame from this DataSource\n \"\"\"\n pass\n\n @abstractmethod\n def descriptive_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n Args:\n recompute (bool): Recompute the descriptive stats (default: False)\n Returns:\n dict(dict): A dictionary of descriptive stats for each column in the form\n {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n 'col2': ...}\n \"\"\"\n pass\n\n @abstractmethod\n def outliers(self, scale: float = 1.5) -> pd.DataFrame:\n \"\"\"Return a DataFrame of outliers from this DataSource\n\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n pass\n\n @abstractmethod\n def smart_sample(self) -> pd.DataFrame:\n \"\"\"Get a SMART sample dataframe from this DataSource\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n pass\n\n @abstractmethod\n def value_counts(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n Args:\n recompute (bool): Recompute the value counts (default: False)\n Returns:\n dict(dict): A dictionary of value counts for each column in the form\n {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...},\n 'col2': ...}\n \"\"\"\n pass\n\n @abstractmethod\n def column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in a DataSource\n Args:\n recompute (bool): Recompute the column stats (default: False)\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros and descriptive stats\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n ...}\n \"\"\"\n pass\n\n @abstractmethod\n def correlations(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of correlations for each column in this format\n {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n \"\"\"\n pass\n\n def details(self) -> dict:\n \"\"\"Additional Details about this DataSourceAbstract Artifact\"\"\"\n details = self.summary()\n details[\"num_rows\"] = self.num_rows()\n details[\"num_columns\"] = self.num_columns()\n details[\"column_details\"] = self.column_details()\n return details\n\n def expected_meta(self) -> list[str]:\n \"\"\"DataSources have quite a bit of expected Metadata for EDA displays\"\"\"\n\n # For DataSources, we expect to see the following metadata\n expected_meta = [\n # FIXME: Revisit this\n # \"sageworks_details\",\n \"sageworks_descriptive_stats\",\n \"sageworks_value_counts\",\n \"sageworks_correlations\",\n \"sageworks_column_stats\",\n ]\n return expected_meta\n\n def ready(self) -> bool:\n \"\"\"Is the DataSource ready?\"\"\"\n\n # Check if the Artifact is ready\n if not super().ready():\n return False\n\n # If we don't have a smart_sample we're probably not ready\n if not self.df_cache.check(f\"{self.uuid}/smart_sample\"):\n self.log.warning(f\"DataSource {self.uuid} not ready...\")\n return False\n\n # Okay so we have sample, outliers, and smart_sample so we are ready\n return True\n\n def onboard(self) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the data source (make it ready)\n\n Returns:\n bool: True if the DataSource was onboarded successfully\n \"\"\"\n self.log.important(f\"Onboarding {self.uuid}...\")\n self.set_status(\"onboarding\")\n self.remove_health_tag(\"needs_onboard\")\n\n # Make sure our display view actually exists\n self.view(\"display\").ensure_exists()\n\n # Recompute the stats\n self.recompute_stats()\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n self.set_status(\"ready\")\n return True\n\n def recompute_stats(self) -> bool:\n \"\"\"This is a BLOCKING method that will recompute the stats for the data source\n\n Returns:\n bool: True if the DataSource stats were recomputed successfully\n \"\"\"\n self.log.important(f\"Recomputing Stats {self.uuid}...\")\n\n # Make sure our computation view actually exists\n self.view(\"computation\").ensure_exists()\n\n # Compute the sample, column stats, outliers, and smart_sample\n self.df_cache.delete(f\"{self.uuid}/sample\")\n self.sample()\n self.column_stats(recompute=True)\n self.refresh_meta() # Refresh the meta since outliers needs descriptive_stats and value_counts\n self.df_cache.delete(f\"{self.uuid}/outliers\")\n self.outliers()\n self.df_cache.delete(f\"{self.uuid}/smart_sample\")\n self.smart_sample()\n return True\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_types","title":"column_types: list[str]
abstractmethod
property
","text":"Return the column types for this Data Source
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.columns","title":"columns: list[str]
abstractmethod
property
","text":"Return the column names for this Data Source
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.database","title":"database: str
property
","text":"Get the database for this Data Source
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.table","title":"table: str
property
","text":"Get the base table name for this Data Source
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.__init__","title":"__init__(data_uuid, database='sageworks', **kwargs)
","text":"DataSourceAbstract: Abstract Base Class for all data sources Args: data_uuid(str): The UUID for this Data Source database(str): The database to use for this Data Source (default: sageworks)
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def __init__(self, data_uuid: str, database: str = \"sageworks\", **kwargs):\n \"\"\"DataSourceAbstract: Abstract Base Class for all data sources\n Args:\n data_uuid(str): The UUID for this Data Source\n database(str): The database to use for this Data Source (default: sageworks)\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, **kwargs)\n\n # Set up our instance attributes\n self._database = database\n self._table_name = data_uuid\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_details","title":"column_details()
","text":"Return the column details for this Data Source
Returns:
Name Type Descriptiondict
dict
The column details for this Data Source
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def column_details(self) -> dict:\n \"\"\"Return the column details for this Data Source\n\n Returns:\n dict: The column details for this Data Source\n \"\"\"\n return dict(zip(self.columns, self.column_types))\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_stats","title":"column_stats(recompute=False)
abstractmethod
","text":"Compute Column Stats for all the columns in a DataSource Args: recompute (bool): Recompute the column stats (default: False) Returns: dict(dict): A dictionary of stats for each column this format NB: String columns will NOT have num_zeros and descriptive stats {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12}, 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}}, ...}
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in a DataSource\n Args:\n recompute (bool): Recompute the column stats (default: False)\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros and descriptive stats\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n ...}\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.correlations","title":"correlations(recompute=False)
abstractmethod
","text":"Compute Correlations for all the numeric columns in a DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the column stats (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of correlations for each column in this format {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...}, 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef correlations(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of correlations for each column in this format\n {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.descriptive_stats","title":"descriptive_stats(recompute=False)
abstractmethod
","text":"Compute Descriptive Stats for all the numeric columns in a DataSource Args: recompute (bool): Recompute the descriptive stats (default: False) Returns: dict(dict): A dictionary of descriptive stats for each column in the form {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4}, 'col2': ...}
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef descriptive_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n Args:\n recompute (bool): Recompute the descriptive stats (default: False)\n Returns:\n dict(dict): A dictionary of descriptive stats for each column in the form\n {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n 'col2': ...}\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.details","title":"details()
","text":"Additional Details about this DataSourceAbstract Artifact
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def details(self) -> dict:\n \"\"\"Additional Details about this DataSourceAbstract Artifact\"\"\"\n details = self.summary()\n details[\"num_rows\"] = self.num_rows()\n details[\"num_columns\"] = self.num_columns()\n details[\"column_details\"] = self.column_details()\n return details\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.execute_statement","title":"execute_statement(query)
abstractmethod
","text":"Execute an SQL statement that doesn't return a result Args: query(str): The SQL statement to execute
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef execute_statement(self, query: str):\n \"\"\"Execute an SQL statement that doesn't return a result\n Args:\n query(str): The SQL statement to execute\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.expected_meta","title":"expected_meta()
","text":"DataSources have quite a bit of expected Metadata for EDA displays
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def expected_meta(self) -> list[str]:\n \"\"\"DataSources have quite a bit of expected Metadata for EDA displays\"\"\"\n\n # For DataSources, we expect to see the following metadata\n expected_meta = [\n # FIXME: Revisit this\n # \"sageworks_details\",\n \"sageworks_descriptive_stats\",\n \"sageworks_value_counts\",\n \"sageworks_correlations\",\n \"sageworks_column_stats\",\n ]\n return expected_meta\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.get_database","title":"get_database()
","text":"Get the database for this Data Source
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@deprecated(version=\"0.9\")\ndef get_database(self) -> str:\n \"\"\"Get the database for this Data Source\"\"\"\n return self._database\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.num_columns","title":"num_columns()
abstractmethod
","text":"Return the number of columns for this Data Source
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef num_columns(self) -> int:\n \"\"\"Return the number of columns for this Data Source\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.num_rows","title":"num_rows()
abstractmethod
","text":"Return the number of rows for this Data Source
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef num_rows(self) -> int:\n \"\"\"Return the number of rows for this Data Source\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.onboard","title":"onboard()
","text":"This is a BLOCKING method that will onboard the data source (make it ready)
Returns:
Name Type Descriptionbool
bool
True if the DataSource was onboarded successfully
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def onboard(self) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the data source (make it ready)\n\n Returns:\n bool: True if the DataSource was onboarded successfully\n \"\"\"\n self.log.important(f\"Onboarding {self.uuid}...\")\n self.set_status(\"onboarding\")\n self.remove_health_tag(\"needs_onboard\")\n\n # Make sure our display view actually exists\n self.view(\"display\").ensure_exists()\n\n # Recompute the stats\n self.recompute_stats()\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n self.set_status(\"ready\")\n return True\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.outliers","title":"outliers(scale=1.5)
abstractmethod
","text":"Return a DataFrame of outliers from this DataSource
Parameters:
Name Type Description Defaultscale
float
The scale to use for the IQR (default: 1.5)
1.5
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame of outliers from this DataSource
NotesUses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers The scale parameter can be adjusted to change the IQR multiplier
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef outliers(self, scale: float = 1.5) -> pd.DataFrame:\n \"\"\"Return a DataFrame of outliers from this DataSource\n\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.query","title":"query(query)
abstractmethod
","text":"Query the DataSourceAbstract Args: query(str): The SQL query to execute
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef query(self, query: str) -> pd.DataFrame:\n \"\"\"Query the DataSourceAbstract\n Args:\n query(str): The SQL query to execute\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.ready","title":"ready()
","text":"Is the DataSource ready?
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def ready(self) -> bool:\n \"\"\"Is the DataSource ready?\"\"\"\n\n # Check if the Artifact is ready\n if not super().ready():\n return False\n\n # If we don't have a smart_sample we're probably not ready\n if not self.df_cache.check(f\"{self.uuid}/smart_sample\"):\n self.log.warning(f\"DataSource {self.uuid} not ready...\")\n return False\n\n # Okay so we have sample, outliers, and smart_sample so we are ready\n return True\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.recompute_stats","title":"recompute_stats()
","text":"This is a BLOCKING method that will recompute the stats for the data source
Returns:
Name Type Descriptionbool
bool
True if the DataSource stats were recomputed successfully
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def recompute_stats(self) -> bool:\n \"\"\"This is a BLOCKING method that will recompute the stats for the data source\n\n Returns:\n bool: True if the DataSource stats were recomputed successfully\n \"\"\"\n self.log.important(f\"Recomputing Stats {self.uuid}...\")\n\n # Make sure our computation view actually exists\n self.view(\"computation\").ensure_exists()\n\n # Compute the sample, column stats, outliers, and smart_sample\n self.df_cache.delete(f\"{self.uuid}/sample\")\n self.sample()\n self.column_stats(recompute=True)\n self.refresh_meta() # Refresh the meta since outliers needs descriptive_stats and value_counts\n self.df_cache.delete(f\"{self.uuid}/outliers\")\n self.outliers()\n self.df_cache.delete(f\"{self.uuid}/smart_sample\")\n self.smart_sample()\n return True\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.sample","title":"sample()
abstractmethod
","text":"Return a sample DataFrame from this DataSourceAbstract
Returns:
Type DescriptionDataFrame
pd.DataFrame: A sample DataFrame from this DataSource
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef sample(self) -> pd.DataFrame:\n \"\"\"Return a sample DataFrame from this DataSourceAbstract\n\n Returns:\n pd.DataFrame: A sample DataFrame from this DataSource\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.set_computation_columns","title":"set_computation_columns(computation_columns, recompute_stats=True)
","text":"Set the computation columns for this Data Source
Parameters:
Name Type Description Defaultcomputation_columns
list[str]
The computation columns for this Data Source
requiredrecompute_stats
bool
Recomputes all the stats for this Data Source (default: True)
True
Source code in src/sageworks/core/artifacts/data_source_abstract.py
def set_computation_columns(self, computation_columns: list[str], recompute_stats: bool = True):\n \"\"\"Set the computation columns for this Data Source\n\n Args:\n computation_columns (list[str]): The computation columns for this Data Source\n recompute_stats (bool): Recomputes all the stats for this Data Source (default: True)\n \"\"\"\n self.log.important(f\"Setting Computation Columns...{computation_columns}\")\n from sageworks.core.views import ComputationView\n\n # Create a NEW computation view\n ComputationView.create(self, column_list=computation_columns)\n if recompute_stats:\n self.recompute_stats()\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.set_display_columns","title":"set_display_columns(diplay_columns)
","text":"Set the display columns for this Data Source
Parameters:
Name Type Description Defaultdiplay_columns
list[str]
The display columns for this Data Source
required Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def set_display_columns(self, diplay_columns: list[str]):\n \"\"\"Set the display columns for this Data Source\n\n Args:\n diplay_columns (list[str]): The display columns for this Data Source\n \"\"\"\n # Check mismatch of display columns to computation columns\n c_view = self.view(\"computation\")\n computation_columns = c_view.columns\n mismatch_columns = [col for col in diplay_columns if col not in computation_columns]\n if mismatch_columns:\n self.log.monitor(f\"Display View/Computation mismatch: {mismatch_columns}\")\n\n self.log.important(f\"Setting Display Columns...{diplay_columns}\")\n from sageworks.core.views import DisplayView\n\n # Create a NEW display view\n DisplayView.create(self, source_table=c_view.table, column_list=diplay_columns)\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.smart_sample","title":"smart_sample()
abstractmethod
","text":"Get a SMART sample dataframe from this DataSource Returns: pd.DataFrame: A combined DataFrame of sample data + outliers
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef smart_sample(self) -> pd.DataFrame:\n \"\"\"Get a SMART sample dataframe from this DataSource\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.value_counts","title":"value_counts(recompute=False)
abstractmethod
","text":"Compute 'value_counts' for all the string columns in a DataSource Args: recompute (bool): Recompute the value counts (default: False) Returns: dict(dict): A dictionary of value counts for each column in the form {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...}, 'col2': ...}
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef value_counts(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n Args:\n recompute (bool): Recompute the value counts (default: False)\n Returns:\n dict(dict): A dictionary of value counts for each column in the form\n {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...},\n 'col2': ...}\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.view","title":"view(view_name)
","text":"Return a DataFrame for a specific view Args: view_name (str): The name of the view to return Returns: pd.DataFrame: A DataFrame for the specified view
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def view(self, view_name: str) -> \"View\":\n \"\"\"Return a DataFrame for a specific view\n Args:\n view_name (str): The name of the view to return\n Returns:\n pd.DataFrame: A DataFrame for the specified view\n \"\"\"\n from sageworks.core.views import View\n\n return View(self, view_name)\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.views","title":"views()
","text":"Return the views for this Data Source
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def views(self) -> list[str]:\n \"\"\"Return the views for this Data Source\"\"\"\n from sageworks.core.views.view_utils import list_views\n\n return list_views(self)\n
"},{"location":"core_classes/artifacts/endpoint_core/","title":"EndpointCore","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the Endpoint API Class and voil\u00e0 it works the same.
EndpointCore: SageWorks EndpointCore Class
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore","title":"EndpointCore
","text":" Bases: Artifact
EndpointCore: SageWorks EndpointCore Class
Common Usagemy_endpoint = EndpointCore(endpoint_uuid)\nprediction_df = my_endpoint.predict(test_df)\nmetrics = my_endpoint.regression_metrics(target_column, prediction_df)\nfor metric, value in metrics.items():\n print(f\"{metric}: {value:0.3f}\")\n
Source code in src/sageworks/core/artifacts/endpoint_core.py
class EndpointCore(Artifact):\n \"\"\"EndpointCore: SageWorks EndpointCore Class\n\n Common Usage:\n ```python\n my_endpoint = EndpointCore(endpoint_uuid)\n prediction_df = my_endpoint.predict(test_df)\n metrics = my_endpoint.regression_metrics(target_column, prediction_df)\n for metric, value in metrics.items():\n print(f\"{metric}: {value:0.3f}\")\n ```\n \"\"\"\n\n def __init__(self, endpoint_uuid, **kwargs):\n \"\"\"EndpointCore Initialization\n\n Args:\n endpoint_uuid (str): Name of Endpoint in SageWorks\n \"\"\"\n\n # Make sure the endpoint_uuid is a valid name\n self.is_name_valid(endpoint_uuid, delimiter=\"-\", lower_case=False)\n\n # Call SuperClass Initialization\n super().__init__(endpoint_uuid, **kwargs)\n\n # Grab an Cloud Metadata object and pull information for Endpoints\n self.endpoint_name = endpoint_uuid\n self.endpoint_meta = self.meta.endpoint(self.endpoint_name)\n\n # Sanity check that we found the endpoint\n if self.endpoint_meta is None:\n self.log.important(f\"Could not find endpoint {self.uuid} within current visibility scope\")\n return\n\n # Sanity check the Endpoint state\n if self.endpoint_meta[\"EndpointStatus\"] == \"Failed\":\n self.log.critical(f\"Endpoint {self.uuid} is in a failed state\")\n reason = self.endpoint_meta[\"FailureReason\"]\n self.log.critical(f\"Failure Reason: {reason}\")\n self.log.critical(\"Please delete this endpoint and re-deploy...\")\n\n # Set the Inference, Capture, and Monitoring S3 Paths\n self.endpoint_inference_path = self.endpoints_s3_path + \"/inference/\" + self.uuid\n self.endpoint_data_capture_path = self.endpoints_s3_path + \"/data_capture/\" + self.uuid\n self.endpoint_monitoring_path = self.endpoints_s3_path + \"/monitoring/\" + self.uuid\n\n # Set the Model Name\n self.model_name = self.get_input()\n\n # This is for endpoint error handling later\n self.endpoint_return_columns = None\n\n # We temporary cache the endpoint metrics\n self.temp_storage = Cache(prefix=\"temp_storage\", expire=300) # 5 minutes\n\n # Call SuperClass Post Initialization\n super().__post_init__()\n\n # All done\n self.log.info(f\"EndpointCore Initialized: {self.endpoint_name}\")\n\n def refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n self.endpoint_meta = self.meta.endpoint(self.endpoint_name)\n\n def exists(self) -> bool:\n \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n if self.endpoint_meta is None:\n self.log.debug(f\"Endpoint {self.endpoint_name} not found in AWS Metadata\")\n return False\n return True\n\n def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n\n Returns:\n list[str]: List of health issues\n \"\"\"\n if not self.ready():\n return [\"needs_onboard\"]\n\n # Call the base class health check\n health_issues = super().health_check()\n\n # Does this endpoint have a config?\n # Note: This is not an authoritative check, so improve later\n if self.endpoint_meta.get(\"ProductionVariants\") is None:\n health_issues.append(\"no_config\")\n\n # We're going to check for 5xx errors and no activity\n endpoint_metrics = self.endpoint_metrics()\n\n # Check if we have metrics\n if endpoint_metrics is None:\n health_issues.append(\"unknown_error\")\n return health_issues\n\n # Check for 5xx errors\n num_errors = endpoint_metrics[\"Invocation5XXErrors\"].sum()\n if num_errors > 5:\n health_issues.append(\"5xx_errors\")\n elif num_errors > 0:\n health_issues.append(\"5xx_errors_min\")\n else:\n self.remove_health_tag(\"5xx_errors\")\n self.remove_health_tag(\"5xx_errors_min\")\n\n # Check for Endpoint activity\n num_invocations = endpoint_metrics[\"Invocations\"].sum()\n if num_invocations == 0:\n health_issues.append(\"no_activity\")\n else:\n self.remove_health_tag(\"no_activity\")\n return health_issues\n\n def is_serverless(self) -> bool:\n \"\"\"Check if the current endpoint is serverless.\n\n Returns:\n bool: True if the endpoint is serverless, False otherwise.\n \"\"\"\n return \"Serverless\" in self.endpoint_meta[\"InstanceType\"]\n\n def add_data_capture(self):\n \"\"\"Add data capture to the endpoint\"\"\"\n self.get_monitor().add_data_capture()\n\n def get_monitor(self):\n \"\"\"Get the MonitorCore class for this endpoint\"\"\"\n from sageworks.core.artifacts.monitor_core import MonitorCore\n\n return MonitorCore(self.endpoint_name)\n\n def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n return 0.0\n\n def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.endpoint_meta\n\n def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n return self.endpoint_meta[\"EndpointArn\"]\n\n def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n\n def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.endpoint_meta[\"CreationTime\"]\n\n def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n return self.endpoint_meta[\"LastModifiedTime\"]\n\n def hash(self) -> Optional[str]:\n \"\"\"Return the hash for the internal model used by this endpoint\n\n Returns:\n Optional[str]: The hash for the internal model used by this endpoint\n \"\"\"\n from sageworks.utils.endpoint_utils import get_model_data_url # Avoid circular import\n\n model_url = get_model_data_url(self.endpoint_config_name(), self.boto3_session)\n return get_s3_etag(model_url, self.boto3_session)\n\n def endpoint_metrics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Return the metrics for this endpoint\n\n Returns:\n pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)\n \"\"\"\n\n # Do we have it cached?\n metrics_key = f\"endpoint:{self.uuid}:endpoint_metrics\"\n endpoint_metrics = self.temp_storage.get(metrics_key)\n if endpoint_metrics is not None:\n return endpoint_metrics\n\n # We don't have it cached so let's get it from CloudWatch\n if \"ProductionVariants\" not in self.endpoint_meta:\n return None\n self.log.important(\"Updating endpoint metrics...\")\n variant = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n endpoint_metrics = EndpointMetrics().get_metrics(self.uuid, variant=variant)\n self.temp_storage.set(metrics_key, endpoint_metrics)\n return endpoint_metrics\n\n def details(self, recompute: bool = False) -> dict:\n \"\"\"Additional Details about this Endpoint\n Args:\n recompute (bool): Recompute the details (default: False)\n Returns:\n dict(dict): A dictionary of details about this Endpoint\n \"\"\"\n\n # Fill in all the details about this Endpoint\n details = self.summary()\n\n # Get details from our AWS Metadata\n details[\"status\"] = self.endpoint_meta[\"EndpointStatus\"]\n details[\"instance\"] = self.endpoint_meta[\"InstanceType\"]\n try:\n details[\"instance_count\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"CurrentInstanceCount\"] or \"-\"\n except KeyError:\n details[\"instance_count\"] = \"-\"\n if \"ProductionVariants\" in self.endpoint_meta:\n details[\"variant\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n else:\n details[\"variant\"] = \"-\"\n\n # Add endpoint metrics from CloudWatch\n details[\"endpoint_metrics\"] = self.endpoint_metrics()\n\n # Return the details\n return details\n\n def onboard(self, interactive: bool = False) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the Endpoint (make it ready)\n Args:\n interactive (bool, optional): If True, will prompt the user for information. (default: False)\n Returns:\n bool: True if the Endpoint is successfully onboarded, False otherwise\n \"\"\"\n\n # Make sure our input is defined\n if self.get_input() == \"unknown\":\n if interactive:\n input_model = input(\"Input Model?: \")\n else:\n self.log.critical(\"Input Model is not defined!\")\n return False\n else:\n input_model = self.get_input()\n\n # Now that we have the details, let's onboard the Endpoint with args\n return self.onboard_with_args(input_model)\n\n def onboard_with_args(self, input_model: str) -> bool:\n \"\"\"Onboard the Endpoint with the given arguments\n\n Args:\n input_model (str): The input model for this endpoint\n Returns:\n bool: True if the Endpoint is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n self.upsert_sageworks_meta({\"sageworks_input\": input_model})\n self.model_name = input_model\n\n # Remove the needs_onboard tag\n self.remove_health_tag(\"needs_onboard\")\n self.set_status(\"ready\")\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n return True\n\n def auto_inference(self, capture: bool = False) -> pd.DataFrame:\n \"\"\"Run inference on the endpoint using FeatureSet data\n\n Args:\n capture (bool, optional): Capture the inference results and metrics (default=False)\n \"\"\"\n\n # Sanity Check that we have a model\n model = ModelCore(self.get_input())\n if not model.exists():\n self.log.error(\"No model found for this endpoint. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Now get the FeatureSet and make sure it exists\n fs = FeatureSetCore(model.get_input())\n if not fs.exists():\n self.log.error(\"No FeatureSet found for this endpoint. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Grab the evaluation data from the FeatureSet\n table = fs.view(\"training\").table\n eval_df = fs.query(f'SELECT * FROM \"{table}\" where training = FALSE')\n capture_uuid = \"auto_inference\" if capture else None\n return self.inference(eval_df, capture_uuid, id_column=fs.id_column)\n\n def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -> pd.DataFrame:\n \"\"\"Run inference and compute performance metrics with optional capture\n\n Args:\n eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n capture_uuid (str, optional): UUID of the inference capture (default=None)\n id_column (str, optional): Name of the ID column (default=None)\n\n Returns:\n pd.DataFrame: DataFrame with the inference results\n\n Note:\n If capture=True inference/performance metrics are written to S3 Endpoint Inference Folder\n \"\"\"\n\n # Run predictions on the evaluation data\n prediction_df = self._predict(eval_df)\n if prediction_df.empty:\n self.log.warning(\"No predictions were made. Returning empty DataFrame.\")\n return prediction_df\n\n # Get the target column\n model = ModelCore(self.model_name)\n target_column = model.target()\n\n # Sanity Check that the target column is present\n if target_column and (target_column not in prediction_df.columns):\n self.log.important(f\"Target Column {target_column} not found in prediction_df!\")\n self.log.important(\"In order to compute metrics, the target column must be present!\")\n return prediction_df\n\n # Compute the standard performance metrics for this model\n model_type = model.model_type\n if model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n prediction_df = self.residuals(target_column, prediction_df)\n metrics = self.regression_metrics(target_column, prediction_df)\n elif model_type == ModelType.CLASSIFIER:\n metrics = self.classification_metrics(target_column, prediction_df)\n else:\n # For other model types, we don't compute metrics\n self.log.important(f\"Model Type: {model_type} doesn't have metrics...\")\n metrics = pd.DataFrame()\n\n # Print out the metrics\n if not metrics.empty:\n print(f\"Performance Metrics for {self.model_name} on {self.uuid}\")\n print(metrics.head())\n\n # Capture the inference results and metrics\n if capture_uuid is not None:\n description = capture_uuid.replace(\"_\", \" \").title()\n self._capture_inference_results(\n capture_uuid, prediction_df, target_column, model_type, metrics, description, id_column\n )\n\n # Return the prediction DataFrame\n return prediction_df\n\n def fast_inference(self, eval_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n\n Note:\n There's no sanity checks or error handling... just FAST Inference!\n \"\"\"\n return fast_inference(self.uuid, eval_df, self.sm_session)\n\n def _predict(self, eval_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Internal: Run prediction on the given observations in the given DataFrame\n Args:\n eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n Returns:\n pd.DataFrame: Return the DataFrame with additional columns, prediction and any _proba columns\n \"\"\"\n\n # Sanity check: Does the DataFrame have 0 rows?\n if eval_df.empty:\n self.log.warning(\"Evaluation DataFrame has 0 rows. No predictions to run.\")\n return pd.DataFrame(columns=eval_df.columns) # Return empty DataFrame with same structure\n\n # Sanity check: Does the Model have Features?\n features = ModelCore(self.model_name).features()\n if not features:\n self.log.warning(\"Model does not have features defined, using all columns in the DataFrame\")\n else:\n # Sanity check: Does the DataFrame have the required features?\n df_columns_lower = set(col.lower() for col in eval_df.columns)\n features_lower = set(feature.lower() for feature in features)\n\n # Check if the features are a subset of the DataFrame columns (case-insensitive)\n if not features_lower.issubset(df_columns_lower):\n missing_features = features_lower - df_columns_lower\n raise ValueError(f\"DataFrame does not contain required features: {missing_features}\")\n\n # Create our Endpoint Predictor Class\n predictor = Predictor(\n self.endpoint_name,\n sagemaker_session=self.sm_session,\n serializer=CSVSerializer(),\n deserializer=CSVDeserializer(),\n )\n\n # Now split up the dataframe into 500 row chunks, send those chunks to our\n # endpoint (with error handling) and stitch all the chunks back together\n df_list = []\n for index in range(0, len(eval_df), 500):\n self.log.info(\"Processing...\")\n\n # Compute partial DataFrames, add them to a list, and concatenate at the end\n partial_df = self._endpoint_error_handling(predictor, eval_df[index : index + 500])\n df_list.append(partial_df)\n\n # Concatenate the dataframes\n combined_df = pd.concat(df_list, ignore_index=True)\n\n # Convert data to numeric\n # Note: Since we're using CSV serializers numeric columns often get changed to generic 'object' types\n\n # Hard Conversion\n # Note: We explicitly catch exceptions for columns that cannot be converted to numeric\n converted_df = combined_df.copy()\n for column in combined_df.columns:\n try:\n converted_df[column] = pd.to_numeric(combined_df[column])\n except ValueError:\n # If a ValueError is raised, the column cannot be converted to numeric, so we keep it as is\n pass\n\n # Soft Conversion\n # Convert columns to the best possible dtype that supports the pd.NA missing value.\n converted_df = converted_df.convert_dtypes()\n\n # Report on any rows that failed\n failed_rows = converted_df[converted_df.isna().any(axis=1)]\n if not failed_rows.empty:\n self.log.warning(f\"Rows that failed:\\n{failed_rows}\")\n\n # Convert pd.NA placeholders to pd.NA\n # Note: CSV serialization converts pd.NA to blank strings, so we have to put in placeholders\n converted_df.replace(\"__NA__\", pd.NA, inplace=True)\n\n # Return the Dataframe\n return converted_df\n\n def _endpoint_error_handling(self, predictor, feature_df):\n \"\"\"Internal: Handles errors, retries, and binary search for problematic rows.\"\"\"\n\n # Convert DataFrame into a CSV buffer\n csv_buffer = StringIO()\n feature_df.to_csv(csv_buffer, index=False)\n\n try:\n # Send CSV buffer to the predictor and process results\n results = predictor.predict(csv_buffer.getvalue())\n results_df = pd.DataFrame.from_records(results[1:], columns=results[0])\n self.endpoint_return_columns = results_df.columns.tolist()\n return results_df\n\n except botocore.exceptions.ClientError as err:\n error_code = err.response[\"Error\"][\"Code\"]\n\n if error_code == \"ModelNotReadyException\":\n self.log.error(f\"Error {error_code}: {err.response.get('Message', 'No message')}\")\n self.log.error(\"Model not ready. Sleeping and retrying...\")\n time.sleep(60)\n return self._endpoint_error_handling(predictor, feature_df)\n\n elif error_code == \"ModelError\":\n self.log.warning(\"Model error. Bisecting the DataFrame and retrying...\")\n\n # Base case: If there is only one row, we can't binary search further\n if len(feature_df) == 1:\n if not self.endpoint_return_columns:\n raise\n return self._error_df(feature_df, self.endpoint_return_columns)\n\n # Binary search to find the problematic row(s)\n mid_point = len(feature_df) // 2\n first_half = self._endpoint_error_handling(predictor, feature_df.iloc[:mid_point])\n second_half = self._endpoint_error_handling(predictor, feature_df.iloc[mid_point:])\n return pd.concat([first_half, second_half], ignore_index=True)\n\n else:\n # Unknown ClientError, raise the exception\n self.log.critical(f\"Unexpected ClientError: {err}\")\n raise\n\n except Exception as err:\n self.log.critical(f\"Unexpected general error: {err}\")\n raise\n\n def _error_df(self, df, all_columns):\n \"\"\"Internal: Method to construct an Error DataFrame (a Pandas DataFrame with one row of NaNs)\"\"\"\n # Create a new dataframe with all NaNs\n error_df = pd.DataFrame(dict(zip(all_columns, [[np.NaN]] * len(self.endpoint_return_columns))))\n # Now set the original values for the incoming dataframe\n for column in df.columns:\n error_df[column] = df[column].values\n return error_df\n\n def _capture_inference_results(\n self,\n capture_uuid: str,\n pred_results_df: pd.DataFrame,\n target_column: str,\n model_type: ModelType,\n metrics: pd.DataFrame,\n description: str,\n id_column: str = None,\n ):\n \"\"\"Internal: Capture the inference results and metrics to S3\n\n Args:\n capture_uuid (str): UUID of the inference capture\n pred_results_df (pd.DataFrame): DataFrame with the prediction results\n target_column (str): Name of the target column\n model_type (ModelType): Type of the model (e.g. REGRESSOR, CLASSIFIER)\n metrics (pd.DataFrame): DataFrame with the performance metrics\n description (str): Description of the inference results\n id_column (str, optional): Name of the ID column (default=None)\n \"\"\"\n\n # Compute a dataframe hash (just use the last 8)\n data_hash = joblib.hash(pred_results_df)[:8]\n\n # Metadata for the model inference\n inference_meta = {\n \"name\": capture_uuid,\n \"data_hash\": data_hash,\n \"num_rows\": len(pred_results_df),\n \"description\": description,\n }\n\n # Create the S3 Path for the Inference Capture\n inference_capture_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n # Write the metadata dictionary and metrics to our S3 Model Inference Folder\n wr.s3.to_json(\n pd.DataFrame([inference_meta]),\n f\"{inference_capture_path}/inference_meta.json\",\n index=False,\n )\n self.log.info(f\"Writing metrics to {inference_capture_path}/inference_metrics.csv\")\n wr.s3.to_csv(metrics, f\"{inference_capture_path}/inference_metrics.csv\", index=False)\n\n # Grab the target column, prediction column, any _proba columns, and the ID column (if present)\n prediction_col = \"prediction\" if \"prediction\" in pred_results_df.columns else \"predictions\"\n output_columns = [target_column, prediction_col]\n\n # Add any _proba columns to the output columns\n output_columns += [col for col in pred_results_df.columns if col.endswith(\"_proba\")]\n\n # Add any quantile columns to the output columns\n output_columns += [col for col in pred_results_df.columns if col.startswith(\"q_\") or col.startswith(\"qr_\")]\n\n # Add the ID column\n if id_column and id_column in pred_results_df.columns:\n output_columns.append(id_column)\n\n # Write the predictions to our S3 Model Inference Folder\n self.log.info(f\"Writing predictions to {inference_capture_path}/inference_predictions.csv\")\n subset_df = pred_results_df[output_columns]\n wr.s3.to_csv(subset_df, f\"{inference_capture_path}/inference_predictions.csv\", index=False)\n\n # CLASSIFIER: Write the confusion matrix to our S3 Model Inference Folder\n if model_type == ModelType.CLASSIFIER:\n conf_mtx = self.generate_confusion_matrix(target_column, pred_results_df)\n self.log.info(f\"Writing confusion matrix to {inference_capture_path}/inference_cm.csv\")\n # Note: Unlike other dataframes here, we want to write the index (labels) to the CSV\n wr.s3.to_csv(conf_mtx, f\"{inference_capture_path}/inference_cm.csv\", index=True)\n\n # Generate SHAP values for our Prediction Dataframe\n generate_shap_values(self.endpoint_name, model_type.value, pred_results_df, inference_capture_path)\n\n # Now recompute the details for our Model\n self.log.important(f\"Recomputing Details for {self.model_name} to show latest Inference Results...\")\n model = ModelCore(self.model_name)\n model._load_inference_metrics(capture_uuid)\n model.details(recompute=True)\n\n # Recompute the details so that inference model metrics are updated\n self.log.important(f\"Recomputing Details for {self.uuid} to show latest Inference Results...\")\n self.details(recompute=True)\n\n def regression_metrics(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the performance metrics for this Endpoint\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with the performance metrics\n \"\"\"\n\n # Sanity Check the prediction DataFrame\n if prediction_df.empty:\n self.log.warning(\"No predictions were made. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Compute the metrics\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n mae = mean_absolute_error(y_true, y_pred)\n rmse = np.sqrt(mean_squared_error(y_true, y_pred))\n r2 = r2_score(y_true, y_pred)\n # Mean Absolute Percentage Error\n mape = np.mean(np.where(y_true != 0, np.abs((y_true - y_pred) / y_true), np.abs(y_true - y_pred))) * 100\n # Median Absolute Error\n medae = median_absolute_error(y_true, y_pred)\n\n # Organize and return the metrics\n metrics = {\n \"MAE\": round(mae, 3),\n \"RMSE\": round(rmse, 3),\n \"R2\": round(r2, 3),\n \"MAPE\": round(mape, 3),\n \"MedAE\": round(medae, 3),\n \"NumRows\": len(prediction_df),\n }\n return pd.DataFrame.from_records([metrics])\n\n def residuals(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Add the residuals to the prediction DataFrame\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'\n \"\"\"\n\n # Compute the residuals\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n # Check for classification scenario\n if not pd.api.types.is_numeric_dtype(y_true) or not pd.api.types.is_numeric_dtype(y_pred):\n self.log.warning(\"Target and Prediction columns are not numeric. Computing 'diffs'...\")\n prediction_df[\"residuals\"] = (y_true != y_pred).astype(int)\n prediction_df[\"residuals_abs\"] = prediction_df[\"residuals\"]\n else:\n # Compute numeric residuals for regression\n prediction_df[\"residuals\"] = y_true - y_pred\n prediction_df[\"residuals_abs\"] = np.abs(prediction_df[\"residuals\"])\n\n return prediction_df\n\n @staticmethod\n def validate_proba_columns(prediction_df: pd.DataFrame, class_labels: list, guessing: bool = False):\n \"\"\"Ensure probability columns are correctly aligned with class labels\n\n Args:\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n class_labels (list): List of class labels\n guessing (bool, optional): Whether we're guessing the class labels. Defaults to False.\n \"\"\"\n proba_columns = [col.replace(\"_proba\", \"\") for col in prediction_df.columns if col.endswith(\"_proba\")]\n\n if sorted(class_labels) != sorted(proba_columns):\n if guessing:\n raise ValueError(f\"_proba columns {proba_columns} != GUESSED class_labels {class_labels}!\")\n else:\n raise ValueError(f\"_proba columns {proba_columns} != class_labels {class_labels}!\")\n\n def classification_metrics(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the performance metrics for this Endpoint\n\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n\n Returns:\n pd.DataFrame: DataFrame with the performance metrics\n \"\"\"\n # Get the class labels from the model\n class_labels = ModelCore(self.model_name).class_labels()\n if class_labels is None:\n self.log.warning(\n \"Class labels not found in the model. Guessing class labels from the prediction DataFrame.\"\n )\n class_labels = prediction_df[target_column].unique().tolist()\n self.validate_proba_columns(prediction_df, class_labels, guessing=True)\n else:\n self.validate_proba_columns(prediction_df, class_labels)\n\n # Calculate precision, recall, fscore, and support, handling zero division\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n scores = precision_recall_fscore_support(\n prediction_df[target_column],\n prediction_df[prediction_col],\n average=None,\n labels=class_labels,\n zero_division=0,\n )\n\n # Identify the probability columns and keep them as a Pandas DataFrame\n proba_columns = [f\"{label}_proba\" for label in class_labels]\n y_score = prediction_df[proba_columns]\n\n # One-hot encode the true labels using all class labels (fit with class_labels)\n encoder = OneHotEncoder(categories=[class_labels], sparse_output=False)\n y_true = encoder.fit_transform(prediction_df[[target_column]])\n\n # Calculate ROC AUC per label and handle exceptions for missing classes\n roc_auc_per_label = []\n for i, label in enumerate(class_labels):\n try:\n roc_auc = roc_auc_score(y_true[:, i], y_score.iloc[:, i])\n except ValueError as e:\n self.log.warning(f\"ROC AUC calculation failed for label {label}.\")\n self.log.warning(f\"{str(e)}\")\n roc_auc = 0.0\n roc_auc_per_label.append(roc_auc)\n\n # Put the scores into a DataFrame\n score_df = pd.DataFrame(\n {\n target_column: class_labels,\n \"precision\": scores[0],\n \"recall\": scores[1],\n \"fscore\": scores[2],\n \"roc_auc\": roc_auc_per_label,\n \"support\": scores[3],\n }\n )\n\n # Sort the target labels\n score_df = score_df.sort_values(by=[target_column], ascending=True)\n return score_df\n\n def generate_confusion_matrix(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the confusion matrix for this Endpoint\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with the confusion matrix\n \"\"\"\n\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n # Check if our model has class labels, if not we'll use the unique labels in the prediction\n class_labels = ModelCore(self.model_name).class_labels()\n if class_labels is None:\n class_labels = sorted(list(set(y_true) | set(y_pred)))\n\n # Compute the confusion matrix (sklearn confusion_matrix)\n conf_mtx = confusion_matrix(y_true, y_pred, labels=class_labels)\n\n # Create a DataFrame\n conf_mtx_df = pd.DataFrame(conf_mtx, index=class_labels, columns=class_labels)\n conf_mtx_df.index.name = \"labels\"\n\n # Check if our model has class labels. If so make the index and columns ordered\n model_class_labels = ModelCore(self.model_name).class_labels()\n if model_class_labels:\n self.log.important(\"Reordering the confusion matrix based on model class labels...\")\n conf_mtx_df.index = pd.Categorical(conf_mtx_df.index, categories=model_class_labels, ordered=True)\n conf_mtx_df.columns = pd.Categorical(conf_mtx_df.columns, categories=model_class_labels, ordered=True)\n conf_mtx_df = conf_mtx_df.sort_index().sort_index(axis=1)\n return conf_mtx_df\n\n def endpoint_config_name(self) -> str:\n # Grab the Endpoint Config Name from the AWS\n details = self.sm_client.describe_endpoint(EndpointName=self.endpoint_name)\n return details[\"EndpointConfigName\"]\n\n def set_input(self, input: str, force=False):\n \"\"\"Override: Set the input data for this artifact\n\n Args:\n input (str): Name of input for this artifact\n force (bool, optional): Force the input to be set. Defaults to False.\n Note:\n We're going to not allow this to be used for Models\n \"\"\"\n if not force:\n self.log.warning(f\"Endpoint {self.uuid}: Does not allow manual override of the input!\")\n return\n\n # Okay we're going to allow this to be set\n self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input})\n\n def delete(self):\n \"\"\" \"Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint\"\"\"\n if not self.exists():\n self.log.warning(f\"Trying to delete an Model that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the FeatureSet\n EndpointCore.managed_delete(endpoint_name=self.uuid)\n\n @classmethod\n def managed_delete(cls, endpoint_name: str):\n \"\"\"Delete the Endpoint and associated resources if it exists\"\"\"\n\n # Check if the endpoint exists\n try:\n endpoint_info = cls.sm_client.describe_endpoint(EndpointName=endpoint_name)\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] in [\"ValidationException\", \"ResourceNotFound\"]:\n cls.log.info(f\"Endpoint {endpoint_name} not found!\")\n return\n raise # Re-raise unexpected errors\n\n # Delete underlying models (Endpoints store/use models internally)\n cls.delete_endpoint_models(endpoint_name)\n\n # Get Endpoint Config Name and delete if exists\n endpoint_config_name = endpoint_info[\"EndpointConfigName\"]\n try:\n cls.log.info(f\"Deleting Endpoint Config {endpoint_config_name}...\")\n cls.sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n except ClientError:\n cls.log.info(f\"Endpoint Config {endpoint_config_name} not found...\")\n\n # Delete any monitoring schedules associated with the endpoint\n monitoring_schedules = cls.sm_client.list_monitoring_schedules(EndpointName=endpoint_name)[\n \"MonitoringScheduleSummaries\"\n ]\n for schedule in monitoring_schedules:\n cls.log.info(f\"Deleting Monitoring Schedule {schedule['MonitoringScheduleName']}...\")\n cls.sm_client.delete_monitoring_schedule(MonitoringScheduleName=schedule[\"MonitoringScheduleName\"])\n\n # Delete related S3 artifacts (inference, data capture, monitoring)\n endpoint_inference_path = cls.endpoints_s3_path + \"/inference/\" + endpoint_name\n endpoint_data_capture_path = cls.endpoints_s3_path + \"/data_capture/\" + endpoint_name\n endpoint_monitoring_path = cls.endpoints_s3_path + \"/monitoring/\" + endpoint_name\n for s3_path in [endpoint_inference_path, endpoint_data_capture_path, endpoint_monitoring_path]:\n s3_path = f\"{s3_path.rstrip('/')}/\"\n objects = wr.s3.list_objects(s3_path, boto3_session=cls.boto3_session)\n if objects:\n cls.log.info(f\"Deleting S3 Objects at {s3_path}...\")\n wr.s3.delete_objects(objects, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(endpoint_name)\n\n # Delete the endpoint\n time.sleep(2) # Allow AWS to catch up\n try:\n cls.log.info(f\"Deleting Endpoint {endpoint_name}...\")\n cls.sm_client.delete_endpoint(EndpointName=endpoint_name)\n except ClientError as e:\n cls.log.error(\"Error deleting endpoint.\")\n raise e\n\n time.sleep(5) # Final sleep for AWS to fully register deletions\n\n @classmethod\n def delete_endpoint_models(cls, endpoint_name: str):\n \"\"\"Delete the underlying Model for an Endpoint\n\n Args:\n endpoint_name (str): The name of the endpoint to delete\n \"\"\"\n\n # Grab the Endpoint Config Name from AWS\n endpoint_config_name = cls.sm_client.describe_endpoint(EndpointName=endpoint_name)[\"EndpointConfigName\"]\n\n # Retrieve the Model Names from the Endpoint Config\n try:\n endpoint_config = cls.sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n except botocore.exceptions.ClientError:\n cls.log.info(f\"Endpoint Config {endpoint_config_name} doesn't exist...\")\n return\n model_names = [variant[\"ModelName\"] for variant in endpoint_config[\"ProductionVariants\"]]\n for model_name in model_names:\n cls.log.info(f\"Deleting Internal Model {model_name}...\")\n try:\n cls.sm_client.delete_model(ModelName=model_name)\n except botocore.exceptions.ClientError as error:\n error_code = error.response[\"Error\"][\"Code\"]\n error_message = error.response[\"Error\"][\"Message\"]\n if error_code == \"ResourceInUse\":\n cls.log.warning(f\"Model {model_name} is still in use...\")\n else:\n cls.log.warning(f\"Error: {error_code} - {error_message}\")\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.__init__","title":"__init__(endpoint_uuid, **kwargs)
","text":"EndpointCore Initialization
Parameters:
Name Type Description Defaultendpoint_uuid
str
Name of Endpoint in SageWorks
required Source code insrc/sageworks/core/artifacts/endpoint_core.py
def __init__(self, endpoint_uuid, **kwargs):\n \"\"\"EndpointCore Initialization\n\n Args:\n endpoint_uuid (str): Name of Endpoint in SageWorks\n \"\"\"\n\n # Make sure the endpoint_uuid is a valid name\n self.is_name_valid(endpoint_uuid, delimiter=\"-\", lower_case=False)\n\n # Call SuperClass Initialization\n super().__init__(endpoint_uuid, **kwargs)\n\n # Grab an Cloud Metadata object and pull information for Endpoints\n self.endpoint_name = endpoint_uuid\n self.endpoint_meta = self.meta.endpoint(self.endpoint_name)\n\n # Sanity check that we found the endpoint\n if self.endpoint_meta is None:\n self.log.important(f\"Could not find endpoint {self.uuid} within current visibility scope\")\n return\n\n # Sanity check the Endpoint state\n if self.endpoint_meta[\"EndpointStatus\"] == \"Failed\":\n self.log.critical(f\"Endpoint {self.uuid} is in a failed state\")\n reason = self.endpoint_meta[\"FailureReason\"]\n self.log.critical(f\"Failure Reason: {reason}\")\n self.log.critical(\"Please delete this endpoint and re-deploy...\")\n\n # Set the Inference, Capture, and Monitoring S3 Paths\n self.endpoint_inference_path = self.endpoints_s3_path + \"/inference/\" + self.uuid\n self.endpoint_data_capture_path = self.endpoints_s3_path + \"/data_capture/\" + self.uuid\n self.endpoint_monitoring_path = self.endpoints_s3_path + \"/monitoring/\" + self.uuid\n\n # Set the Model Name\n self.model_name = self.get_input()\n\n # This is for endpoint error handling later\n self.endpoint_return_columns = None\n\n # We temporary cache the endpoint metrics\n self.temp_storage = Cache(prefix=\"temp_storage\", expire=300) # 5 minutes\n\n # Call SuperClass Post Initialization\n super().__post_init__()\n\n # All done\n self.log.info(f\"EndpointCore Initialized: {self.endpoint_name}\")\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.add_data_capture","title":"add_data_capture()
","text":"Add data capture to the endpoint
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def add_data_capture(self):\n \"\"\"Add data capture to the endpoint\"\"\"\n self.get_monitor().add_data_capture()\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.arn","title":"arn()
","text":"AWS ARN (Amazon Resource Name) for this artifact
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n return self.endpoint_meta[\"EndpointArn\"]\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.auto_inference","title":"auto_inference(capture=False)
","text":"Run inference on the endpoint using FeatureSet data
Parameters:
Name Type Description Defaultcapture
bool
Capture the inference results and metrics (default=False)
False
Source code in src/sageworks/core/artifacts/endpoint_core.py
def auto_inference(self, capture: bool = False) -> pd.DataFrame:\n \"\"\"Run inference on the endpoint using FeatureSet data\n\n Args:\n capture (bool, optional): Capture the inference results and metrics (default=False)\n \"\"\"\n\n # Sanity Check that we have a model\n model = ModelCore(self.get_input())\n if not model.exists():\n self.log.error(\"No model found for this endpoint. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Now get the FeatureSet and make sure it exists\n fs = FeatureSetCore(model.get_input())\n if not fs.exists():\n self.log.error(\"No FeatureSet found for this endpoint. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Grab the evaluation data from the FeatureSet\n table = fs.view(\"training\").table\n eval_df = fs.query(f'SELECT * FROM \"{table}\" where training = FALSE')\n capture_uuid = \"auto_inference\" if capture else None\n return self.inference(eval_df, capture_uuid, id_column=fs.id_column)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.aws_meta","title":"aws_meta()
","text":"Get ALL the AWS metadata for this artifact
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.endpoint_meta\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.aws_url","title":"aws_url()
","text":"The AWS URL for looking at/querying this data source
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.classification_metrics","title":"classification_metrics(target_column, prediction_df)
","text":"Compute the performance metrics for this Endpoint
Parameters:
Name Type Description Defaulttarget_column
str
Name of the target column
requiredprediction_df
DataFrame
DataFrame with the prediction results
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: DataFrame with the performance metrics
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def classification_metrics(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the performance metrics for this Endpoint\n\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n\n Returns:\n pd.DataFrame: DataFrame with the performance metrics\n \"\"\"\n # Get the class labels from the model\n class_labels = ModelCore(self.model_name).class_labels()\n if class_labels is None:\n self.log.warning(\n \"Class labels not found in the model. Guessing class labels from the prediction DataFrame.\"\n )\n class_labels = prediction_df[target_column].unique().tolist()\n self.validate_proba_columns(prediction_df, class_labels, guessing=True)\n else:\n self.validate_proba_columns(prediction_df, class_labels)\n\n # Calculate precision, recall, fscore, and support, handling zero division\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n scores = precision_recall_fscore_support(\n prediction_df[target_column],\n prediction_df[prediction_col],\n average=None,\n labels=class_labels,\n zero_division=0,\n )\n\n # Identify the probability columns and keep them as a Pandas DataFrame\n proba_columns = [f\"{label}_proba\" for label in class_labels]\n y_score = prediction_df[proba_columns]\n\n # One-hot encode the true labels using all class labels (fit with class_labels)\n encoder = OneHotEncoder(categories=[class_labels], sparse_output=False)\n y_true = encoder.fit_transform(prediction_df[[target_column]])\n\n # Calculate ROC AUC per label and handle exceptions for missing classes\n roc_auc_per_label = []\n for i, label in enumerate(class_labels):\n try:\n roc_auc = roc_auc_score(y_true[:, i], y_score.iloc[:, i])\n except ValueError as e:\n self.log.warning(f\"ROC AUC calculation failed for label {label}.\")\n self.log.warning(f\"{str(e)}\")\n roc_auc = 0.0\n roc_auc_per_label.append(roc_auc)\n\n # Put the scores into a DataFrame\n score_df = pd.DataFrame(\n {\n target_column: class_labels,\n \"precision\": scores[0],\n \"recall\": scores[1],\n \"fscore\": scores[2],\n \"roc_auc\": roc_auc_per_label,\n \"support\": scores[3],\n }\n )\n\n # Sort the target labels\n score_df = score_df.sort_values(by=[target_column], ascending=True)\n return score_df\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.created","title":"created()
","text":"Return the datetime when this artifact was created
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.endpoint_meta[\"CreationTime\"]\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.delete","title":"delete()
","text":"\"Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def delete(self):\n \"\"\" \"Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint\"\"\"\n if not self.exists():\n self.log.warning(f\"Trying to delete an Model that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the FeatureSet\n EndpointCore.managed_delete(endpoint_name=self.uuid)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.delete_endpoint_models","title":"delete_endpoint_models(endpoint_name)
classmethod
","text":"Delete the underlying Model for an Endpoint
Parameters:
Name Type Description Defaultendpoint_name
str
The name of the endpoint to delete
required Source code insrc/sageworks/core/artifacts/endpoint_core.py
@classmethod\ndef delete_endpoint_models(cls, endpoint_name: str):\n \"\"\"Delete the underlying Model for an Endpoint\n\n Args:\n endpoint_name (str): The name of the endpoint to delete\n \"\"\"\n\n # Grab the Endpoint Config Name from AWS\n endpoint_config_name = cls.sm_client.describe_endpoint(EndpointName=endpoint_name)[\"EndpointConfigName\"]\n\n # Retrieve the Model Names from the Endpoint Config\n try:\n endpoint_config = cls.sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n except botocore.exceptions.ClientError:\n cls.log.info(f\"Endpoint Config {endpoint_config_name} doesn't exist...\")\n return\n model_names = [variant[\"ModelName\"] for variant in endpoint_config[\"ProductionVariants\"]]\n for model_name in model_names:\n cls.log.info(f\"Deleting Internal Model {model_name}...\")\n try:\n cls.sm_client.delete_model(ModelName=model_name)\n except botocore.exceptions.ClientError as error:\n error_code = error.response[\"Error\"][\"Code\"]\n error_message = error.response[\"Error\"][\"Message\"]\n if error_code == \"ResourceInUse\":\n cls.log.warning(f\"Model {model_name} is still in use...\")\n else:\n cls.log.warning(f\"Error: {error_code} - {error_message}\")\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.details","title":"details(recompute=False)
","text":"Additional Details about this Endpoint Args: recompute (bool): Recompute the details (default: False) Returns: dict(dict): A dictionary of details about this Endpoint
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def details(self, recompute: bool = False) -> dict:\n \"\"\"Additional Details about this Endpoint\n Args:\n recompute (bool): Recompute the details (default: False)\n Returns:\n dict(dict): A dictionary of details about this Endpoint\n \"\"\"\n\n # Fill in all the details about this Endpoint\n details = self.summary()\n\n # Get details from our AWS Metadata\n details[\"status\"] = self.endpoint_meta[\"EndpointStatus\"]\n details[\"instance\"] = self.endpoint_meta[\"InstanceType\"]\n try:\n details[\"instance_count\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"CurrentInstanceCount\"] or \"-\"\n except KeyError:\n details[\"instance_count\"] = \"-\"\n if \"ProductionVariants\" in self.endpoint_meta:\n details[\"variant\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n else:\n details[\"variant\"] = \"-\"\n\n # Add endpoint metrics from CloudWatch\n details[\"endpoint_metrics\"] = self.endpoint_metrics()\n\n # Return the details\n return details\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.endpoint_metrics","title":"endpoint_metrics()
","text":"Return the metrics for this endpoint
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def endpoint_metrics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Return the metrics for this endpoint\n\n Returns:\n pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)\n \"\"\"\n\n # Do we have it cached?\n metrics_key = f\"endpoint:{self.uuid}:endpoint_metrics\"\n endpoint_metrics = self.temp_storage.get(metrics_key)\n if endpoint_metrics is not None:\n return endpoint_metrics\n\n # We don't have it cached so let's get it from CloudWatch\n if \"ProductionVariants\" not in self.endpoint_meta:\n return None\n self.log.important(\"Updating endpoint metrics...\")\n variant = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n endpoint_metrics = EndpointMetrics().get_metrics(self.uuid, variant=variant)\n self.temp_storage.set(metrics_key, endpoint_metrics)\n return endpoint_metrics\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.exists","title":"exists()
","text":"Does the feature_set_name exist in the AWS Metadata?
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def exists(self) -> bool:\n \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n if self.endpoint_meta is None:\n self.log.debug(f\"Endpoint {self.endpoint_name} not found in AWS Metadata\")\n return False\n return True\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.fast_inference","title":"fast_inference(eval_df)
","text":"Run inference on the Endpoint using the provided DataFrame
Parameters:
Name Type Description Defaulteval_df
DataFrame
The DataFrame to run predictions on
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The DataFrame with predictions
NoteThere's no sanity checks or error handling... just FAST Inference!
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def fast_inference(self, eval_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n\n Note:\n There's no sanity checks or error handling... just FAST Inference!\n \"\"\"\n return fast_inference(self.uuid, eval_df, self.sm_session)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.generate_confusion_matrix","title":"generate_confusion_matrix(target_column, prediction_df)
","text":"Compute the confusion matrix for this Endpoint Args: target_column (str): Name of the target column prediction_df (pd.DataFrame): DataFrame with the prediction results Returns: pd.DataFrame: DataFrame with the confusion matrix
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def generate_confusion_matrix(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the confusion matrix for this Endpoint\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with the confusion matrix\n \"\"\"\n\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n # Check if our model has class labels, if not we'll use the unique labels in the prediction\n class_labels = ModelCore(self.model_name).class_labels()\n if class_labels is None:\n class_labels = sorted(list(set(y_true) | set(y_pred)))\n\n # Compute the confusion matrix (sklearn confusion_matrix)\n conf_mtx = confusion_matrix(y_true, y_pred, labels=class_labels)\n\n # Create a DataFrame\n conf_mtx_df = pd.DataFrame(conf_mtx, index=class_labels, columns=class_labels)\n conf_mtx_df.index.name = \"labels\"\n\n # Check if our model has class labels. If so make the index and columns ordered\n model_class_labels = ModelCore(self.model_name).class_labels()\n if model_class_labels:\n self.log.important(\"Reordering the confusion matrix based on model class labels...\")\n conf_mtx_df.index = pd.Categorical(conf_mtx_df.index, categories=model_class_labels, ordered=True)\n conf_mtx_df.columns = pd.Categorical(conf_mtx_df.columns, categories=model_class_labels, ordered=True)\n conf_mtx_df = conf_mtx_df.sort_index().sort_index(axis=1)\n return conf_mtx_df\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.get_monitor","title":"get_monitor()
","text":"Get the MonitorCore class for this endpoint
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def get_monitor(self):\n \"\"\"Get the MonitorCore class for this endpoint\"\"\"\n from sageworks.core.artifacts.monitor_core import MonitorCore\n\n return MonitorCore(self.endpoint_name)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.hash","title":"hash()
","text":"Return the hash for the internal model used by this endpoint
Returns:
Type DescriptionOptional[str]
Optional[str]: The hash for the internal model used by this endpoint
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def hash(self) -> Optional[str]:\n \"\"\"Return the hash for the internal model used by this endpoint\n\n Returns:\n Optional[str]: The hash for the internal model used by this endpoint\n \"\"\"\n from sageworks.utils.endpoint_utils import get_model_data_url # Avoid circular import\n\n model_url = get_model_data_url(self.endpoint_config_name(), self.boto3_session)\n return get_s3_etag(model_url, self.boto3_session)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.health_check","title":"health_check()
","text":"Perform a health check on this model
Returns:
Type Descriptionlist[str]
list[str]: List of health issues
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n\n Returns:\n list[str]: List of health issues\n \"\"\"\n if not self.ready():\n return [\"needs_onboard\"]\n\n # Call the base class health check\n health_issues = super().health_check()\n\n # Does this endpoint have a config?\n # Note: This is not an authoritative check, so improve later\n if self.endpoint_meta.get(\"ProductionVariants\") is None:\n health_issues.append(\"no_config\")\n\n # We're going to check for 5xx errors and no activity\n endpoint_metrics = self.endpoint_metrics()\n\n # Check if we have metrics\n if endpoint_metrics is None:\n health_issues.append(\"unknown_error\")\n return health_issues\n\n # Check for 5xx errors\n num_errors = endpoint_metrics[\"Invocation5XXErrors\"].sum()\n if num_errors > 5:\n health_issues.append(\"5xx_errors\")\n elif num_errors > 0:\n health_issues.append(\"5xx_errors_min\")\n else:\n self.remove_health_tag(\"5xx_errors\")\n self.remove_health_tag(\"5xx_errors_min\")\n\n # Check for Endpoint activity\n num_invocations = endpoint_metrics[\"Invocations\"].sum()\n if num_invocations == 0:\n health_issues.append(\"no_activity\")\n else:\n self.remove_health_tag(\"no_activity\")\n return health_issues\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.inference","title":"inference(eval_df, capture_uuid=None, id_column=None)
","text":"Run inference and compute performance metrics with optional capture
Parameters:
Name Type Description Defaulteval_df
DataFrame
DataFrame to run predictions on (must have superset of features)
requiredcapture_uuid
str
UUID of the inference capture (default=None)
None
id_column
str
Name of the ID column (default=None)
None
Returns:
Type DescriptionDataFrame
pd.DataFrame: DataFrame with the inference results
NoteIf capture=True inference/performance metrics are written to S3 Endpoint Inference Folder
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -> pd.DataFrame:\n \"\"\"Run inference and compute performance metrics with optional capture\n\n Args:\n eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n capture_uuid (str, optional): UUID of the inference capture (default=None)\n id_column (str, optional): Name of the ID column (default=None)\n\n Returns:\n pd.DataFrame: DataFrame with the inference results\n\n Note:\n If capture=True inference/performance metrics are written to S3 Endpoint Inference Folder\n \"\"\"\n\n # Run predictions on the evaluation data\n prediction_df = self._predict(eval_df)\n if prediction_df.empty:\n self.log.warning(\"No predictions were made. Returning empty DataFrame.\")\n return prediction_df\n\n # Get the target column\n model = ModelCore(self.model_name)\n target_column = model.target()\n\n # Sanity Check that the target column is present\n if target_column and (target_column not in prediction_df.columns):\n self.log.important(f\"Target Column {target_column} not found in prediction_df!\")\n self.log.important(\"In order to compute metrics, the target column must be present!\")\n return prediction_df\n\n # Compute the standard performance metrics for this model\n model_type = model.model_type\n if model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n prediction_df = self.residuals(target_column, prediction_df)\n metrics = self.regression_metrics(target_column, prediction_df)\n elif model_type == ModelType.CLASSIFIER:\n metrics = self.classification_metrics(target_column, prediction_df)\n else:\n # For other model types, we don't compute metrics\n self.log.important(f\"Model Type: {model_type} doesn't have metrics...\")\n metrics = pd.DataFrame()\n\n # Print out the metrics\n if not metrics.empty:\n print(f\"Performance Metrics for {self.model_name} on {self.uuid}\")\n print(metrics.head())\n\n # Capture the inference results and metrics\n if capture_uuid is not None:\n description = capture_uuid.replace(\"_\", \" \").title()\n self._capture_inference_results(\n capture_uuid, prediction_df, target_column, model_type, metrics, description, id_column\n )\n\n # Return the prediction DataFrame\n return prediction_df\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.is_serverless","title":"is_serverless()
","text":"Check if the current endpoint is serverless.
Returns:
Name Type Descriptionbool
bool
True if the endpoint is serverless, False otherwise.
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def is_serverless(self) -> bool:\n \"\"\"Check if the current endpoint is serverless.\n\n Returns:\n bool: True if the endpoint is serverless, False otherwise.\n \"\"\"\n return \"Serverless\" in self.endpoint_meta[\"InstanceType\"]\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.managed_delete","title":"managed_delete(endpoint_name)
classmethod
","text":"Delete the Endpoint and associated resources if it exists
Source code insrc/sageworks/core/artifacts/endpoint_core.py
@classmethod\ndef managed_delete(cls, endpoint_name: str):\n \"\"\"Delete the Endpoint and associated resources if it exists\"\"\"\n\n # Check if the endpoint exists\n try:\n endpoint_info = cls.sm_client.describe_endpoint(EndpointName=endpoint_name)\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] in [\"ValidationException\", \"ResourceNotFound\"]:\n cls.log.info(f\"Endpoint {endpoint_name} not found!\")\n return\n raise # Re-raise unexpected errors\n\n # Delete underlying models (Endpoints store/use models internally)\n cls.delete_endpoint_models(endpoint_name)\n\n # Get Endpoint Config Name and delete if exists\n endpoint_config_name = endpoint_info[\"EndpointConfigName\"]\n try:\n cls.log.info(f\"Deleting Endpoint Config {endpoint_config_name}...\")\n cls.sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n except ClientError:\n cls.log.info(f\"Endpoint Config {endpoint_config_name} not found...\")\n\n # Delete any monitoring schedules associated with the endpoint\n monitoring_schedules = cls.sm_client.list_monitoring_schedules(EndpointName=endpoint_name)[\n \"MonitoringScheduleSummaries\"\n ]\n for schedule in monitoring_schedules:\n cls.log.info(f\"Deleting Monitoring Schedule {schedule['MonitoringScheduleName']}...\")\n cls.sm_client.delete_monitoring_schedule(MonitoringScheduleName=schedule[\"MonitoringScheduleName\"])\n\n # Delete related S3 artifacts (inference, data capture, monitoring)\n endpoint_inference_path = cls.endpoints_s3_path + \"/inference/\" + endpoint_name\n endpoint_data_capture_path = cls.endpoints_s3_path + \"/data_capture/\" + endpoint_name\n endpoint_monitoring_path = cls.endpoints_s3_path + \"/monitoring/\" + endpoint_name\n for s3_path in [endpoint_inference_path, endpoint_data_capture_path, endpoint_monitoring_path]:\n s3_path = f\"{s3_path.rstrip('/')}/\"\n objects = wr.s3.list_objects(s3_path, boto3_session=cls.boto3_session)\n if objects:\n cls.log.info(f\"Deleting S3 Objects at {s3_path}...\")\n wr.s3.delete_objects(objects, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(endpoint_name)\n\n # Delete the endpoint\n time.sleep(2) # Allow AWS to catch up\n try:\n cls.log.info(f\"Deleting Endpoint {endpoint_name}...\")\n cls.sm_client.delete_endpoint(EndpointName=endpoint_name)\n except ClientError as e:\n cls.log.error(\"Error deleting endpoint.\")\n raise e\n\n time.sleep(5) # Final sleep for AWS to fully register deletions\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.modified","title":"modified()
","text":"Return the datetime when this artifact was last modified
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n return self.endpoint_meta[\"LastModifiedTime\"]\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.onboard","title":"onboard(interactive=False)
","text":"This is a BLOCKING method that will onboard the Endpoint (make it ready) Args: interactive (bool, optional): If True, will prompt the user for information. (default: False) Returns: bool: True if the Endpoint is successfully onboarded, False otherwise
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def onboard(self, interactive: bool = False) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the Endpoint (make it ready)\n Args:\n interactive (bool, optional): If True, will prompt the user for information. (default: False)\n Returns:\n bool: True if the Endpoint is successfully onboarded, False otherwise\n \"\"\"\n\n # Make sure our input is defined\n if self.get_input() == \"unknown\":\n if interactive:\n input_model = input(\"Input Model?: \")\n else:\n self.log.critical(\"Input Model is not defined!\")\n return False\n else:\n input_model = self.get_input()\n\n # Now that we have the details, let's onboard the Endpoint with args\n return self.onboard_with_args(input_model)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.onboard_with_args","title":"onboard_with_args(input_model)
","text":"Onboard the Endpoint with the given arguments
Parameters:
Name Type Description Defaultinput_model
str
The input model for this endpoint
requiredReturns: bool: True if the Endpoint is successfully onboarded, False otherwise
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def onboard_with_args(self, input_model: str) -> bool:\n \"\"\"Onboard the Endpoint with the given arguments\n\n Args:\n input_model (str): The input model for this endpoint\n Returns:\n bool: True if the Endpoint is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n self.upsert_sageworks_meta({\"sageworks_input\": input_model})\n self.model_name = input_model\n\n # Remove the needs_onboard tag\n self.remove_health_tag(\"needs_onboard\")\n self.set_status(\"ready\")\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n return True\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.refresh_meta","title":"refresh_meta()
","text":"Refresh the Artifact's metadata
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n self.endpoint_meta = self.meta.endpoint(self.endpoint_name)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.regression_metrics","title":"regression_metrics(target_column, prediction_df)
","text":"Compute the performance metrics for this Endpoint Args: target_column (str): Name of the target column prediction_df (pd.DataFrame): DataFrame with the prediction results Returns: pd.DataFrame: DataFrame with the performance metrics
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def regression_metrics(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the performance metrics for this Endpoint\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with the performance metrics\n \"\"\"\n\n # Sanity Check the prediction DataFrame\n if prediction_df.empty:\n self.log.warning(\"No predictions were made. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Compute the metrics\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n mae = mean_absolute_error(y_true, y_pred)\n rmse = np.sqrt(mean_squared_error(y_true, y_pred))\n r2 = r2_score(y_true, y_pred)\n # Mean Absolute Percentage Error\n mape = np.mean(np.where(y_true != 0, np.abs((y_true - y_pred) / y_true), np.abs(y_true - y_pred))) * 100\n # Median Absolute Error\n medae = median_absolute_error(y_true, y_pred)\n\n # Organize and return the metrics\n metrics = {\n \"MAE\": round(mae, 3),\n \"RMSE\": round(rmse, 3),\n \"R2\": round(r2, 3),\n \"MAPE\": round(mape, 3),\n \"MedAE\": round(medae, 3),\n \"NumRows\": len(prediction_df),\n }\n return pd.DataFrame.from_records([metrics])\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.residuals","title":"residuals(target_column, prediction_df)
","text":"Add the residuals to the prediction DataFrame Args: target_column (str): Name of the target column prediction_df (pd.DataFrame): DataFrame with the prediction results Returns: pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def residuals(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Add the residuals to the prediction DataFrame\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'\n \"\"\"\n\n # Compute the residuals\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n # Check for classification scenario\n if not pd.api.types.is_numeric_dtype(y_true) or not pd.api.types.is_numeric_dtype(y_pred):\n self.log.warning(\"Target and Prediction columns are not numeric. Computing 'diffs'...\")\n prediction_df[\"residuals\"] = (y_true != y_pred).astype(int)\n prediction_df[\"residuals_abs\"] = prediction_df[\"residuals\"]\n else:\n # Compute numeric residuals for regression\n prediction_df[\"residuals\"] = y_true - y_pred\n prediction_df[\"residuals_abs\"] = np.abs(prediction_df[\"residuals\"])\n\n return prediction_df\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.set_input","title":"set_input(input, force=False)
","text":"Override: Set the input data for this artifact
Parameters:
Name Type Description Defaultinput
str
Name of input for this artifact
requiredforce
bool
Force the input to be set. Defaults to False.
False
Note: We're going to not allow this to be used for Models
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def set_input(self, input: str, force=False):\n \"\"\"Override: Set the input data for this artifact\n\n Args:\n input (str): Name of input for this artifact\n force (bool, optional): Force the input to be set. Defaults to False.\n Note:\n We're going to not allow this to be used for Models\n \"\"\"\n if not force:\n self.log.warning(f\"Endpoint {self.uuid}: Does not allow manual override of the input!\")\n return\n\n # Okay we're going to allow this to be set\n self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input})\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.size","title":"size()
","text":"Return the size of this data in MegaBytes
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n return 0.0\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.validate_proba_columns","title":"validate_proba_columns(prediction_df, class_labels, guessing=False)
staticmethod
","text":"Ensure probability columns are correctly aligned with class labels
Parameters:
Name Type Description Defaultprediction_df
DataFrame
DataFrame with the prediction results
requiredclass_labels
list
List of class labels
requiredguessing
bool
Whether we're guessing the class labels. Defaults to False.
False
Source code in src/sageworks/core/artifacts/endpoint_core.py
@staticmethod\ndef validate_proba_columns(prediction_df: pd.DataFrame, class_labels: list, guessing: bool = False):\n \"\"\"Ensure probability columns are correctly aligned with class labels\n\n Args:\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n class_labels (list): List of class labels\n guessing (bool, optional): Whether we're guessing the class labels. Defaults to False.\n \"\"\"\n proba_columns = [col.replace(\"_proba\", \"\") for col in prediction_df.columns if col.endswith(\"_proba\")]\n\n if sorted(class_labels) != sorted(proba_columns):\n if guessing:\n raise ValueError(f\"_proba columns {proba_columns} != GUESSED class_labels {class_labels}!\")\n else:\n raise ValueError(f\"_proba columns {proba_columns} != class_labels {class_labels}!\")\n
"},{"location":"core_classes/artifacts/feature_set_core/","title":"FeatureSetCore","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the FeatureSet API Class and voil\u00e0 it works the same.
FeatureSet: SageWorks Feature Set accessible through Athena
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore","title":"FeatureSetCore
","text":" Bases: Artifact
FeatureSetCore: SageWorks FeatureSetCore Class
Common Usagemy_features = FeatureSetCore(feature_uuid)\nmy_features.summary()\nmy_features.details()\n
Source code in src/sageworks/core/artifacts/feature_set_core.py
class FeatureSetCore(Artifact):\n \"\"\"FeatureSetCore: SageWorks FeatureSetCore Class\n\n Common Usage:\n ```python\n my_features = FeatureSetCore(feature_uuid)\n my_features.summary()\n my_features.details()\n ```\n \"\"\"\n\n def __init__(self, feature_set_uuid: str, **kwargs):\n \"\"\"FeatureSetCore Initialization\n\n Args:\n feature_set_uuid (str): Name of Feature Set\n \"\"\"\n\n # Make sure the feature_set name is valid\n self.is_name_valid(feature_set_uuid)\n\n # Call superclass init\n super().__init__(feature_set_uuid, **kwargs)\n\n # Get our FeatureSet metadata\n self.feature_meta = self.meta.feature_set(self.uuid)\n\n # Sanity check and then set up our FeatureSet attributes\n if self.feature_meta is None:\n self.log.warning(f\"Could not find feature set {self.uuid} within current visibility scope\")\n self.data_source = None\n return\n else:\n self.id_column = self.feature_meta[\"RecordIdentifierFeatureName\"]\n self.event_time = self.feature_meta[\"EventTimeFeatureName\"]\n\n # Pull Athena and S3 Storage information from metadata\n self.athena_table = self.feature_meta[\"sageworks_meta\"][\"athena_table\"]\n self.athena_database = self.feature_meta[\"sageworks_meta\"][\"athena_database\"]\n self.s3_storage = self.feature_meta[\"sageworks_meta\"].get(\"s3_storage\")\n\n # Create our internal DataSource (hardcoded to Athena for now)\n self.data_source = AthenaSource(self.athena_table, self.athena_database)\n\n # Spin up our Feature Store\n self.feature_store = FeatureStore(self.sm_session)\n\n # Call superclass post_init\n super().__post_init__()\n\n # All done\n self.log.info(f\"FeatureSet Initialized: {self.uuid}...\")\n\n @property\n def table(self) -> str:\n \"\"\"Get the base table name for this FeatureSet\"\"\"\n return self.data_source.table\n\n def refresh_meta(self):\n \"\"\"Internal: Refresh our internal AWS Feature Store metadata\"\"\"\n self.log.info(\"Calling refresh_meta() on the underlying DataSource\")\n self.data_source.refresh_meta()\n\n def exists(self) -> bool:\n \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n if self.feature_meta is None:\n self.log.debug(f\"FeatureSet {self.uuid} not found in AWS Metadata!\")\n return False\n return True\n\n def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n\n Returns:\n list[str]: List of health issues\n \"\"\"\n # Call the base class health check\n health_issues = super().health_check()\n\n # If we have a 'needs_onboard' in the health check then just return\n if \"needs_onboard\" in health_issues:\n return health_issues\n\n # Check our DataSource\n if not self.data_source.exists():\n self.log.critical(f\"Data Source check failed for {self.uuid}\")\n self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n health_issues.append(\"data_source_missing\")\n return health_issues\n\n def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.feature_meta\n\n def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n return self.feature_meta[\"FeatureGroupArn\"]\n\n def size(self) -> float:\n \"\"\"Return the size of the internal DataSource in MegaBytes\"\"\"\n return self.data_source.size()\n\n @property\n def columns(self) -> list[str]:\n \"\"\"Return the column names of the Feature Set\"\"\"\n return list(self.column_details().keys())\n\n @property\n def column_types(self) -> list[str]:\n \"\"\"Return the column types of the Feature Set\"\"\"\n return list(self.column_details().values())\n\n def column_details(self) -> dict:\n \"\"\"Return the column details of the Feature Set\n\n Returns:\n dict: The column details of the Feature Set\n\n Notes:\n We can't call just call self.data_source.column_details() because FeatureSets have different\n types, so we need to overlay that type information on top of the DataSource type information\n \"\"\"\n fs_details = {item[\"FeatureName\"]: item[\"FeatureType\"] for item in self.feature_meta[\"FeatureDefinitions\"]}\n ds_details = self.data_source.column_details()\n\n # Overlay the FeatureSet type information on top of the DataSource type information\n for col, dtype in ds_details.items():\n ds_details[col] = fs_details.get(col, dtype)\n return ds_details\n\n def views(self) -> list[str]:\n \"\"\"Return the views for this Data Source\"\"\"\n from sageworks.core.views.view_utils import list_views\n\n return list_views(self.data_source)\n\n def view(self, view_name: str) -> \"View\":\n \"\"\"Return a DataFrame for a specific view\n Args:\n view_name (str): The name of the view to return\n Returns:\n pd.DataFrame: A DataFrame for the specified view\n \"\"\"\n from sageworks.core.views import View\n\n return View(self, view_name)\n\n def set_display_columns(self, diplay_columns: list[str]):\n \"\"\"Set the display columns for this Data Source\n\n Args:\n diplay_columns (list[str]): The display columns for this Data Source\n \"\"\"\n # Check mismatch of display columns to computation columns\n c_view = self.view(\"computation\")\n computation_columns = c_view.columns\n mismatch_columns = [col for col in diplay_columns if col not in computation_columns]\n if mismatch_columns:\n self.log.monitor(f\"Display View/Computation mismatch: {mismatch_columns}\")\n\n self.log.important(f\"Setting Display Columns...{diplay_columns}\")\n from sageworks.core.views import DisplayView\n\n # Create a NEW display view\n DisplayView.create(self, source_table=c_view.table, column_list=diplay_columns)\n\n def set_computation_columns(self, computation_columns: list[str], reset_display: bool = True):\n \"\"\"Set the computation columns for this FeatureSet\n\n Args:\n computation_columns (list[str]): The computation columns for this FeatureSet\n reset_display (bool): Also reset the display columns to match (default: True)\n \"\"\"\n self.log.important(f\"Setting Computation Columns...{computation_columns}\")\n from sageworks.core.views import ComputationView\n\n # Create a NEW computation view\n ComputationView.create(self, column_list=computation_columns)\n self.recompute_stats()\n\n # Reset the display columns to match the computation columns\n if reset_display:\n self.set_display_columns(computation_columns)\n\n def num_columns(self) -> int:\n \"\"\"Return the number of columns of the Feature Set\"\"\"\n return len(self.columns)\n\n def num_rows(self) -> int:\n \"\"\"Return the number of rows of the internal DataSource\"\"\"\n return self.data_source.num_rows()\n\n def query(self, query: str, overwrite: bool = True) -> pd.DataFrame:\n \"\"\"Query the internal DataSource\n\n Args:\n query (str): The query to run against the DataSource\n overwrite (bool): Overwrite the table name in the query (default: True)\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n if overwrite:\n query = query.replace(\" \" + self.uuid + \" \", \" \" + self.athena_table + \" \")\n return self.data_source.query(query)\n\n def aws_url(self):\n \"\"\"The AWS URL for looking at/querying the underlying data source\"\"\"\n sageworks_details = self.data_source.sageworks_meta().get(\"sageworks_details\", {})\n return sageworks_details.get(\"aws_url\", \"unknown\")\n\n def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.feature_meta[\"CreationTime\"]\n\n def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n # Note: We can't currently figure out how to this from AWS Metadata\n return self.feature_meta[\"CreationTime\"]\n\n def hash(self) -> str:\n \"\"\"Return the hash for the set of Parquet files for this artifact\"\"\"\n return self.data_source.hash()\n\n def table_hash(self) -> str:\n \"\"\"Return the hash for the Athena table\"\"\"\n return self.data_source.table_hash()\n\n def get_data_source(self) -> DataSourceFactory:\n \"\"\"Return the underlying DataSource object\"\"\"\n return self.data_source\n\n def get_feature_store(self) -> FeatureStore:\n \"\"\"Return the underlying AWS FeatureStore object. This can be useful for more advanced usage\n with create_dataset() such as Joins and time ranges and a host of other options\n See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n \"\"\"\n return self.feature_store\n\n def create_s3_training_data(self) -> str:\n \"\"\"Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want\n additional options/features use the get_feature_store() method and see AWS docs for all\n the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n Returns:\n str: The full path/file for the CSV file created by Feature Store create_dataset()\n \"\"\"\n\n # Set up the S3 Query results path\n date_time = datetime.now(timezone.utc).strftime(\"%Y-%m-%d_%H:%M:%S\")\n s3_output_path = self.feature_sets_s3_path + f\"/{self.uuid}/datasets/all_{date_time}\"\n\n # Make the query\n table = self.view(\"training\").table\n query = f'SELECT * FROM \"{table}\"'\n athena_query = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session).athena_query()\n athena_query.run(query, output_location=s3_output_path)\n athena_query.wait()\n query_execution = athena_query.get_query_execution()\n\n # Get the full path to the S3 files with the results\n full_s3_path = s3_output_path + f\"/{query_execution['QueryExecution']['QueryExecutionId']}.csv\"\n return full_s3_path\n\n def get_training_data(self) -> pd.DataFrame:\n \"\"\"Get the training data for this FeatureSet\n\n Returns:\n pd.DataFrame: The training data for this FeatureSet\n \"\"\"\n from sageworks.core.views.view import View\n\n return View(self, \"training\").pull_dataframe()\n\n def snapshot_query(self, table_name: str = None) -> str:\n \"\"\"An Athena query to get the latest snapshot of features\n\n Args:\n table_name (str): The name of the table to query (default: None)\n\n Returns:\n str: The Athena query to get the latest snapshot of features\n \"\"\"\n # Remove FeatureGroup metadata columns that might have gotten added\n columns = self.columns\n filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n columns = \", \".join(['\"' + x + '\"' for x in columns if x not in filter_columns])\n\n query = (\n f\"SELECT {columns} \"\n f\" FROM (SELECT *, row_number() OVER (PARTITION BY {self.id_column} \"\n f\" ORDER BY {self.event_time} desc, api_invocation_time DESC, write_time DESC) AS row_num \"\n f' FROM \"{table_name}\") '\n \" WHERE row_num = 1 and NOT is_deleted;\"\n )\n return query\n\n def details(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Additional Details about this FeatureSet Artifact\n\n Args:\n recompute (bool): Recompute the details (default: False)\n\n Returns:\n dict(dict): A dictionary of details about this FeatureSet\n \"\"\"\n\n self.log.info(f\"Computing FeatureSet Details ({self.uuid})...\")\n details = self.summary()\n details[\"aws_url\"] = self.aws_url()\n\n # Store the AWS URL in the SageWorks Metadata\n # FIXME: We need to revisit this but doing an upsert just for aws_url is silly\n # self.upsert_sageworks_meta({\"aws_url\": details[\"aws_url\"]})\n\n # Now get a summary of the underlying DataSource\n details[\"storage_summary\"] = self.data_source.summary()\n\n # Number of Columns\n details[\"num_columns\"] = self.num_columns()\n\n # Number of Rows\n details[\"num_rows\"] = self.num_rows()\n\n # Additional Details\n details[\"sageworks_status\"] = self.get_status()\n details[\"sageworks_input\"] = self.get_input()\n details[\"sageworks_tags\"] = self.tag_delimiter.join(self.get_tags())\n\n # Underlying Storage Details\n details[\"storage_type\"] = \"athena\" # TODO: Add RDS support\n details[\"storage_uuid\"] = self.data_source.uuid\n\n # Add the column details and column stats\n details[\"column_details\"] = self.column_details()\n details[\"column_stats\"] = self.column_stats()\n\n # Return the details data\n return details\n\n def delete(self):\n \"\"\"Instance Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\"\"\"\n # Make sure the AthenaSource exists\n if not self.exists():\n self.log.warning(f\"Trying to delete an FeatureSet that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the FeatureSet\n FeatureSetCore.managed_delete(self.uuid)\n\n @classmethod\n def managed_delete(cls, feature_set_name: str):\n \"\"\"Class Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\n\n Args:\n feature_set_name (str): The Name of the FeatureSet to delete\n \"\"\"\n\n # See if the FeatureSet exists\n try:\n response = cls.sm_client.describe_feature_group(FeatureGroupName=feature_set_name)\n except cls.sm_client.exceptions.ResourceNotFound:\n cls.log.info(f\"FeatureSet {feature_set_name} not found!\")\n return\n\n # Extract database and table information from the response\n offline_config = response.get(\"OfflineStoreConfig\", {})\n database = offline_config.get(\"DataCatalogConfig\", {}).get(\"Database\")\n offline_table = offline_config.get(\"DataCatalogConfig\", {}).get(\"TableName\")\n data_source_uuid = offline_table # Our offline storage IS a DataSource\n\n # Delete the Feature Group and ensure that it gets deleted\n cls.log.important(f\"Deleting FeatureSet {feature_set_name}...\")\n remove_fg = cls.aws_feature_group_delete(feature_set_name)\n cls.ensure_feature_group_deleted(remove_fg)\n\n # Delete our underlying DataSource (Data Catalog Table and S3 Storage Objects)\n AthenaSource.managed_delete(data_source_uuid, database=database)\n\n # Delete any views associated with this FeatureSet\n cls.delete_views(offline_table, database)\n\n # Feature Sets can often have a lot of cruft so delete the entire bucket/prefix\n s3_delete_path = cls.feature_sets_s3_path + f\"/{feature_set_name}/\"\n cls.log.info(f\"Deleting All FeatureSet S3 Storage Objects {s3_delete_path}\")\n wr.s3.delete_objects(s3_delete_path, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(feature_set_name)\n\n @classmethod\n @aws_throttle\n def aws_feature_group_delete(cls, feature_set_name):\n remove_fg = FeatureGroup(name=feature_set_name, sagemaker_session=cls.sm_session)\n remove_fg.delete()\n return remove_fg\n\n @classmethod\n def ensure_feature_group_deleted(cls, feature_group):\n status = \"Deleting\"\n while status == \"Deleting\":\n cls.log.debug(\"FeatureSet being Deleted...\")\n try:\n status = feature_group.describe().get(\"FeatureGroupStatus\")\n except botocore.exceptions.ClientError as error:\n # For ResourceNotFound/ValidationException, this is fine, otherwise raise all other exceptions\n if error.response[\"Error\"][\"Code\"] in [\"ResourceNotFound\", \"ValidationException\"]:\n break\n else:\n raise error\n time.sleep(1)\n cls.log.info(f\"FeatureSet {feature_group.name} successfully deleted\")\n\n def set_training_holdouts(self, id_column: str, holdout_ids: list[str]):\n \"\"\"Set the hold out ids for the training view for this FeatureSet\n\n Args:\n id_column (str): The name of the id column.\n holdout_ids (list[str]): The list of holdout ids.\n \"\"\"\n from sageworks.core.views import TrainingView\n\n # Create a NEW training view\n self.log.important(f\"Setting Training Holdouts: {len(holdout_ids)} ids...\")\n TrainingView.create(self, id_column=id_column, holdout_ids=holdout_ids)\n\n @classmethod\n def delete_views(cls, table: str, database: str):\n \"\"\"Delete any views associated with this FeatureSet\n\n Args:\n table (str): Name of Athena Table\n database (str): Athena Database Name\n \"\"\"\n from sageworks.core.views.view_utils import delete_views_and_supplemental_data\n\n delete_views_and_supplemental_data(table, database, cls.boto3_session)\n\n def descriptive_stats(self, recompute: bool = False) -> dict:\n \"\"\"Get the descriptive stats for the numeric columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the descriptive stats (default=False)\n Returns:\n dict: A dictionary of descriptive stats for the numeric columns\n \"\"\"\n return self.data_source.descriptive_stats(recompute)\n\n def sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a sample of the data from the underlying DataSource\n Args:\n recompute (bool): Recompute the sample (default=False)\n Returns:\n pd.DataFrame: A sample of the data from the underlying DataSource\n \"\"\"\n return self.data_source.sample(recompute)\n\n def outliers(self, scale: float = 1.5, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n recompute (bool): Recompute the outliers (default: False)\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n return self.data_source.outliers(scale=scale, recompute=recompute)\n\n def smart_sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a SMART sample dataframe from this FeatureSet\n\n Args:\n recompute (bool): Recompute the smart sample (default=False)\n\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n return self.data_source.smart_sample(recompute=recompute)\n\n def anomalies(self) -> pd.DataFrame:\n \"\"\"Get a set of anomalous data from the underlying DataSource\n Returns:\n pd.DataFrame: A dataframe of anomalies from the underlying DataSource\n \"\"\"\n\n # FIXME: Mock this for now\n anom_df = self.sample().copy()\n anom_df[\"anomaly_score\"] = np.random.rand(anom_df.shape[0])\n anom_df[\"cluster\"] = np.random.randint(0, 10, anom_df.shape[0])\n anom_df[\"x\"] = np.random.rand(anom_df.shape[0])\n anom_df[\"y\"] = np.random.rand(anom_df.shape[0])\n return anom_df\n\n def value_counts(self, recompute: bool = False) -> dict:\n \"\"\"Get the value counts for the string columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the value counts (default=False)\n Returns:\n dict: A dictionary of value counts for the string columns\n \"\"\"\n return self.data_source.value_counts(recompute)\n\n def correlations(self, recompute: bool = False) -> dict:\n \"\"\"Get the correlations for the numeric columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the value counts (default=False)\n Returns:\n dict: A dictionary of correlations for the numeric columns\n \"\"\"\n return self.data_source.correlations(recompute)\n\n def column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in the FeatureSets underlying DataSource\n Args:\n recompute (bool): Recompute the column stats (default: False)\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros and descriptive_stats\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n ...}\n \"\"\"\n\n # Grab the column stats from our DataSource\n ds_column_stats = self.data_source.column_stats(recompute)\n\n # Map the types from our DataSource to the FeatureSet types\n fs_type_mapper = self.column_details()\n for col, details in ds_column_stats.items():\n details[\"fs_dtype\"] = fs_type_mapper.get(col, \"unknown\")\n\n return ds_column_stats\n\n def ready(self) -> bool:\n \"\"\"Is the FeatureSet ready? Is initial setup complete and expected metadata populated?\n Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to\n check both to see if the FeatureSet is ready.\"\"\"\n\n # Check the expected metadata for the FeatureSet\n expected_meta = self.expected_meta()\n existing_meta = self.sageworks_meta()\n feature_set_ready = set(existing_meta.keys()).issuperset(expected_meta)\n if not feature_set_ready:\n self.log.info(f\"FeatureSet {self.uuid} is not ready!\")\n return False\n\n # Okay now call/return the DataSource ready() method\n return self.data_source.ready()\n\n def onboard(self) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the FeatureSet (make it ready)\"\"\"\n\n # Set our status to onboarding\n self.log.important(f\"Onboarding {self.uuid}...\")\n self.set_status(\"onboarding\")\n self.remove_health_tag(\"needs_onboard\")\n\n # Call our underlying DataSource onboard method\n self.data_source.refresh_meta()\n if not self.data_source.exists():\n self.log.critical(f\"Data Source check failed for {self.uuid}\")\n self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n return False\n if not self.data_source.ready():\n self.data_source.onboard()\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n self.set_status(\"ready\")\n return True\n\n def recompute_stats(self) -> bool:\n \"\"\"This is a BLOCKING method that will recompute the stats for the FeatureSet\"\"\"\n\n # Call our underlying DataSource recompute stats method\n self.log.important(f\"Recomputing Stats {self.uuid}...\")\n self.data_source.recompute_stats()\n self.details(recompute=True)\n return True\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_types","title":"column_types: list[str]
property
","text":"Return the column types of the Feature Set
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.columns","title":"columns: list[str]
property
","text":"Return the column names of the Feature Set
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.table","title":"table: str
property
","text":"Get the base table name for this FeatureSet
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.__init__","title":"__init__(feature_set_uuid, **kwargs)
","text":"FeatureSetCore Initialization
Parameters:
Name Type Description Defaultfeature_set_uuid
str
Name of Feature Set
required Source code insrc/sageworks/core/artifacts/feature_set_core.py
def __init__(self, feature_set_uuid: str, **kwargs):\n \"\"\"FeatureSetCore Initialization\n\n Args:\n feature_set_uuid (str): Name of Feature Set\n \"\"\"\n\n # Make sure the feature_set name is valid\n self.is_name_valid(feature_set_uuid)\n\n # Call superclass init\n super().__init__(feature_set_uuid, **kwargs)\n\n # Get our FeatureSet metadata\n self.feature_meta = self.meta.feature_set(self.uuid)\n\n # Sanity check and then set up our FeatureSet attributes\n if self.feature_meta is None:\n self.log.warning(f\"Could not find feature set {self.uuid} within current visibility scope\")\n self.data_source = None\n return\n else:\n self.id_column = self.feature_meta[\"RecordIdentifierFeatureName\"]\n self.event_time = self.feature_meta[\"EventTimeFeatureName\"]\n\n # Pull Athena and S3 Storage information from metadata\n self.athena_table = self.feature_meta[\"sageworks_meta\"][\"athena_table\"]\n self.athena_database = self.feature_meta[\"sageworks_meta\"][\"athena_database\"]\n self.s3_storage = self.feature_meta[\"sageworks_meta\"].get(\"s3_storage\")\n\n # Create our internal DataSource (hardcoded to Athena for now)\n self.data_source = AthenaSource(self.athena_table, self.athena_database)\n\n # Spin up our Feature Store\n self.feature_store = FeatureStore(self.sm_session)\n\n # Call superclass post_init\n super().__post_init__()\n\n # All done\n self.log.info(f\"FeatureSet Initialized: {self.uuid}...\")\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.anomalies","title":"anomalies()
","text":"Get a set of anomalous data from the underlying DataSource Returns: pd.DataFrame: A dataframe of anomalies from the underlying DataSource
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def anomalies(self) -> pd.DataFrame:\n \"\"\"Get a set of anomalous data from the underlying DataSource\n Returns:\n pd.DataFrame: A dataframe of anomalies from the underlying DataSource\n \"\"\"\n\n # FIXME: Mock this for now\n anom_df = self.sample().copy()\n anom_df[\"anomaly_score\"] = np.random.rand(anom_df.shape[0])\n anom_df[\"cluster\"] = np.random.randint(0, 10, anom_df.shape[0])\n anom_df[\"x\"] = np.random.rand(anom_df.shape[0])\n anom_df[\"y\"] = np.random.rand(anom_df.shape[0])\n return anom_df\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.arn","title":"arn()
","text":"AWS ARN (Amazon Resource Name) for this artifact
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n return self.feature_meta[\"FeatureGroupArn\"]\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.aws_meta","title":"aws_meta()
","text":"Get ALL the AWS metadata for this artifact
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.feature_meta\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.aws_url","title":"aws_url()
","text":"The AWS URL for looking at/querying the underlying data source
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def aws_url(self):\n \"\"\"The AWS URL for looking at/querying the underlying data source\"\"\"\n sageworks_details = self.data_source.sageworks_meta().get(\"sageworks_details\", {})\n return sageworks_details.get(\"aws_url\", \"unknown\")\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_details","title":"column_details()
","text":"Return the column details of the Feature Set
Returns:
Name Type Descriptiondict
dict
The column details of the Feature Set
NotesWe can't call just call self.data_source.column_details() because FeatureSets have different types, so we need to overlay that type information on top of the DataSource type information
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def column_details(self) -> dict:\n \"\"\"Return the column details of the Feature Set\n\n Returns:\n dict: The column details of the Feature Set\n\n Notes:\n We can't call just call self.data_source.column_details() because FeatureSets have different\n types, so we need to overlay that type information on top of the DataSource type information\n \"\"\"\n fs_details = {item[\"FeatureName\"]: item[\"FeatureType\"] for item in self.feature_meta[\"FeatureDefinitions\"]}\n ds_details = self.data_source.column_details()\n\n # Overlay the FeatureSet type information on top of the DataSource type information\n for col, dtype in ds_details.items():\n ds_details[col] = fs_details.get(col, dtype)\n return ds_details\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_stats","title":"column_stats(recompute=False)
","text":"Compute Column Stats for all the columns in the FeatureSets underlying DataSource Args: recompute (bool): Recompute the column stats (default: False) Returns: dict(dict): A dictionary of stats for each column this format NB: String columns will NOT have num_zeros and descriptive_stats {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12}, 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}}, ...}
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in the FeatureSets underlying DataSource\n Args:\n recompute (bool): Recompute the column stats (default: False)\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros and descriptive_stats\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n ...}\n \"\"\"\n\n # Grab the column stats from our DataSource\n ds_column_stats = self.data_source.column_stats(recompute)\n\n # Map the types from our DataSource to the FeatureSet types\n fs_type_mapper = self.column_details()\n for col, details in ds_column_stats.items():\n details[\"fs_dtype\"] = fs_type_mapper.get(col, \"unknown\")\n\n return ds_column_stats\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.correlations","title":"correlations(recompute=False)
","text":"Get the correlations for the numeric columns of the underlying DataSource Args: recompute (bool): Recompute the value counts (default=False) Returns: dict: A dictionary of correlations for the numeric columns
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def correlations(self, recompute: bool = False) -> dict:\n \"\"\"Get the correlations for the numeric columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the value counts (default=False)\n Returns:\n dict: A dictionary of correlations for the numeric columns\n \"\"\"\n return self.data_source.correlations(recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.create_s3_training_data","title":"create_s3_training_data()
","text":"Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want additional options/features use the get_feature_store() method and see AWS docs for all the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html Returns: str: The full path/file for the CSV file created by Feature Store create_dataset()
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def create_s3_training_data(self) -> str:\n \"\"\"Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want\n additional options/features use the get_feature_store() method and see AWS docs for all\n the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n Returns:\n str: The full path/file for the CSV file created by Feature Store create_dataset()\n \"\"\"\n\n # Set up the S3 Query results path\n date_time = datetime.now(timezone.utc).strftime(\"%Y-%m-%d_%H:%M:%S\")\n s3_output_path = self.feature_sets_s3_path + f\"/{self.uuid}/datasets/all_{date_time}\"\n\n # Make the query\n table = self.view(\"training\").table\n query = f'SELECT * FROM \"{table}\"'\n athena_query = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session).athena_query()\n athena_query.run(query, output_location=s3_output_path)\n athena_query.wait()\n query_execution = athena_query.get_query_execution()\n\n # Get the full path to the S3 files with the results\n full_s3_path = s3_output_path + f\"/{query_execution['QueryExecution']['QueryExecutionId']}.csv\"\n return full_s3_path\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.created","title":"created()
","text":"Return the datetime when this artifact was created
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.feature_meta[\"CreationTime\"]\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.delete","title":"delete()
","text":"Instance Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def delete(self):\n \"\"\"Instance Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\"\"\"\n # Make sure the AthenaSource exists\n if not self.exists():\n self.log.warning(f\"Trying to delete an FeatureSet that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the FeatureSet\n FeatureSetCore.managed_delete(self.uuid)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.delete_views","title":"delete_views(table, database)
classmethod
","text":"Delete any views associated with this FeatureSet
Parameters:
Name Type Description Defaulttable
str
Name of Athena Table
requireddatabase
str
Athena Database Name
required Source code insrc/sageworks/core/artifacts/feature_set_core.py
@classmethod\ndef delete_views(cls, table: str, database: str):\n \"\"\"Delete any views associated with this FeatureSet\n\n Args:\n table (str): Name of Athena Table\n database (str): Athena Database Name\n \"\"\"\n from sageworks.core.views.view_utils import delete_views_and_supplemental_data\n\n delete_views_and_supplemental_data(table, database, cls.boto3_session)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.descriptive_stats","title":"descriptive_stats(recompute=False)
","text":"Get the descriptive stats for the numeric columns of the underlying DataSource Args: recompute (bool): Recompute the descriptive stats (default=False) Returns: dict: A dictionary of descriptive stats for the numeric columns
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def descriptive_stats(self, recompute: bool = False) -> dict:\n \"\"\"Get the descriptive stats for the numeric columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the descriptive stats (default=False)\n Returns:\n dict: A dictionary of descriptive stats for the numeric columns\n \"\"\"\n return self.data_source.descriptive_stats(recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.details","title":"details(recompute=False)
","text":"Additional Details about this FeatureSet Artifact
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the details (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of details about this FeatureSet
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def details(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Additional Details about this FeatureSet Artifact\n\n Args:\n recompute (bool): Recompute the details (default: False)\n\n Returns:\n dict(dict): A dictionary of details about this FeatureSet\n \"\"\"\n\n self.log.info(f\"Computing FeatureSet Details ({self.uuid})...\")\n details = self.summary()\n details[\"aws_url\"] = self.aws_url()\n\n # Store the AWS URL in the SageWorks Metadata\n # FIXME: We need to revisit this but doing an upsert just for aws_url is silly\n # self.upsert_sageworks_meta({\"aws_url\": details[\"aws_url\"]})\n\n # Now get a summary of the underlying DataSource\n details[\"storage_summary\"] = self.data_source.summary()\n\n # Number of Columns\n details[\"num_columns\"] = self.num_columns()\n\n # Number of Rows\n details[\"num_rows\"] = self.num_rows()\n\n # Additional Details\n details[\"sageworks_status\"] = self.get_status()\n details[\"sageworks_input\"] = self.get_input()\n details[\"sageworks_tags\"] = self.tag_delimiter.join(self.get_tags())\n\n # Underlying Storage Details\n details[\"storage_type\"] = \"athena\" # TODO: Add RDS support\n details[\"storage_uuid\"] = self.data_source.uuid\n\n # Add the column details and column stats\n details[\"column_details\"] = self.column_details()\n details[\"column_stats\"] = self.column_stats()\n\n # Return the details data\n return details\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.exists","title":"exists()
","text":"Does the feature_set_name exist in the AWS Metadata?
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def exists(self) -> bool:\n \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n if self.feature_meta is None:\n self.log.debug(f\"FeatureSet {self.uuid} not found in AWS Metadata!\")\n return False\n return True\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_data_source","title":"get_data_source()
","text":"Return the underlying DataSource object
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def get_data_source(self) -> DataSourceFactory:\n \"\"\"Return the underlying DataSource object\"\"\"\n return self.data_source\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_feature_store","title":"get_feature_store()
","text":"Return the underlying AWS FeatureStore object. This can be useful for more advanced usage with create_dataset() such as Joins and time ranges and a host of other options See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def get_feature_store(self) -> FeatureStore:\n \"\"\"Return the underlying AWS FeatureStore object. This can be useful for more advanced usage\n with create_dataset() such as Joins and time ranges and a host of other options\n See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n \"\"\"\n return self.feature_store\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_training_data","title":"get_training_data()
","text":"Get the training data for this FeatureSet
Returns:
Type DescriptionDataFrame
pd.DataFrame: The training data for this FeatureSet
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def get_training_data(self) -> pd.DataFrame:\n \"\"\"Get the training data for this FeatureSet\n\n Returns:\n pd.DataFrame: The training data for this FeatureSet\n \"\"\"\n from sageworks.core.views.view import View\n\n return View(self, \"training\").pull_dataframe()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.hash","title":"hash()
","text":"Return the hash for the set of Parquet files for this artifact
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def hash(self) -> str:\n \"\"\"Return the hash for the set of Parquet files for this artifact\"\"\"\n return self.data_source.hash()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.health_check","title":"health_check()
","text":"Perform a health check on this model
Returns:
Type Descriptionlist[str]
list[str]: List of health issues
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n\n Returns:\n list[str]: List of health issues\n \"\"\"\n # Call the base class health check\n health_issues = super().health_check()\n\n # If we have a 'needs_onboard' in the health check then just return\n if \"needs_onboard\" in health_issues:\n return health_issues\n\n # Check our DataSource\n if not self.data_source.exists():\n self.log.critical(f\"Data Source check failed for {self.uuid}\")\n self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n health_issues.append(\"data_source_missing\")\n return health_issues\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.managed_delete","title":"managed_delete(feature_set_name)
classmethod
","text":"Class Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects
Parameters:
Name Type Description Defaultfeature_set_name
str
The Name of the FeatureSet to delete
required Source code insrc/sageworks/core/artifacts/feature_set_core.py
@classmethod\ndef managed_delete(cls, feature_set_name: str):\n \"\"\"Class Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\n\n Args:\n feature_set_name (str): The Name of the FeatureSet to delete\n \"\"\"\n\n # See if the FeatureSet exists\n try:\n response = cls.sm_client.describe_feature_group(FeatureGroupName=feature_set_name)\n except cls.sm_client.exceptions.ResourceNotFound:\n cls.log.info(f\"FeatureSet {feature_set_name} not found!\")\n return\n\n # Extract database and table information from the response\n offline_config = response.get(\"OfflineStoreConfig\", {})\n database = offline_config.get(\"DataCatalogConfig\", {}).get(\"Database\")\n offline_table = offline_config.get(\"DataCatalogConfig\", {}).get(\"TableName\")\n data_source_uuid = offline_table # Our offline storage IS a DataSource\n\n # Delete the Feature Group and ensure that it gets deleted\n cls.log.important(f\"Deleting FeatureSet {feature_set_name}...\")\n remove_fg = cls.aws_feature_group_delete(feature_set_name)\n cls.ensure_feature_group_deleted(remove_fg)\n\n # Delete our underlying DataSource (Data Catalog Table and S3 Storage Objects)\n AthenaSource.managed_delete(data_source_uuid, database=database)\n\n # Delete any views associated with this FeatureSet\n cls.delete_views(offline_table, database)\n\n # Feature Sets can often have a lot of cruft so delete the entire bucket/prefix\n s3_delete_path = cls.feature_sets_s3_path + f\"/{feature_set_name}/\"\n cls.log.info(f\"Deleting All FeatureSet S3 Storage Objects {s3_delete_path}\")\n wr.s3.delete_objects(s3_delete_path, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(feature_set_name)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.modified","title":"modified()
","text":"Return the datetime when this artifact was last modified
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n # Note: We can't currently figure out how to this from AWS Metadata\n return self.feature_meta[\"CreationTime\"]\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.num_columns","title":"num_columns()
","text":"Return the number of columns of the Feature Set
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def num_columns(self) -> int:\n \"\"\"Return the number of columns of the Feature Set\"\"\"\n return len(self.columns)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.num_rows","title":"num_rows()
","text":"Return the number of rows of the internal DataSource
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def num_rows(self) -> int:\n \"\"\"Return the number of rows of the internal DataSource\"\"\"\n return self.data_source.num_rows()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.onboard","title":"onboard()
","text":"This is a BLOCKING method that will onboard the FeatureSet (make it ready)
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def onboard(self) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the FeatureSet (make it ready)\"\"\"\n\n # Set our status to onboarding\n self.log.important(f\"Onboarding {self.uuid}...\")\n self.set_status(\"onboarding\")\n self.remove_health_tag(\"needs_onboard\")\n\n # Call our underlying DataSource onboard method\n self.data_source.refresh_meta()\n if not self.data_source.exists():\n self.log.critical(f\"Data Source check failed for {self.uuid}\")\n self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n return False\n if not self.data_source.ready():\n self.data_source.onboard()\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n self.set_status(\"ready\")\n return True\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.outliers","title":"outliers(scale=1.5, recompute=False)
","text":"Compute outliers for all the numeric columns in a DataSource Args: scale (float): The scale to use for the IQR (default: 1.5) recompute (bool): Recompute the outliers (default: False) Returns: pd.DataFrame: A DataFrame of outliers from this DataSource Notes: Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers The scale parameter can be adjusted to change the IQR multiplier
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def outliers(self, scale: float = 1.5, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n recompute (bool): Recompute the outliers (default: False)\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n return self.data_source.outliers(scale=scale, recompute=recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.query","title":"query(query, overwrite=True)
","text":"Query the internal DataSource
Parameters:
Name Type Description Defaultquery
str
The query to run against the DataSource
requiredoverwrite
bool
Overwrite the table name in the query (default: True)
True
Returns:
Type DescriptionDataFrame
pd.DataFrame: The results of the query
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def query(self, query: str, overwrite: bool = True) -> pd.DataFrame:\n \"\"\"Query the internal DataSource\n\n Args:\n query (str): The query to run against the DataSource\n overwrite (bool): Overwrite the table name in the query (default: True)\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n if overwrite:\n query = query.replace(\" \" + self.uuid + \" \", \" \" + self.athena_table + \" \")\n return self.data_source.query(query)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.ready","title":"ready()
","text":"Is the FeatureSet ready? Is initial setup complete and expected metadata populated? Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to check both to see if the FeatureSet is ready.
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def ready(self) -> bool:\n \"\"\"Is the FeatureSet ready? Is initial setup complete and expected metadata populated?\n Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to\n check both to see if the FeatureSet is ready.\"\"\"\n\n # Check the expected metadata for the FeatureSet\n expected_meta = self.expected_meta()\n existing_meta = self.sageworks_meta()\n feature_set_ready = set(existing_meta.keys()).issuperset(expected_meta)\n if not feature_set_ready:\n self.log.info(f\"FeatureSet {self.uuid} is not ready!\")\n return False\n\n # Okay now call/return the DataSource ready() method\n return self.data_source.ready()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.recompute_stats","title":"recompute_stats()
","text":"This is a BLOCKING method that will recompute the stats for the FeatureSet
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def recompute_stats(self) -> bool:\n \"\"\"This is a BLOCKING method that will recompute the stats for the FeatureSet\"\"\"\n\n # Call our underlying DataSource recompute stats method\n self.log.important(f\"Recomputing Stats {self.uuid}...\")\n self.data_source.recompute_stats()\n self.details(recompute=True)\n return True\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.refresh_meta","title":"refresh_meta()
","text":"Internal: Refresh our internal AWS Feature Store metadata
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def refresh_meta(self):\n \"\"\"Internal: Refresh our internal AWS Feature Store metadata\"\"\"\n self.log.info(\"Calling refresh_meta() on the underlying DataSource\")\n self.data_source.refresh_meta()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.sample","title":"sample(recompute=False)
","text":"Get a sample of the data from the underlying DataSource Args: recompute (bool): Recompute the sample (default=False) Returns: pd.DataFrame: A sample of the data from the underlying DataSource
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a sample of the data from the underlying DataSource\n Args:\n recompute (bool): Recompute the sample (default=False)\n Returns:\n pd.DataFrame: A sample of the data from the underlying DataSource\n \"\"\"\n return self.data_source.sample(recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.set_computation_columns","title":"set_computation_columns(computation_columns, reset_display=True)
","text":"Set the computation columns for this FeatureSet
Parameters:
Name Type Description Defaultcomputation_columns
list[str]
The computation columns for this FeatureSet
requiredreset_display
bool
Also reset the display columns to match (default: True)
True
Source code in src/sageworks/core/artifacts/feature_set_core.py
def set_computation_columns(self, computation_columns: list[str], reset_display: bool = True):\n \"\"\"Set the computation columns for this FeatureSet\n\n Args:\n computation_columns (list[str]): The computation columns for this FeatureSet\n reset_display (bool): Also reset the display columns to match (default: True)\n \"\"\"\n self.log.important(f\"Setting Computation Columns...{computation_columns}\")\n from sageworks.core.views import ComputationView\n\n # Create a NEW computation view\n ComputationView.create(self, column_list=computation_columns)\n self.recompute_stats()\n\n # Reset the display columns to match the computation columns\n if reset_display:\n self.set_display_columns(computation_columns)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.set_display_columns","title":"set_display_columns(diplay_columns)
","text":"Set the display columns for this Data Source
Parameters:
Name Type Description Defaultdiplay_columns
list[str]
The display columns for this Data Source
required Source code insrc/sageworks/core/artifacts/feature_set_core.py
def set_display_columns(self, diplay_columns: list[str]):\n \"\"\"Set the display columns for this Data Source\n\n Args:\n diplay_columns (list[str]): The display columns for this Data Source\n \"\"\"\n # Check mismatch of display columns to computation columns\n c_view = self.view(\"computation\")\n computation_columns = c_view.columns\n mismatch_columns = [col for col in diplay_columns if col not in computation_columns]\n if mismatch_columns:\n self.log.monitor(f\"Display View/Computation mismatch: {mismatch_columns}\")\n\n self.log.important(f\"Setting Display Columns...{diplay_columns}\")\n from sageworks.core.views import DisplayView\n\n # Create a NEW display view\n DisplayView.create(self, source_table=c_view.table, column_list=diplay_columns)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.set_training_holdouts","title":"set_training_holdouts(id_column, holdout_ids)
","text":"Set the hold out ids for the training view for this FeatureSet
Parameters:
Name Type Description Defaultid_column
str
The name of the id column.
requiredholdout_ids
list[str]
The list of holdout ids.
required Source code insrc/sageworks/core/artifacts/feature_set_core.py
def set_training_holdouts(self, id_column: str, holdout_ids: list[str]):\n \"\"\"Set the hold out ids for the training view for this FeatureSet\n\n Args:\n id_column (str): The name of the id column.\n holdout_ids (list[str]): The list of holdout ids.\n \"\"\"\n from sageworks.core.views import TrainingView\n\n # Create a NEW training view\n self.log.important(f\"Setting Training Holdouts: {len(holdout_ids)} ids...\")\n TrainingView.create(self, id_column=id_column, holdout_ids=holdout_ids)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.size","title":"size()
","text":"Return the size of the internal DataSource in MegaBytes
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def size(self) -> float:\n \"\"\"Return the size of the internal DataSource in MegaBytes\"\"\"\n return self.data_source.size()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.smart_sample","title":"smart_sample(recompute=False)
","text":"Get a SMART sample dataframe from this FeatureSet
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the smart sample (default=False)
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A combined DataFrame of sample data + outliers
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def smart_sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a SMART sample dataframe from this FeatureSet\n\n Args:\n recompute (bool): Recompute the smart sample (default=False)\n\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n return self.data_source.smart_sample(recompute=recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.snapshot_query","title":"snapshot_query(table_name=None)
","text":"An Athena query to get the latest snapshot of features
Parameters:
Name Type Description Defaulttable_name
str
The name of the table to query (default: None)
None
Returns:
Name Type Descriptionstr
str
The Athena query to get the latest snapshot of features
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def snapshot_query(self, table_name: str = None) -> str:\n \"\"\"An Athena query to get the latest snapshot of features\n\n Args:\n table_name (str): The name of the table to query (default: None)\n\n Returns:\n str: The Athena query to get the latest snapshot of features\n \"\"\"\n # Remove FeatureGroup metadata columns that might have gotten added\n columns = self.columns\n filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n columns = \", \".join(['\"' + x + '\"' for x in columns if x not in filter_columns])\n\n query = (\n f\"SELECT {columns} \"\n f\" FROM (SELECT *, row_number() OVER (PARTITION BY {self.id_column} \"\n f\" ORDER BY {self.event_time} desc, api_invocation_time DESC, write_time DESC) AS row_num \"\n f' FROM \"{table_name}\") '\n \" WHERE row_num = 1 and NOT is_deleted;\"\n )\n return query\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.table_hash","title":"table_hash()
","text":"Return the hash for the Athena table
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def table_hash(self) -> str:\n \"\"\"Return the hash for the Athena table\"\"\"\n return self.data_source.table_hash()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.value_counts","title":"value_counts(recompute=False)
","text":"Get the value counts for the string columns of the underlying DataSource Args: recompute (bool): Recompute the value counts (default=False) Returns: dict: A dictionary of value counts for the string columns
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def value_counts(self, recompute: bool = False) -> dict:\n \"\"\"Get the value counts for the string columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the value counts (default=False)\n Returns:\n dict: A dictionary of value counts for the string columns\n \"\"\"\n return self.data_source.value_counts(recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.view","title":"view(view_name)
","text":"Return a DataFrame for a specific view Args: view_name (str): The name of the view to return Returns: pd.DataFrame: A DataFrame for the specified view
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def view(self, view_name: str) -> \"View\":\n \"\"\"Return a DataFrame for a specific view\n Args:\n view_name (str): The name of the view to return\n Returns:\n pd.DataFrame: A DataFrame for the specified view\n \"\"\"\n from sageworks.core.views import View\n\n return View(self, view_name)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.views","title":"views()
","text":"Return the views for this Data Source
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def views(self) -> list[str]:\n \"\"\"Return the views for this Data Source\"\"\"\n from sageworks.core.views.view_utils import list_views\n\n return list_views(self.data_source)\n
"},{"location":"core_classes/artifacts/model_core/","title":"ModelCore","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the Model API Class and voil\u00e0 it works the same.
ModelCore: SageWorks ModelCore Class
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.InferenceImage","title":"InferenceImage
","text":"Class for retrieving locked Scikit-Learn inference images
Source code insrc/sageworks/core/artifacts/model_core.py
class InferenceImage:\n \"\"\"Class for retrieving locked Scikit-Learn inference images\"\"\"\n\n image_uris = {\n (\"us-east-1\", \"sklearn\", \"1.2.1\"): (\n \"683313688378.dkr.ecr.us-east-1.amazonaws.com/\"\n \"sagemaker-scikit-learn@sha256:ed242e33af079f334972acd2a7ddf74d13310d3c9a0ef3a0e9b0429ccc104dcd\"\n ),\n (\"us-east-2\", \"sklearn\", \"1.2.1\"): (\n \"257758044811.dkr.ecr.us-east-2.amazonaws.com/\"\n \"sagemaker-scikit-learn@sha256:ed242e33af079f334972acd2a7ddf74d13310d3c9a0ef3a0e9b0429ccc104dcd\"\n ),\n (\"us-west-1\", \"sklearn\", \"1.2.1\"): (\n \"746614075791.dkr.ecr.us-west-1.amazonaws.com/\"\n \"sagemaker-scikit-learn@sha256:ed242e33af079f334972acd2a7ddf74d13310d3c9a0ef3a0e9b0429ccc104dcd\"\n ),\n (\"us-west-2\", \"sklearn\", \"1.2.1\"): (\n \"246618743249.dkr.ecr.us-west-2.amazonaws.com/\"\n \"sagemaker-scikit-learn@sha256:ed242e33af079f334972acd2a7ddf74d13310d3c9a0ef3a0e9b0429ccc104dcd\"\n ),\n }\n\n @classmethod\n def get_image_uri(cls, region, framework, version):\n key = (region, framework, version)\n if key in cls.image_uris:\n return cls.image_uris[key]\n else:\n raise ValueError(\n f\"No matching image found for region: {region}, framework: {framework}, version: {version}\"\n )\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore","title":"ModelCore
","text":" Bases: Artifact
ModelCore: SageWorks ModelCore Class
Common Usagemy_model = ModelCore(model_uuid)\nmy_model.summary()\nmy_model.details()\n
Source code in src/sageworks/core/artifacts/model_core.py
class ModelCore(Artifact):\n \"\"\"ModelCore: SageWorks ModelCore Class\n\n Common Usage:\n ```python\n my_model = ModelCore(model_uuid)\n my_model.summary()\n my_model.details()\n ```\n \"\"\"\n\n def __init__(self, model_uuid: str, model_type: ModelType = None, **kwargs):\n \"\"\"ModelCore Initialization\n Args:\n model_uuid (str): Name of Model in SageWorks.\n model_type (ModelType, optional): Set this for newly created Models. Defaults to None.\n **kwargs: Additional keyword arguments\n \"\"\"\n\n # Make sure the model name is valid\n self.is_name_valid(model_uuid, delimiter=\"-\", lower_case=False)\n\n # Call SuperClass Initialization\n super().__init__(model_uuid, **kwargs)\n\n # Initialize our class attributes\n self.latest_model = None\n self.model_type = ModelType.UNKNOWN\n self.model_training_path = None\n self.endpoint_inference_path = None\n\n # Grab an Cloud Platform Meta object and pull information for this Model\n self.model_name = model_uuid\n self.model_meta = self.meta.model(self.model_name)\n if self.model_meta is None:\n self.log.warning(f\"Could not find model {self.model_name} within current visibility scope\")\n return\n else:\n # Is this a model package group without any models?\n if len(self.model_meta[\"ModelPackageList\"]) == 0:\n self.log.warning(f\"Model Group {self.model_name} has no Model Packages!\")\n self.latest_model = None\n self.add_health_tag(\"model_not_found\")\n return\n try:\n self.latest_model = self.model_meta[\"ModelPackageList\"][0]\n self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n self.training_job_name = self._extract_training_job_name()\n if model_type:\n self._set_model_type(model_type)\n else:\n self.model_type = self._get_model_type()\n except (IndexError, KeyError):\n self.log.critical(f\"Model {self.model_name} appears to be malformed. Delete and recreate it!\")\n return\n\n # Set the Model Training S3 Path\n self.model_training_path = self.models_s3_path + \"/training/\" + self.model_name\n\n # Get our Endpoint Inference Path (might be None)\n self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n # Call SuperClass Post Initialization\n super().__post_init__()\n\n # All done\n self.log.info(f\"Model Initialized: {self.model_name}\")\n\n def refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n self.model_meta = self.meta.model(self.model_name)\n self.latest_model = self.model_meta[\"ModelPackageList\"][0]\n self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n self.training_job_name = self._extract_training_job_name()\n\n def exists(self) -> bool:\n \"\"\"Does the model metadata exist in the AWS Metadata?\"\"\"\n if self.model_meta is None:\n self.log.info(f\"Model {self.model_name} not found in AWS Metadata!\")\n return False\n return True\n\n def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n Returns:\n list[str]: List of health issues\n \"\"\"\n # Call the base class health check\n health_issues = super().health_check()\n\n # Check if the model exists\n if self.latest_model is None:\n health_issues.append(\"model_not_found\")\n\n # Model Type\n if self._get_model_type() == ModelType.UNKNOWN:\n health_issues.append(\"model_type_unknown\")\n else:\n self.remove_health_tag(\"model_type_unknown\")\n\n # Model Performance Metrics\n needs_metrics = self.model_type in {ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR, ModelType.CLASSIFIER}\n if needs_metrics and self.get_inference_metrics() is None:\n health_issues.append(\"metrics_needed\")\n else:\n self.remove_health_tag(\"metrics_needed\")\n\n # Endpoint\n if not self.endpoints():\n health_issues.append(\"no_endpoint\")\n else:\n self.remove_health_tag(\"no_endpoint\")\n return health_issues\n\n def latest_model_object(self) -> SagemakerModel:\n \"\"\"Return the latest AWS Sagemaker Model object for this SageWorks Model\n\n Returns:\n sagemaker.model.Model: AWS Sagemaker Model object\n \"\"\"\n return SagemakerModel(\n model_data=self.model_package_arn(), sagemaker_session=self.sm_session, image_uri=self.container_image()\n )\n\n def list_inference_runs(self) -> list[str]:\n \"\"\"List the inference runs for this model\n\n Returns:\n list[str]: List of inference runs\n \"\"\"\n\n # Check if we have a model (if not return empty list)\n if self.latest_model is None:\n return []\n\n # Check if we have model training metrics in our metadata\n have_model_training = True if self.sageworks_meta().get(\"sageworks_training_metrics\") else False\n\n # Now grab the list of directories from our inference path\n inference_runs = []\n if self.endpoint_inference_path:\n directories = wr.s3.list_directories(path=self.endpoint_inference_path + \"/\")\n inference_runs = [urlparse(directory).path.split(\"/\")[-2] for directory in directories]\n\n # We're going to add the model training to the end of the list\n if have_model_training:\n inference_runs.append(\"model_training\")\n return inference_runs\n\n def delete_inference_run(self, inference_run_uuid: str):\n \"\"\"Delete the inference run for this model\n\n Args:\n inference_run_uuid (str): UUID of the inference run\n \"\"\"\n if inference_run_uuid == \"model_training\":\n self.log.warning(\"Cannot delete model training data!\")\n return\n\n if self.endpoint_inference_path:\n full_path = f\"{self.endpoint_inference_path}/{inference_run_uuid}\"\n # Check if there are any objects at the path\n if wr.s3.list_objects(full_path):\n wr.s3.delete_objects(path=full_path)\n self.log.important(f\"Deleted inference run {inference_run_uuid} for {self.model_name}\")\n else:\n self.log.warning(f\"Inference run {inference_run_uuid} not found for {self.model_name}!\")\n else:\n self.log.warning(f\"No inference data found for {self.model_name}!\")\n\n def get_inference_metrics(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the inference performance metrics for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n Returns:\n pd.DataFrame: DataFrame of the Model Metrics\n\n Note:\n If a capture_uuid isn't specified this will try to return something reasonable\n \"\"\"\n # Try to get the auto_capture 'training_holdout' or the training\n if capture_uuid == \"latest\":\n metrics_df = self.get_inference_metrics(\"auto_inference\")\n return metrics_df if metrics_df is not None else self.get_inference_metrics(\"model_training\")\n\n # Grab the metrics captured during model training (could return None)\n if capture_uuid == \"model_training\":\n # Sanity check the sageworks metadata\n if self.sageworks_meta() is None:\n error_msg = f\"Model {self.model_name} has no sageworks_meta(). Either onboard() or delete this model!\"\n self.log.critical(error_msg)\n raise ValueError(error_msg)\n\n metrics = self.sageworks_meta().get(\"sageworks_training_metrics\")\n return pd.DataFrame.from_dict(metrics) if metrics else None\n\n else: # Specific capture_uuid (could return None)\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n metrics = pull_s3_data(s3_path, embedded_index=True)\n if metrics is not None:\n return metrics\n else:\n self.log.warning(f\"Performance metrics {capture_uuid} not found for {self.model_name}!\")\n return None\n\n def confusion_matrix(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the confusion_matrix for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n Returns:\n pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n \"\"\"\n\n # Sanity check the sageworks metadata\n if self.sageworks_meta() is None:\n error_msg = f\"Model {self.model_name} has no sageworks_meta(). Either onboard() or delete this model!\"\n self.log.critical(error_msg)\n raise ValueError(error_msg)\n\n # Grab the metrics from the SageWorks Metadata (try inference first, then training)\n if capture_uuid == \"latest\":\n cm = self.confusion_matrix(\"auto_inference\")\n return cm if cm is not None else self.confusion_matrix(\"model_training\")\n\n # Grab the confusion matrix captured during model training (could return None)\n if capture_uuid == \"model_training\":\n cm = self.sageworks_meta().get(\"sageworks_training_cm\")\n return pd.DataFrame.from_dict(cm) if cm else None\n\n else: # Specific capture_uuid\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_cm.csv\"\n cm = pull_s3_data(s3_path, embedded_index=True)\n if cm is not None:\n return cm\n else:\n self.log.warning(f\"Confusion Matrix {capture_uuid} not found for {self.model_name}!\")\n return None\n\n def set_input(self, input: str, force: bool = False):\n \"\"\"Override: Set the input data for this artifact\n\n Args:\n input (str): Name of input for this artifact\n force (bool, optional): Force the input to be set (default: False)\n Note:\n We're going to not allow this to be used for Models\n \"\"\"\n if not force:\n self.log.warning(f\"Model {self.uuid}: Does not allow manual override of the input!\")\n return\n\n # Okay we're going to allow this to be set\n self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input})\n\n def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n return 0.0\n\n def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.model_meta\n\n def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n return self.group_arn()\n\n def group_arn(self) -> Union[str, None]:\n \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n return self.model_meta[\"ModelPackageGroupArn\"] if self.model_meta else None\n\n def model_package_arn(self) -> Union[str, None]:\n \"\"\"AWS ARN (Amazon Resource Name) for the Latest Model Package (within the Group)\"\"\"\n if self.latest_model is None:\n return None\n return self.latest_model[\"ModelPackageArn\"]\n\n def container_info(self) -> Union[dict, None]:\n \"\"\"Container Info for the Latest Model Package\"\"\"\n return self.latest_model[\"InferenceSpecification\"][\"Containers\"][0] if self.latest_model else None\n\n def container_image(self) -> str:\n \"\"\"Container Image for the Latest Model Package\"\"\"\n return self.container_info()[\"Image\"]\n\n def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this model\"\"\"\n return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n\n def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n if self.latest_model is None:\n return \"-\"\n return self.latest_model[\"CreationTime\"]\n\n def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n if self.latest_model is None:\n return \"-\"\n return self.latest_model[\"CreationTime\"]\n\n def hash(self) -> Optional[str]:\n \"\"\"Return the hash for this artifact\n\n Returns:\n Optional[str]: The hash for this artifact\n \"\"\"\n model_url = self.get_model_data_url()\n return get_s3_etag(model_url, self.boto3_session)\n\n def register_endpoint(self, endpoint_name: str):\n \"\"\"Add this endpoint to the set of registered endpoints for the model\n\n Args:\n endpoint_name (str): Name of the endpoint\n \"\"\"\n self.log.important(f\"Registering Endpoint {endpoint_name} with Model {self.uuid}...\")\n registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n registered_endpoints.add(endpoint_name)\n self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n # Remove any health tags\n self.remove_health_tag(\"no_endpoint\")\n\n # A new endpoint means we need to refresh our inference path\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n def remove_endpoint(self, endpoint_name: str):\n \"\"\"Remove this endpoint from the set of registered endpoints for the model\n\n Args:\n endpoint_name (str): Name of the endpoint\n \"\"\"\n self.log.important(f\"Removing Endpoint {endpoint_name} from Model {self.uuid}...\")\n registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n registered_endpoints.discard(endpoint_name)\n self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n # If we have NO endpionts, then set a health tags\n if not registered_endpoints:\n self.add_health_tag(\"no_endpoint\")\n self.details(recompute=True)\n\n # A new endpoint means we need to refresh our inference path\n time.sleep(2)\n\n def endpoints(self) -> list[str]:\n \"\"\"Get the list of registered endpoints for this Model\n\n Returns:\n list[str]: List of registered endpoints\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_registered_endpoints\", [])\n\n def get_endpoint_inference_path(self) -> Union[str, None]:\n \"\"\"Get the S3 Path for the Inference Data\n\n Returns:\n str: S3 Path for the Inference Data (or None if not found)\n \"\"\"\n\n # Look for any Registered Endpoints\n registered_endpoints = self.sageworks_meta().get(\"sageworks_registered_endpoints\")\n\n # Note: We may have 0 to N endpoints, so we find the one with the most recent artifacts\n if registered_endpoints:\n endpoint_inference_base = self.endpoints_s3_path + \"/inference/\"\n endpoint_inference_paths = [endpoint_inference_base + e for e in registered_endpoints]\n inference_path = newest_path(endpoint_inference_paths, self.sm_session)\n if inference_path is None:\n self.log.important(f\"No inference data found for {self.model_name}!\")\n self.log.important(f\"Returning default inference path for {registered_endpoints[0]}...\")\n self.log.important(f\"{endpoint_inference_paths[0]}\")\n return endpoint_inference_paths[0]\n else:\n return inference_path\n else:\n self.log.warning(f\"No registered endpoints found for {self.model_name}!\")\n return None\n\n def set_target(self, target_column: str):\n \"\"\"Set the target for this Model\n\n Args:\n target_column (str): Target column for this Model\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_model_target\": target_column})\n\n def set_features(self, feature_columns: list[str]):\n \"\"\"Set the features for this Model\n\n Args:\n feature_columns (list[str]): List of feature columns\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_model_features\": feature_columns})\n\n def target(self) -> Union[str, None]:\n \"\"\"Return the target for this Model (if supervised, else None)\n\n Returns:\n str: Target column for this Model (if supervised, else None)\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_model_target\") # Returns None if not found\n\n def features(self) -> Union[list[str], None]:\n \"\"\"Return a list of features used for this Model\n\n Returns:\n list[str]: List of features used for this Model\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_model_features\") # Returns None if not found\n\n def class_labels(self) -> Union[list[str], None]:\n \"\"\"Return the class labels for this Model (if it's a classifier)\n\n Returns:\n list[str]: List of class labels\n \"\"\"\n if self.model_type == ModelType.CLASSIFIER:\n return self.sageworks_meta().get(\"class_labels\") # Returns None if not found\n else:\n return None\n\n def set_class_labels(self, labels: list[str]):\n \"\"\"Return the class labels for this Model (if it's a classifier)\n\n Args:\n labels (list[str]): List of class labels\n \"\"\"\n if self.model_type == ModelType.CLASSIFIER:\n self.upsert_sageworks_meta({\"class_labels\": labels})\n else:\n self.log.error(f\"Model {self.model_name} is not a classifier!\")\n\n def details(self, recompute=False) -> dict:\n \"\"\"Additional Details about this Model\n Args:\n recompute (bool, optional): Recompute the details (default: False)\n Returns:\n dict: Dictionary of details about this Model\n \"\"\"\n self.log.info(\"Computing Model Details...\")\n details = self.summary()\n details[\"pipeline\"] = self.get_pipeline()\n details[\"model_type\"] = self.model_type.value\n details[\"model_package_group_arn\"] = self.group_arn()\n details[\"model_package_arn\"] = self.model_package_arn()\n\n # Sanity check is we have models in the group\n if self.latest_model is None:\n self.log.warning(f\"Model Package Group {self.model_name} has no models!\")\n return details\n\n # Grab the Model Details\n details[\"description\"] = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n details[\"version\"] = self.latest_model[\"ModelPackageVersion\"]\n details[\"status\"] = self.latest_model[\"ModelPackageStatus\"]\n details[\"approval_status\"] = self.latest_model.get(\"ModelApprovalStatus\", \"unknown\")\n details[\"image\"] = self.container_image().split(\"/\")[-1] # Shorten the image uri\n\n # Grab the inference and container info\n inference_spec = self.latest_model[\"InferenceSpecification\"]\n container_info = self.container_info()\n details[\"framework\"] = container_info.get(\"Framework\", \"unknown\")\n details[\"framework_version\"] = container_info.get(\"FrameworkVersion\", \"unknown\")\n details[\"inference_types\"] = inference_spec[\"SupportedRealtimeInferenceInstanceTypes\"]\n details[\"transform_types\"] = inference_spec[\"SupportedTransformInstanceTypes\"]\n details[\"content_types\"] = inference_spec[\"SupportedContentTypes\"]\n details[\"response_types\"] = inference_spec[\"SupportedResponseMIMETypes\"]\n details[\"model_metrics\"] = self.get_inference_metrics()\n if self.model_type == ModelType.CLASSIFIER:\n details[\"confusion_matrix\"] = self.confusion_matrix()\n details[\"predictions\"] = None\n elif self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n details[\"confusion_matrix\"] = None\n details[\"predictions\"] = self.get_inference_predictions()\n else:\n details[\"confusion_matrix\"] = None\n details[\"predictions\"] = None\n\n # Grab the inference metadata\n details[\"inference_meta\"] = self.get_inference_metadata()\n\n # Return the details\n return details\n\n # Pipeline for this model\n def get_pipeline(self) -> str:\n \"\"\"Get the pipeline for this model\"\"\"\n return self.sageworks_meta().get(\"sageworks_pipeline\")\n\n def set_pipeline(self, pipeline: str):\n \"\"\"Set the pipeline for this model\n\n Args:\n pipeline (str): Pipeline that was used to create this model\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_pipeline\": pipeline})\n\n def expected_meta(self) -> list[str]:\n \"\"\"Metadata we expect to see for this Model when it's ready\n Returns:\n list[str]: List of expected metadata keys\n \"\"\"\n # Our current list of expected metadata, we can add to this as needed\n return [\"sageworks_status\", \"sageworks_training_metrics\", \"sageworks_training_cm\"]\n\n def is_model_unknown(self) -> bool:\n \"\"\"Is the Model Type unknown?\"\"\"\n return self.model_type == ModelType.UNKNOWN\n\n def _determine_model_type(self):\n \"\"\"Internal: Determine the Model Type\"\"\"\n model_type = input(\"Model Type? (classifier, regressor, quantile_regressor, unsupervised, transformer): \")\n if model_type == \"classifier\":\n self._set_model_type(ModelType.CLASSIFIER)\n elif model_type == \"regressor\":\n self._set_model_type(ModelType.REGRESSOR)\n elif model_type == \"quantile_regressor\":\n self._set_model_type(ModelType.QUANTILE_REGRESSOR)\n elif model_type == \"unsupervised\":\n self._set_model_type(ModelType.UNSUPERVISED)\n elif model_type == \"transformer\":\n self._set_model_type(ModelType.TRANSFORMER)\n else:\n self.log.warning(f\"Unknown Model Type {model_type}!\")\n self._set_model_type(ModelType.UNKNOWN)\n\n def onboard(self, ask_everything=False) -> bool:\n \"\"\"This is an interactive method that will onboard the Model (make it ready)\n\n Args:\n ask_everything (bool, optional): Ask for all the details. Defaults to False.\n\n Returns:\n bool: True if the Model is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n # Determine the Model Type\n while self.is_model_unknown():\n self._determine_model_type()\n\n # Is our input data set?\n if self.get_input() in [\"\", \"unknown\"] or ask_everything:\n input_data = input(\"Input Data?: \")\n if input_data not in [\"None\", \"none\", \"\", \"unknown\"]:\n self.set_input(input_data)\n\n # Determine the Target Column (can be None)\n target_column = self.target()\n if target_column is None or ask_everything:\n target_column = input(\"Target Column? (for unsupervised/transformer just type None): \")\n if target_column in [\"None\", \"none\", \"\"]:\n target_column = None\n\n # Determine the Feature Columns\n feature_columns = self.features()\n if feature_columns is None or ask_everything:\n feature_columns = input(\"Feature Columns? (use commas): \")\n feature_columns = [e.strip() for e in feature_columns.split(\",\")]\n if feature_columns in [[\"None\"], [\"none\"], [\"\"]]:\n feature_columns = None\n\n # Registered Endpoints?\n endpoints = self.endpoints()\n if not endpoints or ask_everything:\n endpoints = input(\"Register Endpoints? (use commas for multiple): \")\n endpoints = [e.strip() for e in endpoints.split(\",\")]\n if endpoints in [[\"None\"], [\"none\"], [\"\"]]:\n endpoints = None\n\n # Model Owner?\n owner = self.get_owner()\n if owner in [None, \"unknown\"] or ask_everything:\n owner = input(\"Model Owner: \")\n if owner in [\"None\", \"none\", \"\"]:\n owner = \"unknown\"\n\n # Model Class Labels (if it's a classifier)\n if self.model_type == ModelType.CLASSIFIER:\n class_labels = self.class_labels()\n if class_labels is None or ask_everything:\n class_labels = input(\"Class Labels? (use commas): \")\n class_labels = [e.strip() for e in class_labels.split(\",\")]\n if class_labels in [[\"None\"], [\"none\"], [\"\"]]:\n class_labels = None\n self.set_class_labels(class_labels)\n\n # Now that we have all the details, let's onboard the Model with all the args\n return self.onboard_with_args(self.model_type, target_column, feature_columns, endpoints, owner)\n\n def onboard_with_args(\n self,\n model_type: ModelType,\n target_column: str = None,\n feature_list: list = None,\n endpoints: list = None,\n owner: str = None,\n ) -> bool:\n \"\"\"Onboard the Model with the given arguments\n\n Args:\n model_type (ModelType): Model Type\n target_column (str): Target Column\n feature_list (list): List of Feature Columns\n endpoints (list, optional): List of Endpoints. Defaults to None.\n owner (str, optional): Model Owner. Defaults to None.\n Returns:\n bool: True if the Model is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n # Set All the Details\n self._set_model_type(model_type)\n if target_column:\n self.set_target(target_column)\n if feature_list:\n self.set_features(feature_list)\n if endpoints:\n for endpoint in endpoints:\n self.register_endpoint(endpoint)\n if owner:\n self.set_owner(owner)\n\n # Load the training metrics and inference metrics\n self._load_training_metrics()\n self._load_inference_metrics()\n\n # Remove the needs_onboard tag\n self.remove_health_tag(\"needs_onboard\")\n self.set_status(\"ready\")\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n return True\n\n def get_model_data_url(self) -> Optional[str]:\n \"\"\"Retrieve the ModelDataUrl from the model's AWS metadata.\n\n Returns:\n Optional[str]: The ModelDataUrl if available, otherwise None.\n \"\"\"\n meta = self.aws_meta()\n try:\n return meta[\"ModelPackageList\"][0][\"InferenceSpecification\"][\"Containers\"][0][\"ModelDataUrl\"]\n except (KeyError, IndexError, TypeError):\n return None\n\n def delete(self):\n \"\"\"Delete the Model Packages and the Model Group\"\"\"\n if not self.exists():\n self.log.warning(f\"Trying to delete an Model that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the Model Group\n ModelCore.managed_delete(model_group_name=self.uuid)\n\n @classmethod\n def managed_delete(cls, model_group_name: str):\n \"\"\"Delete the Model Packages, Model Group, and S3 Storage Objects\n\n Args:\n model_group_name (str): The name of the Model Group to delete\n \"\"\"\n # Check if the model group exists in SageMaker\n try:\n cls.sm_client.describe_model_package_group(ModelPackageGroupName=model_group_name)\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] in [\"ValidationException\", \"ResourceNotFound\"]:\n cls.log.info(f\"Model Group {model_group_name} not found!\")\n return\n else:\n raise # Re-raise unexpected errors\n\n # Delete Model Packages within the Model Group\n try:\n paginator = cls.sm_client.get_paginator(\"list_model_packages\")\n for page in paginator.paginate(ModelPackageGroupName=model_group_name):\n for model_package in page[\"ModelPackageSummaryList\"]:\n package_arn = model_package[\"ModelPackageArn\"]\n cls.log.info(f\"Deleting Model Package {package_arn}...\")\n cls.sm_client.delete_model_package(ModelPackageName=package_arn)\n except ClientError as e:\n cls.log.error(f\"Error while deleting model packages: {e}\")\n raise\n\n # Delete the Model Package Group\n cls.log.info(f\"Deleting Model Group {model_group_name}...\")\n cls.sm_client.delete_model_package_group(ModelPackageGroupName=model_group_name)\n\n # Delete S3 training artifacts\n s3_delete_path = f\"{cls.models_s3_path}/training/{model_group_name}/\"\n cls.log.info(f\"Deleting S3 Objects at {s3_delete_path}...\")\n wr.s3.delete_objects(s3_delete_path, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(model_group_name)\n\n def _set_model_type(self, model_type: ModelType):\n \"\"\"Internal: Set the Model Type for this Model\"\"\"\n self.model_type = model_type\n self.upsert_sageworks_meta({\"sageworks_model_type\": self.model_type.value})\n self.remove_health_tag(\"model_type_unknown\")\n\n def _get_model_type(self) -> ModelType:\n \"\"\"Internal: Query the SageWorks Metadata to get the model type\n Returns:\n ModelType: The ModelType of this Model\n Notes:\n This is an internal method that should not be called directly\n Use the model_type attribute instead\n \"\"\"\n model_type = self.sageworks_meta().get(\"sageworks_model_type\")\n try:\n return ModelType(model_type)\n except ValueError:\n self.log.warning(f\"Could not determine model type for {self.model_name}!\")\n return ModelType.UNKNOWN\n\n def _load_training_metrics(self):\n \"\"\"Internal: Retrieve the training metrics and Confusion Matrix for this model\n and load the data into the SageWorks Metadata\n\n Notes:\n This may or may not exist based on whether we have access to TrainingJobAnalytics\n \"\"\"\n try:\n df = TrainingJobAnalytics(training_job_name=self.training_job_name).dataframe()\n if df.empty:\n self.log.important(f\"No training job metrics found for {self.training_job_name}\")\n self.upsert_sageworks_meta({\"sageworks_training_metrics\": None, \"sageworks_training_cm\": None})\n return\n if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n if \"timestamp\" in df.columns:\n df = df.drop(columns=[\"timestamp\"])\n\n # We're going to pivot the DataFrame to get the desired structure\n reg_metrics_df = df.set_index(\"metric_name\").T\n\n # Store and return the metrics in the SageWorks Metadata\n self.upsert_sageworks_meta(\n {\"sageworks_training_metrics\": reg_metrics_df.to_dict(), \"sageworks_training_cm\": None}\n )\n return\n\n except (KeyError, botocore.exceptions.ClientError):\n self.log.important(f\"No training job metrics found for {self.training_job_name}\")\n # Store and return the metrics in the SageWorks Metadata\n self.upsert_sageworks_meta({\"sageworks_training_metrics\": None, \"sageworks_training_cm\": None})\n return\n\n # We need additional processing for classification metrics\n if self.model_type == ModelType.CLASSIFIER:\n metrics_df, cm_df = self._process_classification_metrics(df)\n\n # Store and return the metrics in the SageWorks Metadata\n self.upsert_sageworks_meta(\n {\"sageworks_training_metrics\": metrics_df.to_dict(), \"sageworks_training_cm\": cm_df.to_dict()}\n )\n\n def _load_inference_metrics(self, capture_uuid: str = \"auto_inference\"):\n \"\"\"Internal: Retrieve the inference model metrics for this model\n and load the data into the SageWorks Metadata\n\n Args:\n capture_uuid (str, optional): A specific capture_uuid (default: \"auto_inference\")\n Notes:\n This may or may not exist based on whether an Endpoint ran Inference\n \"\"\"\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n inference_metrics = pull_s3_data(s3_path)\n\n # Store data into the SageWorks Metadata\n metrics_storage = None if inference_metrics is None else inference_metrics.to_dict(\"records\")\n self.upsert_sageworks_meta({\"sageworks_inference_metrics\": metrics_storage})\n\n def get_inference_metadata(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the inference metadata for this model\n\n Args:\n capture_uuid (str, optional): A specific capture_uuid (default: \"auto_inference\")\n\n Returns:\n dict: Dictionary of the inference metadata (might be None)\n Notes:\n Basically when Endpoint inference was run, name of the dataset, the MD5, etc\n \"\"\"\n # Sanity check the inference path (which may or may not exist)\n if self.endpoint_inference_path is None:\n return None\n\n # Check for model_training capture_uuid\n if capture_uuid == \"model_training\":\n # Create a DataFrame with the training metadata\n meta_df = pd.DataFrame(\n [\n {\n \"name\": \"AWS Training Capture\",\n \"data_hash\": \"N/A\",\n \"num_rows\": \"-\",\n \"description\": \"-\",\n }\n ]\n )\n return meta_df\n\n # Pull the inference metadata\n try:\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_meta.json\"\n return wr.s3.read_json(s3_path)\n except NoFilesFound:\n self.log.info(f\"Could not find model inference meta at {s3_path}...\")\n return None\n\n def get_inference_predictions(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n \"\"\"\n self.log.important(f\"Grabbing {capture_uuid} predictions for {self.model_name}...\")\n\n # Sanity check that the model should have predictions\n has_predictions = self.model_type in [ModelType.CLASSIFIER, ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]\n if not has_predictions:\n self.log.warning(f\"No Predictions for {self.model_name}...\")\n return None\n\n # Special case for model_training\n if capture_uuid == \"model_training\":\n return self._get_validation_predictions()\n\n # Construct the S3 path for the Inference Predictions\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_predictions.csv\"\n return pull_s3_data(s3_path)\n\n def _get_validation_predictions(self) -> Union[pd.DataFrame, None]:\n \"\"\"Internal: Retrieve the captured prediction results for this model\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Validation Predictions (might be None)\n \"\"\"\n # Sanity check the training path (which may or may not exist)\n if self.model_training_path is None:\n self.log.warning(f\"No Validation Predictions for {self.model_name}...\")\n return None\n self.log.important(f\"Grabbing Validation Predictions for {self.model_name}...\")\n s3_path = f\"{self.model_training_path}/validation_predictions.csv\"\n df = pull_s3_data(s3_path)\n return df\n\n def _extract_training_job_name(self) -> Union[str, None]:\n \"\"\"Internal: Extract the training job name from the ModelDataUrl\"\"\"\n try:\n model_data_url = self.container_info()[\"ModelDataUrl\"]\n parsed_url = urllib.parse.urlparse(model_data_url)\n training_job_name = parsed_url.path.lstrip(\"/\").split(\"/\")[0]\n return training_job_name\n except KeyError:\n self.log.warning(f\"Could not extract training job name from {model_data_url}\")\n return None\n\n @staticmethod\n def _process_classification_metrics(df: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"Internal: Process classification metrics into a more reasonable format\n Args:\n df (pd.DataFrame): DataFrame of training metrics\n Returns:\n (pd.DataFrame, pd.DataFrame): Tuple of DataFrames. Metrics and confusion matrix\n \"\"\"\n # Split into two DataFrames based on 'metric_name'\n metrics_df = df[df[\"metric_name\"].str.startswith(\"Metrics:\")].copy()\n cm_df = df[df[\"metric_name\"].str.startswith(\"ConfusionMatrix:\")].copy()\n\n # Split the 'metric_name' into different parts\n metrics_df[\"class\"] = metrics_df[\"metric_name\"].str.split(\":\").str[1]\n metrics_df[\"metric_type\"] = metrics_df[\"metric_name\"].str.split(\":\").str[2]\n\n # Pivot the DataFrame to get the desired structure\n metrics_df = metrics_df.pivot(index=\"class\", columns=\"metric_type\", values=\"value\").reset_index()\n metrics_df = metrics_df.rename_axis(None, axis=1)\n\n # Now process the confusion matrix\n cm_df[\"row_class\"] = cm_df[\"metric_name\"].str.split(\":\").str[1]\n cm_df[\"col_class\"] = cm_df[\"metric_name\"].str.split(\":\").str[2]\n\n # Pivot the DataFrame to create a form suitable for the heatmap\n cm_df = cm_df.pivot(index=\"row_class\", columns=\"col_class\", values=\"value\")\n\n # Convert the values in cm_df to integers\n cm_df = cm_df.astype(int)\n\n return metrics_df, cm_df\n\n def shapley_values(self, capture_uuid: str = \"auto_inference\") -> Union[list[pd.DataFrame], pd.DataFrame, None]:\n \"\"\"Retrieve the Shapely values for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: Dataframe(s) of the shapley values or None if not found\n\n Notes:\n This may or may not exist based on whether an Endpoint ran Shapley\n \"\"\"\n\n # Sanity check the inference path (which may or may not exist)\n if self.endpoint_inference_path is None:\n return None\n\n # Construct the S3 path for the Shapley values\n shapley_s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n # Multiple CSV if classifier\n if self.model_type == ModelType.CLASSIFIER:\n # CSVs for shap values are indexed by prediction class\n # Because we don't know how many classes there are, we need to search through\n # a list of S3 objects in the parent folder\n s3_paths = wr.s3.list_objects(shapley_s3_path)\n return [pull_s3_data(f) for f in s3_paths if \"inference_shap_values\" in f]\n\n # One CSV if regressor\n if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n s3_path = f\"{shapley_s3_path}/inference_shap_values.csv\"\n return pull_s3_data(s3_path)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.__init__","title":"__init__(model_uuid, model_type=None, **kwargs)
","text":"ModelCore Initialization Args: model_uuid (str): Name of Model in SageWorks. model_type (ModelType, optional): Set this for newly created Models. Defaults to None. **kwargs: Additional keyword arguments
Source code insrc/sageworks/core/artifacts/model_core.py
def __init__(self, model_uuid: str, model_type: ModelType = None, **kwargs):\n \"\"\"ModelCore Initialization\n Args:\n model_uuid (str): Name of Model in SageWorks.\n model_type (ModelType, optional): Set this for newly created Models. Defaults to None.\n **kwargs: Additional keyword arguments\n \"\"\"\n\n # Make sure the model name is valid\n self.is_name_valid(model_uuid, delimiter=\"-\", lower_case=False)\n\n # Call SuperClass Initialization\n super().__init__(model_uuid, **kwargs)\n\n # Initialize our class attributes\n self.latest_model = None\n self.model_type = ModelType.UNKNOWN\n self.model_training_path = None\n self.endpoint_inference_path = None\n\n # Grab an Cloud Platform Meta object and pull information for this Model\n self.model_name = model_uuid\n self.model_meta = self.meta.model(self.model_name)\n if self.model_meta is None:\n self.log.warning(f\"Could not find model {self.model_name} within current visibility scope\")\n return\n else:\n # Is this a model package group without any models?\n if len(self.model_meta[\"ModelPackageList\"]) == 0:\n self.log.warning(f\"Model Group {self.model_name} has no Model Packages!\")\n self.latest_model = None\n self.add_health_tag(\"model_not_found\")\n return\n try:\n self.latest_model = self.model_meta[\"ModelPackageList\"][0]\n self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n self.training_job_name = self._extract_training_job_name()\n if model_type:\n self._set_model_type(model_type)\n else:\n self.model_type = self._get_model_type()\n except (IndexError, KeyError):\n self.log.critical(f\"Model {self.model_name} appears to be malformed. Delete and recreate it!\")\n return\n\n # Set the Model Training S3 Path\n self.model_training_path = self.models_s3_path + \"/training/\" + self.model_name\n\n # Get our Endpoint Inference Path (might be None)\n self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n # Call SuperClass Post Initialization\n super().__post_init__()\n\n # All done\n self.log.info(f\"Model Initialized: {self.model_name}\")\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.arn","title":"arn()
","text":"AWS ARN (Amazon Resource Name) for the Model Package Group
Source code insrc/sageworks/core/artifacts/model_core.py
def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n return self.group_arn()\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.aws_meta","title":"aws_meta()
","text":"Get ALL the AWS metadata for this artifact
Source code insrc/sageworks/core/artifacts/model_core.py
def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.model_meta\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.aws_url","title":"aws_url()
","text":"The AWS URL for looking at/querying this model
Source code insrc/sageworks/core/artifacts/model_core.py
def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this model\"\"\"\n return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.class_labels","title":"class_labels()
","text":"Return the class labels for this Model (if it's a classifier)
Returns:
Type DescriptionUnion[list[str], None]
list[str]: List of class labels
Source code insrc/sageworks/core/artifacts/model_core.py
def class_labels(self) -> Union[list[str], None]:\n \"\"\"Return the class labels for this Model (if it's a classifier)\n\n Returns:\n list[str]: List of class labels\n \"\"\"\n if self.model_type == ModelType.CLASSIFIER:\n return self.sageworks_meta().get(\"class_labels\") # Returns None if not found\n else:\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.confusion_matrix","title":"confusion_matrix(capture_uuid='latest')
","text":"Retrieve the confusion_matrix for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid or \"training\" (default: \"latest\")
'latest'
Returns: pd.DataFrame: DataFrame of the Confusion Matrix (might be None)
Source code insrc/sageworks/core/artifacts/model_core.py
def confusion_matrix(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the confusion_matrix for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n Returns:\n pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n \"\"\"\n\n # Sanity check the sageworks metadata\n if self.sageworks_meta() is None:\n error_msg = f\"Model {self.model_name} has no sageworks_meta(). Either onboard() or delete this model!\"\n self.log.critical(error_msg)\n raise ValueError(error_msg)\n\n # Grab the metrics from the SageWorks Metadata (try inference first, then training)\n if capture_uuid == \"latest\":\n cm = self.confusion_matrix(\"auto_inference\")\n return cm if cm is not None else self.confusion_matrix(\"model_training\")\n\n # Grab the confusion matrix captured during model training (could return None)\n if capture_uuid == \"model_training\":\n cm = self.sageworks_meta().get(\"sageworks_training_cm\")\n return pd.DataFrame.from_dict(cm) if cm else None\n\n else: # Specific capture_uuid\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_cm.csv\"\n cm = pull_s3_data(s3_path, embedded_index=True)\n if cm is not None:\n return cm\n else:\n self.log.warning(f\"Confusion Matrix {capture_uuid} not found for {self.model_name}!\")\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.container_image","title":"container_image()
","text":"Container Image for the Latest Model Package
Source code insrc/sageworks/core/artifacts/model_core.py
def container_image(self) -> str:\n \"\"\"Container Image for the Latest Model Package\"\"\"\n return self.container_info()[\"Image\"]\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.container_info","title":"container_info()
","text":"Container Info for the Latest Model Package
Source code insrc/sageworks/core/artifacts/model_core.py
def container_info(self) -> Union[dict, None]:\n \"\"\"Container Info for the Latest Model Package\"\"\"\n return self.latest_model[\"InferenceSpecification\"][\"Containers\"][0] if self.latest_model else None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.created","title":"created()
","text":"Return the datetime when this artifact was created
Source code insrc/sageworks/core/artifacts/model_core.py
def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n if self.latest_model is None:\n return \"-\"\n return self.latest_model[\"CreationTime\"]\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.delete","title":"delete()
","text":"Delete the Model Packages and the Model Group
Source code insrc/sageworks/core/artifacts/model_core.py
def delete(self):\n \"\"\"Delete the Model Packages and the Model Group\"\"\"\n if not self.exists():\n self.log.warning(f\"Trying to delete an Model that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the Model Group\n ModelCore.managed_delete(model_group_name=self.uuid)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.delete_inference_run","title":"delete_inference_run(inference_run_uuid)
","text":"Delete the inference run for this model
Parameters:
Name Type Description Defaultinference_run_uuid
str
UUID of the inference run
required Source code insrc/sageworks/core/artifacts/model_core.py
def delete_inference_run(self, inference_run_uuid: str):\n \"\"\"Delete the inference run for this model\n\n Args:\n inference_run_uuid (str): UUID of the inference run\n \"\"\"\n if inference_run_uuid == \"model_training\":\n self.log.warning(\"Cannot delete model training data!\")\n return\n\n if self.endpoint_inference_path:\n full_path = f\"{self.endpoint_inference_path}/{inference_run_uuid}\"\n # Check if there are any objects at the path\n if wr.s3.list_objects(full_path):\n wr.s3.delete_objects(path=full_path)\n self.log.important(f\"Deleted inference run {inference_run_uuid} for {self.model_name}\")\n else:\n self.log.warning(f\"Inference run {inference_run_uuid} not found for {self.model_name}!\")\n else:\n self.log.warning(f\"No inference data found for {self.model_name}!\")\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.details","title":"details(recompute=False)
","text":"Additional Details about this Model Args: recompute (bool, optional): Recompute the details (default: False) Returns: dict: Dictionary of details about this Model
Source code insrc/sageworks/core/artifacts/model_core.py
def details(self, recompute=False) -> dict:\n \"\"\"Additional Details about this Model\n Args:\n recompute (bool, optional): Recompute the details (default: False)\n Returns:\n dict: Dictionary of details about this Model\n \"\"\"\n self.log.info(\"Computing Model Details...\")\n details = self.summary()\n details[\"pipeline\"] = self.get_pipeline()\n details[\"model_type\"] = self.model_type.value\n details[\"model_package_group_arn\"] = self.group_arn()\n details[\"model_package_arn\"] = self.model_package_arn()\n\n # Sanity check is we have models in the group\n if self.latest_model is None:\n self.log.warning(f\"Model Package Group {self.model_name} has no models!\")\n return details\n\n # Grab the Model Details\n details[\"description\"] = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n details[\"version\"] = self.latest_model[\"ModelPackageVersion\"]\n details[\"status\"] = self.latest_model[\"ModelPackageStatus\"]\n details[\"approval_status\"] = self.latest_model.get(\"ModelApprovalStatus\", \"unknown\")\n details[\"image\"] = self.container_image().split(\"/\")[-1] # Shorten the image uri\n\n # Grab the inference and container info\n inference_spec = self.latest_model[\"InferenceSpecification\"]\n container_info = self.container_info()\n details[\"framework\"] = container_info.get(\"Framework\", \"unknown\")\n details[\"framework_version\"] = container_info.get(\"FrameworkVersion\", \"unknown\")\n details[\"inference_types\"] = inference_spec[\"SupportedRealtimeInferenceInstanceTypes\"]\n details[\"transform_types\"] = inference_spec[\"SupportedTransformInstanceTypes\"]\n details[\"content_types\"] = inference_spec[\"SupportedContentTypes\"]\n details[\"response_types\"] = inference_spec[\"SupportedResponseMIMETypes\"]\n details[\"model_metrics\"] = self.get_inference_metrics()\n if self.model_type == ModelType.CLASSIFIER:\n details[\"confusion_matrix\"] = self.confusion_matrix()\n details[\"predictions\"] = None\n elif self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n details[\"confusion_matrix\"] = None\n details[\"predictions\"] = self.get_inference_predictions()\n else:\n details[\"confusion_matrix\"] = None\n details[\"predictions\"] = None\n\n # Grab the inference metadata\n details[\"inference_meta\"] = self.get_inference_metadata()\n\n # Return the details\n return details\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.endpoints","title":"endpoints()
","text":"Get the list of registered endpoints for this Model
Returns:
Type Descriptionlist[str]
list[str]: List of registered endpoints
Source code insrc/sageworks/core/artifacts/model_core.py
def endpoints(self) -> list[str]:\n \"\"\"Get the list of registered endpoints for this Model\n\n Returns:\n list[str]: List of registered endpoints\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_registered_endpoints\", [])\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.exists","title":"exists()
","text":"Does the model metadata exist in the AWS Metadata?
Source code insrc/sageworks/core/artifacts/model_core.py
def exists(self) -> bool:\n \"\"\"Does the model metadata exist in the AWS Metadata?\"\"\"\n if self.model_meta is None:\n self.log.info(f\"Model {self.model_name} not found in AWS Metadata!\")\n return False\n return True\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.expected_meta","title":"expected_meta()
","text":"Metadata we expect to see for this Model when it's ready Returns: list[str]: List of expected metadata keys
Source code insrc/sageworks/core/artifacts/model_core.py
def expected_meta(self) -> list[str]:\n \"\"\"Metadata we expect to see for this Model when it's ready\n Returns:\n list[str]: List of expected metadata keys\n \"\"\"\n # Our current list of expected metadata, we can add to this as needed\n return [\"sageworks_status\", \"sageworks_training_metrics\", \"sageworks_training_cm\"]\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.features","title":"features()
","text":"Return a list of features used for this Model
Returns:
Type DescriptionUnion[list[str], None]
list[str]: List of features used for this Model
Source code insrc/sageworks/core/artifacts/model_core.py
def features(self) -> Union[list[str], None]:\n \"\"\"Return a list of features used for this Model\n\n Returns:\n list[str]: List of features used for this Model\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_model_features\") # Returns None if not found\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_endpoint_inference_path","title":"get_endpoint_inference_path()
","text":"Get the S3 Path for the Inference Data
Returns:
Name Type Descriptionstr
Union[str, None]
S3 Path for the Inference Data (or None if not found)
Source code insrc/sageworks/core/artifacts/model_core.py
def get_endpoint_inference_path(self) -> Union[str, None]:\n \"\"\"Get the S3 Path for the Inference Data\n\n Returns:\n str: S3 Path for the Inference Data (or None if not found)\n \"\"\"\n\n # Look for any Registered Endpoints\n registered_endpoints = self.sageworks_meta().get(\"sageworks_registered_endpoints\")\n\n # Note: We may have 0 to N endpoints, so we find the one with the most recent artifacts\n if registered_endpoints:\n endpoint_inference_base = self.endpoints_s3_path + \"/inference/\"\n endpoint_inference_paths = [endpoint_inference_base + e for e in registered_endpoints]\n inference_path = newest_path(endpoint_inference_paths, self.sm_session)\n if inference_path is None:\n self.log.important(f\"No inference data found for {self.model_name}!\")\n self.log.important(f\"Returning default inference path for {registered_endpoints[0]}...\")\n self.log.important(f\"{endpoint_inference_paths[0]}\")\n return endpoint_inference_paths[0]\n else:\n return inference_path\n else:\n self.log.warning(f\"No registered endpoints found for {self.model_name}!\")\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_metadata","title":"get_inference_metadata(capture_uuid='auto_inference')
","text":"Retrieve the inference metadata for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
A specific capture_uuid (default: \"auto_inference\")
'auto_inference'
Returns:
Name Type Descriptiondict
Union[DataFrame, None]
Dictionary of the inference metadata (might be None)
Notes: Basically when Endpoint inference was run, name of the dataset, the MD5, etc
Source code insrc/sageworks/core/artifacts/model_core.py
def get_inference_metadata(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the inference metadata for this model\n\n Args:\n capture_uuid (str, optional): A specific capture_uuid (default: \"auto_inference\")\n\n Returns:\n dict: Dictionary of the inference metadata (might be None)\n Notes:\n Basically when Endpoint inference was run, name of the dataset, the MD5, etc\n \"\"\"\n # Sanity check the inference path (which may or may not exist)\n if self.endpoint_inference_path is None:\n return None\n\n # Check for model_training capture_uuid\n if capture_uuid == \"model_training\":\n # Create a DataFrame with the training metadata\n meta_df = pd.DataFrame(\n [\n {\n \"name\": \"AWS Training Capture\",\n \"data_hash\": \"N/A\",\n \"num_rows\": \"-\",\n \"description\": \"-\",\n }\n ]\n )\n return meta_df\n\n # Pull the inference metadata\n try:\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_meta.json\"\n return wr.s3.read_json(s3_path)\n except NoFilesFound:\n self.log.info(f\"Could not find model inference meta at {s3_path}...\")\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_metrics","title":"get_inference_metrics(capture_uuid='latest')
","text":"Retrieve the inference performance metrics for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid or \"training\" (default: \"latest\")
'latest'
Returns: pd.DataFrame: DataFrame of the Model Metrics
NoteIf a capture_uuid isn't specified this will try to return something reasonable
Source code insrc/sageworks/core/artifacts/model_core.py
def get_inference_metrics(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the inference performance metrics for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n Returns:\n pd.DataFrame: DataFrame of the Model Metrics\n\n Note:\n If a capture_uuid isn't specified this will try to return something reasonable\n \"\"\"\n # Try to get the auto_capture 'training_holdout' or the training\n if capture_uuid == \"latest\":\n metrics_df = self.get_inference_metrics(\"auto_inference\")\n return metrics_df if metrics_df is not None else self.get_inference_metrics(\"model_training\")\n\n # Grab the metrics captured during model training (could return None)\n if capture_uuid == \"model_training\":\n # Sanity check the sageworks metadata\n if self.sageworks_meta() is None:\n error_msg = f\"Model {self.model_name} has no sageworks_meta(). Either onboard() or delete this model!\"\n self.log.critical(error_msg)\n raise ValueError(error_msg)\n\n metrics = self.sageworks_meta().get(\"sageworks_training_metrics\")\n return pd.DataFrame.from_dict(metrics) if metrics else None\n\n else: # Specific capture_uuid (could return None)\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n metrics = pull_s3_data(s3_path, embedded_index=True)\n if metrics is not None:\n return metrics\n else:\n self.log.warning(f\"Performance metrics {capture_uuid} not found for {self.model_name}!\")\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_predictions","title":"get_inference_predictions(capture_uuid='auto_inference')
","text":"Retrieve the captured prediction results for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid (default: training_holdout)
'auto_inference'
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: DataFrame of the Captured Predictions (might be None)
Source code insrc/sageworks/core/artifacts/model_core.py
def get_inference_predictions(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n \"\"\"\n self.log.important(f\"Grabbing {capture_uuid} predictions for {self.model_name}...\")\n\n # Sanity check that the model should have predictions\n has_predictions = self.model_type in [ModelType.CLASSIFIER, ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]\n if not has_predictions:\n self.log.warning(f\"No Predictions for {self.model_name}...\")\n return None\n\n # Special case for model_training\n if capture_uuid == \"model_training\":\n return self._get_validation_predictions()\n\n # Construct the S3 path for the Inference Predictions\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_predictions.csv\"\n return pull_s3_data(s3_path)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_model_data_url","title":"get_model_data_url()
","text":"Retrieve the ModelDataUrl from the model's AWS metadata.
Returns:
Type DescriptionOptional[str]
Optional[str]: The ModelDataUrl if available, otherwise None.
Source code insrc/sageworks/core/artifacts/model_core.py
def get_model_data_url(self) -> Optional[str]:\n \"\"\"Retrieve the ModelDataUrl from the model's AWS metadata.\n\n Returns:\n Optional[str]: The ModelDataUrl if available, otherwise None.\n \"\"\"\n meta = self.aws_meta()\n try:\n return meta[\"ModelPackageList\"][0][\"InferenceSpecification\"][\"Containers\"][0][\"ModelDataUrl\"]\n except (KeyError, IndexError, TypeError):\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_pipeline","title":"get_pipeline()
","text":"Get the pipeline for this model
Source code insrc/sageworks/core/artifacts/model_core.py
def get_pipeline(self) -> str:\n \"\"\"Get the pipeline for this model\"\"\"\n return self.sageworks_meta().get(\"sageworks_pipeline\")\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.group_arn","title":"group_arn()
","text":"AWS ARN (Amazon Resource Name) for the Model Package Group
Source code insrc/sageworks/core/artifacts/model_core.py
def group_arn(self) -> Union[str, None]:\n \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n return self.model_meta[\"ModelPackageGroupArn\"] if self.model_meta else None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.hash","title":"hash()
","text":"Return the hash for this artifact
Returns:
Type DescriptionOptional[str]
Optional[str]: The hash for this artifact
Source code insrc/sageworks/core/artifacts/model_core.py
def hash(self) -> Optional[str]:\n \"\"\"Return the hash for this artifact\n\n Returns:\n Optional[str]: The hash for this artifact\n \"\"\"\n model_url = self.get_model_data_url()\n return get_s3_etag(model_url, self.boto3_session)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.health_check","title":"health_check()
","text":"Perform a health check on this model Returns: list[str]: List of health issues
Source code insrc/sageworks/core/artifacts/model_core.py
def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n Returns:\n list[str]: List of health issues\n \"\"\"\n # Call the base class health check\n health_issues = super().health_check()\n\n # Check if the model exists\n if self.latest_model is None:\n health_issues.append(\"model_not_found\")\n\n # Model Type\n if self._get_model_type() == ModelType.UNKNOWN:\n health_issues.append(\"model_type_unknown\")\n else:\n self.remove_health_tag(\"model_type_unknown\")\n\n # Model Performance Metrics\n needs_metrics = self.model_type in {ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR, ModelType.CLASSIFIER}\n if needs_metrics and self.get_inference_metrics() is None:\n health_issues.append(\"metrics_needed\")\n else:\n self.remove_health_tag(\"metrics_needed\")\n\n # Endpoint\n if not self.endpoints():\n health_issues.append(\"no_endpoint\")\n else:\n self.remove_health_tag(\"no_endpoint\")\n return health_issues\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.is_model_unknown","title":"is_model_unknown()
","text":"Is the Model Type unknown?
Source code insrc/sageworks/core/artifacts/model_core.py
def is_model_unknown(self) -> bool:\n \"\"\"Is the Model Type unknown?\"\"\"\n return self.model_type == ModelType.UNKNOWN\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.latest_model_object","title":"latest_model_object()
","text":"Return the latest AWS Sagemaker Model object for this SageWorks Model
Returns:
Type DescriptionModel
sagemaker.model.Model: AWS Sagemaker Model object
Source code insrc/sageworks/core/artifacts/model_core.py
def latest_model_object(self) -> SagemakerModel:\n \"\"\"Return the latest AWS Sagemaker Model object for this SageWorks Model\n\n Returns:\n sagemaker.model.Model: AWS Sagemaker Model object\n \"\"\"\n return SagemakerModel(\n model_data=self.model_package_arn(), sagemaker_session=self.sm_session, image_uri=self.container_image()\n )\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.list_inference_runs","title":"list_inference_runs()
","text":"List the inference runs for this model
Returns:
Type Descriptionlist[str]
list[str]: List of inference runs
Source code insrc/sageworks/core/artifacts/model_core.py
def list_inference_runs(self) -> list[str]:\n \"\"\"List the inference runs for this model\n\n Returns:\n list[str]: List of inference runs\n \"\"\"\n\n # Check if we have a model (if not return empty list)\n if self.latest_model is None:\n return []\n\n # Check if we have model training metrics in our metadata\n have_model_training = True if self.sageworks_meta().get(\"sageworks_training_metrics\") else False\n\n # Now grab the list of directories from our inference path\n inference_runs = []\n if self.endpoint_inference_path:\n directories = wr.s3.list_directories(path=self.endpoint_inference_path + \"/\")\n inference_runs = [urlparse(directory).path.split(\"/\")[-2] for directory in directories]\n\n # We're going to add the model training to the end of the list\n if have_model_training:\n inference_runs.append(\"model_training\")\n return inference_runs\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.managed_delete","title":"managed_delete(model_group_name)
classmethod
","text":"Delete the Model Packages, Model Group, and S3 Storage Objects
Parameters:
Name Type Description Defaultmodel_group_name
str
The name of the Model Group to delete
required Source code insrc/sageworks/core/artifacts/model_core.py
@classmethod\ndef managed_delete(cls, model_group_name: str):\n \"\"\"Delete the Model Packages, Model Group, and S3 Storage Objects\n\n Args:\n model_group_name (str): The name of the Model Group to delete\n \"\"\"\n # Check if the model group exists in SageMaker\n try:\n cls.sm_client.describe_model_package_group(ModelPackageGroupName=model_group_name)\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] in [\"ValidationException\", \"ResourceNotFound\"]:\n cls.log.info(f\"Model Group {model_group_name} not found!\")\n return\n else:\n raise # Re-raise unexpected errors\n\n # Delete Model Packages within the Model Group\n try:\n paginator = cls.sm_client.get_paginator(\"list_model_packages\")\n for page in paginator.paginate(ModelPackageGroupName=model_group_name):\n for model_package in page[\"ModelPackageSummaryList\"]:\n package_arn = model_package[\"ModelPackageArn\"]\n cls.log.info(f\"Deleting Model Package {package_arn}...\")\n cls.sm_client.delete_model_package(ModelPackageName=package_arn)\n except ClientError as e:\n cls.log.error(f\"Error while deleting model packages: {e}\")\n raise\n\n # Delete the Model Package Group\n cls.log.info(f\"Deleting Model Group {model_group_name}...\")\n cls.sm_client.delete_model_package_group(ModelPackageGroupName=model_group_name)\n\n # Delete S3 training artifacts\n s3_delete_path = f\"{cls.models_s3_path}/training/{model_group_name}/\"\n cls.log.info(f\"Deleting S3 Objects at {s3_delete_path}...\")\n wr.s3.delete_objects(s3_delete_path, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(model_group_name)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.model_package_arn","title":"model_package_arn()
","text":"AWS ARN (Amazon Resource Name) for the Latest Model Package (within the Group)
Source code insrc/sageworks/core/artifacts/model_core.py
def model_package_arn(self) -> Union[str, None]:\n \"\"\"AWS ARN (Amazon Resource Name) for the Latest Model Package (within the Group)\"\"\"\n if self.latest_model is None:\n return None\n return self.latest_model[\"ModelPackageArn\"]\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.modified","title":"modified()
","text":"Return the datetime when this artifact was last modified
Source code insrc/sageworks/core/artifacts/model_core.py
def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n if self.latest_model is None:\n return \"-\"\n return self.latest_model[\"CreationTime\"]\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.onboard","title":"onboard(ask_everything=False)
","text":"This is an interactive method that will onboard the Model (make it ready)
Parameters:
Name Type Description Defaultask_everything
bool
Ask for all the details. Defaults to False.
False
Returns:
Name Type Descriptionbool
bool
True if the Model is successfully onboarded, False otherwise
Source code insrc/sageworks/core/artifacts/model_core.py
def onboard(self, ask_everything=False) -> bool:\n \"\"\"This is an interactive method that will onboard the Model (make it ready)\n\n Args:\n ask_everything (bool, optional): Ask for all the details. Defaults to False.\n\n Returns:\n bool: True if the Model is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n # Determine the Model Type\n while self.is_model_unknown():\n self._determine_model_type()\n\n # Is our input data set?\n if self.get_input() in [\"\", \"unknown\"] or ask_everything:\n input_data = input(\"Input Data?: \")\n if input_data not in [\"None\", \"none\", \"\", \"unknown\"]:\n self.set_input(input_data)\n\n # Determine the Target Column (can be None)\n target_column = self.target()\n if target_column is None or ask_everything:\n target_column = input(\"Target Column? (for unsupervised/transformer just type None): \")\n if target_column in [\"None\", \"none\", \"\"]:\n target_column = None\n\n # Determine the Feature Columns\n feature_columns = self.features()\n if feature_columns is None or ask_everything:\n feature_columns = input(\"Feature Columns? (use commas): \")\n feature_columns = [e.strip() for e in feature_columns.split(\",\")]\n if feature_columns in [[\"None\"], [\"none\"], [\"\"]]:\n feature_columns = None\n\n # Registered Endpoints?\n endpoints = self.endpoints()\n if not endpoints or ask_everything:\n endpoints = input(\"Register Endpoints? (use commas for multiple): \")\n endpoints = [e.strip() for e in endpoints.split(\",\")]\n if endpoints in [[\"None\"], [\"none\"], [\"\"]]:\n endpoints = None\n\n # Model Owner?\n owner = self.get_owner()\n if owner in [None, \"unknown\"] or ask_everything:\n owner = input(\"Model Owner: \")\n if owner in [\"None\", \"none\", \"\"]:\n owner = \"unknown\"\n\n # Model Class Labels (if it's a classifier)\n if self.model_type == ModelType.CLASSIFIER:\n class_labels = self.class_labels()\n if class_labels is None or ask_everything:\n class_labels = input(\"Class Labels? (use commas): \")\n class_labels = [e.strip() for e in class_labels.split(\",\")]\n if class_labels in [[\"None\"], [\"none\"], [\"\"]]:\n class_labels = None\n self.set_class_labels(class_labels)\n\n # Now that we have all the details, let's onboard the Model with all the args\n return self.onboard_with_args(self.model_type, target_column, feature_columns, endpoints, owner)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.onboard_with_args","title":"onboard_with_args(model_type, target_column=None, feature_list=None, endpoints=None, owner=None)
","text":"Onboard the Model with the given arguments
Parameters:
Name Type Description Defaultmodel_type
ModelType
Model Type
requiredtarget_column
str
Target Column
None
feature_list
list
List of Feature Columns
None
endpoints
list
List of Endpoints. Defaults to None.
None
owner
str
Model Owner. Defaults to None.
None
Returns: bool: True if the Model is successfully onboarded, False otherwise
Source code insrc/sageworks/core/artifacts/model_core.py
def onboard_with_args(\n self,\n model_type: ModelType,\n target_column: str = None,\n feature_list: list = None,\n endpoints: list = None,\n owner: str = None,\n) -> bool:\n \"\"\"Onboard the Model with the given arguments\n\n Args:\n model_type (ModelType): Model Type\n target_column (str): Target Column\n feature_list (list): List of Feature Columns\n endpoints (list, optional): List of Endpoints. Defaults to None.\n owner (str, optional): Model Owner. Defaults to None.\n Returns:\n bool: True if the Model is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n # Set All the Details\n self._set_model_type(model_type)\n if target_column:\n self.set_target(target_column)\n if feature_list:\n self.set_features(feature_list)\n if endpoints:\n for endpoint in endpoints:\n self.register_endpoint(endpoint)\n if owner:\n self.set_owner(owner)\n\n # Load the training metrics and inference metrics\n self._load_training_metrics()\n self._load_inference_metrics()\n\n # Remove the needs_onboard tag\n self.remove_health_tag(\"needs_onboard\")\n self.set_status(\"ready\")\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n return True\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.refresh_meta","title":"refresh_meta()
","text":"Refresh the Artifact's metadata
Source code insrc/sageworks/core/artifacts/model_core.py
def refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n self.model_meta = self.meta.model(self.model_name)\n self.latest_model = self.model_meta[\"ModelPackageList\"][0]\n self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n self.training_job_name = self._extract_training_job_name()\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.register_endpoint","title":"register_endpoint(endpoint_name)
","text":"Add this endpoint to the set of registered endpoints for the model
Parameters:
Name Type Description Defaultendpoint_name
str
Name of the endpoint
required Source code insrc/sageworks/core/artifacts/model_core.py
def register_endpoint(self, endpoint_name: str):\n \"\"\"Add this endpoint to the set of registered endpoints for the model\n\n Args:\n endpoint_name (str): Name of the endpoint\n \"\"\"\n self.log.important(f\"Registering Endpoint {endpoint_name} with Model {self.uuid}...\")\n registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n registered_endpoints.add(endpoint_name)\n self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n # Remove any health tags\n self.remove_health_tag(\"no_endpoint\")\n\n # A new endpoint means we need to refresh our inference path\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.endpoint_inference_path = self.get_endpoint_inference_path()\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.remove_endpoint","title":"remove_endpoint(endpoint_name)
","text":"Remove this endpoint from the set of registered endpoints for the model
Parameters:
Name Type Description Defaultendpoint_name
str
Name of the endpoint
required Source code insrc/sageworks/core/artifacts/model_core.py
def remove_endpoint(self, endpoint_name: str):\n \"\"\"Remove this endpoint from the set of registered endpoints for the model\n\n Args:\n endpoint_name (str): Name of the endpoint\n \"\"\"\n self.log.important(f\"Removing Endpoint {endpoint_name} from Model {self.uuid}...\")\n registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n registered_endpoints.discard(endpoint_name)\n self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n # If we have NO endpionts, then set a health tags\n if not registered_endpoints:\n self.add_health_tag(\"no_endpoint\")\n self.details(recompute=True)\n\n # A new endpoint means we need to refresh our inference path\n time.sleep(2)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_class_labels","title":"set_class_labels(labels)
","text":"Return the class labels for this Model (if it's a classifier)
Parameters:
Name Type Description Defaultlabels
list[str]
List of class labels
required Source code insrc/sageworks/core/artifacts/model_core.py
def set_class_labels(self, labels: list[str]):\n \"\"\"Return the class labels for this Model (if it's a classifier)\n\n Args:\n labels (list[str]): List of class labels\n \"\"\"\n if self.model_type == ModelType.CLASSIFIER:\n self.upsert_sageworks_meta({\"class_labels\": labels})\n else:\n self.log.error(f\"Model {self.model_name} is not a classifier!\")\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_features","title":"set_features(feature_columns)
","text":"Set the features for this Model
Parameters:
Name Type Description Defaultfeature_columns
list[str]
List of feature columns
required Source code insrc/sageworks/core/artifacts/model_core.py
def set_features(self, feature_columns: list[str]):\n \"\"\"Set the features for this Model\n\n Args:\n feature_columns (list[str]): List of feature columns\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_model_features\": feature_columns})\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_input","title":"set_input(input, force=False)
","text":"Override: Set the input data for this artifact
Parameters:
Name Type Description Defaultinput
str
Name of input for this artifact
requiredforce
bool
Force the input to be set (default: False)
False
Note: We're going to not allow this to be used for Models
Source code insrc/sageworks/core/artifacts/model_core.py
def set_input(self, input: str, force: bool = False):\n \"\"\"Override: Set the input data for this artifact\n\n Args:\n input (str): Name of input for this artifact\n force (bool, optional): Force the input to be set (default: False)\n Note:\n We're going to not allow this to be used for Models\n \"\"\"\n if not force:\n self.log.warning(f\"Model {self.uuid}: Does not allow manual override of the input!\")\n return\n\n # Okay we're going to allow this to be set\n self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input})\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_pipeline","title":"set_pipeline(pipeline)
","text":"Set the pipeline for this model
Parameters:
Name Type Description Defaultpipeline
str
Pipeline that was used to create this model
required Source code insrc/sageworks/core/artifacts/model_core.py
def set_pipeline(self, pipeline: str):\n \"\"\"Set the pipeline for this model\n\n Args:\n pipeline (str): Pipeline that was used to create this model\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_pipeline\": pipeline})\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_target","title":"set_target(target_column)
","text":"Set the target for this Model
Parameters:
Name Type Description Defaulttarget_column
str
Target column for this Model
required Source code insrc/sageworks/core/artifacts/model_core.py
def set_target(self, target_column: str):\n \"\"\"Set the target for this Model\n\n Args:\n target_column (str): Target column for this Model\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_model_target\": target_column})\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.shapley_values","title":"shapley_values(capture_uuid='auto_inference')
","text":"Retrieve the Shapely values for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid (default: training_holdout)
'auto_inference'
Returns:
Type DescriptionUnion[list[DataFrame], DataFrame, None]
pd.DataFrame: Dataframe(s) of the shapley values or None if not found
NotesThis may or may not exist based on whether an Endpoint ran Shapley
Source code insrc/sageworks/core/artifacts/model_core.py
def shapley_values(self, capture_uuid: str = \"auto_inference\") -> Union[list[pd.DataFrame], pd.DataFrame, None]:\n \"\"\"Retrieve the Shapely values for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: Dataframe(s) of the shapley values or None if not found\n\n Notes:\n This may or may not exist based on whether an Endpoint ran Shapley\n \"\"\"\n\n # Sanity check the inference path (which may or may not exist)\n if self.endpoint_inference_path is None:\n return None\n\n # Construct the S3 path for the Shapley values\n shapley_s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n # Multiple CSV if classifier\n if self.model_type == ModelType.CLASSIFIER:\n # CSVs for shap values are indexed by prediction class\n # Because we don't know how many classes there are, we need to search through\n # a list of S3 objects in the parent folder\n s3_paths = wr.s3.list_objects(shapley_s3_path)\n return [pull_s3_data(f) for f in s3_paths if \"inference_shap_values\" in f]\n\n # One CSV if regressor\n if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n s3_path = f\"{shapley_s3_path}/inference_shap_values.csv\"\n return pull_s3_data(s3_path)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.size","title":"size()
","text":"Return the size of this data in MegaBytes
Source code insrc/sageworks/core/artifacts/model_core.py
def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n return 0.0\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.target","title":"target()
","text":"Return the target for this Model (if supervised, else None)
Returns:
Name Type Descriptionstr
Union[str, None]
Target column for this Model (if supervised, else None)
Source code insrc/sageworks/core/artifacts/model_core.py
def target(self) -> Union[str, None]:\n \"\"\"Return the target for this Model (if supervised, else None)\n\n Returns:\n str: Target column for this Model (if supervised, else None)\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_model_target\") # Returns None if not found\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelType","title":"ModelType
","text":" Bases: Enum
Enumerated Types for SageWorks Model Types
Source code insrc/sageworks/core/artifacts/model_core.py
class ModelType(Enum):\n \"\"\"Enumerated Types for SageWorks Model Types\"\"\"\n\n CLASSIFIER = \"classifier\"\n REGRESSOR = \"regressor\"\n CLUSTERER = \"clusterer\"\n TRANSFORMER = \"transformer\"\n PROJECTION = \"projection\"\n UNSUPERVISED = \"unsupervised\"\n QUANTILE_REGRESSOR = \"quantile_regressor\"\n DETECTOR = \"detector\"\n UNKNOWN = \"unknown\"\n
"},{"location":"core_classes/artifacts/monitor_core/","title":"MonitorCore","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the Monitor API Class and voil\u00e0 it works the same.
MonitorCore class for monitoring SageMaker endpoints
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore","title":"MonitorCore
","text":"Source code in src/sageworks/core/artifacts/monitor_core.py
class MonitorCore:\n def __init__(self, endpoint_name, instance_type=\"ml.t3.large\"):\n \"\"\"ExtractModelArtifact Class\n Args:\n endpoint_name (str): Name of the endpoint to set up monitoring for\n instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\".\n Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.endpoint_name = endpoint_name\n self.endpoint = EndpointCore(self.endpoint_name)\n\n # Initialize Class Attributes\n self.sagemaker_session = self.endpoint.sm_session\n self.sagemaker_client = self.endpoint.sm_client\n self.data_capture_path = self.endpoint.endpoint_data_capture_path\n self.monitoring_path = self.endpoint.endpoint_monitoring_path\n self.instance_type = instance_type\n self.monitoring_schedule_name = f\"{self.endpoint_name}-monitoring-schedule\"\n self.monitoring_output_path = f\"{self.monitoring_path}/monitoring_reports\"\n self.baseline_dir = f\"{self.monitoring_path}/baseline\"\n self.baseline_csv_file = f\"{self.baseline_dir}/baseline.csv\"\n self.constraints_json_file = f\"{self.baseline_dir}/constraints.json\"\n self.statistics_json_file = f\"{self.baseline_dir}/statistics.json\"\n\n # Initialize the DefaultModelMonitor\n self.sageworks_role_arn = AWSAccountClamp().aws_session.get_sageworks_execution_role_arn()\n self.model_monitor = DefaultModelMonitor(role=self.sageworks_role_arn, instance_type=self.instance_type)\n\n def summary(self) -> dict:\n \"\"\"Return the summary of information about the endpoint monitor\n\n Returns:\n dict: Summary of information about the endpoint monitor\n \"\"\"\n if self.endpoint.is_serverless():\n return {\n \"endpoint_type\": \"serverless\",\n \"data_capture\": \"not supported\",\n \"baseline\": \"not supported\",\n \"monitoring_schedule\": \"not supported\",\n }\n else:\n summary = {\n \"endpoint_type\": \"realtime\",\n \"data_capture\": self.is_data_capture_configured(capture_percentage=100),\n \"baseline\": self.baseline_exists(),\n \"monitoring_schedule\": self.monitoring_schedule_exists(),\n }\n summary.update(self.last_run_details() or {})\n return summary\n\n def __repr__(self) -> str:\n \"\"\"String representation of this MonitorCore object\n\n Returns:\n str: String representation of this MonitorCore object\n \"\"\"\n summary_dict = self.summary()\n summary_items = [f\" {repr(key)}: {repr(value)}\" for key, value in summary_dict.items()]\n summary_str = f\"{self.__class__.__name__}: {self.endpoint_name}\\n\" + \",\\n\".join(summary_items)\n return summary_str\n\n def last_run_details(self) -> Union[dict, None]:\n \"\"\"Return the details of the last monitoring run for the endpoint\n\n Returns:\n dict: The details of the last monitoring run for the endpoint (None if no monitoring schedule)\n \"\"\"\n # Check if we have a monitoring schedule\n if not self.monitoring_schedule_exists():\n return None\n\n # Get the details of the last monitoring run\n schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n MonitoringScheduleName=self.monitoring_schedule_name\n )\n last_run_status = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"MonitoringExecutionStatus\")\n last_run_time = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"ScheduledTime\")\n failure_reason = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"FailureReason\")\n return {\n \"last_run_status\": last_run_status,\n \"last_run_time\": str(last_run_time),\n \"failure_reason\": failure_reason,\n }\n\n def details(self) -> dict:\n \"\"\"Return the details of the monitoring for the endpoint\n\n Returns:\n dict: The details of the monitoring for the endpoint\n \"\"\"\n # Check if we have data capture\n if self.is_data_capture_configured(capture_percentage=100):\n data_capture_path = self.data_capture_path\n else:\n data_capture_path = None\n\n # Check if we have a baseline\n if self.baseline_exists():\n baseline_csv_file = self.baseline_csv_file\n constraints_json_file = self.constraints_json_file\n statistics_json_file = self.statistics_json_file\n else:\n baseline_csv_file = None\n constraints_json_file = None\n statistics_json_file = None\n\n # Check if we have a monitoring schedule\n if self.monitoring_schedule_exists():\n schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n MonitoringScheduleName=self.monitoring_schedule_name\n )\n\n # General monitoring details\n schedule_name = schedule_details.get(\"MonitoringScheduleName\")\n schedule_status = schedule_details.get(\"MonitoringScheduleStatus\")\n output_path = self.monitoring_output_path\n last_run_details = self.last_run_details()\n else:\n schedule_name = None\n schedule_status = \"Not Scheduled\"\n schedule_details = None\n output_path = None\n last_run_details = None\n\n # General monitoring details\n general = {\n \"data_capture_path\": data_capture_path,\n \"baseline_csv_file\": baseline_csv_file,\n \"baseline_constraints_json_file\": constraints_json_file,\n \"baseline_statistics_json_file\": statistics_json_file,\n \"monitoring_schedule_name\": schedule_name,\n \"monitoring_output_path\": output_path,\n \"monitoring_schedule_status\": schedule_status,\n \"monitoring_schedule_details\": schedule_details,\n }\n if last_run_details:\n general.update(last_run_details)\n return general\n\n def add_data_capture(self, capture_percentage=100):\n \"\"\"\n Add data capture configuration for the SageMaker endpoint.\n\n Args:\n capture_percentage (int): Percentage of data to capture. Defaults to 100.\n \"\"\"\n\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\"Data capture is not currently supported for serverless endpoints.\")\n return\n\n # Check if the endpoint already has data capture configured\n if self.is_data_capture_configured(capture_percentage):\n self.log.important(f\"Data capture {capture_percentage} already configured for {self.endpoint_name}.\")\n return\n\n # Get the current endpoint configuration name\n current_endpoint_config_name = self.endpoint.endpoint_config_name()\n\n # Log the data capture path\n self.log.important(f\"Adding Data Capture to {self.endpoint_name} --> {self.data_capture_path}\")\n self.log.important(\"This normally redeploys the endpoint...\")\n\n # Setup data capture config\n data_capture_config = DataCaptureConfig(\n enable_capture=True,\n sampling_percentage=capture_percentage,\n destination_s3_uri=self.data_capture_path,\n capture_options=[\"Input\", \"Output\"],\n csv_content_types=[\"text/csv\"],\n )\n\n # Create a Predictor instance and update data capture configuration\n predictor = Predictor(self.endpoint_name, sagemaker_session=self.sagemaker_session)\n predictor.update_data_capture_config(data_capture_config=data_capture_config)\n\n # Delete the old endpoint configuration\n self.log.important(f\"Deleting old endpoint configuration: {current_endpoint_config_name}\")\n self.sagemaker_client.delete_endpoint_config(EndpointConfigName=current_endpoint_config_name)\n\n def is_data_capture_configured(self, capture_percentage):\n \"\"\"\n Check if data capture is already configured on the endpoint.\n Args:\n capture_percentage (int): Expected data capture percentage.\n Returns:\n bool: True if data capture is already configured, False otherwise.\n \"\"\"\n try:\n endpoint_config_name = self.endpoint.endpoint_config_name()\n endpoint_config = self.sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n data_capture_config = endpoint_config.get(\"DataCaptureConfig\", {})\n\n # Check if data capture is enabled and the percentage matches\n is_enabled = data_capture_config.get(\"EnableCapture\", False)\n current_percentage = data_capture_config.get(\"InitialSamplingPercentage\", 0)\n return is_enabled and current_percentage == capture_percentage\n except Exception as e:\n self.log.error(f\"Error checking data capture configuration: {e}\")\n return False\n\n def get_latest_data_capture(self) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Get the latest data capture from S3.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n # List files in the specified S3 path\n files = wr.s3.list_objects(self.data_capture_path)\n\n if files:\n print(f\"Found {len(files)} files in {self.data_capture_path}. Reading the most recent file.\")\n\n # Read the most recent file into a DataFrame\n df = wr.s3.read_json(path=files[-1], lines=True) # Reads the last file assuming it's the most recent one\n\n # Process the captured data and return the input and output DataFrames\n return self.process_captured_data(df)\n else:\n print(f\"No data capture files found in {self.data_capture_path}.\")\n return None, None\n\n @staticmethod\n def process_captured_data(df: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Process the captured data DataFrame to extract and flatten the nested data.\n\n Args:\n df (DataFrame): DataFrame with captured data.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n processed_records = []\n\n # Phase1: Process the AWS Data Capture format into a flatter DataFrame\n for _, row in df.iterrows():\n # Extract data from captureData dictionary\n capture_data = row[\"captureData\"]\n input_data = capture_data[\"endpointInput\"]\n output_data = capture_data[\"endpointOutput\"]\n\n # Process input and output, both meta and actual data\n record = {\n \"input_content_type\": input_data.get(\"observedContentType\"),\n \"input_encoding\": input_data.get(\"encoding\"),\n \"input\": input_data.get(\"data\"),\n \"output_content_type\": output_data.get(\"observedContentType\"),\n \"output_encoding\": output_data.get(\"encoding\"),\n \"output\": output_data.get(\"data\"),\n }\n processed_records.append(record)\n processed_df = pd.DataFrame(processed_records)\n\n # Phase2: Process the input and output 'data' columns into separate DataFrames\n input_df_list = []\n output_df_list = []\n for _, row in processed_df.iterrows():\n input_df = pd.read_csv(StringIO(row[\"input\"]))\n input_df_list.append(input_df)\n output_df = pd.read_csv(StringIO(row[\"output\"]))\n output_df_list.append(output_df)\n\n # Return the input and output DataFrames\n return pd.concat(input_df_list), pd.concat(output_df_list)\n\n def baseline_exists(self) -> bool:\n \"\"\"\n Check if baseline files exist in S3.\n\n Returns:\n bool: True if all files exist, False otherwise.\n \"\"\"\n\n files = [self.baseline_csv_file, self.constraints_json_file, self.statistics_json_file]\n return all(wr.s3.does_object_exist(file) for file in files)\n\n def create_baseline(self, recreate: bool = False):\n \"\"\"Code to create a baseline for monitoring\n Args:\n recreate (bool): If True, recreate the baseline even if it already exists\n Notes:\n This will create/write three files to the baseline_dir:\n - baseline.csv\n - constraints.json\n - statistics.json\n \"\"\"\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\n \"You can create a baseline but it can't be used/monitored for serverless endpoints, skipping...\"\n )\n return\n\n if not self.baseline_exists() or recreate:\n # Create a baseline for monitoring (training data from the FeatureSet)\n baseline_df = endpoint_utils.fs_training_data(self.endpoint)\n wr.s3.to_csv(baseline_df, self.baseline_csv_file, index=False)\n\n self.log.important(f\"Creating baseline files for {self.endpoint_name} --> {self.baseline_dir}\")\n self.model_monitor.suggest_baseline(\n baseline_dataset=self.baseline_csv_file,\n dataset_format=DatasetFormat.csv(header=True),\n output_s3_uri=self.baseline_dir,\n )\n else:\n self.log.important(f\"Baseline already exists for {self.endpoint_name}\")\n\n def get_baseline(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n Returns:\n pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n \"\"\"\n # Read the monitoring data from S3\n if not wr.s3.does_object_exist(path=self.baseline_csv_file):\n self.log.warning(\"baseline.csv data does not exist in S3.\")\n return None\n else:\n return wr.s3.read_csv(self.baseline_csv_file)\n\n def get_constraints(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the constraints from the baseline\n\n Returns:\n pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n \"\"\"\n return self._get_monitor_json_data(self.constraints_json_file)\n\n def get_statistics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the statistics from the baseline\n\n Returns:\n pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n \"\"\"\n return self._get_monitor_json_data(self.statistics_json_file)\n\n def _get_monitor_json_data(self, s3_path: str) -> Union[pd.DataFrame, None]:\n \"\"\"Internal: Convert the JSON monitoring data into a DataFrame\n Args:\n s3_path(str): The S3 path to the monitoring data\n Returns:\n pd.DataFrame: Monitoring data in DataFrame form (None if it doesn't exist)\n \"\"\"\n # Read the monitoring data from S3\n if not wr.s3.does_object_exist(path=s3_path):\n self.log.warning(\"Monitoring data does not exist in S3.\")\n return None\n else:\n raw_json = read_s3_file(s3_path=s3_path)\n monitoring_data = json.loads(raw_json)\n monitoring_df = pd.json_normalize(monitoring_data[\"features\"])\n return monitoring_df\n\n def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n \"\"\"\n Sets up the monitoring schedule for the model endpoint.\n Args:\n schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n \"\"\"\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\"Monitoring Schedule is not currently supported for serverless endpoints.\")\n return\n\n # Set up the monitoring schedule, name, and output path\n if schedule == \"daily\":\n schedule = CronExpressionGenerator.daily()\n else:\n schedule = CronExpressionGenerator.hourly()\n\n # Check if the baseline exists\n if not self.baseline_exists():\n self.log.warning(f\"Baseline does not exist for {self.endpoint_name}. Call create_baseline() first...\")\n return\n\n # Check if monitoring schedule already exists\n schedule_exists = self.monitoring_schedule_exists()\n\n # If the schedule exists, and we don't want to recreate it, return\n if schedule_exists and not recreate:\n return\n\n # If the schedule exists, delete it\n if schedule_exists:\n self.log.important(f\"Deleting existing monitoring schedule for {self.endpoint_name}...\")\n self.sagemaker_client.delete_monitoring_schedule(MonitoringScheduleName=self.monitoring_schedule_name)\n\n # Set up a NEW monitoring schedule\n self.model_monitor.create_monitoring_schedule(\n monitor_schedule_name=self.monitoring_schedule_name,\n endpoint_input=self.endpoint_name,\n output_s3_uri=self.monitoring_output_path,\n statistics=self.statistics_json_file,\n constraints=self.constraints_json_file,\n schedule_cron_expression=schedule,\n )\n self.log.important(f\"New Monitoring schedule created for {self.endpoint_name}.\")\n\n def setup_alerts(self):\n \"\"\"Code to set up alerts based on monitoring results\"\"\"\n pass\n\n def monitoring_schedule_exists(self):\n \"\"\"Code to figure out if a monitoring schedule already exists for this endpoint\"\"\"\n existing_schedules = self.sagemaker_client.list_monitoring_schedules(MaxResults=100).get(\n \"MonitoringScheduleSummaries\", []\n )\n if any(schedule[\"MonitoringScheduleName\"] == self.monitoring_schedule_name for schedule in existing_schedules):\n self.log.info(f\"Monitoring schedule already exists for {self.endpoint_name}.\")\n return True\n else:\n self.log.info(f\"Could not find a Monitoring schedule for {self.endpoint_name}.\")\n return False\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.__init__","title":"__init__(endpoint_name, instance_type='ml.t3.large')
","text":"ExtractModelArtifact Class Args: endpoint_name (str): Name of the endpoint to set up monitoring for instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\". Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...
Source code insrc/sageworks/core/artifacts/monitor_core.py
def __init__(self, endpoint_name, instance_type=\"ml.t3.large\"):\n \"\"\"ExtractModelArtifact Class\n Args:\n endpoint_name (str): Name of the endpoint to set up monitoring for\n instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\".\n Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.endpoint_name = endpoint_name\n self.endpoint = EndpointCore(self.endpoint_name)\n\n # Initialize Class Attributes\n self.sagemaker_session = self.endpoint.sm_session\n self.sagemaker_client = self.endpoint.sm_client\n self.data_capture_path = self.endpoint.endpoint_data_capture_path\n self.monitoring_path = self.endpoint.endpoint_monitoring_path\n self.instance_type = instance_type\n self.monitoring_schedule_name = f\"{self.endpoint_name}-monitoring-schedule\"\n self.monitoring_output_path = f\"{self.monitoring_path}/monitoring_reports\"\n self.baseline_dir = f\"{self.monitoring_path}/baseline\"\n self.baseline_csv_file = f\"{self.baseline_dir}/baseline.csv\"\n self.constraints_json_file = f\"{self.baseline_dir}/constraints.json\"\n self.statistics_json_file = f\"{self.baseline_dir}/statistics.json\"\n\n # Initialize the DefaultModelMonitor\n self.sageworks_role_arn = AWSAccountClamp().aws_session.get_sageworks_execution_role_arn()\n self.model_monitor = DefaultModelMonitor(role=self.sageworks_role_arn, instance_type=self.instance_type)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.__repr__","title":"__repr__()
","text":"String representation of this MonitorCore object
Returns:
Name Type Descriptionstr
str
String representation of this MonitorCore object
Source code insrc/sageworks/core/artifacts/monitor_core.py
def __repr__(self) -> str:\n \"\"\"String representation of this MonitorCore object\n\n Returns:\n str: String representation of this MonitorCore object\n \"\"\"\n summary_dict = self.summary()\n summary_items = [f\" {repr(key)}: {repr(value)}\" for key, value in summary_dict.items()]\n summary_str = f\"{self.__class__.__name__}: {self.endpoint_name}\\n\" + \",\\n\".join(summary_items)\n return summary_str\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.add_data_capture","title":"add_data_capture(capture_percentage=100)
","text":"Add data capture configuration for the SageMaker endpoint.
Parameters:
Name Type Description Defaultcapture_percentage
int
Percentage of data to capture. Defaults to 100.
100
Source code in src/sageworks/core/artifacts/monitor_core.py
def add_data_capture(self, capture_percentage=100):\n \"\"\"\n Add data capture configuration for the SageMaker endpoint.\n\n Args:\n capture_percentage (int): Percentage of data to capture. Defaults to 100.\n \"\"\"\n\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\"Data capture is not currently supported for serverless endpoints.\")\n return\n\n # Check if the endpoint already has data capture configured\n if self.is_data_capture_configured(capture_percentage):\n self.log.important(f\"Data capture {capture_percentage} already configured for {self.endpoint_name}.\")\n return\n\n # Get the current endpoint configuration name\n current_endpoint_config_name = self.endpoint.endpoint_config_name()\n\n # Log the data capture path\n self.log.important(f\"Adding Data Capture to {self.endpoint_name} --> {self.data_capture_path}\")\n self.log.important(\"This normally redeploys the endpoint...\")\n\n # Setup data capture config\n data_capture_config = DataCaptureConfig(\n enable_capture=True,\n sampling_percentage=capture_percentage,\n destination_s3_uri=self.data_capture_path,\n capture_options=[\"Input\", \"Output\"],\n csv_content_types=[\"text/csv\"],\n )\n\n # Create a Predictor instance and update data capture configuration\n predictor = Predictor(self.endpoint_name, sagemaker_session=self.sagemaker_session)\n predictor.update_data_capture_config(data_capture_config=data_capture_config)\n\n # Delete the old endpoint configuration\n self.log.important(f\"Deleting old endpoint configuration: {current_endpoint_config_name}\")\n self.sagemaker_client.delete_endpoint_config(EndpointConfigName=current_endpoint_config_name)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.baseline_exists","title":"baseline_exists()
","text":"Check if baseline files exist in S3.
Returns:
Name Type Descriptionbool
bool
True if all files exist, False otherwise.
Source code insrc/sageworks/core/artifacts/monitor_core.py
def baseline_exists(self) -> bool:\n \"\"\"\n Check if baseline files exist in S3.\n\n Returns:\n bool: True if all files exist, False otherwise.\n \"\"\"\n\n files = [self.baseline_csv_file, self.constraints_json_file, self.statistics_json_file]\n return all(wr.s3.does_object_exist(file) for file in files)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.create_baseline","title":"create_baseline(recreate=False)
","text":"Code to create a baseline for monitoring Args: recreate (bool): If True, recreate the baseline even if it already exists Notes: This will create/write three files to the baseline_dir: - baseline.csv - constraints.json - statistics.json
Source code insrc/sageworks/core/artifacts/monitor_core.py
def create_baseline(self, recreate: bool = False):\n \"\"\"Code to create a baseline for monitoring\n Args:\n recreate (bool): If True, recreate the baseline even if it already exists\n Notes:\n This will create/write three files to the baseline_dir:\n - baseline.csv\n - constraints.json\n - statistics.json\n \"\"\"\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\n \"You can create a baseline but it can't be used/monitored for serverless endpoints, skipping...\"\n )\n return\n\n if not self.baseline_exists() or recreate:\n # Create a baseline for monitoring (training data from the FeatureSet)\n baseline_df = endpoint_utils.fs_training_data(self.endpoint)\n wr.s3.to_csv(baseline_df, self.baseline_csv_file, index=False)\n\n self.log.important(f\"Creating baseline files for {self.endpoint_name} --> {self.baseline_dir}\")\n self.model_monitor.suggest_baseline(\n baseline_dataset=self.baseline_csv_file,\n dataset_format=DatasetFormat.csv(header=True),\n output_s3_uri=self.baseline_dir,\n )\n else:\n self.log.important(f\"Baseline already exists for {self.endpoint_name}\")\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.create_monitoring_schedule","title":"create_monitoring_schedule(schedule='hourly', recreate=False)
","text":"Sets up the monitoring schedule for the model endpoint. Args: schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly). recreate (bool): If True, recreate the monitoring schedule even if it already exists.
Source code insrc/sageworks/core/artifacts/monitor_core.py
def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n \"\"\"\n Sets up the monitoring schedule for the model endpoint.\n Args:\n schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n \"\"\"\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\"Monitoring Schedule is not currently supported for serverless endpoints.\")\n return\n\n # Set up the monitoring schedule, name, and output path\n if schedule == \"daily\":\n schedule = CronExpressionGenerator.daily()\n else:\n schedule = CronExpressionGenerator.hourly()\n\n # Check if the baseline exists\n if not self.baseline_exists():\n self.log.warning(f\"Baseline does not exist for {self.endpoint_name}. Call create_baseline() first...\")\n return\n\n # Check if monitoring schedule already exists\n schedule_exists = self.monitoring_schedule_exists()\n\n # If the schedule exists, and we don't want to recreate it, return\n if schedule_exists and not recreate:\n return\n\n # If the schedule exists, delete it\n if schedule_exists:\n self.log.important(f\"Deleting existing monitoring schedule for {self.endpoint_name}...\")\n self.sagemaker_client.delete_monitoring_schedule(MonitoringScheduleName=self.monitoring_schedule_name)\n\n # Set up a NEW monitoring schedule\n self.model_monitor.create_monitoring_schedule(\n monitor_schedule_name=self.monitoring_schedule_name,\n endpoint_input=self.endpoint_name,\n output_s3_uri=self.monitoring_output_path,\n statistics=self.statistics_json_file,\n constraints=self.constraints_json_file,\n schedule_cron_expression=schedule,\n )\n self.log.important(f\"New Monitoring schedule created for {self.endpoint_name}.\")\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.details","title":"details()
","text":"Return the details of the monitoring for the endpoint
Returns:
Name Type Descriptiondict
dict
The details of the monitoring for the endpoint
Source code insrc/sageworks/core/artifacts/monitor_core.py
def details(self) -> dict:\n \"\"\"Return the details of the monitoring for the endpoint\n\n Returns:\n dict: The details of the monitoring for the endpoint\n \"\"\"\n # Check if we have data capture\n if self.is_data_capture_configured(capture_percentage=100):\n data_capture_path = self.data_capture_path\n else:\n data_capture_path = None\n\n # Check if we have a baseline\n if self.baseline_exists():\n baseline_csv_file = self.baseline_csv_file\n constraints_json_file = self.constraints_json_file\n statistics_json_file = self.statistics_json_file\n else:\n baseline_csv_file = None\n constraints_json_file = None\n statistics_json_file = None\n\n # Check if we have a monitoring schedule\n if self.monitoring_schedule_exists():\n schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n MonitoringScheduleName=self.monitoring_schedule_name\n )\n\n # General monitoring details\n schedule_name = schedule_details.get(\"MonitoringScheduleName\")\n schedule_status = schedule_details.get(\"MonitoringScheduleStatus\")\n output_path = self.monitoring_output_path\n last_run_details = self.last_run_details()\n else:\n schedule_name = None\n schedule_status = \"Not Scheduled\"\n schedule_details = None\n output_path = None\n last_run_details = None\n\n # General monitoring details\n general = {\n \"data_capture_path\": data_capture_path,\n \"baseline_csv_file\": baseline_csv_file,\n \"baseline_constraints_json_file\": constraints_json_file,\n \"baseline_statistics_json_file\": statistics_json_file,\n \"monitoring_schedule_name\": schedule_name,\n \"monitoring_output_path\": output_path,\n \"monitoring_schedule_status\": schedule_status,\n \"monitoring_schedule_details\": schedule_details,\n }\n if last_run_details:\n general.update(last_run_details)\n return general\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_baseline","title":"get_baseline()
","text":"Code to get the baseline CSV from the S3 baseline directory
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)
Source code insrc/sageworks/core/artifacts/monitor_core.py
def get_baseline(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n Returns:\n pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n \"\"\"\n # Read the monitoring data from S3\n if not wr.s3.does_object_exist(path=self.baseline_csv_file):\n self.log.warning(\"baseline.csv data does not exist in S3.\")\n return None\n else:\n return wr.s3.read_csv(self.baseline_csv_file)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_constraints","title":"get_constraints()
","text":"Code to get the constraints from the baseline
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)
Source code insrc/sageworks/core/artifacts/monitor_core.py
def get_constraints(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the constraints from the baseline\n\n Returns:\n pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n \"\"\"\n return self._get_monitor_json_data(self.constraints_json_file)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_latest_data_capture","title":"get_latest_data_capture()
","text":"Get the latest data capture from S3.
Returns:
Name Type DescriptionDataFrame
input), DataFrame(output
Flattened and processed DataFrames for input and output data.
Source code insrc/sageworks/core/artifacts/monitor_core.py
def get_latest_data_capture(self) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Get the latest data capture from S3.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n # List files in the specified S3 path\n files = wr.s3.list_objects(self.data_capture_path)\n\n if files:\n print(f\"Found {len(files)} files in {self.data_capture_path}. Reading the most recent file.\")\n\n # Read the most recent file into a DataFrame\n df = wr.s3.read_json(path=files[-1], lines=True) # Reads the last file assuming it's the most recent one\n\n # Process the captured data and return the input and output DataFrames\n return self.process_captured_data(df)\n else:\n print(f\"No data capture files found in {self.data_capture_path}.\")\n return None, None\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_statistics","title":"get_statistics()
","text":"Code to get the statistics from the baseline
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)
Source code insrc/sageworks/core/artifacts/monitor_core.py
def get_statistics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the statistics from the baseline\n\n Returns:\n pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n \"\"\"\n return self._get_monitor_json_data(self.statistics_json_file)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.is_data_capture_configured","title":"is_data_capture_configured(capture_percentage)
","text":"Check if data capture is already configured on the endpoint. Args: capture_percentage (int): Expected data capture percentage. Returns: bool: True if data capture is already configured, False otherwise.
Source code insrc/sageworks/core/artifacts/monitor_core.py
def is_data_capture_configured(self, capture_percentage):\n \"\"\"\n Check if data capture is already configured on the endpoint.\n Args:\n capture_percentage (int): Expected data capture percentage.\n Returns:\n bool: True if data capture is already configured, False otherwise.\n \"\"\"\n try:\n endpoint_config_name = self.endpoint.endpoint_config_name()\n endpoint_config = self.sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n data_capture_config = endpoint_config.get(\"DataCaptureConfig\", {})\n\n # Check if data capture is enabled and the percentage matches\n is_enabled = data_capture_config.get(\"EnableCapture\", False)\n current_percentage = data_capture_config.get(\"InitialSamplingPercentage\", 0)\n return is_enabled and current_percentage == capture_percentage\n except Exception as e:\n self.log.error(f\"Error checking data capture configuration: {e}\")\n return False\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.last_run_details","title":"last_run_details()
","text":"Return the details of the last monitoring run for the endpoint
Returns:
Name Type Descriptiondict
Union[dict, None]
The details of the last monitoring run for the endpoint (None if no monitoring schedule)
Source code insrc/sageworks/core/artifacts/monitor_core.py
def last_run_details(self) -> Union[dict, None]:\n \"\"\"Return the details of the last monitoring run for the endpoint\n\n Returns:\n dict: The details of the last monitoring run for the endpoint (None if no monitoring schedule)\n \"\"\"\n # Check if we have a monitoring schedule\n if not self.monitoring_schedule_exists():\n return None\n\n # Get the details of the last monitoring run\n schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n MonitoringScheduleName=self.monitoring_schedule_name\n )\n last_run_status = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"MonitoringExecutionStatus\")\n last_run_time = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"ScheduledTime\")\n failure_reason = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"FailureReason\")\n return {\n \"last_run_status\": last_run_status,\n \"last_run_time\": str(last_run_time),\n \"failure_reason\": failure_reason,\n }\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.monitoring_schedule_exists","title":"monitoring_schedule_exists()
","text":"Code to figure out if a monitoring schedule already exists for this endpoint
Source code insrc/sageworks/core/artifacts/monitor_core.py
def monitoring_schedule_exists(self):\n \"\"\"Code to figure out if a monitoring schedule already exists for this endpoint\"\"\"\n existing_schedules = self.sagemaker_client.list_monitoring_schedules(MaxResults=100).get(\n \"MonitoringScheduleSummaries\", []\n )\n if any(schedule[\"MonitoringScheduleName\"] == self.monitoring_schedule_name for schedule in existing_schedules):\n self.log.info(f\"Monitoring schedule already exists for {self.endpoint_name}.\")\n return True\n else:\n self.log.info(f\"Could not find a Monitoring schedule for {self.endpoint_name}.\")\n return False\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.process_captured_data","title":"process_captured_data(df)
staticmethod
","text":"Process the captured data DataFrame to extract and flatten the nested data.
Parameters:
Name Type Description Defaultdf
DataFrame
DataFrame with captured data.
requiredReturns:
Name Type DescriptionDataFrame
input), DataFrame(output
Flattened and processed DataFrames for input and output data.
Source code insrc/sageworks/core/artifacts/monitor_core.py
@staticmethod\ndef process_captured_data(df: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Process the captured data DataFrame to extract and flatten the nested data.\n\n Args:\n df (DataFrame): DataFrame with captured data.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n processed_records = []\n\n # Phase1: Process the AWS Data Capture format into a flatter DataFrame\n for _, row in df.iterrows():\n # Extract data from captureData dictionary\n capture_data = row[\"captureData\"]\n input_data = capture_data[\"endpointInput\"]\n output_data = capture_data[\"endpointOutput\"]\n\n # Process input and output, both meta and actual data\n record = {\n \"input_content_type\": input_data.get(\"observedContentType\"),\n \"input_encoding\": input_data.get(\"encoding\"),\n \"input\": input_data.get(\"data\"),\n \"output_content_type\": output_data.get(\"observedContentType\"),\n \"output_encoding\": output_data.get(\"encoding\"),\n \"output\": output_data.get(\"data\"),\n }\n processed_records.append(record)\n processed_df = pd.DataFrame(processed_records)\n\n # Phase2: Process the input and output 'data' columns into separate DataFrames\n input_df_list = []\n output_df_list = []\n for _, row in processed_df.iterrows():\n input_df = pd.read_csv(StringIO(row[\"input\"]))\n input_df_list.append(input_df)\n output_df = pd.read_csv(StringIO(row[\"output\"]))\n output_df_list.append(output_df)\n\n # Return the input and output DataFrames\n return pd.concat(input_df_list), pd.concat(output_df_list)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.setup_alerts","title":"setup_alerts()
","text":"Code to set up alerts based on monitoring results
Source code insrc/sageworks/core/artifacts/monitor_core.py
def setup_alerts(self):\n \"\"\"Code to set up alerts based on monitoring results\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.summary","title":"summary()
","text":"Return the summary of information about the endpoint monitor
Returns:
Name Type Descriptiondict
dict
Summary of information about the endpoint monitor
Source code insrc/sageworks/core/artifacts/monitor_core.py
def summary(self) -> dict:\n \"\"\"Return the summary of information about the endpoint monitor\n\n Returns:\n dict: Summary of information about the endpoint monitor\n \"\"\"\n if self.endpoint.is_serverless():\n return {\n \"endpoint_type\": \"serverless\",\n \"data_capture\": \"not supported\",\n \"baseline\": \"not supported\",\n \"monitoring_schedule\": \"not supported\",\n }\n else:\n summary = {\n \"endpoint_type\": \"realtime\",\n \"data_capture\": self.is_data_capture_configured(capture_percentage=100),\n \"baseline\": self.baseline_exists(),\n \"monitoring_schedule\": self.monitoring_schedule_exists(),\n }\n summary.update(self.last_run_details() or {})\n return summary\n
"},{"location":"core_classes/artifacts/overview/","title":"SageWorks Artifacts","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
"},{"location":"core_classes/artifacts/overview/#welcome-to-the-sageworks-core-artifact-classes","title":"Welcome to the SageWorks Core Artifact Classes","text":"These classes provide low-level APIs for the SageWorks package, they interact more directly with AWS Services and are therefore more complex with a fairly large number of methods.
These DataLoader Classes are intended to load larger dataset into AWS. For large data we need to use AWS Glue Jobs/Batch Jobs and in general the process is a bit more complicated and has less features.
If you have smaller data please see DataLoaders Light
Welcome to the SageWorks DataLoaders Heavy Classes
These classes provide low-level APIs for loading larger data into AWS services
S3HeavyToDataSource
","text":"Source code in src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py
class S3HeavyToDataSource:\n def __init__(self, glue_context: GlueContext, input_uuid: str, output_uuid: str):\n \"\"\"S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource\n\n Args:\n glue_context: GlueContext, AWS Glue Specific wrapper around SparkContext\n input_uuid (str): The S3 Path to the files to be loaded\n output_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n self.log = glue_context.get_logger()\n\n # FIXME: Pull these from Parameter Store or Config\n self.input_uuid = input_uuid\n self.output_uuid = output_uuid\n self.output_meta = {\"sageworks_input\": self.input_uuid}\n sageworks_bucket = \"s3://sandbox-sageworks-artifacts\"\n self.data_sources_s3_path = sageworks_bucket + \"/data-sources\"\n\n # Our Spark Context\n self.glue_context = glue_context\n\n @staticmethod\n def resolve_choice_fields(dyf):\n # Get schema fields\n schema_fields = dyf.schema().fields\n\n # Collect choice fields\n choice_fields = [(field.name, \"cast:long\") for field in schema_fields if field.dataType.typeName() == \"choice\"]\n print(f\"Choice Fields: {choice_fields}\")\n\n # If there are choice fields, resolve them\n if choice_fields:\n dyf = dyf.resolveChoice(specs=choice_fields)\n\n return dyf\n\n def timestamp_conversions(self, dyf: DynamicFrame, time_columns: list = []) -> DynamicFrame:\n \"\"\"Convert columns in the DynamicFrame to the correct data types\n Args:\n dyf (DynamicFrame): The DynamicFrame to convert\n time_columns (list): A list of column names to convert to timestamp\n Returns:\n DynamicFrame: The converted DynamicFrame\n \"\"\"\n\n # Convert the timestamp columns to timestamp types\n spark_df = dyf.toDF()\n for column in time_columns:\n spark_df = spark_df.withColumn(column, to_timestamp(col(column)))\n\n # Convert the Spark DataFrame back to a Glue DynamicFrame and return\n return DynamicFrame.fromDF(spark_df, self.glue_context, \"output_dyf\")\n\n @staticmethod\n def remove_periods_from_columns(dyf: DynamicFrame) -> DynamicFrame:\n \"\"\"Remove periods from column names in the DynamicFrame\n Args:\n dyf (DynamicFrame): The DynamicFrame to convert\n Returns:\n DynamicFrame: The converted DynamicFrame\n \"\"\"\n # Extract the column names from the schema\n old_column_names = [field.name for field in dyf.schema().fields]\n\n # Create a new list of renamed column names\n new_column_names = [name.replace(\".\", \"_\") for name in old_column_names]\n print(old_column_names)\n print(new_column_names)\n\n # Create a new DynamicFrame with renamed columns\n for c_old, c_new in zip(old_column_names, new_column_names):\n dyf = dyf.rename_field(f\"`{c_old}`\", c_new)\n return dyf\n\n def transform(\n self,\n input_type: str = \"json\",\n timestamp_columns: list = None,\n output_format: str = \"parquet\",\n ):\n \"\"\"Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n Args:\n input_type (str): The type of input files, either 'csv' or 'json'\n timestamp_columns (list): A list of column names to convert to timestamp\n output_format (str): The format of the output files, either 'parquet' or 'orc'\n \"\"\"\n\n # Add some tags here\n tags = [\"heavy\"]\n\n # Create the Output Parquet file S3 Storage Path\n s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n # Read JSONL files from S3 and infer schema dynamically\n self.log.info(f\"Reading JSONL files from {self.input_uuid}...\")\n input_dyf = self.glue_context.create_dynamic_frame.from_options(\n connection_type=\"s3\",\n connection_options={\n \"paths\": [self.input_uuid],\n \"recurse\": True,\n \"gzip\": True,\n },\n format=input_type,\n # format_options={'jsonPath': 'auto'}, Look into this later\n )\n self.log.info(\"Incoming DataFrame...\")\n input_dyf.show(5)\n input_dyf.printSchema()\n\n # Resolve Choice fields\n resolved_dyf = self.resolve_choice_fields(input_dyf)\n\n # The next couple of lines of code is for un-nesting any nested JSON\n # Create a Dynamic Frame Collection (dfc)\n dfc = Relationalize.apply(resolved_dyf, name=\"root\")\n\n # Aggregate the collection into a single dynamic frame\n output_dyf = dfc.select(\"root\")\n\n print(\"Before TimeStamp Conversions\")\n output_dyf.printSchema()\n\n # Convert any timestamp columns\n output_dyf = self.timestamp_conversions(output_dyf, timestamp_columns)\n\n # Relationalize will put periods in the column names. This will cause\n # problems later when we try to create a FeatureSet from this DataSource\n output_dyf = self.remove_periods_from_columns(output_dyf)\n\n print(\"After TimeStamp Conversions and Removing Periods from column names\")\n output_dyf.printSchema()\n\n # Write Parquet files to S3\n self.log.info(f\"Writing Parquet files to {s3_storage_path}...\")\n self.glue_context.purge_s3_path(s3_storage_path, {\"retentionPeriod\": 0})\n self.glue_context.write_dynamic_frame.from_options(\n frame=output_dyf,\n connection_type=\"s3\",\n connection_options={\n \"path\": s3_storage_path\n # \"partitionKeys\": [\"year\", \"month\", \"day\"],\n },\n format=output_format,\n )\n\n # Set up our SageWorks metadata (description, tags, etc)\n description = f\"SageWorks data source: {self.output_uuid}\"\n sageworks_meta = {\"sageworks_tags\": self.tag_delimiter.join(tags)}\n for key, value in self.output_meta.items():\n sageworks_meta[key] = value\n\n # Create a new table in the AWS Data Catalog\n self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n\n # Converting the Spark Types to Athena Types\n def to_athena_type(col):\n athena_type_map = {\"long\": \"bigint\"}\n spark_type = col.dataType.typeName()\n return athena_type_map.get(spark_type, spark_type)\n\n column_name_types = [{\"Name\": col.name, \"Type\": to_athena_type(col)} for col in output_dyf.schema().fields]\n\n # Our parameters for the Glue Data Catalog are different for Parquet and ORC\n if output_format == \"parquet\":\n glue_input_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat\"\n glue_output_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat\"\n serialization_library = \"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe\"\n else:\n glue_input_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n glue_output_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n serialization_library = \"org.apache.hadoop.hive.ql.io.orc.OrcSerde\"\n\n table_input = {\n \"Name\": self.output_uuid,\n \"Description\": description,\n \"Parameters\": sageworks_meta,\n \"TableType\": \"EXTERNAL_TABLE\",\n \"StorageDescriptor\": {\n \"Columns\": column_name_types,\n \"Location\": s3_storage_path,\n \"InputFormat\": glue_input_format,\n \"OutputFormat\": glue_output_format,\n \"Compressed\": True,\n \"SerdeInfo\": {\n \"SerializationLibrary\": serialization_library,\n },\n },\n }\n\n # Delete the Data Catalog Table if it already exists\n glue_client = boto3.client(\"glue\")\n try:\n glue_client.delete_table(DatabaseName=\"sageworks\", Name=self.output_uuid)\n self.log.info(f\"Deleting Data Catalog Table: {self.output_uuid}...\")\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] != \"EntityNotFoundException\":\n raise e\n\n self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n glue_client.create_table(DatabaseName=\"sageworks\", TableInput=table_input)\n\n # All done!\n self.log.info(f\"{self.input_uuid} --> {self.output_uuid} complete!\")\n
"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.__init__","title":"__init__(glue_context, input_uuid, output_uuid)
","text":"S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource
Parameters:
Name Type Description Defaultglue_context
GlueContext
GlueContext, AWS Glue Specific wrapper around SparkContext
requiredinput_uuid
str
The S3 Path to the files to be loaded
requiredoutput_uuid
str
The UUID of the SageWorks DataSource to be created
required Source code insrc/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py
def __init__(self, glue_context: GlueContext, input_uuid: str, output_uuid: str):\n \"\"\"S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource\n\n Args:\n glue_context: GlueContext, AWS Glue Specific wrapper around SparkContext\n input_uuid (str): The S3 Path to the files to be loaded\n output_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n self.log = glue_context.get_logger()\n\n # FIXME: Pull these from Parameter Store or Config\n self.input_uuid = input_uuid\n self.output_uuid = output_uuid\n self.output_meta = {\"sageworks_input\": self.input_uuid}\n sageworks_bucket = \"s3://sandbox-sageworks-artifacts\"\n self.data_sources_s3_path = sageworks_bucket + \"/data-sources\"\n\n # Our Spark Context\n self.glue_context = glue_context\n
"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.remove_periods_from_columns","title":"remove_periods_from_columns(dyf)
staticmethod
","text":"Remove periods from column names in the DynamicFrame Args: dyf (DynamicFrame): The DynamicFrame to convert Returns: DynamicFrame: The converted DynamicFrame
Source code insrc/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py
@staticmethod\ndef remove_periods_from_columns(dyf: DynamicFrame) -> DynamicFrame:\n \"\"\"Remove periods from column names in the DynamicFrame\n Args:\n dyf (DynamicFrame): The DynamicFrame to convert\n Returns:\n DynamicFrame: The converted DynamicFrame\n \"\"\"\n # Extract the column names from the schema\n old_column_names = [field.name for field in dyf.schema().fields]\n\n # Create a new list of renamed column names\n new_column_names = [name.replace(\".\", \"_\") for name in old_column_names]\n print(old_column_names)\n print(new_column_names)\n\n # Create a new DynamicFrame with renamed columns\n for c_old, c_new in zip(old_column_names, new_column_names):\n dyf = dyf.rename_field(f\"`{c_old}`\", c_new)\n return dyf\n
"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.timestamp_conversions","title":"timestamp_conversions(dyf, time_columns=[])
","text":"Convert columns in the DynamicFrame to the correct data types Args: dyf (DynamicFrame): The DynamicFrame to convert time_columns (list): A list of column names to convert to timestamp Returns: DynamicFrame: The converted DynamicFrame
Source code insrc/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py
def timestamp_conversions(self, dyf: DynamicFrame, time_columns: list = []) -> DynamicFrame:\n \"\"\"Convert columns in the DynamicFrame to the correct data types\n Args:\n dyf (DynamicFrame): The DynamicFrame to convert\n time_columns (list): A list of column names to convert to timestamp\n Returns:\n DynamicFrame: The converted DynamicFrame\n \"\"\"\n\n # Convert the timestamp columns to timestamp types\n spark_df = dyf.toDF()\n for column in time_columns:\n spark_df = spark_df.withColumn(column, to_timestamp(col(column)))\n\n # Convert the Spark DataFrame back to a Glue DynamicFrame and return\n return DynamicFrame.fromDF(spark_df, self.glue_context, \"output_dyf\")\n
"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.transform","title":"transform(input_type='json', timestamp_columns=None, output_format='parquet')
","text":"Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and store the information about the data to the AWS Data Catalog sageworks database Args: input_type (str): The type of input files, either 'csv' or 'json' timestamp_columns (list): A list of column names to convert to timestamp output_format (str): The format of the output files, either 'parquet' or 'orc'
Source code insrc/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py
def transform(\n self,\n input_type: str = \"json\",\n timestamp_columns: list = None,\n output_format: str = \"parquet\",\n):\n \"\"\"Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n Args:\n input_type (str): The type of input files, either 'csv' or 'json'\n timestamp_columns (list): A list of column names to convert to timestamp\n output_format (str): The format of the output files, either 'parquet' or 'orc'\n \"\"\"\n\n # Add some tags here\n tags = [\"heavy\"]\n\n # Create the Output Parquet file S3 Storage Path\n s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n # Read JSONL files from S3 and infer schema dynamically\n self.log.info(f\"Reading JSONL files from {self.input_uuid}...\")\n input_dyf = self.glue_context.create_dynamic_frame.from_options(\n connection_type=\"s3\",\n connection_options={\n \"paths\": [self.input_uuid],\n \"recurse\": True,\n \"gzip\": True,\n },\n format=input_type,\n # format_options={'jsonPath': 'auto'}, Look into this later\n )\n self.log.info(\"Incoming DataFrame...\")\n input_dyf.show(5)\n input_dyf.printSchema()\n\n # Resolve Choice fields\n resolved_dyf = self.resolve_choice_fields(input_dyf)\n\n # The next couple of lines of code is for un-nesting any nested JSON\n # Create a Dynamic Frame Collection (dfc)\n dfc = Relationalize.apply(resolved_dyf, name=\"root\")\n\n # Aggregate the collection into a single dynamic frame\n output_dyf = dfc.select(\"root\")\n\n print(\"Before TimeStamp Conversions\")\n output_dyf.printSchema()\n\n # Convert any timestamp columns\n output_dyf = self.timestamp_conversions(output_dyf, timestamp_columns)\n\n # Relationalize will put periods in the column names. This will cause\n # problems later when we try to create a FeatureSet from this DataSource\n output_dyf = self.remove_periods_from_columns(output_dyf)\n\n print(\"After TimeStamp Conversions and Removing Periods from column names\")\n output_dyf.printSchema()\n\n # Write Parquet files to S3\n self.log.info(f\"Writing Parquet files to {s3_storage_path}...\")\n self.glue_context.purge_s3_path(s3_storage_path, {\"retentionPeriod\": 0})\n self.glue_context.write_dynamic_frame.from_options(\n frame=output_dyf,\n connection_type=\"s3\",\n connection_options={\n \"path\": s3_storage_path\n # \"partitionKeys\": [\"year\", \"month\", \"day\"],\n },\n format=output_format,\n )\n\n # Set up our SageWorks metadata (description, tags, etc)\n description = f\"SageWorks data source: {self.output_uuid}\"\n sageworks_meta = {\"sageworks_tags\": self.tag_delimiter.join(tags)}\n for key, value in self.output_meta.items():\n sageworks_meta[key] = value\n\n # Create a new table in the AWS Data Catalog\n self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n\n # Converting the Spark Types to Athena Types\n def to_athena_type(col):\n athena_type_map = {\"long\": \"bigint\"}\n spark_type = col.dataType.typeName()\n return athena_type_map.get(spark_type, spark_type)\n\n column_name_types = [{\"Name\": col.name, \"Type\": to_athena_type(col)} for col in output_dyf.schema().fields]\n\n # Our parameters for the Glue Data Catalog are different for Parquet and ORC\n if output_format == \"parquet\":\n glue_input_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat\"\n glue_output_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat\"\n serialization_library = \"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe\"\n else:\n glue_input_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n glue_output_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n serialization_library = \"org.apache.hadoop.hive.ql.io.orc.OrcSerde\"\n\n table_input = {\n \"Name\": self.output_uuid,\n \"Description\": description,\n \"Parameters\": sageworks_meta,\n \"TableType\": \"EXTERNAL_TABLE\",\n \"StorageDescriptor\": {\n \"Columns\": column_name_types,\n \"Location\": s3_storage_path,\n \"InputFormat\": glue_input_format,\n \"OutputFormat\": glue_output_format,\n \"Compressed\": True,\n \"SerdeInfo\": {\n \"SerializationLibrary\": serialization_library,\n },\n },\n }\n\n # Delete the Data Catalog Table if it already exists\n glue_client = boto3.client(\"glue\")\n try:\n glue_client.delete_table(DatabaseName=\"sageworks\", Name=self.output_uuid)\n self.log.info(f\"Deleting Data Catalog Table: {self.output_uuid}...\")\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] != \"EntityNotFoundException\":\n raise e\n\n self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n glue_client.create_table(DatabaseName=\"sageworks\", TableInput=table_input)\n\n # All done!\n self.log.info(f\"{self.input_uuid} --> {self.output_uuid} complete!\")\n
"},{"location":"core_classes/transforms/data_loaders_light/","title":"DataLoaders Light","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
These DataLoader Classes are intended to load smaller dataset into AWS. If you have large data please see DataLoaders Heavy
Welcome to the SageWorks DataLoaders Light Classes
These classes provide low-level APIs for loading smaller data into AWS services
CSVToDataSource
","text":" Bases: Transform
CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource
Common Usagecsv_to_data = CSVToDataSource(csv_file_path, data_uuid)\ncsv_to_data.set_output_tags([\"abalone\", \"csv\", \"whatever\"])\ncsv_to_data.transform()\n
Source code in src/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py
class CSVToDataSource(Transform):\n \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n Common Usage:\n ```python\n csv_to_data = CSVToDataSource(csv_file_path, data_uuid)\n csv_to_data.set_output_tags([\"abalone\", \"csv\", \"whatever\"])\n csv_to_data.transform()\n ```\n \"\"\"\n\n def __init__(self, csv_file_path: str, data_uuid: str):\n \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n Args:\n csv_file_path (str): The path to the CSV file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(csv_file_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.LOCAL_FILE\n self.output_type = TransformOutput.DATA_SOURCE\n\n def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Report the transformation initiation\n csv_file = os.path.basename(self.input_uuid)\n self.log.info(f\"Starting {csv_file} --> DataSource: {self.output_uuid}...\")\n\n # Read in the Local CSV as a Pandas DataFrame\n df = pd.read_csv(self.input_uuid, low_memory=False)\n df = convert_object_columns(df)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{csv_file} --> DataSource: {self.output_uuid} Complete!\")\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.__init__","title":"__init__(csv_file_path, data_uuid)
","text":"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource
Parameters:
Name Type Description Defaultcsv_file_path
str
The path to the CSV file to be transformed
requireddata_uuid
str
The UUID of the SageWorks DataSource to be created
required Source code insrc/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py
def __init__(self, csv_file_path: str, data_uuid: str):\n \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n Args:\n csv_file_path (str): The path to the CSV file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(csv_file_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.LOCAL_FILE\n self.output_type = TransformOutput.DATA_SOURCE\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform
Source code insrc/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.transform_impl","title":"transform_impl(overwrite=True)
","text":"Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database
Source code insrc/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py
def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Report the transformation initiation\n csv_file = os.path.basename(self.input_uuid)\n self.log.info(f\"Starting {csv_file} --> DataSource: {self.output_uuid}...\")\n\n # Read in the Local CSV as a Pandas DataFrame\n df = pd.read_csv(self.input_uuid, low_memory=False)\n df = convert_object_columns(df)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{csv_file} --> DataSource: {self.output_uuid} Complete!\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource","title":"JSONToDataSource
","text":" Bases: Transform
JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource
Common Usagejson_to_data = JSONToDataSource(json_file_path, data_uuid)\njson_to_data.set_output_tags([\"abalone\", \"json\", \"whatever\"])\njson_to_data.transform()\n
Source code in src/sageworks/core/transforms/data_loaders/light/json_to_data_source.py
class JSONToDataSource(Transform):\n \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n Common Usage:\n ```python\n json_to_data = JSONToDataSource(json_file_path, data_uuid)\n json_to_data.set_output_tags([\"abalone\", \"json\", \"whatever\"])\n json_to_data.transform()\n ```\n \"\"\"\n\n def __init__(self, json_file_path: str, data_uuid: str):\n \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n Args:\n json_file_path (str): The path to the JSON file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(json_file_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.LOCAL_FILE\n self.output_type = TransformOutput.DATA_SOURCE\n\n def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Report the transformation initiation\n json_file = os.path.basename(self.input_uuid)\n self.log.info(f\"Starting {json_file} --> DataSource: {self.output_uuid}...\")\n\n # Read in the Local JSON as a Pandas DataFrame\n df = pd.read_json(self.input_uuid, lines=True)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{json_file} --> DataSource: {self.output_uuid} Complete!\")\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.__init__","title":"__init__(json_file_path, data_uuid)
","text":"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource
Parameters:
Name Type Description Defaultjson_file_path
str
The path to the JSON file to be transformed
requireddata_uuid
str
The UUID of the SageWorks DataSource to be created
required Source code insrc/sageworks/core/transforms/data_loaders/light/json_to_data_source.py
def __init__(self, json_file_path: str, data_uuid: str):\n \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n Args:\n json_file_path (str): The path to the JSON file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(json_file_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.LOCAL_FILE\n self.output_type = TransformOutput.DATA_SOURCE\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform
Source code insrc/sageworks/core/transforms/data_loaders/light/json_to_data_source.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.transform_impl","title":"transform_impl(overwrite=True)
","text":"Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database
Source code insrc/sageworks/core/transforms/data_loaders/light/json_to_data_source.py
def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Report the transformation initiation\n json_file = os.path.basename(self.input_uuid)\n self.log.info(f\"Starting {json_file} --> DataSource: {self.output_uuid}...\")\n\n # Read in the Local JSON as a Pandas DataFrame\n df = pd.read_json(self.input_uuid, lines=True)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{json_file} --> DataSource: {self.output_uuid} Complete!\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight","title":"S3ToDataSourceLight
","text":" Bases: Transform
S3ToDataSourceLight: Class to move LIGHT S3 Files into a SageWorks DataSource
Common Usages3_to_data = S3ToDataSourceLight(s3_path, data_uuid, datatype=\"csv/json\")\ns3_to_data.set_output_tags([\"abalone\", \"whatever\"])\ns3_to_data.transform()\n
Source code in src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py
class S3ToDataSourceLight(Transform):\n \"\"\"S3ToDataSourceLight: Class to move LIGHT S3 Files into a SageWorks DataSource\n\n Common Usage:\n ```python\n s3_to_data = S3ToDataSourceLight(s3_path, data_uuid, datatype=\"csv/json\")\n s3_to_data.set_output_tags([\"abalone\", \"whatever\"])\n s3_to_data.transform()\n ```\n \"\"\"\n\n def __init__(self, s3_path: str, data_uuid: str, datatype: str = \"csv\"):\n \"\"\"S3ToDataSourceLight Initialization\n\n Args:\n s3_path (str): The S3 Path to the file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n datatype (str): The datatype of the file to be transformed (defaults to \"csv\")\n \"\"\"\n\n # Call superclass init\n super().__init__(s3_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.S3_OBJECT\n self.output_type = TransformOutput.DATA_SOURCE\n self.datatype = datatype\n\n def input_size_mb(self) -> int:\n \"\"\"Get the size of the input S3 object in MBytes\"\"\"\n size_in_bytes = wr.s3.size_objects(self.input_uuid, boto3_session=self.boto3_session)[self.input_uuid]\n size_in_mb = round(size_in_bytes / 1_000_000)\n return size_in_mb\n\n def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Sanity Check for S3 Object size\n object_megabytes = self.input_size_mb()\n if object_megabytes > 100:\n self.log.error(f\"S3 Object too big ({object_megabytes} MBytes): Use the S3ToDataSourceHeavy class!\")\n return\n\n # Read in the S3 CSV as a Pandas DataFrame\n if self.datatype == \"csv\":\n df = wr.s3.read_csv(self.input_uuid, low_memory=False, boto3_session=self.boto3_session)\n else:\n df = wr.s3.read_json(self.input_uuid, lines=True, boto3_session=self.boto3_session)\n\n # Temporary hack to limit the number of columns in the dataframe\n if len(df.columns) > 40:\n self.log.warning(f\"{self.input_uuid} Too Many Columns! Talk to SageWorks Support...\")\n\n # Convert object columns before sending to SageWorks Data Source\n df = convert_object_columns(df)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{self.input_uuid} --> DataSource: {self.output_uuid} Complete!\")\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.__init__","title":"__init__(s3_path, data_uuid, datatype='csv')
","text":"S3ToDataSourceLight Initialization
Parameters:
Name Type Description Defaults3_path
str
The S3 Path to the file to be transformed
requireddata_uuid
str
The UUID of the SageWorks DataSource to be created
requireddatatype
str
The datatype of the file to be transformed (defaults to \"csv\")
'csv'
Source code in src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py
def __init__(self, s3_path: str, data_uuid: str, datatype: str = \"csv\"):\n \"\"\"S3ToDataSourceLight Initialization\n\n Args:\n s3_path (str): The S3 Path to the file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n datatype (str): The datatype of the file to be transformed (defaults to \"csv\")\n \"\"\"\n\n # Call superclass init\n super().__init__(s3_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.S3_OBJECT\n self.output_type = TransformOutput.DATA_SOURCE\n self.datatype = datatype\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.input_size_mb","title":"input_size_mb()
","text":"Get the size of the input S3 object in MBytes
Source code insrc/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py
def input_size_mb(self) -> int:\n \"\"\"Get the size of the input S3 object in MBytes\"\"\"\n size_in_bytes = wr.s3.size_objects(self.input_uuid, boto3_session=self.boto3_session)[self.input_uuid]\n size_in_mb = round(size_in_bytes / 1_000_000)\n return size_in_mb\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform
Source code insrc/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.transform_impl","title":"transform_impl(overwrite=True)
","text":"Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database
Source code insrc/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py
def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Sanity Check for S3 Object size\n object_megabytes = self.input_size_mb()\n if object_megabytes > 100:\n self.log.error(f\"S3 Object too big ({object_megabytes} MBytes): Use the S3ToDataSourceHeavy class!\")\n return\n\n # Read in the S3 CSV as a Pandas DataFrame\n if self.datatype == \"csv\":\n df = wr.s3.read_csv(self.input_uuid, low_memory=False, boto3_session=self.boto3_session)\n else:\n df = wr.s3.read_json(self.input_uuid, lines=True, boto3_session=self.boto3_session)\n\n # Temporary hack to limit the number of columns in the dataframe\n if len(df.columns) > 40:\n self.log.warning(f\"{self.input_uuid} Too Many Columns! Talk to SageWorks Support...\")\n\n # Convert object columns before sending to SageWorks Data Source\n df = convert_object_columns(df)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{self.input_uuid} --> DataSource: {self.output_uuid} Complete!\")\n
"},{"location":"core_classes/transforms/data_to_features/","title":"Data To Features","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas
MolecularDescriptors: Compute a Feature Set based on RDKit Descriptors
An alternative to using this class is to use thecompute_molecular_descriptors
function directly. df_features = compute_molecular_descriptors(df) to_features = PandasToFeatures(\"my_feature_set\") to_features.set_input(df_features, id_column=\"id\") to_features.set_output_tags([\"blah\", \"whatever\"]) to_features.transform()
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight","title":"DataToFeaturesLight
","text":" Bases: Transform
DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas
Common Usageto_features = DataToFeaturesLight(data_uuid, feature_uuid)\nto_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\nto_features.transform(id_column=\"id\"/None, event_time_column=\"date\"/None, query=str/None)\n
Source code in src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py
class DataToFeaturesLight(Transform):\n \"\"\"DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas\n\n Common Usage:\n ```python\n to_features = DataToFeaturesLight(data_uuid, feature_uuid)\n to_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n to_features.transform(id_column=\"id\"/None, event_time_column=\"date\"/None, query=str/None)\n ```\n \"\"\"\n\n def __init__(self, data_uuid: str, feature_uuid: str):\n \"\"\"DataToFeaturesLight Initialization\n\n Args:\n data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, feature_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.DATA_SOURCE\n self.output_type = TransformOutput.FEATURE_SET\n self.input_df = None\n self.output_df = None\n\n def pre_transform(self, query: str = None, **kwargs):\n \"\"\"Pull the input DataSource into our Input Pandas DataFrame\n Args:\n query(str): Optional query to filter the input DataFrame\n \"\"\"\n\n # Grab the Input (Data Source)\n data_to_pandas = DataToPandas(self.input_uuid)\n data_to_pandas.transform(query=query)\n self.input_df = data_to_pandas.get_output()\n\n # Check if there are any columns that are greater than 64 characters\n for col in self.input_df.columns:\n if len(col) > 64:\n raise ValueError(f\"Column name '{col}' > 64 characters. AWS FeatureGroup limits to 64 characters.\")\n\n def transform_impl(self, **kwargs):\n \"\"\"Transform the input DataFrame into a Feature Set\"\"\"\n\n # This is a reference implementation that should be overridden by the subclass\n self.output_df = self.input_df\n\n def post_transform(self, id_column, event_time_column=None, one_hot_columns=None, **kwargs):\n \"\"\"At this point the output DataFrame should be populated, so publish it as a Feature Set\n\n Args:\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n event_time_column (str, optional): The name of the event time column (default: None).\n one_hot_columns (list, optional): The list of columns to one-hot encode (default: None).\n \"\"\"\n # Now publish to the output location\n output_features = PandasToFeatures(self.output_uuid)\n output_features.set_input(\n self.output_df, id_column=id_column, event_time_column=event_time_column, one_hot_columns=one_hot_columns\n )\n output_features.set_output_tags(self.output_tags)\n output_features.add_output_meta(self.output_meta)\n output_features.transform()\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.__init__","title":"__init__(data_uuid, feature_uuid)
","text":"DataToFeaturesLight Initialization
Parameters:
Name Type Description Defaultdata_uuid
str
The UUID of the SageWorks DataSource to be transformed
requiredfeature_uuid
str
The UUID of the SageWorks FeatureSet to be created
required Source code insrc/sageworks/core/transforms/data_to_features/light/data_to_features_light.py
def __init__(self, data_uuid: str, feature_uuid: str):\n \"\"\"DataToFeaturesLight Initialization\n\n Args:\n data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, feature_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.DATA_SOURCE\n self.output_type = TransformOutput.FEATURE_SET\n self.input_df = None\n self.output_df = None\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.post_transform","title":"post_transform(id_column, event_time_column=None, one_hot_columns=None, **kwargs)
","text":"At this point the output DataFrame should be populated, so publish it as a Feature Set
Parameters:
Name Type Description Defaultid_column
str
The ID column (must be specified, use \"auto\" for auto-generated IDs).
requiredevent_time_column
str
The name of the event time column (default: None).
None
one_hot_columns
list
The list of columns to one-hot encode (default: None).
None
Source code in src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py
def post_transform(self, id_column, event_time_column=None, one_hot_columns=None, **kwargs):\n \"\"\"At this point the output DataFrame should be populated, so publish it as a Feature Set\n\n Args:\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n event_time_column (str, optional): The name of the event time column (default: None).\n one_hot_columns (list, optional): The list of columns to one-hot encode (default: None).\n \"\"\"\n # Now publish to the output location\n output_features = PandasToFeatures(self.output_uuid)\n output_features.set_input(\n self.output_df, id_column=id_column, event_time_column=event_time_column, one_hot_columns=one_hot_columns\n )\n output_features.set_output_tags(self.output_tags)\n output_features.add_output_meta(self.output_meta)\n output_features.transform()\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.pre_transform","title":"pre_transform(query=None, **kwargs)
","text":"Pull the input DataSource into our Input Pandas DataFrame Args: query(str): Optional query to filter the input DataFrame
Source code insrc/sageworks/core/transforms/data_to_features/light/data_to_features_light.py
def pre_transform(self, query: str = None, **kwargs):\n \"\"\"Pull the input DataSource into our Input Pandas DataFrame\n Args:\n query(str): Optional query to filter the input DataFrame\n \"\"\"\n\n # Grab the Input (Data Source)\n data_to_pandas = DataToPandas(self.input_uuid)\n data_to_pandas.transform(query=query)\n self.input_df = data_to_pandas.get_output()\n\n # Check if there are any columns that are greater than 64 characters\n for col in self.input_df.columns:\n if len(col) > 64:\n raise ValueError(f\"Column name '{col}' > 64 characters. AWS FeatureGroup limits to 64 characters.\")\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.transform_impl","title":"transform_impl(**kwargs)
","text":"Transform the input DataFrame into a Feature Set
Source code insrc/sageworks/core/transforms/data_to_features/light/data_to_features_light.py
def transform_impl(self, **kwargs):\n \"\"\"Transform the input DataFrame into a Feature Set\"\"\"\n\n # This is a reference implementation that should be overridden by the subclass\n self.output_df = self.input_df\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors","title":"MolecularDescriptors
","text":" Bases: DataToFeaturesLight
MolecularDescriptors: Create a FeatureSet (RDKit Descriptors) from a DataSource
Common Usageto_features = MolecularDescriptors(data_uuid, feature_uuid)\nto_features.set_output_tags([\"aqsol\", \"whatever\"])\nto_features.transform()\n
Source code in src/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py
class MolecularDescriptors(DataToFeaturesLight):\n \"\"\"MolecularDescriptors: Create a FeatureSet (RDKit Descriptors) from a DataSource\n\n Common Usage:\n ```python\n to_features = MolecularDescriptors(data_uuid, feature_uuid)\n to_features.set_output_tags([\"aqsol\", \"whatever\"])\n to_features.transform()\n ```\n \"\"\"\n\n def __init__(self, data_uuid: str, feature_uuid: str):\n \"\"\"MolecularDescriptors Initialization\n\n Args:\n data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, feature_uuid)\n\n def transform_impl(self, **kwargs):\n \"\"\"Compute a Feature Set based on RDKit Descriptors\"\"\"\n\n # Compute/add all the Molecular Descriptors\n self.output_df = compute_molecular_descriptors(self.input_df)\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors.__init__","title":"__init__(data_uuid, feature_uuid)
","text":"MolecularDescriptors Initialization
Parameters:
Name Type Description Defaultdata_uuid
str
The UUID of the SageWorks DataSource to be transformed
requiredfeature_uuid
str
The UUID of the SageWorks FeatureSet to be created
required Source code insrc/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py
def __init__(self, data_uuid: str, feature_uuid: str):\n \"\"\"MolecularDescriptors Initialization\n\n Args:\n data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, feature_uuid)\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors.transform_impl","title":"transform_impl(**kwargs)
","text":"Compute a Feature Set based on RDKit Descriptors
Source code insrc/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py
def transform_impl(self, **kwargs):\n \"\"\"Compute a Feature Set based on RDKit Descriptors\"\"\"\n\n # Compute/add all the Molecular Descriptors\n self.output_df = compute_molecular_descriptors(self.input_df)\n
"},{"location":"core_classes/transforms/features_to_model/","title":"Features To Model","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
FeaturesToModel: Train/Create a Model from a Feature Set
"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel","title":"FeaturesToModel
","text":" Bases: Transform
FeaturesToModel: Train/Create a Model from a FeatureSet
Common Usagefrom sageworks.core.transforms.features_to_model.features_to_model import FeaturesToModel\nto_model = FeaturesToModel(feature_uuid, model_uuid, model_type=ModelType)\nto_model.set_output_tags([\"abalone\", \"public\", \"whatever\"])\nto_model.transform(target_column=\"class_number_of_rings\",\n feature_list=[\"my\", \"best\", \"features\"])\n
Source code in src/sageworks/core/transforms/features_to_model/features_to_model.py
class FeaturesToModel(Transform):\n \"\"\"FeaturesToModel: Train/Create a Model from a FeatureSet\n\n Common Usage:\n ```python\n from sageworks.core.transforms.features_to_model.features_to_model import FeaturesToModel\n to_model = FeaturesToModel(feature_uuid, model_uuid, model_type=ModelType)\n to_model.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n to_model.transform(target_column=\"class_number_of_rings\",\n feature_list=[\"my\", \"best\", \"features\"])\n ```\n \"\"\"\n\n def __init__(\n self,\n feature_uuid: str,\n model_uuid: str,\n model_type: ModelType,\n model_class=None,\n model_import_str=None,\n custom_script=None,\n ):\n \"\"\"FeaturesToModel Initialization\n Args:\n feature_uuid (str): UUID of the FeatureSet to use as input\n model_uuid (str): UUID of the Model to create as output\n model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.\n model_class (str, optional): The class of the model (default None)\n model_import_str (str, optional): The import string for the model (default None)\n custom_script (str, optional): Custom script to use for the model (default None)\n \"\"\"\n\n # Make sure the model_uuid is a valid name\n Artifact.is_name_valid(model_uuid, delimiter=\"-\", lower_case=False)\n\n # Call superclass init\n super().__init__(feature_uuid, model_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.FEATURE_SET\n self.output_type = TransformOutput.MODEL\n self.model_type = model_type\n self.model_class = model_class\n self.model_import_str = model_import_str\n self.custom_script = custom_script\n self.estimator = None\n self.model_description = None\n self.model_training_root = self.models_s3_path + \"/training\"\n self.model_feature_list = None\n self.target_column = None\n self.class_labels = None\n\n def transform_impl(\n self, target_column: str, description: str = None, feature_list: list = None, train_all_data=False\n ):\n \"\"\"Generic Features to Model: Note you should create a new class and inherit from\n this one to include specific logic for your Feature Set/Model\n Args:\n target_column (str): Column name of the target variable\n description (str): Description of the model (optional)\n feature_list (list[str]): A list of columns for the features (default None, will try to guess)\n train_all_data (bool): Train on ALL (100%) of the data (default False)\n \"\"\"\n # Delete the existing model (if it exists)\n self.log.important(\"Trying to delete existing model...\")\n ModelCore.managed_delete(self.output_uuid)\n\n # Set our model description\n self.model_description = description if description is not None else f\"Model created from {self.input_uuid}\"\n\n # Get our Feature Set and create an S3 CSV Training dataset\n feature_set = FeatureSetCore(self.input_uuid)\n s3_training_path = feature_set.create_s3_training_data()\n self.log.info(f\"Created new training data {s3_training_path}...\")\n\n # Report the target column\n self.target_column = target_column\n self.log.info(f\"Target column: {self.target_column}\")\n\n # Did they specify a feature list?\n if feature_list:\n # AWS Feature Groups will also add these implicit columns, so remove them\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"training\"]\n feature_list = [c for c in feature_list if c not in aws_cols]\n\n # If they didn't specify a feature list, try to guess it\n else:\n # Try to figure out features with this logic\n # - Don't include id, event_time, __index_level_0__, or training columns\n # - Don't include AWS generated columns (e.g. write_time, api_invocation_time, is_deleted)\n # - Don't include the target columns\n # - Don't include any columns that are of type string or timestamp\n # - The rest of the columns are assumed to be features\n self.log.warning(\"Guessing at the feature list, HIGHLY SUGGESTED to specify an explicit feature list!\")\n all_columns = feature_set.columns\n filter_list = [\n \"id\",\n \"__index_level_0__\",\n \"write_time\",\n \"api_invocation_time\",\n \"is_deleted\",\n \"event_time\",\n \"training\",\n ] + [self.target_column]\n feature_list = [c for c in all_columns if c not in filter_list]\n\n # AWS Feature Store has 3 user column types (String, Integral, Fractional)\n # and two internal types (Timestamp and Boolean). A Feature List for\n # modeling can only contain Integral and Fractional types.\n remove_columns = []\n column_details = feature_set.column_details()\n for column_name in feature_list:\n if column_details[column_name] not in [\"Integral\", \"Fractional\"]:\n self.log.warning(\n f\"Removing {column_name} from feature list, improper type {column_details[column_name]}\"\n )\n remove_columns.append(column_name)\n\n # Remove the columns that are not Integral or Fractional\n feature_list = [c for c in feature_list if c not in remove_columns]\n\n # Set the final feature list\n self.model_feature_list = feature_list\n self.log.important(f\"Feature List for Modeling: {self.model_feature_list}\")\n\n # Custom Script\n if self.custom_script:\n script_path = self.custom_script\n self.log.info(\"Custom script path: {script_path}\")\n # Fixme: We'll need to circle back to this later\n copy_imports_to_script_dir(script_path, [\"sageworks.utils.chem_utils\"])\n\n # We're using one of the built-in model script templates\n else:\n # Set up our parameters for the model script\n template_params = {\n \"model_imports\": self.model_import_str,\n \"model_type\": self.model_type,\n \"model_class\": self.model_class,\n \"target_column\": self.target_column,\n \"feature_list\": self.model_feature_list,\n \"model_metrics_s3_path\": f\"{self.model_training_root}/{self.output_uuid}\",\n \"train_all_data\": train_all_data,\n }\n # Generate our model script\n script_path = generate_model_script(template_params)\n\n # Metric Definitions for Regression\n if self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.QUANTILE_REGRESSOR:\n metric_definitions = [\n {\"Name\": \"RMSE\", \"Regex\": \"RMSE: ([0-9.]+)\"},\n {\"Name\": \"MAE\", \"Regex\": \"MAE: ([0-9.]+)\"},\n {\"Name\": \"R2\", \"Regex\": \"R2: ([0-9.]+)\"},\n {\"Name\": \"NumRows\", \"Regex\": \"NumRows: ([0-9]+)\"},\n ]\n\n # Metric Definitions for Classification\n elif self.model_type == ModelType.CLASSIFIER:\n # We need to get creative with the Classification Metrics\n\n # Grab all the target column class values (class labels)\n table = feature_set.data_source.table\n self.class_labels = feature_set.query(f'select DISTINCT {self.target_column} FROM \"{table}\"')[\n self.target_column\n ].to_list()\n\n # Sanity check on the targets\n if len(self.class_labels) > 10:\n msg = f\"Too many target classes ({len(self.class_labels)}) for classification, aborting!\"\n self.log.critical(msg)\n raise ValueError(msg)\n\n # Dynamically create the metric definitions\n metrics = [\"precision\", \"recall\", \"fscore\"]\n metric_definitions = []\n for t in self.class_labels:\n for m in metrics:\n metric_definitions.append({\"Name\": f\"Metrics:{t}:{m}\", \"Regex\": f\"Metrics:{t}:{m} ([0-9.]+)\"})\n\n # Add the confusion matrix metrics\n for row in self.class_labels:\n for col in self.class_labels:\n metric_definitions.append(\n {\"Name\": f\"ConfusionMatrix:{row}:{col}\", \"Regex\": f\"ConfusionMatrix:{row}:{col} ([0-9.]+)\"}\n )\n\n # If the model type is UNKNOWN, our metric_definitions will be empty\n else:\n self.log.important(f\"ModelType is {self.model_type}, skipping metric_definitions...\")\n metric_definitions = []\n\n # Take the full script path and extract the entry point and source directory\n entry_point = str(Path(script_path).name)\n source_dir = str(Path(script_path).parent)\n\n # Create a Sagemaker Model with our script\n image = InferenceImage.get_image_uri(self.sm_session.boto_region_name, \"sklearn\", \"1.2.1\")\n self.estimator = SKLearn(\n entry_point=entry_point,\n source_dir=source_dir,\n role=self.sageworks_role_arn,\n instance_type=\"ml.m5.large\",\n sagemaker_session=self.sm_session,\n framework_version=\"1.2-1\",\n image_uri=image,\n metric_definitions=metric_definitions,\n )\n\n # Training Job Name based on the Model UUID and today's date\n training_date_time_utc = datetime.now(timezone.utc).strftime(\"%Y-%m-%d-%H-%M\")\n training_job_name = f\"{self.output_uuid}-{training_date_time_utc}\"\n\n # Train the estimator\n self.estimator.fit({\"train\": s3_training_path}, job_name=training_job_name)\n\n # Now delete the training data\n self.log.info(f\"Deleting training data {s3_training_path}...\")\n wr.s3.delete_objects(\n [s3_training_path, s3_training_path.replace(\".csv\", \".csv.metadata\")],\n boto3_session=self.boto3_session,\n )\n\n # Create Model and officially Register\n self.log.important(f\"Creating new model {self.output_uuid}...\")\n self.create_and_register_model()\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() on the Model\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() on the Model...\")\n\n # Store the model feature_list and target_column in the sageworks_meta\n output_model = ModelCore(self.output_uuid, model_type=self.model_type)\n output_model.upsert_sageworks_meta({\"sageworks_model_features\": self.model_feature_list})\n output_model.upsert_sageworks_meta({\"sageworks_model_target\": self.target_column})\n\n # Store the class labels (if they exist)\n if self.class_labels:\n output_model.set_class_labels(self.class_labels)\n\n # Call the Model onboard method\n output_model.onboard_with_args(self.model_type, self.target_column, self.model_feature_list)\n\n def create_and_register_model(self):\n \"\"\"Create and Register the Model\"\"\"\n\n # Get the metadata/tags to push into AWS\n aws_tags = self.get_aws_tags()\n\n # Create model group (if it doesn't already exist)\n self.sm_client.create_model_package_group(\n ModelPackageGroupName=self.output_uuid,\n ModelPackageGroupDescription=self.model_description,\n Tags=aws_tags,\n )\n\n # Register our model\n image = InferenceImage.get_image_uri(self.sm_session.boto_region_name, \"sklearn\", \"1.2.1\")\n self.log.important(f\"Registering model {self.output_uuid} with image {image}...\")\n model = self.estimator.create_model(role=self.sageworks_role_arn)\n model.register(\n model_package_group_name=self.output_uuid,\n framework_version=\"1.2.1\",\n image_uri=image,\n content_types=[\"text/csv\"],\n response_types=[\"text/csv\"],\n inference_instances=[\"ml.t2.medium\"],\n transform_instances=[\"ml.m5.large\"],\n approval_status=\"Approved\",\n description=self.model_description,\n )\n
"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.__init__","title":"__init__(feature_uuid, model_uuid, model_type, model_class=None, model_import_str=None, custom_script=None)
","text":"FeaturesToModel Initialization Args: feature_uuid (str): UUID of the FeatureSet to use as input model_uuid (str): UUID of the Model to create as output model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc. model_class (str, optional): The class of the model (default None) model_import_str (str, optional): The import string for the model (default None) custom_script (str, optional): Custom script to use for the model (default None)
Source code insrc/sageworks/core/transforms/features_to_model/features_to_model.py
def __init__(\n self,\n feature_uuid: str,\n model_uuid: str,\n model_type: ModelType,\n model_class=None,\n model_import_str=None,\n custom_script=None,\n):\n \"\"\"FeaturesToModel Initialization\n Args:\n feature_uuid (str): UUID of the FeatureSet to use as input\n model_uuid (str): UUID of the Model to create as output\n model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.\n model_class (str, optional): The class of the model (default None)\n model_import_str (str, optional): The import string for the model (default None)\n custom_script (str, optional): Custom script to use for the model (default None)\n \"\"\"\n\n # Make sure the model_uuid is a valid name\n Artifact.is_name_valid(model_uuid, delimiter=\"-\", lower_case=False)\n\n # Call superclass init\n super().__init__(feature_uuid, model_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.FEATURE_SET\n self.output_type = TransformOutput.MODEL\n self.model_type = model_type\n self.model_class = model_class\n self.model_import_str = model_import_str\n self.custom_script = custom_script\n self.estimator = None\n self.model_description = None\n self.model_training_root = self.models_s3_path + \"/training\"\n self.model_feature_list = None\n self.target_column = None\n self.class_labels = None\n
"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.create_and_register_model","title":"create_and_register_model()
","text":"Create and Register the Model
Source code insrc/sageworks/core/transforms/features_to_model/features_to_model.py
def create_and_register_model(self):\n \"\"\"Create and Register the Model\"\"\"\n\n # Get the metadata/tags to push into AWS\n aws_tags = self.get_aws_tags()\n\n # Create model group (if it doesn't already exist)\n self.sm_client.create_model_package_group(\n ModelPackageGroupName=self.output_uuid,\n ModelPackageGroupDescription=self.model_description,\n Tags=aws_tags,\n )\n\n # Register our model\n image = InferenceImage.get_image_uri(self.sm_session.boto_region_name, \"sklearn\", \"1.2.1\")\n self.log.important(f\"Registering model {self.output_uuid} with image {image}...\")\n model = self.estimator.create_model(role=self.sageworks_role_arn)\n model.register(\n model_package_group_name=self.output_uuid,\n framework_version=\"1.2.1\",\n image_uri=image,\n content_types=[\"text/csv\"],\n response_types=[\"text/csv\"],\n inference_instances=[\"ml.t2.medium\"],\n transform_instances=[\"ml.m5.large\"],\n approval_status=\"Approved\",\n description=self.model_description,\n )\n
"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Calling onboard() on the Model
Source code insrc/sageworks/core/transforms/features_to_model/features_to_model.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() on the Model\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() on the Model...\")\n\n # Store the model feature_list and target_column in the sageworks_meta\n output_model = ModelCore(self.output_uuid, model_type=self.model_type)\n output_model.upsert_sageworks_meta({\"sageworks_model_features\": self.model_feature_list})\n output_model.upsert_sageworks_meta({\"sageworks_model_target\": self.target_column})\n\n # Store the class labels (if they exist)\n if self.class_labels:\n output_model.set_class_labels(self.class_labels)\n\n # Call the Model onboard method\n output_model.onboard_with_args(self.model_type, self.target_column, self.model_feature_list)\n
"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.transform_impl","title":"transform_impl(target_column, description=None, feature_list=None, train_all_data=False)
","text":"Generic Features to Model: Note you should create a new class and inherit from this one to include specific logic for your Feature Set/Model Args: target_column (str): Column name of the target variable description (str): Description of the model (optional) feature_list (list[str]): A list of columns for the features (default None, will try to guess) train_all_data (bool): Train on ALL (100%) of the data (default False)
Source code insrc/sageworks/core/transforms/features_to_model/features_to_model.py
def transform_impl(\n self, target_column: str, description: str = None, feature_list: list = None, train_all_data=False\n):\n \"\"\"Generic Features to Model: Note you should create a new class and inherit from\n this one to include specific logic for your Feature Set/Model\n Args:\n target_column (str): Column name of the target variable\n description (str): Description of the model (optional)\n feature_list (list[str]): A list of columns for the features (default None, will try to guess)\n train_all_data (bool): Train on ALL (100%) of the data (default False)\n \"\"\"\n # Delete the existing model (if it exists)\n self.log.important(\"Trying to delete existing model...\")\n ModelCore.managed_delete(self.output_uuid)\n\n # Set our model description\n self.model_description = description if description is not None else f\"Model created from {self.input_uuid}\"\n\n # Get our Feature Set and create an S3 CSV Training dataset\n feature_set = FeatureSetCore(self.input_uuid)\n s3_training_path = feature_set.create_s3_training_data()\n self.log.info(f\"Created new training data {s3_training_path}...\")\n\n # Report the target column\n self.target_column = target_column\n self.log.info(f\"Target column: {self.target_column}\")\n\n # Did they specify a feature list?\n if feature_list:\n # AWS Feature Groups will also add these implicit columns, so remove them\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"training\"]\n feature_list = [c for c in feature_list if c not in aws_cols]\n\n # If they didn't specify a feature list, try to guess it\n else:\n # Try to figure out features with this logic\n # - Don't include id, event_time, __index_level_0__, or training columns\n # - Don't include AWS generated columns (e.g. write_time, api_invocation_time, is_deleted)\n # - Don't include the target columns\n # - Don't include any columns that are of type string or timestamp\n # - The rest of the columns are assumed to be features\n self.log.warning(\"Guessing at the feature list, HIGHLY SUGGESTED to specify an explicit feature list!\")\n all_columns = feature_set.columns\n filter_list = [\n \"id\",\n \"__index_level_0__\",\n \"write_time\",\n \"api_invocation_time\",\n \"is_deleted\",\n \"event_time\",\n \"training\",\n ] + [self.target_column]\n feature_list = [c for c in all_columns if c not in filter_list]\n\n # AWS Feature Store has 3 user column types (String, Integral, Fractional)\n # and two internal types (Timestamp and Boolean). A Feature List for\n # modeling can only contain Integral and Fractional types.\n remove_columns = []\n column_details = feature_set.column_details()\n for column_name in feature_list:\n if column_details[column_name] not in [\"Integral\", \"Fractional\"]:\n self.log.warning(\n f\"Removing {column_name} from feature list, improper type {column_details[column_name]}\"\n )\n remove_columns.append(column_name)\n\n # Remove the columns that are not Integral or Fractional\n feature_list = [c for c in feature_list if c not in remove_columns]\n\n # Set the final feature list\n self.model_feature_list = feature_list\n self.log.important(f\"Feature List for Modeling: {self.model_feature_list}\")\n\n # Custom Script\n if self.custom_script:\n script_path = self.custom_script\n self.log.info(\"Custom script path: {script_path}\")\n # Fixme: We'll need to circle back to this later\n copy_imports_to_script_dir(script_path, [\"sageworks.utils.chem_utils\"])\n\n # We're using one of the built-in model script templates\n else:\n # Set up our parameters for the model script\n template_params = {\n \"model_imports\": self.model_import_str,\n \"model_type\": self.model_type,\n \"model_class\": self.model_class,\n \"target_column\": self.target_column,\n \"feature_list\": self.model_feature_list,\n \"model_metrics_s3_path\": f\"{self.model_training_root}/{self.output_uuid}\",\n \"train_all_data\": train_all_data,\n }\n # Generate our model script\n script_path = generate_model_script(template_params)\n\n # Metric Definitions for Regression\n if self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.QUANTILE_REGRESSOR:\n metric_definitions = [\n {\"Name\": \"RMSE\", \"Regex\": \"RMSE: ([0-9.]+)\"},\n {\"Name\": \"MAE\", \"Regex\": \"MAE: ([0-9.]+)\"},\n {\"Name\": \"R2\", \"Regex\": \"R2: ([0-9.]+)\"},\n {\"Name\": \"NumRows\", \"Regex\": \"NumRows: ([0-9]+)\"},\n ]\n\n # Metric Definitions for Classification\n elif self.model_type == ModelType.CLASSIFIER:\n # We need to get creative with the Classification Metrics\n\n # Grab all the target column class values (class labels)\n table = feature_set.data_source.table\n self.class_labels = feature_set.query(f'select DISTINCT {self.target_column} FROM \"{table}\"')[\n self.target_column\n ].to_list()\n\n # Sanity check on the targets\n if len(self.class_labels) > 10:\n msg = f\"Too many target classes ({len(self.class_labels)}) for classification, aborting!\"\n self.log.critical(msg)\n raise ValueError(msg)\n\n # Dynamically create the metric definitions\n metrics = [\"precision\", \"recall\", \"fscore\"]\n metric_definitions = []\n for t in self.class_labels:\n for m in metrics:\n metric_definitions.append({\"Name\": f\"Metrics:{t}:{m}\", \"Regex\": f\"Metrics:{t}:{m} ([0-9.]+)\"})\n\n # Add the confusion matrix metrics\n for row in self.class_labels:\n for col in self.class_labels:\n metric_definitions.append(\n {\"Name\": f\"ConfusionMatrix:{row}:{col}\", \"Regex\": f\"ConfusionMatrix:{row}:{col} ([0-9.]+)\"}\n )\n\n # If the model type is UNKNOWN, our metric_definitions will be empty\n else:\n self.log.important(f\"ModelType is {self.model_type}, skipping metric_definitions...\")\n metric_definitions = []\n\n # Take the full script path and extract the entry point and source directory\n entry_point = str(Path(script_path).name)\n source_dir = str(Path(script_path).parent)\n\n # Create a Sagemaker Model with our script\n image = InferenceImage.get_image_uri(self.sm_session.boto_region_name, \"sklearn\", \"1.2.1\")\n self.estimator = SKLearn(\n entry_point=entry_point,\n source_dir=source_dir,\n role=self.sageworks_role_arn,\n instance_type=\"ml.m5.large\",\n sagemaker_session=self.sm_session,\n framework_version=\"1.2-1\",\n image_uri=image,\n metric_definitions=metric_definitions,\n )\n\n # Training Job Name based on the Model UUID and today's date\n training_date_time_utc = datetime.now(timezone.utc).strftime(\"%Y-%m-%d-%H-%M\")\n training_job_name = f\"{self.output_uuid}-{training_date_time_utc}\"\n\n # Train the estimator\n self.estimator.fit({\"train\": s3_training_path}, job_name=training_job_name)\n\n # Now delete the training data\n self.log.info(f\"Deleting training data {s3_training_path}...\")\n wr.s3.delete_objects(\n [s3_training_path, s3_training_path.replace(\".csv\", \".csv.metadata\")],\n boto3_session=self.boto3_session,\n )\n\n # Create Model and officially Register\n self.log.important(f\"Creating new model {self.output_uuid}...\")\n self.create_and_register_model()\n
"},{"location":"core_classes/transforms/features_to_model/#supported-models","title":"Supported Models","text":"Currently SageWorks supports XGBoost (classifier/regressor), and Scikit Learn models. Those models can be created by just specifying different parameters to the FeaturesToModel
class. The main issue with the supported models is they are vanilla versions with default parameters, any customization should be done with Custom Models
from sageworks.core.transforms.features_to_model.features_to_model import FeaturesToModel\n\n# XGBoost Regression Model\ninput_uuid = \"abalone_features\"\noutput_uuid = \"abalone-regression\"\nto_model = FeaturesToModel(input_uuid, output_uuid, model_type=ModelType.REGRESSOR)\nto_model.set_output_tags([\"abalone\", \"public\"])\nto_model.transform(target_column=\"class_number_of_rings\", description=\"Abalone Regression\")\n\n# XGBoost Classification Model\ninput_uuid = \"wine_features\"\noutput_uuid = \"wine-classification\"\nto_model = FeaturesToModel(input_uuid, output_uuid, ModelType.CLASSIFIER)\nto_model.set_output_tags([\"wine\", \"public\"])\nto_model.transform(target_column=\"wine_class\", description=\"Wine Classification\")\n\n# Quantile Regression Model (Abalone)\ninput_uuid = \"abalone_features\"\noutput_uuid = \"abalone-quantile-reg\"\nto_model = FeaturesToModel(input_uuid, output_uuid, ModelType.QUANTILE_REGRESSOR)\nto_model.set_output_tags([\"abalone\", \"quantiles\"])\nto_model.transform(target_column=\"class_number_of_rings\", description=\"Abalone Quantile Regression\")\n
"},{"location":"core_classes/transforms/features_to_model/#scikit-learn","title":"Scikit-Learn","text":"from sageworks.core.transforms.features_to_model.features_to_model import FeaturesToModel\n\n# Scikit-Learn Kmeans Clustering Model\ninput_uuid = \"wine_features\"\noutput_uuid = \"wine-clusters\"\nto_model = FeaturesToModel(\n input_uuid,\n output_uuid,\n model_class=\"KMeans\", # Clustering algorithm\n model_import_str=\"from sklearn.cluster import KMeans\", # Import statement for KMeans\n model_type=ModelType.CLUSTERER,\n)\nto_model.set_output_tags([\"wine\", \"clustering\"])\nto_model.transform(target_column=None, description=\"Wine Clustering\", train_all_data=True)\n\n# Scikit-Learn HDBSCAN Clustering Model\ninput_uuid = \"wine_features\"\noutput_uuid = \"wine-clusters-hdbscan\"\nto_model = FeaturesToModel(\n input_uuid,\n output_uuid,\n model_class=\"HDBSCAN\", # Density-based clustering algorithm\n model_import_str=\"from sklearn.cluster import HDBSCAN\",\n model_type=ModelType.CLUSTERER,\n)\nto_model.set_output_tags([\"wine\", \"density-based clustering\"])\nto_model.transform(target_column=None, description=\"Wine Clustering with HDBSCAN\", train_all_data=True)\n\n# Scikit-Learn 2D Projection Model using UMAP\ninput_uuid = \"wine_features\"\noutput_uuid = \"wine-2d-projection\"\nto_model = FeaturesToModel(\n input_uuid,\n output_uuid,\n model_class=\"UMAP\",\n model_import_str=\"from umap import UMAP\",\n model_type=ModelType.PROJECTION,\n)\nto_model.set_output_tags([\"wine\", \"2d-projection\"])\nto_model.transform(target_column=None, description=\"Wine 2D Projection\", train_all_data=True)\n
"},{"location":"core_classes/transforms/features_to_model/#custom-models","title":"Custom Models","text":"For custom models we recommend the following steps:
Experimental
The SageWorks Custom Models are currently in experimental mode so have fun but expect issues. Requires sageworks >= 0.8.60
. Feel free to submit issues to SageWorks Github
from sageworks.api import ModelType\nfrom sageworks.core.transforms.features_to_model.features_to_model import FeaturesToModel\n\n# Note this directory should also have a requirements.txt in it\nmy_custom_script = \"/full/path/to/my/directory/my_custom_script.py\"\ninput_uuid = \"wine_features\" # FeatureSet you want to use\noutput_uuid = \"my-custom-model\" # change to whatever\ntarget_column = \"wine-class\" # change to whatever\nto_model = FeaturesToModel(input_uuid, output_uuid,\n model_type=ModelType.CLASSIFIER, \n custom_script=my_custom_script)\nto_model.set_output_tags([\"your\", \"tags\"])\nto_model.transform(target_column=target_column, description=\"Custom Model\")\n
"},{"location":"core_classes/transforms/features_to_model/#custom-models-create-an-endpointrun-inference","title":"Custom Models: Create an Endpoint/Run Inference","text":"from sageworks.api import Model, Endpoint\n\nmodel = Model(\"my-custom-model\")\nend = model.to_endpoint() # Note: This takes a while\n\n# Now run inference on my custom model :)\nend.auto_inference(capture=True)\n\n# Run inference with my own dataframe\ndf = fs.pull_dataframe() # Or whatever dataframe\nend.inference(df)\n
"},{"location":"core_classes/transforms/features_to_model/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/transforms/model_to_endpoint/","title":"Model to Endpoint","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
ModelToEndpoint: Deploy an Endpoint for a Model
"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint","title":"ModelToEndpoint
","text":" Bases: Transform
ModelToEndpoint: Deploy an Endpoint for a Model
Common Usageto_endpoint = ModelToEndpoint(model_uuid, endpoint_uuid)\nto_endpoint.set_output_tags([\"aqsol\", \"public\", \"whatever\"])\nto_endpoint.transform()\n
Source code in src/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py
class ModelToEndpoint(Transform):\n \"\"\"ModelToEndpoint: Deploy an Endpoint for a Model\n\n Common Usage:\n ```python\n to_endpoint = ModelToEndpoint(model_uuid, endpoint_uuid)\n to_endpoint.set_output_tags([\"aqsol\", \"public\", \"whatever\"])\n to_endpoint.transform()\n ```\n \"\"\"\n\n def __init__(self, model_uuid: str, endpoint_uuid: str, serverless: bool = True):\n \"\"\"ModelToEndpoint Initialization\n Args:\n model_uuid(str): The UUID of the input Model\n endpoint_uuid(str): The UUID of the output Endpoint\n serverless(bool): Deploy the Endpoint in serverless mode (default: True)\n \"\"\"\n # Make sure the endpoint_uuid is a valid name\n Artifact.is_name_valid(endpoint_uuid, delimiter=\"-\", lower_case=False)\n\n # Call superclass init\n super().__init__(model_uuid, endpoint_uuid)\n\n # Set up all my instance attributes\n self.serverless = serverless\n self.instance_type = \"serverless\" if serverless else \"ml.t2.medium\"\n self.input_type = TransformInput.MODEL\n self.output_type = TransformOutput.ENDPOINT\n\n def transform_impl(self):\n \"\"\"Deploy an Endpoint for a Model\"\"\"\n\n # Delete endpoint (if it already exists)\n EndpointCore.managed_delete(self.output_uuid)\n\n # Get the Model Package ARN for our input model\n input_model = ModelCore(self.input_uuid)\n model_package_arn = input_model.model_package_arn()\n\n # Deploy the model\n self._deploy_model(model_package_arn)\n\n # Add this endpoint to the set of registered endpoints for the model\n input_model.register_endpoint(self.output_uuid)\n\n # This ensures that the endpoint is ready for use\n time.sleep(5) # We wait for AWS Lag\n end = EndpointCore(self.output_uuid)\n self.log.important(f\"Endpoint {end.uuid} is ready for use\")\n\n def _deploy_model(self, model_package_arn: str):\n \"\"\"Internal Method: Deploy the Model\n\n Args:\n model_package_arn(str): The Model Package ARN used to deploy the Endpoint\n \"\"\"\n # Grab the specified Model Package\n model_package = ModelPackage(\n role=self.sageworks_role_arn,\n model_package_arn=model_package_arn,\n sagemaker_session=self.sm_session,\n )\n\n # Get the metadata/tags to push into AWS\n aws_tags = self.get_aws_tags()\n\n # Is this a serverless deployment?\n serverless_config = None\n if self.serverless:\n serverless_config = ServerlessInferenceConfig(\n memory_size_in_mb=2048,\n max_concurrency=5,\n )\n\n # Deploy the Endpoint\n self.log.important(f\"Deploying the Endpoint {self.output_uuid}...\")\n model_package.deploy(\n initial_instance_count=1,\n instance_type=self.instance_type,\n serverless_inference_config=serverless_config,\n endpoint_name=self.output_uuid,\n serializer=CSVSerializer(),\n deserializer=CSVDeserializer(),\n tags=aws_tags,\n )\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() for the Endpoint\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() for the Endpoint...\")\n\n # Onboard the Endpoint\n output_endpoint = EndpointCore(self.output_uuid)\n output_endpoint.onboard_with_args(input_model=self.input_uuid)\n
"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.__init__","title":"__init__(model_uuid, endpoint_uuid, serverless=True)
","text":"ModelToEndpoint Initialization Args: model_uuid(str): The UUID of the input Model endpoint_uuid(str): The UUID of the output Endpoint serverless(bool): Deploy the Endpoint in serverless mode (default: True)
Source code insrc/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py
def __init__(self, model_uuid: str, endpoint_uuid: str, serverless: bool = True):\n \"\"\"ModelToEndpoint Initialization\n Args:\n model_uuid(str): The UUID of the input Model\n endpoint_uuid(str): The UUID of the output Endpoint\n serverless(bool): Deploy the Endpoint in serverless mode (default: True)\n \"\"\"\n # Make sure the endpoint_uuid is a valid name\n Artifact.is_name_valid(endpoint_uuid, delimiter=\"-\", lower_case=False)\n\n # Call superclass init\n super().__init__(model_uuid, endpoint_uuid)\n\n # Set up all my instance attributes\n self.serverless = serverless\n self.instance_type = \"serverless\" if serverless else \"ml.t2.medium\"\n self.input_type = TransformInput.MODEL\n self.output_type = TransformOutput.ENDPOINT\n
"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Calling onboard() for the Endpoint
Source code insrc/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() for the Endpoint\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() for the Endpoint...\")\n\n # Onboard the Endpoint\n output_endpoint = EndpointCore(self.output_uuid)\n output_endpoint.onboard_with_args(input_model=self.input_uuid)\n
"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.transform_impl","title":"transform_impl()
","text":"Deploy an Endpoint for a Model
Source code insrc/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py
def transform_impl(self):\n \"\"\"Deploy an Endpoint for a Model\"\"\"\n\n # Delete endpoint (if it already exists)\n EndpointCore.managed_delete(self.output_uuid)\n\n # Get the Model Package ARN for our input model\n input_model = ModelCore(self.input_uuid)\n model_package_arn = input_model.model_package_arn()\n\n # Deploy the model\n self._deploy_model(model_package_arn)\n\n # Add this endpoint to the set of registered endpoints for the model\n input_model.register_endpoint(self.output_uuid)\n\n # This ensures that the endpoint is ready for use\n time.sleep(5) # We wait for AWS Lag\n end = EndpointCore(self.output_uuid)\n self.log.important(f\"Endpoint {end.uuid} is ready for use\")\n
"},{"location":"core_classes/transforms/overview/","title":"Transforms","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
SageWorks currently has a large set of Transforms that go from one Artifact type to another (e.g. DataSource to FeatureSet). The Transforms will often have light and heavy versions depending on the scale of data that needs to be transformed.
"},{"location":"core_classes/transforms/overview/#transform-details","title":"Transform Details","text":"API Classes
The API Classes will often provide helpful methods that give you a DataFrame (data_source.query() for instance), so always check out the API Classes first.
These Transforms will give you the ultimate in customization and flexibility when creating AWS Machine Learning Pipelines. Grab a Pandas DataFrame from a DataSource or FeatureSet process in whatever way for your use case and simply create another Sageworks DataSource or FeatureSet from the resulting DataFrame.
Lots of Options:
Not for Large Data
Pandas Transforms can't handle large datasets (> 4 GigaBytes). For doing transforma on large data see our Heavy Transforms
Welcome to the SageWorks Pandas Transform Classes
These classes provide low-level APIs for using Pandas DataFrames
DataToPandas
","text":" Bases: Transform
DataToPandas: Class to transform a Data Source into a Pandas DataFrame
Common Usagedata_to_df = DataToPandas(data_source_uuid)\ndata_to_df.transform(query=<optional SQL query to filter/process data>)\ndata_to_df.transform(max_rows=<optional max rows to sample>)\nmy_df = data_to_df.get_output()\n\nNote: query is the best way to use this class, so use it :)\n
Source code in src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py
class DataToPandas(Transform):\n \"\"\"DataToPandas: Class to transform a Data Source into a Pandas DataFrame\n\n Common Usage:\n ```python\n data_to_df = DataToPandas(data_source_uuid)\n data_to_df.transform(query=<optional SQL query to filter/process data>)\n data_to_df.transform(max_rows=<optional max rows to sample>)\n my_df = data_to_df.get_output()\n\n Note: query is the best way to use this class, so use it :)\n ```\n \"\"\"\n\n def __init__(self, input_uuid: str):\n \"\"\"DataToPandas Initialization\"\"\"\n\n # Call superclass init\n super().__init__(input_uuid, \"DataFrame\")\n\n # Set up all my instance attributes\n self.input_type = TransformInput.DATA_SOURCE\n self.output_type = TransformOutput.PANDAS_DF\n self.output_df = None\n\n def transform_impl(self, query: str = None, max_rows=100000):\n \"\"\"Convert the DataSource into a Pandas DataFrame\n Args:\n query(str): The query to run against the DataSource (default: None)\n max_rows(int): The maximum number of rows to return (default: 100000)\n \"\"\"\n\n # Grab the Input (Data Source)\n input_data = DataSourceFactory(self.input_uuid)\n if not input_data.exists():\n self.log.critical(f\"Data Check on {self.input_uuid} failed!\")\n return\n\n # If a query is provided, that overrides the queries below\n if query:\n self.log.info(f\"Querying {self.input_uuid} with {query}...\")\n self.output_df = input_data.query(query)\n return\n\n # If the data source has more rows than max_rows, do a sample query\n num_rows = input_data.num_rows()\n if num_rows > max_rows:\n percentage = round(max_rows * 100.0 / num_rows)\n self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n query = f\"SELECT * FROM {self.input_uuid} TABLESAMPLE BERNOULLI({percentage})\"\n else:\n query = f\"SELECT * FROM {self.input_uuid}\"\n\n # Mark the transform as complete and set the output DataFrame\n self.output_df = input_data.query(query)\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n\n def get_output(self) -> pd.DataFrame:\n \"\"\"Get the DataFrame Output from this Transform\"\"\"\n return self.output_df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.__init__","title":"__init__(input_uuid)
","text":"DataToPandas Initialization
Source code insrc/sageworks/core/transforms/pandas_transforms/data_to_pandas.py
def __init__(self, input_uuid: str):\n \"\"\"DataToPandas Initialization\"\"\"\n\n # Call superclass init\n super().__init__(input_uuid, \"DataFrame\")\n\n # Set up all my instance attributes\n self.input_type = TransformInput.DATA_SOURCE\n self.output_type = TransformOutput.PANDAS_DF\n self.output_df = None\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.get_output","title":"get_output()
","text":"Get the DataFrame Output from this Transform
Source code insrc/sageworks/core/transforms/pandas_transforms/data_to_pandas.py
def get_output(self) -> pd.DataFrame:\n \"\"\"Get the DataFrame Output from this Transform\"\"\"\n return self.output_df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Any checks on the Pandas DataFrame that need to be done
Source code insrc/sageworks/core/transforms/pandas_transforms/data_to_pandas.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.transform_impl","title":"transform_impl(query=None, max_rows=100000)
","text":"Convert the DataSource into a Pandas DataFrame Args: query(str): The query to run against the DataSource (default: None) max_rows(int): The maximum number of rows to return (default: 100000)
Source code insrc/sageworks/core/transforms/pandas_transforms/data_to_pandas.py
def transform_impl(self, query: str = None, max_rows=100000):\n \"\"\"Convert the DataSource into a Pandas DataFrame\n Args:\n query(str): The query to run against the DataSource (default: None)\n max_rows(int): The maximum number of rows to return (default: 100000)\n \"\"\"\n\n # Grab the Input (Data Source)\n input_data = DataSourceFactory(self.input_uuid)\n if not input_data.exists():\n self.log.critical(f\"Data Check on {self.input_uuid} failed!\")\n return\n\n # If a query is provided, that overrides the queries below\n if query:\n self.log.info(f\"Querying {self.input_uuid} with {query}...\")\n self.output_df = input_data.query(query)\n return\n\n # If the data source has more rows than max_rows, do a sample query\n num_rows = input_data.num_rows()\n if num_rows > max_rows:\n percentage = round(max_rows * 100.0 / num_rows)\n self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n query = f\"SELECT * FROM {self.input_uuid} TABLESAMPLE BERNOULLI({percentage})\"\n else:\n query = f\"SELECT * FROM {self.input_uuid}\"\n\n # Mark the transform as complete and set the output DataFrame\n self.output_df = input_data.query(query)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas","title":"FeaturesToPandas
","text":" Bases: Transform
FeaturesToPandas: Class to transform a FeatureSet into a Pandas DataFrame
Common Usagefeature_to_df = FeaturesToPandas(feature_set_uuid)\nfeature_to_df.transform(max_rows=<optional max rows to sample>)\nmy_df = feature_to_df.get_output()\n
Source code in src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py
class FeaturesToPandas(Transform):\n \"\"\"FeaturesToPandas: Class to transform a FeatureSet into a Pandas DataFrame\n\n Common Usage:\n ```python\n feature_to_df = FeaturesToPandas(feature_set_uuid)\n feature_to_df.transform(max_rows=<optional max rows to sample>)\n my_df = feature_to_df.get_output()\n ```\n \"\"\"\n\n def __init__(self, feature_set_name: str):\n \"\"\"FeaturesToPandas Initialization\"\"\"\n\n # Call superclass init\n super().__init__(input_uuid=feature_set_name, output_uuid=\"DataFrame\")\n\n # Set up all my instance attributes\n self.input_type = TransformInput.FEATURE_SET\n self.output_type = TransformOutput.PANDAS_DF\n self.output_df = None\n self.transform_run = False\n\n def transform_impl(self, max_rows=100000):\n \"\"\"Convert the FeatureSet into a Pandas DataFrame\"\"\"\n\n # Grab the Input (Feature Set)\n input_data = FeatureSetCore(self.input_uuid)\n if not input_data.exists():\n self.log.critical(f\"Feature Set Check on {self.input_uuid} failed!\")\n return\n\n # Grab the table for this Feature Set\n table = input_data.athena_table\n\n # Get the list of columns (and subtract metadata columns that might get added)\n columns = input_data.columns\n filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n columns = \", \".join([x for x in columns if x not in filter_columns])\n\n # Get the number of rows in the Feature Set\n num_rows = input_data.num_rows()\n\n # If the data source has more rows than max_rows, do a sample query\n if num_rows > max_rows:\n percentage = round(max_rows * 100.0 / num_rows)\n self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n query = f'SELECT {columns} FROM \"{table}\" TABLESAMPLE BERNOULLI({percentage})'\n else:\n query = f'SELECT {columns} FROM \"{table}\"'\n\n # Mark the transform as complete and set the output DataFrame\n self.transform_run = True\n self.output_df = input_data.query(query)\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n\n def get_output(self) -> pd.DataFrame:\n \"\"\"Get the DataFrame Output from this Transform\"\"\"\n if not self.transform_run:\n self.transform()\n return self.output_df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.__init__","title":"__init__(feature_set_name)
","text":"FeaturesToPandas Initialization
Source code insrc/sageworks/core/transforms/pandas_transforms/features_to_pandas.py
def __init__(self, feature_set_name: str):\n \"\"\"FeaturesToPandas Initialization\"\"\"\n\n # Call superclass init\n super().__init__(input_uuid=feature_set_name, output_uuid=\"DataFrame\")\n\n # Set up all my instance attributes\n self.input_type = TransformInput.FEATURE_SET\n self.output_type = TransformOutput.PANDAS_DF\n self.output_df = None\n self.transform_run = False\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.get_output","title":"get_output()
","text":"Get the DataFrame Output from this Transform
Source code insrc/sageworks/core/transforms/pandas_transforms/features_to_pandas.py
def get_output(self) -> pd.DataFrame:\n \"\"\"Get the DataFrame Output from this Transform\"\"\"\n if not self.transform_run:\n self.transform()\n return self.output_df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Any checks on the Pandas DataFrame that need to be done
Source code insrc/sageworks/core/transforms/pandas_transforms/features_to_pandas.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.transform_impl","title":"transform_impl(max_rows=100000)
","text":"Convert the FeatureSet into a Pandas DataFrame
Source code insrc/sageworks/core/transforms/pandas_transforms/features_to_pandas.py
def transform_impl(self, max_rows=100000):\n \"\"\"Convert the FeatureSet into a Pandas DataFrame\"\"\"\n\n # Grab the Input (Feature Set)\n input_data = FeatureSetCore(self.input_uuid)\n if not input_data.exists():\n self.log.critical(f\"Feature Set Check on {self.input_uuid} failed!\")\n return\n\n # Grab the table for this Feature Set\n table = input_data.athena_table\n\n # Get the list of columns (and subtract metadata columns that might get added)\n columns = input_data.columns\n filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n columns = \", \".join([x for x in columns if x not in filter_columns])\n\n # Get the number of rows in the Feature Set\n num_rows = input_data.num_rows()\n\n # If the data source has more rows than max_rows, do a sample query\n if num_rows > max_rows:\n percentage = round(max_rows * 100.0 / num_rows)\n self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n query = f'SELECT {columns} FROM \"{table}\" TABLESAMPLE BERNOULLI({percentage})'\n else:\n query = f'SELECT {columns} FROM \"{table}\"'\n\n # Mark the transform as complete and set the output DataFrame\n self.transform_run = True\n self.output_df = input_data.query(query)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData","title":"PandasToData
","text":" Bases: Transform
PandasToData: Class to publish a Pandas DataFrame as a DataSource
Common Usagedf_to_data = PandasToData(output_uuid)\ndf_to_data.set_output_tags([\"test\", \"small\"])\ndf_to_data.set_input(test_df)\ndf_to_data.transform()\n
Source code in src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
class PandasToData(Transform):\n \"\"\"PandasToData: Class to publish a Pandas DataFrame as a DataSource\n\n Common Usage:\n ```python\n df_to_data = PandasToData(output_uuid)\n df_to_data.set_output_tags([\"test\", \"small\"])\n df_to_data.set_input(test_df)\n df_to_data.transform()\n ```\n \"\"\"\n\n def __init__(self, output_uuid: str, output_format: str = \"parquet\"):\n \"\"\"PandasToData Initialization\n Args:\n output_uuid (str): The UUID of the DataSource to create\n output_format (str): The file format to store the S3 object data in (default: \"parquet\")\n \"\"\"\n\n # Make sure the output_uuid is a valid name/id\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.PANDAS_DF\n self.output_type = TransformOutput.DATA_SOURCE\n self.output_df = None\n\n # Give a message that Parquet is best in most cases\n if output_format != \"parquet\":\n self.log.warning(\"Parquet format works the best in most cases please consider using it\")\n self.output_format = output_format\n\n def set_input(self, input_df: pd.DataFrame):\n \"\"\"Set the DataFrame Input for this Transform\"\"\"\n self.output_df = input_df.copy()\n\n def delete_existing(self):\n # Delete the existing FeatureSet if it exists\n self.log.info(f\"Deleting the {self.output_uuid} DataSource...\")\n AthenaSource.managed_delete(self.output_uuid)\n time.sleep(1)\n\n def convert_object_to_string(self, df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Try to automatically convert object columns to string columns\"\"\"\n for c in df.columns[df.dtypes == \"object\"]: # Look at the object columns\n try:\n df[c] = df[c].astype(\"string\")\n df[c] = df[c].str.replace(\"'\", '\"') # This is for nested JSON\n except (ParserError, ValueError, TypeError):\n self.log.info(f\"Column {c} could not be converted to string...\")\n return df\n\n def convert_object_to_datetime(self, df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Try to automatically convert object columns to datetime or string columns\"\"\"\n for c in df.columns[df.dtypes == \"object\"]: # Look at the object columns\n try:\n df[c] = pd.to_datetime(df[c])\n except (ParserError, ValueError, TypeError):\n self.log.debug(f\"Column {c} could not be converted to datetime...\")\n return df\n\n @staticmethod\n def convert_datetime_columns(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Convert datetime columns to ISO-8601 string\"\"\"\n datetime_type = [\"datetime\", \"datetime64\", \"datetime64[ns]\", \"datetimetz\"]\n for c in df.select_dtypes(include=datetime_type).columns:\n df[c] = df[c].map(datetime_to_iso8601)\n df[c] = df[c].astype(pd.StringDtype())\n return df\n\n def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Delete the existing DataSource if it exists\"\"\"\n self.delete_existing()\n\n def transform_impl(self, overwrite: bool = True, **kwargs):\n \"\"\"Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n\n Args:\n overwrite (bool): Overwrite the existing data in the SageWorks S3 Bucket\n \"\"\"\n self.log.info(f\"DataFrame to SageWorks DataSource: {self.output_uuid}...\")\n\n # Set up our metadata storage\n sageworks_meta = {\"sageworks_tags\": self.output_tags}\n sageworks_meta.update(self.output_meta)\n\n # Create the Output Parquet file S3 Storage Path\n s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n # Convert columns names to lowercase, Athena will not work with uppercase column names\n if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n for c in self.output_df.columns:\n if c != c.lower():\n self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n self.output_df.columns = self.output_df.columns.str.lower()\n\n # Convert Object Columns to String\n self.output_df = self.convert_object_to_string(self.output_df)\n\n # Note: Both of these conversions may not be necessary, so we're leaving them commented out\n \"\"\"\n # Convert Object Columns to Datetime\n self.output_df = self.convert_object_to_datetime(self.output_df)\n\n # Now convert datetime columns to ISO-8601 string\n # self.output_df = self.convert_datetime_columns(self.output_df)\n \"\"\"\n\n # Write out the DataFrame to AWS Data Catalog in either Parquet or JSONL format\n description = f\"SageWorks data source: {self.output_uuid}\"\n glue_table_settings = {\"description\": description, \"parameters\": sageworks_meta}\n if self.output_format == \"parquet\":\n wr.s3.to_parquet(\n self.output_df,\n path=s3_storage_path,\n dataset=True,\n mode=\"overwrite\",\n database=self.data_catalog_db,\n table=self.output_uuid,\n filename_prefix=f\"{self.output_uuid}_\",\n boto3_session=self.boto3_session,\n partition_cols=None,\n glue_table_settings=glue_table_settings,\n sanitize_columns=False,\n ) # FIXME: Have some logic around partition columns\n\n # Note: In general Parquet works will for most uses cases. We recommend using Parquet\n # You can use JSON_EXTRACT on Parquet string field, and it works great.\n elif self.output_format == \"jsonl\":\n self.log.warning(\"We recommend using Parquet format for most use cases\")\n self.log.warning(\"If you have a use case that requires JSONL please contact SageWorks support\")\n self.log.warning(\"We'd like to understand what functionality JSONL is providing that isn't already\")\n self.log.warning(\"provided with Parquet and JSON_EXTRACT() for your Athena Queries\")\n wr.s3.to_json(\n self.output_df,\n path=s3_storage_path,\n orient=\"records\",\n lines=True,\n date_format=\"iso\",\n dataset=True,\n mode=\"overwrite\",\n database=self.data_catalog_db,\n table=self.output_uuid,\n filename_prefix=f\"{self.output_uuid}_\",\n boto3_session=self.boto3_session,\n partition_cols=None,\n glue_table_settings=glue_table_settings,\n )\n else:\n raise ValueError(f\"Unsupported file format: {self.output_format}\")\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() fnr the DataSource\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() for the DataSource...\")\n\n # Onboard the DataSource\n output_data_source = DataSourceFactory(self.output_uuid)\n output_data_source.onboard()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.__init__","title":"__init__(output_uuid, output_format='parquet')
","text":"PandasToData Initialization Args: output_uuid (str): The UUID of the DataSource to create output_format (str): The file format to store the S3 object data in (default: \"parquet\")
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def __init__(self, output_uuid: str, output_format: str = \"parquet\"):\n \"\"\"PandasToData Initialization\n Args:\n output_uuid (str): The UUID of the DataSource to create\n output_format (str): The file format to store the S3 object data in (default: \"parquet\")\n \"\"\"\n\n # Make sure the output_uuid is a valid name/id\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.PANDAS_DF\n self.output_type = TransformOutput.DATA_SOURCE\n self.output_df = None\n\n # Give a message that Parquet is best in most cases\n if output_format != \"parquet\":\n self.log.warning(\"Parquet format works the best in most cases please consider using it\")\n self.output_format = output_format\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_datetime_columns","title":"convert_datetime_columns(df)
staticmethod
","text":"Convert datetime columns to ISO-8601 string
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
@staticmethod\ndef convert_datetime_columns(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Convert datetime columns to ISO-8601 string\"\"\"\n datetime_type = [\"datetime\", \"datetime64\", \"datetime64[ns]\", \"datetimetz\"]\n for c in df.select_dtypes(include=datetime_type).columns:\n df[c] = df[c].map(datetime_to_iso8601)\n df[c] = df[c].astype(pd.StringDtype())\n return df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_object_to_datetime","title":"convert_object_to_datetime(df)
","text":"Try to automatically convert object columns to datetime or string columns
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def convert_object_to_datetime(self, df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Try to automatically convert object columns to datetime or string columns\"\"\"\n for c in df.columns[df.dtypes == \"object\"]: # Look at the object columns\n try:\n df[c] = pd.to_datetime(df[c])\n except (ParserError, ValueError, TypeError):\n self.log.debug(f\"Column {c} could not be converted to datetime...\")\n return df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_object_to_string","title":"convert_object_to_string(df)
","text":"Try to automatically convert object columns to string columns
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def convert_object_to_string(self, df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Try to automatically convert object columns to string columns\"\"\"\n for c in df.columns[df.dtypes == \"object\"]: # Look at the object columns\n try:\n df[c] = df[c].astype(\"string\")\n df[c] = df[c].str.replace(\"'\", '\"') # This is for nested JSON\n except (ParserError, ValueError, TypeError):\n self.log.info(f\"Column {c} could not be converted to string...\")\n return df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Calling onboard() fnr the DataSource
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() fnr the DataSource\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() for the DataSource...\")\n\n # Onboard the DataSource\n output_data_source = DataSourceFactory(self.output_uuid)\n output_data_source.onboard()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.pre_transform","title":"pre_transform(**kwargs)
","text":"Pre-Transform: Delete the existing DataSource if it exists
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Delete the existing DataSource if it exists\"\"\"\n self.delete_existing()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.set_input","title":"set_input(input_df)
","text":"Set the DataFrame Input for this Transform
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def set_input(self, input_df: pd.DataFrame):\n \"\"\"Set the DataFrame Input for this Transform\"\"\"\n self.output_df = input_df.copy()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.transform_impl","title":"transform_impl(overwrite=True, **kwargs)
","text":"Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and store the information about the data to the AWS Data Catalog sageworks database
Parameters:
Name Type Description Defaultoverwrite
bool
Overwrite the existing data in the SageWorks S3 Bucket
True
Source code in src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def transform_impl(self, overwrite: bool = True, **kwargs):\n \"\"\"Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n\n Args:\n overwrite (bool): Overwrite the existing data in the SageWorks S3 Bucket\n \"\"\"\n self.log.info(f\"DataFrame to SageWorks DataSource: {self.output_uuid}...\")\n\n # Set up our metadata storage\n sageworks_meta = {\"sageworks_tags\": self.output_tags}\n sageworks_meta.update(self.output_meta)\n\n # Create the Output Parquet file S3 Storage Path\n s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n # Convert columns names to lowercase, Athena will not work with uppercase column names\n if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n for c in self.output_df.columns:\n if c != c.lower():\n self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n self.output_df.columns = self.output_df.columns.str.lower()\n\n # Convert Object Columns to String\n self.output_df = self.convert_object_to_string(self.output_df)\n\n # Note: Both of these conversions may not be necessary, so we're leaving them commented out\n \"\"\"\n # Convert Object Columns to Datetime\n self.output_df = self.convert_object_to_datetime(self.output_df)\n\n # Now convert datetime columns to ISO-8601 string\n # self.output_df = self.convert_datetime_columns(self.output_df)\n \"\"\"\n\n # Write out the DataFrame to AWS Data Catalog in either Parquet or JSONL format\n description = f\"SageWorks data source: {self.output_uuid}\"\n glue_table_settings = {\"description\": description, \"parameters\": sageworks_meta}\n if self.output_format == \"parquet\":\n wr.s3.to_parquet(\n self.output_df,\n path=s3_storage_path,\n dataset=True,\n mode=\"overwrite\",\n database=self.data_catalog_db,\n table=self.output_uuid,\n filename_prefix=f\"{self.output_uuid}_\",\n boto3_session=self.boto3_session,\n partition_cols=None,\n glue_table_settings=glue_table_settings,\n sanitize_columns=False,\n ) # FIXME: Have some logic around partition columns\n\n # Note: In general Parquet works will for most uses cases. We recommend using Parquet\n # You can use JSON_EXTRACT on Parquet string field, and it works great.\n elif self.output_format == \"jsonl\":\n self.log.warning(\"We recommend using Parquet format for most use cases\")\n self.log.warning(\"If you have a use case that requires JSONL please contact SageWorks support\")\n self.log.warning(\"We'd like to understand what functionality JSONL is providing that isn't already\")\n self.log.warning(\"provided with Parquet and JSON_EXTRACT() for your Athena Queries\")\n wr.s3.to_json(\n self.output_df,\n path=s3_storage_path,\n orient=\"records\",\n lines=True,\n date_format=\"iso\",\n dataset=True,\n mode=\"overwrite\",\n database=self.data_catalog_db,\n table=self.output_uuid,\n filename_prefix=f\"{self.output_uuid}_\",\n boto3_session=self.boto3_session,\n partition_cols=None,\n glue_table_settings=glue_table_settings,\n )\n else:\n raise ValueError(f\"Unsupported file format: {self.output_format}\")\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures","title":"PandasToFeatures
","text":" Bases: Transform
PandasToFeatures: Class to publish a Pandas DataFrame into a FeatureSet
Common Usageto_features = PandasToFeatures(output_uuid)\nto_features.set_output_tags([\"my\", \"awesome\", \"data\"])\nto_features.set_input(df, id_column=\"my_id\")\nto_features.transform()\n
Source code in src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
class PandasToFeatures(Transform):\n \"\"\"PandasToFeatures: Class to publish a Pandas DataFrame into a FeatureSet\n\n Common Usage:\n ```python\n to_features = PandasToFeatures(output_uuid)\n to_features.set_output_tags([\"my\", \"awesome\", \"data\"])\n to_features.set_input(df, id_column=\"my_id\")\n to_features.transform()\n ```\n \"\"\"\n\n def __init__(self, output_uuid: str):\n \"\"\"PandasToFeatures Initialization\n\n Args:\n output_uuid (str): The UUID of the FeatureSet to create\n \"\"\"\n\n # Make sure the output_uuid is a valid name\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.PANDAS_DF\n self.output_type = TransformOutput.FEATURE_SET\n self.id_column = None\n self.event_time_column = None\n self.one_hot_columns = []\n self.categorical_dtypes = {} # Used for streaming/chunking\n self.output_df = None\n self.table_format = TableFormatEnum.ICEBERG\n self.incoming_hold_out_ids = None\n\n # These will be set in the transform method\n self.output_feature_group = None\n self.output_feature_set = None\n self.expected_rows = 0\n\n def set_input(self, input_df: pd.DataFrame, id_column, event_time_column=None, one_hot_columns=None):\n \"\"\"Set the Input DataFrame for this Transform\n\n Args:\n input_df (pd.DataFrame): The input DataFrame.\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n event_time_column (str, optional): The name of the event time column (default: None).\n one_hot_columns (list, optional): The list of columns to one-hot encode (default: None).\n \"\"\"\n self.id_column = id_column\n self.event_time_column = event_time_column\n self.output_df = input_df.copy()\n self.one_hot_columns = one_hot_columns or []\n\n # Now Prepare the DataFrame for its journey into an AWS FeatureGroup\n self.prep_dataframe()\n\n def delete_existing(self):\n # Delete the existing FeatureSet if it exists\n self.log.info(f\"Deleting the {self.output_uuid} FeatureSet...\")\n FeatureSetCore.managed_delete(self.output_uuid)\n time.sleep(1)\n\n def _ensure_id_column(self):\n \"\"\"Internal: AWS Feature Store requires an Id field\"\"\"\n if self.id_column in [\"auto\", \"index\"]:\n self.log.info(\"Generating an 'auto_id' column from the dataframe index..\")\n self.output_df[\"auto_id\"] = self.output_df.index\n return\n if self.id_column not in self.output_df.columns:\n error_msg = f\"Id column {self.id_column} not found in the DataFrame\"\n self.log.critical(error_msg)\n raise ValueError(error_msg)\n\n def _ensure_event_time(self):\n \"\"\"Internal: AWS Feature Store requires an event_time field for all data stored\"\"\"\n if self.event_time_column is None or self.event_time_column not in self.output_df.columns:\n self.log.info(\"Generating an event_time column before FeatureSet Creation...\")\n self.event_time_column = \"event_time\"\n self.output_df[self.event_time_column] = pd.Timestamp(\"now\", tz=\"UTC\")\n\n # The event_time_column is defined, so we need to make sure it's in ISO-8601 string format\n # Note: AWS Feature Store only a particular ISO-8601 format not ALL ISO-8601 formats\n time_column = self.output_df[self.event_time_column]\n\n # Check if the event_time_column is of type object or string convert it to DateTime\n if time_column.dtypes == \"object\" or time_column.dtypes.name == \"string\":\n self.log.info(f\"Converting {self.event_time_column} to DateTime...\")\n time_column = pd.to_datetime(time_column)\n\n # Let's make sure it the right type for Feature Store\n if pd.api.types.is_datetime64_any_dtype(time_column):\n self.log.info(f\"Converting {self.event_time_column} to ISOFormat Date String before FeatureSet Creation...\")\n\n # Convert the datetime DType to ISO-8601 string\n # TableFormat=ICEBERG does not support alternate formats for event_time field, it only supports String type.\n time_column = time_column.map(datetime_to_iso8601)\n self.output_df[self.event_time_column] = time_column.astype(\"string\")\n\n def _convert_objs_to_string(self):\n \"\"\"Internal: AWS Feature Store doesn't know how to store object dtypes, so convert to String\"\"\"\n for col in self.output_df:\n if pd.api.types.is_object_dtype(self.output_df[col].dtype):\n self.output_df[col] = self.output_df[col].astype(pd.StringDtype())\n\n def process_column_name(self, column: str, shorten: bool = False) -> str:\n \"\"\"Call various methods to make sure the column is ready for Feature Store\n Args:\n column (str): The column name to process\n shorten (bool): Should we shorten the column name? (default: False)\n \"\"\"\n self.log.debug(f\"Processing column {column}...\")\n\n # Make sure the column name is valid\n column = self.sanitize_column_name(column)\n\n # Make sure the column name isn't too long\n if shorten:\n column = self.shorten_column_name(column)\n\n return column\n\n def shorten_column_name(self, name, max_length=20):\n if len(name) <= max_length:\n return name\n\n # Start building the new name from the end\n parts = name.split(\"_\")[::-1]\n new_name = \"\"\n for part in parts:\n if len(new_name) + len(part) + 1 <= max_length: # +1 for the underscore\n new_name = f\"{part}_{new_name}\" if new_name else part\n else:\n break\n\n # If new_name is empty, just use the last part of the original name\n if not new_name:\n new_name = parts[0]\n\n self.log.info(f\"Shortening {name} to {new_name}\")\n return new_name\n\n def sanitize_column_name(self, name):\n # Remove all invalid characters\n sanitized = re.sub(\"[^a-zA-Z0-9-_]\", \"_\", name)\n sanitized = re.sub(\"_+\", \"_\", sanitized)\n sanitized = sanitized.strip(\"_\")\n\n # Log the change if the name was altered\n if sanitized != name:\n self.log.info(f\"Sanitizing {name} to {sanitized}\")\n\n return sanitized\n\n def one_hot_encode(self, df, one_hot_columns: list) -> pd.DataFrame:\n \"\"\"One Hot Encoding for Categorical Columns with additional column name management\n\n Args:\n df (pd.DataFrame): The DataFrame to process\n one_hot_columns (list): The list of columns to one-hot encode\n\n Returns:\n pd.DataFrame: The DataFrame with one-hot encoded columns\n \"\"\"\n\n # Grab the current list of columns\n current_columns = list(df.columns)\n\n # Now convert the list of columns into Categorical and then One-Hot Encode\n self.convert_columns_to_categorical(one_hot_columns)\n self.log.important(f\"One-Hot encoding columns: {one_hot_columns}\")\n df = pd.get_dummies(df, columns=one_hot_columns)\n\n # Compute the new columns generated by get_dummies\n new_columns = list(set(df.columns) - set(current_columns))\n self.log.important(f\"New columns generated: {new_columns}\")\n\n # Convert new columns to int32\n df[new_columns] = df[new_columns].astype(\"int32\")\n\n # For the new columns we're going to shorten the names\n renamed_columns = {col: self.process_column_name(col) for col in new_columns}\n\n # Rename the columns in the DataFrame\n df.rename(columns=renamed_columns, inplace=True)\n\n return df\n\n # Helper Methods\n def convert_columns_to_categorical(self, columns: list):\n \"\"\"Convert column to Categorical type\"\"\"\n for feature in columns:\n if feature not in [self.event_time_column, self.id_column]:\n unique_values = self.output_df[feature].nunique()\n if 1 < unique_values < 10:\n self.log.important(f\"Converting column {feature} to categorical (unique {unique_values})\")\n self.output_df[feature] = self.output_df[feature].astype(\"category\")\n else:\n self.log.warning(f\"Column {feature} too many unique values {unique_values} skipping...\")\n\n def manual_categorical_converter(self):\n \"\"\"Used for Streaming: Convert object and string types to Categorical\n\n Note:\n This method is used for streaming/chunking. You can set the\n categorical_dtypes attribute to a dictionary of column names and\n their respective categorical types.\n \"\"\"\n for column, cat_d_type in self.categorical_dtypes.items():\n self.output_df[column] = self.output_df[column].astype(cat_d_type)\n\n @staticmethod\n def convert_column_types(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Convert the types of the DataFrame to the correct types for the Feature Store\"\"\"\n for column in list(df.select_dtypes(include=\"bool\").columns):\n df[column] = df[column].astype(\"int32\")\n for column in list(df.select_dtypes(include=\"category\").columns):\n df[column] = df[column].astype(\"str\")\n\n # Select all columns that are of datetime dtype and convert them to ISO-8601 strings\n for column in [col for col in df.columns if pd.api.types.is_datetime64_any_dtype(df[col])]:\n df[column] = df[column].map(datetime_to_iso8601).astype(\"string\")\n\n \"\"\"FIXME Not sure we need these conversions\n for column in list(df.select_dtypes(include=\"object\").columns):\n df[column] = df[column].astype(\"string\")\n for column in list(df.select_dtypes(include=[pd.Int64Dtype]).columns):\n df[column] = df[column].astype(\"int64\")\n for column in list(df.select_dtypes(include=[pd.Float64Dtype]).columns):\n df[column] = df[column].astype(\"float64\")\n \"\"\"\n return df\n\n def prep_dataframe(self):\n \"\"\"Prep the DataFrame for Feature Store Creation\"\"\"\n self.log.info(\"Prep the output_df (cat_convert, convert types, and lowercase columns)...\")\n\n # Remove any columns generated from AWS\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n self.output_df = self.output_df.drop(columns=aws_cols, errors=\"ignore\")\n\n # If one-hot columns are provided then one-hot encode them\n if self.one_hot_columns:\n self.output_df = self.one_hot_encode(self.output_df, self.one_hot_columns)\n\n # Convert columns names to lowercase, Athena will not work with uppercase column names\n if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n for c in self.output_df.columns:\n if c != c.lower():\n self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n self.output_df.columns = self.output_df.columns.str.lower()\n\n # Make sure we have the required id and event_time columns\n self._ensure_id_column()\n self._ensure_event_time()\n\n # Check for a training column (SageWorks uses dynamic training columns)\n if \"training\" in self.output_df.columns:\n self.log.important(\n \"\"\"Training column detected: Since FeatureSets are read-only, SageWorks creates a training view\n that can be dynamically changed. We'll use this training column to create a training view.\"\"\"\n )\n self.incoming_hold_out_ids = self.output_df[~self.output_df[\"training\"]][self.id_column].tolist()\n self.output_df = self.output_df.drop(columns=[\"training\"])\n\n # We need to convert some of our column types to the correct types\n # Feature Store only supports these data types:\n # - Integral\n # - Fractional\n # - String (timestamp/datetime types need to be converted to string)\n self.output_df = self.convert_column_types(self.output_df)\n\n def create_feature_group(self):\n \"\"\"Create a Feature Group, load our Feature Definitions, and wait for it to be ready\"\"\"\n\n # Create a Feature Group and load our Feature Definitions\n my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)\n my_feature_group.load_feature_definitions(data_frame=self.output_df)\n\n # Create the Output S3 Storage Path for this Feature Set\n s3_storage_path = f\"{self.feature_sets_s3_path}/{self.output_uuid}\"\n\n # Get the metadata/tags to push into AWS\n aws_tags = self.get_aws_tags()\n\n # Create the Feature Group\n my_feature_group.create(\n s3_uri=s3_storage_path,\n record_identifier_name=self.id_column,\n event_time_feature_name=self.event_time_column,\n role_arn=self.sageworks_role_arn,\n enable_online_store=True,\n table_format=self.table_format,\n tags=aws_tags,\n )\n\n # Ensure/wait for the feature group to be created\n self.ensure_feature_group_created(my_feature_group)\n return my_feature_group\n\n def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Delete any existing FeatureSet and Create the Feature Group\"\"\"\n self.delete_existing()\n self.output_feature_group = self.create_feature_group()\n\n def transform_impl(self):\n \"\"\"Transform Implementation: Ingest the data into the Feature Group\"\"\"\n\n # Now we actually push the data into the Feature Group (called ingestion)\n self.log.important(f\"Ingesting rows into Feature Group {self.output_uuid}...\")\n ingest_manager = self.output_feature_group.ingest(self.output_df, max_workers=8, max_processes=2, wait=False)\n try:\n ingest_manager.wait()\n except IngestionError as exc:\n self.log.warning(f\"Some rows had an ingesting error: {exc}\")\n\n # Report on any rows that failed to ingest\n if ingest_manager.failed_rows:\n self.log.warning(f\"Number of Failed Rows: {len(ingest_manager.failed_rows)}\")\n\n # FIXME: This may or may not give us the correct rows\n # If any index is greater then the number of rows, then the index needs\n # to be converted to a relative index in our current output_df\n df_rows = len(self.output_df)\n relative_indexes = [idx - df_rows if idx >= df_rows else idx for idx in ingest_manager.failed_rows]\n failed_data = self.output_df.iloc[relative_indexes]\n for idx, row in failed_data.iterrows():\n self.log.warning(f\"Failed Row {idx}: {row.to_dict()}\")\n\n # Keep track of the number of rows we expect to be ingested\n self.expected_rows += len(self.output_df) - len(ingest_manager.failed_rows)\n self.log.info(f\"Added rows: {len(self.output_df)}\")\n self.log.info(f\"Failed rows: {len(ingest_manager.failed_rows)}\")\n self.log.info(f\"Total rows ingested: {self.expected_rows}\")\n\n # We often need to wait a bit for AWS to fully register the new Feature Group\n self.log.important(f\"Waiting for AWS to register the new Feature Group {self.output_uuid}...\")\n time.sleep(30)\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Populating Offline Storage and onboard()\"\"\"\n self.log.info(\"Post-Transform: Populating Offline Storage and onboard()...\")\n\n # Feature Group Ingestion takes a while, so we need to wait for it to finish\n self.output_feature_set = FeatureSetCore(self.output_uuid)\n self.log.important(\"Waiting for AWS Feature Group Offline storage to be ready...\")\n self.log.important(\"This will often take 10-20 minutes...go have coffee or lunch :)\")\n self.output_feature_set.set_status(\"initializing\")\n self.wait_for_rows(self.expected_rows)\n\n # Call the FeatureSet onboard method to compute a bunch of EDA stuff\n self.output_feature_set.onboard()\n\n # Set Hold Out Ids (if we got them during creation)\n if self.incoming_hold_out_ids:\n self.output_feature_set.set_training_holdouts(self.id_column, self.incoming_hold_out_ids)\n\n def ensure_feature_group_created(self, feature_group):\n status = feature_group.describe().get(\"FeatureGroupStatus\")\n while status == \"Creating\":\n self.log.debug(\"FeatureSet being Created...\")\n time.sleep(5)\n status = feature_group.describe().get(\"FeatureGroupStatus\")\n if status == \"Created\":\n self.log.info(f\"FeatureSet {feature_group.name} successfully created\")\n else:\n self.log.critical(f\"FeatureSet {feature_group.name} creation failed with status: {status}\")\n\n def wait_for_rows(self, expected_rows: int):\n \"\"\"Wait for AWS Feature Group to fully populate the Offline Storage\"\"\"\n rows = self.output_feature_set.num_rows()\n\n # Wait for the rows to be populated\n self.log.info(f\"Waiting for AWS Feature Group {self.output_uuid} Offline Storage...\")\n max_retry = 20\n num_retry = 0\n sleep_time = 30\n while rows < expected_rows and num_retry < max_retry:\n num_retry += 1\n time.sleep(sleep_time)\n rows = self.output_feature_set.num_rows()\n self.log.info(f\"Offline Storage {self.output_uuid}: {rows} rows out of {expected_rows}\")\n if rows == expected_rows:\n self.log.important(f\"Success: Reached Expected Rows ({rows} rows)...\")\n else:\n msg = f\"Did not reach expected rows ({rows}/{expected_rows})...(probably AWS lag)\"\n self.log.warning(msg)\n self.log.monitor(msg)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.__init__","title":"__init__(output_uuid)
","text":"PandasToFeatures Initialization
Parameters:
Name Type Description Defaultoutput_uuid
str
The UUID of the FeatureSet to create
required Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def __init__(self, output_uuid: str):\n \"\"\"PandasToFeatures Initialization\n\n Args:\n output_uuid (str): The UUID of the FeatureSet to create\n \"\"\"\n\n # Make sure the output_uuid is a valid name\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.PANDAS_DF\n self.output_type = TransformOutput.FEATURE_SET\n self.id_column = None\n self.event_time_column = None\n self.one_hot_columns = []\n self.categorical_dtypes = {} # Used for streaming/chunking\n self.output_df = None\n self.table_format = TableFormatEnum.ICEBERG\n self.incoming_hold_out_ids = None\n\n # These will be set in the transform method\n self.output_feature_group = None\n self.output_feature_set = None\n self.expected_rows = 0\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.convert_column_types","title":"convert_column_types(df)
staticmethod
","text":"Convert the types of the DataFrame to the correct types for the Feature Store
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
@staticmethod\ndef convert_column_types(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Convert the types of the DataFrame to the correct types for the Feature Store\"\"\"\n for column in list(df.select_dtypes(include=\"bool\").columns):\n df[column] = df[column].astype(\"int32\")\n for column in list(df.select_dtypes(include=\"category\").columns):\n df[column] = df[column].astype(\"str\")\n\n # Select all columns that are of datetime dtype and convert them to ISO-8601 strings\n for column in [col for col in df.columns if pd.api.types.is_datetime64_any_dtype(df[col])]:\n df[column] = df[column].map(datetime_to_iso8601).astype(\"string\")\n\n \"\"\"FIXME Not sure we need these conversions\n for column in list(df.select_dtypes(include=\"object\").columns):\n df[column] = df[column].astype(\"string\")\n for column in list(df.select_dtypes(include=[pd.Int64Dtype]).columns):\n df[column] = df[column].astype(\"int64\")\n for column in list(df.select_dtypes(include=[pd.Float64Dtype]).columns):\n df[column] = df[column].astype(\"float64\")\n \"\"\"\n return df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.convert_columns_to_categorical","title":"convert_columns_to_categorical(columns)
","text":"Convert column to Categorical type
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def convert_columns_to_categorical(self, columns: list):\n \"\"\"Convert column to Categorical type\"\"\"\n for feature in columns:\n if feature not in [self.event_time_column, self.id_column]:\n unique_values = self.output_df[feature].nunique()\n if 1 < unique_values < 10:\n self.log.important(f\"Converting column {feature} to categorical (unique {unique_values})\")\n self.output_df[feature] = self.output_df[feature].astype(\"category\")\n else:\n self.log.warning(f\"Column {feature} too many unique values {unique_values} skipping...\")\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.create_feature_group","title":"create_feature_group()
","text":"Create a Feature Group, load our Feature Definitions, and wait for it to be ready
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def create_feature_group(self):\n \"\"\"Create a Feature Group, load our Feature Definitions, and wait for it to be ready\"\"\"\n\n # Create a Feature Group and load our Feature Definitions\n my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)\n my_feature_group.load_feature_definitions(data_frame=self.output_df)\n\n # Create the Output S3 Storage Path for this Feature Set\n s3_storage_path = f\"{self.feature_sets_s3_path}/{self.output_uuid}\"\n\n # Get the metadata/tags to push into AWS\n aws_tags = self.get_aws_tags()\n\n # Create the Feature Group\n my_feature_group.create(\n s3_uri=s3_storage_path,\n record_identifier_name=self.id_column,\n event_time_feature_name=self.event_time_column,\n role_arn=self.sageworks_role_arn,\n enable_online_store=True,\n table_format=self.table_format,\n tags=aws_tags,\n )\n\n # Ensure/wait for the feature group to be created\n self.ensure_feature_group_created(my_feature_group)\n return my_feature_group\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.manual_categorical_converter","title":"manual_categorical_converter()
","text":"Used for Streaming: Convert object and string types to Categorical
NoteThis method is used for streaming/chunking. You can set the categorical_dtypes attribute to a dictionary of column names and their respective categorical types.
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def manual_categorical_converter(self):\n \"\"\"Used for Streaming: Convert object and string types to Categorical\n\n Note:\n This method is used for streaming/chunking. You can set the\n categorical_dtypes attribute to a dictionary of column names and\n their respective categorical types.\n \"\"\"\n for column, cat_d_type in self.categorical_dtypes.items():\n self.output_df[column] = self.output_df[column].astype(cat_d_type)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.one_hot_encode","title":"one_hot_encode(df, one_hot_columns)
","text":"One Hot Encoding for Categorical Columns with additional column name management
Parameters:
Name Type Description Defaultdf
DataFrame
The DataFrame to process
requiredone_hot_columns
list
The list of columns to one-hot encode
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The DataFrame with one-hot encoded columns
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def one_hot_encode(self, df, one_hot_columns: list) -> pd.DataFrame:\n \"\"\"One Hot Encoding for Categorical Columns with additional column name management\n\n Args:\n df (pd.DataFrame): The DataFrame to process\n one_hot_columns (list): The list of columns to one-hot encode\n\n Returns:\n pd.DataFrame: The DataFrame with one-hot encoded columns\n \"\"\"\n\n # Grab the current list of columns\n current_columns = list(df.columns)\n\n # Now convert the list of columns into Categorical and then One-Hot Encode\n self.convert_columns_to_categorical(one_hot_columns)\n self.log.important(f\"One-Hot encoding columns: {one_hot_columns}\")\n df = pd.get_dummies(df, columns=one_hot_columns)\n\n # Compute the new columns generated by get_dummies\n new_columns = list(set(df.columns) - set(current_columns))\n self.log.important(f\"New columns generated: {new_columns}\")\n\n # Convert new columns to int32\n df[new_columns] = df[new_columns].astype(\"int32\")\n\n # For the new columns we're going to shorten the names\n renamed_columns = {col: self.process_column_name(col) for col in new_columns}\n\n # Rename the columns in the DataFrame\n df.rename(columns=renamed_columns, inplace=True)\n\n return df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Populating Offline Storage and onboard()
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Populating Offline Storage and onboard()\"\"\"\n self.log.info(\"Post-Transform: Populating Offline Storage and onboard()...\")\n\n # Feature Group Ingestion takes a while, so we need to wait for it to finish\n self.output_feature_set = FeatureSetCore(self.output_uuid)\n self.log.important(\"Waiting for AWS Feature Group Offline storage to be ready...\")\n self.log.important(\"This will often take 10-20 minutes...go have coffee or lunch :)\")\n self.output_feature_set.set_status(\"initializing\")\n self.wait_for_rows(self.expected_rows)\n\n # Call the FeatureSet onboard method to compute a bunch of EDA stuff\n self.output_feature_set.onboard()\n\n # Set Hold Out Ids (if we got them during creation)\n if self.incoming_hold_out_ids:\n self.output_feature_set.set_training_holdouts(self.id_column, self.incoming_hold_out_ids)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.pre_transform","title":"pre_transform(**kwargs)
","text":"Pre-Transform: Delete any existing FeatureSet and Create the Feature Group
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Delete any existing FeatureSet and Create the Feature Group\"\"\"\n self.delete_existing()\n self.output_feature_group = self.create_feature_group()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.prep_dataframe","title":"prep_dataframe()
","text":"Prep the DataFrame for Feature Store Creation
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def prep_dataframe(self):\n \"\"\"Prep the DataFrame for Feature Store Creation\"\"\"\n self.log.info(\"Prep the output_df (cat_convert, convert types, and lowercase columns)...\")\n\n # Remove any columns generated from AWS\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n self.output_df = self.output_df.drop(columns=aws_cols, errors=\"ignore\")\n\n # If one-hot columns are provided then one-hot encode them\n if self.one_hot_columns:\n self.output_df = self.one_hot_encode(self.output_df, self.one_hot_columns)\n\n # Convert columns names to lowercase, Athena will not work with uppercase column names\n if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n for c in self.output_df.columns:\n if c != c.lower():\n self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n self.output_df.columns = self.output_df.columns.str.lower()\n\n # Make sure we have the required id and event_time columns\n self._ensure_id_column()\n self._ensure_event_time()\n\n # Check for a training column (SageWorks uses dynamic training columns)\n if \"training\" in self.output_df.columns:\n self.log.important(\n \"\"\"Training column detected: Since FeatureSets are read-only, SageWorks creates a training view\n that can be dynamically changed. We'll use this training column to create a training view.\"\"\"\n )\n self.incoming_hold_out_ids = self.output_df[~self.output_df[\"training\"]][self.id_column].tolist()\n self.output_df = self.output_df.drop(columns=[\"training\"])\n\n # We need to convert some of our column types to the correct types\n # Feature Store only supports these data types:\n # - Integral\n # - Fractional\n # - String (timestamp/datetime types need to be converted to string)\n self.output_df = self.convert_column_types(self.output_df)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.process_column_name","title":"process_column_name(column, shorten=False)
","text":"Call various methods to make sure the column is ready for Feature Store Args: column (str): The column name to process shorten (bool): Should we shorten the column name? (default: False)
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def process_column_name(self, column: str, shorten: bool = False) -> str:\n \"\"\"Call various methods to make sure the column is ready for Feature Store\n Args:\n column (str): The column name to process\n shorten (bool): Should we shorten the column name? (default: False)\n \"\"\"\n self.log.debug(f\"Processing column {column}...\")\n\n # Make sure the column name is valid\n column = self.sanitize_column_name(column)\n\n # Make sure the column name isn't too long\n if shorten:\n column = self.shorten_column_name(column)\n\n return column\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.set_input","title":"set_input(input_df, id_column, event_time_column=None, one_hot_columns=None)
","text":"Set the Input DataFrame for this Transform
Parameters:
Name Type Description Defaultinput_df
DataFrame
The input DataFrame.
requiredid_column
str
The ID column (must be specified, use \"auto\" for auto-generated IDs).
requiredevent_time_column
str
The name of the event time column (default: None).
None
one_hot_columns
list
The list of columns to one-hot encode (default: None).
None
Source code in src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def set_input(self, input_df: pd.DataFrame, id_column, event_time_column=None, one_hot_columns=None):\n \"\"\"Set the Input DataFrame for this Transform\n\n Args:\n input_df (pd.DataFrame): The input DataFrame.\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n event_time_column (str, optional): The name of the event time column (default: None).\n one_hot_columns (list, optional): The list of columns to one-hot encode (default: None).\n \"\"\"\n self.id_column = id_column\n self.event_time_column = event_time_column\n self.output_df = input_df.copy()\n self.one_hot_columns = one_hot_columns or []\n\n # Now Prepare the DataFrame for its journey into an AWS FeatureGroup\n self.prep_dataframe()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.transform_impl","title":"transform_impl()
","text":"Transform Implementation: Ingest the data into the Feature Group
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def transform_impl(self):\n \"\"\"Transform Implementation: Ingest the data into the Feature Group\"\"\"\n\n # Now we actually push the data into the Feature Group (called ingestion)\n self.log.important(f\"Ingesting rows into Feature Group {self.output_uuid}...\")\n ingest_manager = self.output_feature_group.ingest(self.output_df, max_workers=8, max_processes=2, wait=False)\n try:\n ingest_manager.wait()\n except IngestionError as exc:\n self.log.warning(f\"Some rows had an ingesting error: {exc}\")\n\n # Report on any rows that failed to ingest\n if ingest_manager.failed_rows:\n self.log.warning(f\"Number of Failed Rows: {len(ingest_manager.failed_rows)}\")\n\n # FIXME: This may or may not give us the correct rows\n # If any index is greater then the number of rows, then the index needs\n # to be converted to a relative index in our current output_df\n df_rows = len(self.output_df)\n relative_indexes = [idx - df_rows if idx >= df_rows else idx for idx in ingest_manager.failed_rows]\n failed_data = self.output_df.iloc[relative_indexes]\n for idx, row in failed_data.iterrows():\n self.log.warning(f\"Failed Row {idx}: {row.to_dict()}\")\n\n # Keep track of the number of rows we expect to be ingested\n self.expected_rows += len(self.output_df) - len(ingest_manager.failed_rows)\n self.log.info(f\"Added rows: {len(self.output_df)}\")\n self.log.info(f\"Failed rows: {len(ingest_manager.failed_rows)}\")\n self.log.info(f\"Total rows ingested: {self.expected_rows}\")\n\n # We often need to wait a bit for AWS to fully register the new Feature Group\n self.log.important(f\"Waiting for AWS to register the new Feature Group {self.output_uuid}...\")\n time.sleep(30)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.wait_for_rows","title":"wait_for_rows(expected_rows)
","text":"Wait for AWS Feature Group to fully populate the Offline Storage
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def wait_for_rows(self, expected_rows: int):\n \"\"\"Wait for AWS Feature Group to fully populate the Offline Storage\"\"\"\n rows = self.output_feature_set.num_rows()\n\n # Wait for the rows to be populated\n self.log.info(f\"Waiting for AWS Feature Group {self.output_uuid} Offline Storage...\")\n max_retry = 20\n num_retry = 0\n sleep_time = 30\n while rows < expected_rows and num_retry < max_retry:\n num_retry += 1\n time.sleep(sleep_time)\n rows = self.output_feature_set.num_rows()\n self.log.info(f\"Offline Storage {self.output_uuid}: {rows} rows out of {expected_rows}\")\n if rows == expected_rows:\n self.log.important(f\"Success: Reached Expected Rows ({rows} rows)...\")\n else:\n msg = f\"Did not reach expected rows ({rows}/{expected_rows})...(probably AWS lag)\"\n self.log.warning(msg)\n self.log.monitor(msg)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked","title":"PandasToFeaturesChunked
","text":" Bases: Transform
PandasToFeaturesChunked: Class to manage a bunch of chunked Pandas DataFrames into a FeatureSet
Common Usageto_features = PandasToFeaturesChunked(output_uuid, id_column=\"id\"/None, event_time_column=\"date\"/None)\nto_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\ncat_column_info = {\"sex\": [\"M\", \"F\", \"I\"]}\nto_features.set_categorical_info(cat_column_info)\nto_features.add_chunk(df)\nto_features.add_chunk(df)\n...\nto_features.finalize()\n
Source code in src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
class PandasToFeaturesChunked(Transform):\n \"\"\"PandasToFeaturesChunked: Class to manage a bunch of chunked Pandas DataFrames into a FeatureSet\n\n Common Usage:\n ```python\n to_features = PandasToFeaturesChunked(output_uuid, id_column=\"id\"/None, event_time_column=\"date\"/None)\n to_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n cat_column_info = {\"sex\": [\"M\", \"F\", \"I\"]}\n to_features.set_categorical_info(cat_column_info)\n to_features.add_chunk(df)\n to_features.add_chunk(df)\n ...\n to_features.finalize()\n ```\n \"\"\"\n\n def __init__(self, output_uuid: str, id_column=None, event_time_column=None):\n \"\"\"PandasToFeaturesChunked Initialization\"\"\"\n\n # Make sure the output_uuid is a valid name\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.id_column = id_column\n self.event_time_column = event_time_column\n self.first_chunk = None\n self.pandas_to_features = PandasToFeatures(output_uuid)\n\n def set_categorical_info(self, cat_column_info: dict[list[str]]):\n \"\"\"Set the Categorical Columns\n Args:\n cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values\n \"\"\"\n\n # Create the CategoricalDtypes\n cat_d_types = {}\n for col, vals in cat_column_info.items():\n cat_d_types[col] = CategoricalDtype(categories=vals)\n\n # Now set the CategoricalDtypes on our underlying PandasToFeatures\n self.pandas_to_features.categorical_dtypes = cat_d_types\n\n def add_chunk(self, chunk_df: pd.DataFrame):\n \"\"\"Add a Chunk of Data to the FeatureSet\"\"\"\n\n # Is this the first chunk? If so we need to run the pre_transform\n if self.first_chunk is None:\n self.log.info(f\"Adding first chunk {chunk_df.shape}...\")\n self.first_chunk = chunk_df\n self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n self.pandas_to_features.pre_transform()\n self.pandas_to_features.transform_impl()\n else:\n self.log.info(f\"Adding chunk {chunk_df.shape}...\")\n self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n self.pandas_to_features.transform_impl()\n\n def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Create the Feature Group with Chunked Data\"\"\"\n\n # Loading data into a Feature Group takes a while, so set status to loading\n FeatureSetCore(self.output_uuid).set_status(\"loading\")\n\n def transform_impl(self):\n \"\"\"Required implementation of the Transform interface\"\"\"\n self.log.warning(\"PandasToFeaturesChunked.transform_impl() called. This is a no-op.\")\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any Post Transform Steps\"\"\"\n self.pandas_to_features.post_transform()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.__init__","title":"__init__(output_uuid, id_column=None, event_time_column=None)
","text":"PandasToFeaturesChunked Initialization
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def __init__(self, output_uuid: str, id_column=None, event_time_column=None):\n \"\"\"PandasToFeaturesChunked Initialization\"\"\"\n\n # Make sure the output_uuid is a valid name\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.id_column = id_column\n self.event_time_column = event_time_column\n self.first_chunk = None\n self.pandas_to_features = PandasToFeatures(output_uuid)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.add_chunk","title":"add_chunk(chunk_df)
","text":"Add a Chunk of Data to the FeatureSet
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def add_chunk(self, chunk_df: pd.DataFrame):\n \"\"\"Add a Chunk of Data to the FeatureSet\"\"\"\n\n # Is this the first chunk? If so we need to run the pre_transform\n if self.first_chunk is None:\n self.log.info(f\"Adding first chunk {chunk_df.shape}...\")\n self.first_chunk = chunk_df\n self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n self.pandas_to_features.pre_transform()\n self.pandas_to_features.transform_impl()\n else:\n self.log.info(f\"Adding chunk {chunk_df.shape}...\")\n self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n self.pandas_to_features.transform_impl()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Any Post Transform Steps
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any Post Transform Steps\"\"\"\n self.pandas_to_features.post_transform()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.pre_transform","title":"pre_transform(**kwargs)
","text":"Pre-Transform: Create the Feature Group with Chunked Data
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Create the Feature Group with Chunked Data\"\"\"\n\n # Loading data into a Feature Group takes a while, so set status to loading\n FeatureSetCore(self.output_uuid).set_status(\"loading\")\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.set_categorical_info","title":"set_categorical_info(cat_column_info)
","text":"Set the Categorical Columns Args: cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def set_categorical_info(self, cat_column_info: dict[list[str]]):\n \"\"\"Set the Categorical Columns\n Args:\n cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values\n \"\"\"\n\n # Create the CategoricalDtypes\n cat_d_types = {}\n for col, vals in cat_column_info.items():\n cat_d_types[col] = CategoricalDtype(categories=vals)\n\n # Now set the CategoricalDtypes on our underlying PandasToFeatures\n self.pandas_to_features.categorical_dtypes = cat_d_types\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.transform_impl","title":"transform_impl()
","text":"Required implementation of the Transform interface
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def transform_impl(self):\n \"\"\"Required implementation of the Transform interface\"\"\"\n self.log.warning(\"PandasToFeaturesChunked.transform_impl() called. This is a no-op.\")\n
"},{"location":"core_classes/transforms/transform/","title":"Transform","text":"API Classes
The API Classes will use Transforms internally. So model.to_endpoint() uses the ModelToEndpoint() transform. If you need more control over the Transform you can use the Core Classes directly.
The SageWorks Transform class is a base/abstract class that defines API implemented by all the child classes (DataLoaders, DataSourceToFeatureSet, ModelToEndpoint, etc).
Transform: Base Class for all transforms within SageWorks Inherited Classes must implement the abstract transform_impl() method
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform","title":"Transform
","text":" Bases: ABC
Transform: Abstract Base Class for all transforms within SageWorks. Inherited Classes must implement the abstract transform_impl() method
Source code insrc/sageworks/core/transforms/transform.py
class Transform(ABC):\n \"\"\"Transform: Abstract Base Class for all transforms within SageWorks. Inherited Classes\n must implement the abstract transform_impl() method\"\"\"\n\n def __init__(self, input_uuid: str, output_uuid: str):\n \"\"\"Transform Initialization\"\"\"\n\n self.log = logging.getLogger(\"sageworks\")\n self.input_type = None\n self.output_type = None\n self.output_tags = \"\"\n self.input_uuid = str(input_uuid) # Occasionally we get a pathlib.Path object\n self.output_uuid = str(output_uuid) # Occasionally we get a pathlib.Path object\n self.output_meta = {\"sageworks_input\": self.input_uuid}\n self.data_catalog_db = \"sageworks\"\n\n # Grab our SageWorks Bucket\n cm = ConfigManager()\n if not cm.config_okay():\n self.log.error(\"SageWorks Configuration Incomplete...\")\n self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n raise FatalConfigError()\n self.sageworks_bucket = cm.get_config(\"SAGEWORKS_BUCKET\")\n self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n self.endpoints_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n # Grab a SageWorks Role ARN, Boto3, SageMaker Session, and SageMaker Client\n self.aws_account_clamp = AWSAccountClamp()\n self.sageworks_role_arn = self.aws_account_clamp.aws_session.get_sageworks_execution_role_arn()\n self.boto3_session = self.aws_account_clamp.boto3_session\n self.sm_session = self.aws_account_clamp.sagemaker_session()\n self.sm_client = self.aws_account_clamp.sagemaker_client()\n\n # Delimiter for storing lists in AWS Tags\n self.tag_delimiter = \"::\"\n\n @abstractmethod\n def transform_impl(self, **kwargs):\n \"\"\"Abstract Method: Implement the Transformation from Input to Output\"\"\"\n pass\n\n def pre_transform(self, **kwargs):\n \"\"\"Perform any Pre-Transform operations\"\"\"\n self.log.debug(\"Pre-Transform...\")\n\n @abstractmethod\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform ensures that the output Artifact is ready for use\"\"\"\n pass\n\n def set_output_tags(self, tags: Union[list, str]):\n \"\"\"Set the tags that will be associated with the output object\n Args:\n tags (Union[list, str]): The list of tags or a '::' separated string of tags\"\"\"\n if isinstance(tags, list):\n self.output_tags = self.tag_delimiter.join(tags)\n else:\n self.output_tags = tags\n\n def add_output_meta(self, meta: dict):\n \"\"\"Add additional metadata that will be associated with the output artifact\n Args:\n meta (dict): A dictionary of metadata\"\"\"\n self.output_meta = self.output_meta | meta\n\n @staticmethod\n def convert_to_aws_tags(metadata: dict):\n \"\"\"Convert a dictionary to the AWS tag format (list of dicts)\n [ {Key: key_name, Value: value}, {..}, ...]\"\"\"\n return [{\"Key\": key, \"Value\": value} for key, value in metadata.items()]\n\n def get_aws_tags(self):\n \"\"\"Get the metadata/tags and convert them into AWS Tag Format\"\"\"\n # Set up our metadata storage\n sageworks_meta = {\"sageworks_tags\": self.output_tags}\n for key, value in self.output_meta.items():\n sageworks_meta[key] = value\n aws_tags = self.convert_to_aws_tags(sageworks_meta)\n return aws_tags\n\n @final\n def transform(self, **kwargs):\n \"\"\"Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations\"\"\"\n self.pre_transform(**kwargs)\n self.transform_impl(**kwargs)\n self.post_transform(**kwargs)\n\n def input_type(self) -> TransformInput:\n \"\"\"What Input Type does this Transform Consume\"\"\"\n return self.input_type\n\n def output_type(self) -> TransformOutput:\n \"\"\"What Output Type does this Transform Produce\"\"\"\n return self.output_type\n\n def set_input_uuid(self, input_uuid: str):\n \"\"\"Set the Input UUID (Name) for this Transform\"\"\"\n self.input_uuid = input_uuid\n\n def set_output_uuid(self, output_uuid: str):\n \"\"\"Set the Output UUID (Name) for this Transform\"\"\"\n self.output_uuid = output_uuid\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.__init__","title":"__init__(input_uuid, output_uuid)
","text":"Transform Initialization
Source code insrc/sageworks/core/transforms/transform.py
def __init__(self, input_uuid: str, output_uuid: str):\n \"\"\"Transform Initialization\"\"\"\n\n self.log = logging.getLogger(\"sageworks\")\n self.input_type = None\n self.output_type = None\n self.output_tags = \"\"\n self.input_uuid = str(input_uuid) # Occasionally we get a pathlib.Path object\n self.output_uuid = str(output_uuid) # Occasionally we get a pathlib.Path object\n self.output_meta = {\"sageworks_input\": self.input_uuid}\n self.data_catalog_db = \"sageworks\"\n\n # Grab our SageWorks Bucket\n cm = ConfigManager()\n if not cm.config_okay():\n self.log.error(\"SageWorks Configuration Incomplete...\")\n self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n raise FatalConfigError()\n self.sageworks_bucket = cm.get_config(\"SAGEWORKS_BUCKET\")\n self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n self.endpoints_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n # Grab a SageWorks Role ARN, Boto3, SageMaker Session, and SageMaker Client\n self.aws_account_clamp = AWSAccountClamp()\n self.sageworks_role_arn = self.aws_account_clamp.aws_session.get_sageworks_execution_role_arn()\n self.boto3_session = self.aws_account_clamp.boto3_session\n self.sm_session = self.aws_account_clamp.sagemaker_session()\n self.sm_client = self.aws_account_clamp.sagemaker_client()\n\n # Delimiter for storing lists in AWS Tags\n self.tag_delimiter = \"::\"\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.add_output_meta","title":"add_output_meta(meta)
","text":"Add additional metadata that will be associated with the output artifact Args: meta (dict): A dictionary of metadata
Source code insrc/sageworks/core/transforms/transform.py
def add_output_meta(self, meta: dict):\n \"\"\"Add additional metadata that will be associated with the output artifact\n Args:\n meta (dict): A dictionary of metadata\"\"\"\n self.output_meta = self.output_meta | meta\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.convert_to_aws_tags","title":"convert_to_aws_tags(metadata)
staticmethod
","text":"Convert a dictionary to the AWS tag format (list of dicts) [ {Key: key_name, Value: value}, {..}, ...]
Source code insrc/sageworks/core/transforms/transform.py
@staticmethod\ndef convert_to_aws_tags(metadata: dict):\n \"\"\"Convert a dictionary to the AWS tag format (list of dicts)\n [ {Key: key_name, Value: value}, {..}, ...]\"\"\"\n return [{\"Key\": key, \"Value\": value} for key, value in metadata.items()]\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.get_aws_tags","title":"get_aws_tags()
","text":"Get the metadata/tags and convert them into AWS Tag Format
Source code insrc/sageworks/core/transforms/transform.py
def get_aws_tags(self):\n \"\"\"Get the metadata/tags and convert them into AWS Tag Format\"\"\"\n # Set up our metadata storage\n sageworks_meta = {\"sageworks_tags\": self.output_tags}\n for key, value in self.output_meta.items():\n sageworks_meta[key] = value\n aws_tags = self.convert_to_aws_tags(sageworks_meta)\n return aws_tags\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.input_type","title":"input_type()
","text":"What Input Type does this Transform Consume
Source code insrc/sageworks/core/transforms/transform.py
def input_type(self) -> TransformInput:\n \"\"\"What Input Type does this Transform Consume\"\"\"\n return self.input_type\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.output_type","title":"output_type()
","text":"What Output Type does this Transform Produce
Source code insrc/sageworks/core/transforms/transform.py
def output_type(self) -> TransformOutput:\n \"\"\"What Output Type does this Transform Produce\"\"\"\n return self.output_type\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.post_transform","title":"post_transform(**kwargs)
abstractmethod
","text":"Post-Transform ensures that the output Artifact is ready for use
Source code insrc/sageworks/core/transforms/transform.py
@abstractmethod\ndef post_transform(self, **kwargs):\n \"\"\"Post-Transform ensures that the output Artifact is ready for use\"\"\"\n pass\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.pre_transform","title":"pre_transform(**kwargs)
","text":"Perform any Pre-Transform operations
Source code insrc/sageworks/core/transforms/transform.py
def pre_transform(self, **kwargs):\n \"\"\"Perform any Pre-Transform operations\"\"\"\n self.log.debug(\"Pre-Transform...\")\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_input_uuid","title":"set_input_uuid(input_uuid)
","text":"Set the Input UUID (Name) for this Transform
Source code insrc/sageworks/core/transforms/transform.py
def set_input_uuid(self, input_uuid: str):\n \"\"\"Set the Input UUID (Name) for this Transform\"\"\"\n self.input_uuid = input_uuid\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_output_tags","title":"set_output_tags(tags)
","text":"Set the tags that will be associated with the output object Args: tags (Union[list, str]): The list of tags or a '::' separated string of tags
Source code insrc/sageworks/core/transforms/transform.py
def set_output_tags(self, tags: Union[list, str]):\n \"\"\"Set the tags that will be associated with the output object\n Args:\n tags (Union[list, str]): The list of tags or a '::' separated string of tags\"\"\"\n if isinstance(tags, list):\n self.output_tags = self.tag_delimiter.join(tags)\n else:\n self.output_tags = tags\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_output_uuid","title":"set_output_uuid(output_uuid)
","text":"Set the Output UUID (Name) for this Transform
Source code insrc/sageworks/core/transforms/transform.py
def set_output_uuid(self, output_uuid: str):\n \"\"\"Set the Output UUID (Name) for this Transform\"\"\"\n self.output_uuid = output_uuid\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.transform","title":"transform(**kwargs)
","text":"Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations
Source code insrc/sageworks/core/transforms/transform.py
@final\ndef transform(self, **kwargs):\n \"\"\"Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations\"\"\"\n self.pre_transform(**kwargs)\n self.transform_impl(**kwargs)\n self.post_transform(**kwargs)\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.transform_impl","title":"transform_impl(**kwargs)
abstractmethod
","text":"Abstract Method: Implement the Transformation from Input to Output
Source code insrc/sageworks/core/transforms/transform.py
@abstractmethod\ndef transform_impl(self, **kwargs):\n \"\"\"Abstract Method: Implement the Transformation from Input to Output\"\"\"\n pass\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.TransformInput","title":"TransformInput
","text":" Bases: Enum
Enumerated Types for SageWorks Transform Inputs
Source code insrc/sageworks/core/transforms/transform.py
class TransformInput(Enum):\n \"\"\"Enumerated Types for SageWorks Transform Inputs\"\"\"\n\n LOCAL_FILE = auto()\n PANDAS_DF = auto()\n SPARK_DF = auto()\n S3_OBJECT = auto()\n DATA_SOURCE = auto()\n FEATURE_SET = auto()\n MODEL = auto()\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.TransformOutput","title":"TransformOutput
","text":" Bases: Enum
Enumerated Types for SageWorks Transform Outputs
Source code insrc/sageworks/core/transforms/transform.py
class TransformOutput(Enum):\n \"\"\"Enumerated Types for SageWorks Transform Outputs\"\"\"\n\n PANDAS_DF = auto()\n SPARK_DF = auto()\n S3_OBJECT = auto()\n DATA_SOURCE = auto()\n FEATURE_SET = auto()\n MODEL = auto()\n ENDPOINT = auto()\n
"},{"location":"core_classes/views/computation_view/","title":"Computation View","text":"Experimental
The SageWorks View classes are currently in experimental mode so have fun but expect issues and API changes going forward.
Note: This class can be automatically invoked from DataSource/FeatureSet set_computation_columns()
DataSource or FeatureSet. If you need more control then you can use this class directly.
ComputationView Class: Create a View with a subset of columns for display purposes
"},{"location":"core_classes/views/computation_view/#sageworks.core.views.computation_view.ComputationView","title":"ComputationView
","text":" Bases: ColumnSubsetView
ComputationView Class: Create a View with a subset of columns for computation purposes
Common Usage# Create a default ComputationView\nfs = FeatureSet(\"test_features\")\ncomp_view = ComputationView.create(fs)\ndf = comp_view.pull_dataframe()\n\n# Create a ComputationView with a specific set of columns\ncomp_view = ComputationView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n# Query the view\ndf = comp_view.query(f\"SELECT * FROM {comp_view.table} where prediction > 0.5\")\n
Source code in src/sageworks/core/views/computation_view.py
class ComputationView(ColumnSubsetView):\n \"\"\"ComputationView Class: Create a View with a subset of columns for computation purposes\n\n Common Usage:\n ```python\n # Create a default ComputationView\n fs = FeatureSet(\"test_features\")\n comp_view = ComputationView.create(fs)\n df = comp_view.pull_dataframe()\n\n # Create a ComputationView with a specific set of columns\n comp_view = ComputationView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n # Query the view\n df = comp_view.query(f\"SELECT * FROM {comp_view.table} where prediction > 0.5\")\n ```\n \"\"\"\n\n @classmethod\n def create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n ) -> Union[View, None]:\n \"\"\"Factory method to create and return a ComputationView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"computation\" view name\n return ColumnSubsetView.create(\"computation\", artifact, source_table, column_list, column_limit)\n
"},{"location":"core_classes/views/computation_view/#sageworks.core.views.computation_view.ComputationView.create","title":"create(artifact, source_table=None, column_list=None, column_limit=30)
classmethod
","text":"Factory method to create and return a ComputationView instance.
Parameters:
Name Type Description Defaultartifact
Union[DataSource, FeatureSet]
The DataSource or FeatureSet object
requiredsource_table
str
The table/view to create the view from. Defaults to None
None
column_list
Union[list[str], None]
A list of columns to include. Defaults to None.
None
column_limit
int
The max number of columns to include. Defaults to 30.
30
Returns:
Type DescriptionUnion[View, None]
Union[View, None]: The created View object (or None if failed to create the view)
Source code insrc/sageworks/core/views/computation_view.py
@classmethod\ndef create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n) -> Union[View, None]:\n \"\"\"Factory method to create and return a ComputationView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"computation\" view name\n return ColumnSubsetView.create(\"computation\", artifact, source_table, column_list, column_limit)\n
"},{"location":"core_classes/views/computation_view/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/views/display_view/","title":"Display View","text":"Experimental
The SageWorks View classes are currently in experimental mode so have fun but expect issues and API changes going forward.
Note: This class will be used in the future to fine tune what columns get displayed. For now just use the DataSource/FeatureSet set_computation_columns()
DataSource or FeatureSet
DisplayView Class: Create a View with a subset of columns for display purposes
"},{"location":"core_classes/views/display_view/#sageworks.core.views.display_view.DisplayView","title":"DisplayView
","text":" Bases: ColumnSubsetView
DisplayView Class: Create a View with a subset of columns for display purposes
Common Usage# Create a default DisplayView\nfs = FeatureSet(\"test_features\")\ndisplay_view = DisplayView.create(fs)\ndf = display_view.pull_dataframe()\n\n# Create a DisplayView with a specific set of columns\ndisplay_view = DisplayView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n# Query the view\ndf = display_view.query(f\"SELECT * FROM {display_view.table} where awesome = 'yes'\")\n
Source code in src/sageworks/core/views/display_view.py
class DisplayView(ColumnSubsetView):\n \"\"\"DisplayView Class: Create a View with a subset of columns for display purposes\n\n Common Usage:\n ```python\n # Create a default DisplayView\n fs = FeatureSet(\"test_features\")\n display_view = DisplayView.create(fs)\n df = display_view.pull_dataframe()\n\n # Create a DisplayView with a specific set of columns\n display_view = DisplayView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n # Query the view\n df = display_view.query(f\"SELECT * FROM {display_view.table} where awesome = 'yes'\")\n ```\n \"\"\"\n\n @classmethod\n def create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n ) -> Union[View, None]:\n \"\"\"Factory method to create and return a DisplayView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"display\" view name\n return ColumnSubsetView.create(\"display\", artifact, source_table, column_list, column_limit)\n
"},{"location":"core_classes/views/display_view/#sageworks.core.views.display_view.DisplayView.create","title":"create(artifact, source_table=None, column_list=None, column_limit=30)
classmethod
","text":"Factory method to create and return a DisplayView instance.
Parameters:
Name Type Description Defaultartifact
Union[DataSource, FeatureSet]
The DataSource or FeatureSet object
requiredsource_table
str
The table/view to create the view from. Defaults to None
None
column_list
Union[list[str], None]
A list of columns to include. Defaults to None.
None
column_limit
int
The max number of columns to include. Defaults to 30.
30
Returns:
Type DescriptionUnion[View, None]
Union[View, None]: The created View object (or None if failed to create the view)
Source code insrc/sageworks/core/views/display_view.py
@classmethod\ndef create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n) -> Union[View, None]:\n \"\"\"Factory method to create and return a DisplayView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"display\" view name\n return ColumnSubsetView.create(\"display\", artifact, source_table, column_list, column_limit)\n
"},{"location":"core_classes/views/display_view/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/views/mdq_view/","title":"ModelDataQuality View","text":"Experimental
The SageWorks View classes are currently in experimental mode so have fun but expect issues and API changes going forward.
MDQView Class: A View that computes various endpoint data quality metrics
"},{"location":"core_classes/views/mdq_view/#sageworks.core.views.mdq_view.MDQView","title":"MDQView
","text":"MDQView Class: A View that computes various endpoint data quality metrics
Common Usage# Grab a FeatureSet and an Endpoint\nfs = FeatureSet(\"abalone_features\")\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# Create a ModelDataQuality View\nmdq_view = MDQView.create(fs, endpoint=endpoint, id_column=\"id\")\nmy_df = mdq_view.pull_dataframe(head=True)\n\n# Query the view\ndf = mdq_view.query(f\"SELECT * FROM {mdq_view.table} where residuals > 0.5\")\n
Source code in src/sageworks/core/views/mdq_view.py
class MDQView:\n \"\"\"MDQView Class: A View that computes various endpoint data quality metrics\n\n Common Usage:\n ```python\n # Grab a FeatureSet and an Endpoint\n fs = FeatureSet(\"abalone_features\")\n endpoint = Endpoint(\"abalone-regression-end\")\n\n # Create a ModelDataQuality View\n mdq_view = MDQView.create(fs, endpoint=endpoint, id_column=\"id\")\n my_df = mdq_view.pull_dataframe(head=True)\n\n # Query the view\n df = mdq_view.query(f\"SELECT * FROM {mdq_view.table} where residuals > 0.5\")\n ```\n \"\"\"\n\n @classmethod\n def create(\n cls,\n fs: FeatureSet,\n endpoint: Endpoint,\n id_column: str,\n use_reference_model: bool = False,\n ) -> Union[View, None]:\n \"\"\"Create a Model Data Quality View with metrics\n\n Args:\n fs (FeatureSet): The FeatureSet object\n endpoint (Endpoint): The Endpoint object to use for the target and features\n id_column (str): The name of the id column (must be defined for join logic)\n use_reference_model (bool): Use the reference model for inference (default: False)\n\n Returns:\n Union[View, None]: The created View object (or None if failed)\n \"\"\"\n # Log view creation\n fs.log.important(\"Creating Model Data Quality View...\")\n\n # Get the target and feature columns from the endpoints model input\n model_input = Model(endpoint.get_input())\n target = model_input.target()\n features = model_input.features()\n\n # Pull in data from the source table\n df = fs.data_source.query(f\"SELECT * FROM {fs.data_source.uuid}\")\n\n # Check if the target and features are available in the data source\n missing_columns = [col for col in [target] + features if col not in df.columns]\n if missing_columns:\n fs.log.error(f\"Missing columns in data source: {missing_columns}\")\n return None\n\n # Check if the target is categorical\n categorical_target = not pd.api.types.is_numeric_dtype(df[target])\n\n # Compute row tags with RowTagger\n row_tagger = RowTagger(\n df,\n features=features,\n id_column=id_column,\n target_column=target,\n within_dist=0.25,\n min_target_diff=1.0,\n outlier_df=fs.data_source.outliers(),\n categorical_target=categorical_target,\n )\n mdq_df = row_tagger.tag_rows()\n\n # Rename and compute data quality scores based on tags\n mdq_df.rename(columns={\"tags\": \"data_quality_tags\"}, inplace=True)\n\n # We're going to compute a data_quality score based on the tags.\n mdq_df[\"data_quality\"] = mdq_df[\"data_quality_tags\"].apply(cls.calculate_data_quality)\n\n # Compute residuals using ResidualsCalculator\n if use_reference_model:\n residuals_calculator = ResidualsCalculator()\n else:\n residuals_calculator = ResidualsCalculator(endpoint=endpoint)\n residuals_df = residuals_calculator.fit_transform(df[features], df[target])\n\n # Add id_column to the residuals dataframe and merge with mdq_df\n residuals_df[id_column] = df[id_column]\n\n # Drop overlapping columns in mdq_df (except for the id_column) to avoid _x and _y suffixes\n overlap_columns = [col for col in residuals_df.columns if col in mdq_df.columns and col != id_column]\n mdq_df = mdq_df.drop(columns=overlap_columns)\n\n # Merge the DataFrames, with the id_column as the join key\n mdq_df = mdq_df.merge(residuals_df, on=id_column, how=\"left\")\n\n # Delegate view creation to PandasToView\n view_name = \"mdq_ref\" if use_reference_model else \"mdq\"\n return PandasToView.create(view_name, fs, df=mdq_df, id_column=id_column)\n\n @staticmethod\n def calculate_data_quality(tags):\n score = 1.0 # Start with the default score\n if \"coincident\" in tags:\n score -= 1.0\n if \"htg\" in tags:\n score -= 0.5\n if \"outlier\" in tags:\n score -= 0.25\n score = max(0.0, score)\n return score\n
"},{"location":"core_classes/views/mdq_view/#sageworks.core.views.mdq_view.MDQView.create","title":"create(fs, endpoint, id_column, use_reference_model=False)
classmethod
","text":"Create a Model Data Quality View with metrics
Parameters:
Name Type Description Defaultfs
FeatureSet
The FeatureSet object
requiredendpoint
Endpoint
The Endpoint object to use for the target and features
requiredid_column
str
The name of the id column (must be defined for join logic)
requireduse_reference_model
bool
Use the reference model for inference (default: False)
False
Returns:
Type DescriptionUnion[View, None]
Union[View, None]: The created View object (or None if failed)
Source code insrc/sageworks/core/views/mdq_view.py
@classmethod\ndef create(\n cls,\n fs: FeatureSet,\n endpoint: Endpoint,\n id_column: str,\n use_reference_model: bool = False,\n) -> Union[View, None]:\n \"\"\"Create a Model Data Quality View with metrics\n\n Args:\n fs (FeatureSet): The FeatureSet object\n endpoint (Endpoint): The Endpoint object to use for the target and features\n id_column (str): The name of the id column (must be defined for join logic)\n use_reference_model (bool): Use the reference model for inference (default: False)\n\n Returns:\n Union[View, None]: The created View object (or None if failed)\n \"\"\"\n # Log view creation\n fs.log.important(\"Creating Model Data Quality View...\")\n\n # Get the target and feature columns from the endpoints model input\n model_input = Model(endpoint.get_input())\n target = model_input.target()\n features = model_input.features()\n\n # Pull in data from the source table\n df = fs.data_source.query(f\"SELECT * FROM {fs.data_source.uuid}\")\n\n # Check if the target and features are available in the data source\n missing_columns = [col for col in [target] + features if col not in df.columns]\n if missing_columns:\n fs.log.error(f\"Missing columns in data source: {missing_columns}\")\n return None\n\n # Check if the target is categorical\n categorical_target = not pd.api.types.is_numeric_dtype(df[target])\n\n # Compute row tags with RowTagger\n row_tagger = RowTagger(\n df,\n features=features,\n id_column=id_column,\n target_column=target,\n within_dist=0.25,\n min_target_diff=1.0,\n outlier_df=fs.data_source.outliers(),\n categorical_target=categorical_target,\n )\n mdq_df = row_tagger.tag_rows()\n\n # Rename and compute data quality scores based on tags\n mdq_df.rename(columns={\"tags\": \"data_quality_tags\"}, inplace=True)\n\n # We're going to compute a data_quality score based on the tags.\n mdq_df[\"data_quality\"] = mdq_df[\"data_quality_tags\"].apply(cls.calculate_data_quality)\n\n # Compute residuals using ResidualsCalculator\n if use_reference_model:\n residuals_calculator = ResidualsCalculator()\n else:\n residuals_calculator = ResidualsCalculator(endpoint=endpoint)\n residuals_df = residuals_calculator.fit_transform(df[features], df[target])\n\n # Add id_column to the residuals dataframe and merge with mdq_df\n residuals_df[id_column] = df[id_column]\n\n # Drop overlapping columns in mdq_df (except for the id_column) to avoid _x and _y suffixes\n overlap_columns = [col for col in residuals_df.columns if col in mdq_df.columns and col != id_column]\n mdq_df = mdq_df.drop(columns=overlap_columns)\n\n # Merge the DataFrames, with the id_column as the join key\n mdq_df = mdq_df.merge(residuals_df, on=id_column, how=\"left\")\n\n # Delegate view creation to PandasToView\n view_name = \"mdq_ref\" if use_reference_model else \"mdq\"\n return PandasToView.create(view_name, fs, df=mdq_df, id_column=id_column)\n
"},{"location":"core_classes/views/mdq_view/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/views/overview/","title":"Views","text":"View Examples
Examples of using the Views classes to extend the functionality of SageWorks Artifacts are in the Examples section at the bottom of this page.
Views are a powerful way to filter and agument your DataSources and FeatureSets. With Views you can subset columns, rows, and even add data to existing SageWorks Artifacts. If you want to compute outliers, runs some statistics or engineer some new features, Views are an easy way to change, modify, and add to DataSources and FeatureSets.
If you're looking to read and pull data from a view please see the Views documentation.
"},{"location":"core_classes/views/overview/#view-constructor-classes","title":"View Constructor Classes","text":"These classes provide APIs for creating Views for DataSources and FeatureSets.
All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Listing Views
views.pyfrom sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\ntest_data = DataSource('test_data')\ntest_data.views()\n[\"display\", \"training\", \"computation\"]\n
Getting a Particular View
views.pyfrom sageworks.api.feature_set import FeatureSet\n\nfs = FeatureSet('test_features')\n\n# Grab the columns for the display view\ndisplay_view = fs.view(\"display\")\ndisplay_view.columns\n['id', 'name', 'height', 'weight', 'salary', ...]\n\n# Pull the dataframe for this view\ndf = display_view.pull_dataframe()\n id name height weight salary ...\n0 58 Person 58 71.781227 275.088196 162053.140625 \n
View Queries
All SageWorks Views are stored in AWS Athena, so any query that you can make with Athena is accessible through the View Query API.
view_query.pyfrom sageworks.api.feature_set import FeatureSet\n\n# Grab a FeatureSet View\nfs = FeatureSet(\"abalone_features\")\nt_view = fs.view(\"training\")\n\n# Make some queries using the Athena backend\ndf = t_view(f\"select * from {t_view.table} where height > .3\")\nprint(df.head())\n\ndf = t_view.query(\"select * from abalone_features where class_number_of_rings < 3\")\nprint(df.head())\n
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10\n1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8\n\n sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1\n1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2\n
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/views/training_view/","title":"Training View","text":"Experimental
The SageWorks View classes are currently in experimental mode so have fun but expect issues and API changes going forward.
TrainingView Class: A View with an additional training column that marks holdout ids
"},{"location":"core_classes/views/training_view/#sageworks.core.views.training_view.TrainingView","title":"TrainingView
","text":" Bases: CreateView
TrainingView Class: A View with an additional training column that marks holdout ids
Common Usage# Create a default TrainingView\nfs = FeatureSet(\"test_features\")\ntraining_view = TrainingView.create(fs)\ndf = training_view.pull_dataframe()\n\n# Create a TrainingView with a specific set of columns\ntraining_view = TrainingView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n# Query the view\ndf = training_view.query(f\"SELECT * FROM {training_view.table} where training = TRUE\")\n
Source code in src/sageworks/core/views/training_view.py
class TrainingView(CreateView):\n \"\"\"TrainingView Class: A View with an additional training column that marks holdout ids\n\n Common Usage:\n ```python\n # Create a default TrainingView\n fs = FeatureSet(\"test_features\")\n training_view = TrainingView.create(fs)\n df = training_view.pull_dataframe()\n\n # Create a TrainingView with a specific set of columns\n training_view = TrainingView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n # Query the view\n df = training_view.query(f\"SELECT * FROM {training_view.table} where training = TRUE\")\n ```\n \"\"\"\n\n @classmethod\n def create(\n cls,\n feature_set: FeatureSet,\n source_table: str = None,\n id_column: str = None,\n holdout_ids: Union[list[str], list[int], None] = None,\n ) -> Union[View, None]:\n \"\"\"Factory method to create and return a TrainingView instance.\n\n Args:\n feature_set (FeatureSet): A FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None.\n id_column (str, optional): The name of the id column. Defaults to None.\n holdout_ids (Union[list[str], list[int], None], optional): A list of holdout ids. Defaults to None.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Instantiate the TrainingView with \"training\" as the view name\n instance = cls(\"training\", feature_set, source_table)\n\n # Drop any columns generated from AWS\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n source_table_columns = get_column_list(instance.data_source, instance.source_table)\n column_list = [col for col in source_table_columns if col not in aws_cols]\n\n # Sanity check on the id column\n if not id_column:\n instance.log.important(\"No id column specified, we'll try the auto_id_column ..\")\n if not instance.auto_id_column:\n instance.log.error(\"No id column specified and no auto_id_column found, aborting ..\")\n return None\n else:\n if instance.auto_id_column not in column_list:\n instance.log.error(\n f\"Auto id column {instance.auto_id_column} not found in column list, aborting ..\"\n )\n return None\n else:\n id_column = instance.auto_id_column\n\n # If we don't have holdout ids, create a default training view\n if not holdout_ids:\n instance._default_training_view(instance.data_source, id_column)\n return View(instance.data_source, instance.view_name, auto_create_view=False)\n\n # Format the list of holdout ids for SQL IN clause\n if holdout_ids and all(isinstance(id, str) for id in holdout_ids):\n formatted_holdout_ids = \", \".join(f\"'{id}'\" for id in holdout_ids)\n else:\n formatted_holdout_ids = \", \".join(map(str, holdout_ids))\n\n # Enclose each column name in double quotes\n sql_columns = \", \".join([f'\"{column}\"' for column in column_list])\n\n # Construct the CREATE VIEW query\n create_view_query = f\"\"\"\n CREATE OR REPLACE VIEW {instance.table} AS\n SELECT {sql_columns}, CASE\n WHEN {id_column} IN ({formatted_holdout_ids}) THEN False\n ELSE True\n END AS training\n FROM {instance.source_table}\n \"\"\"\n\n # Execute the CREATE VIEW query\n instance.data_source.execute_statement(create_view_query)\n\n # Return the View\n return View(instance.data_source, instance.view_name, auto_create_view=False)\n\n # This is an internal method that's used to create a default training view\n def _default_training_view(self, data_source: DataSource, id_column: str):\n \"\"\"Create a default view in Athena that assigns roughly 80% of the data to training\n\n Args:\n data_source (DataSource): The SageWorks DataSource object\n id_column (str): The name of the id column\n \"\"\"\n self.log.important(f\"Creating default Training View {self.table}...\")\n\n # Drop any columns generated from AWS\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n column_list = [col for col in data_source.columns if col not in aws_cols]\n\n # Enclose each column name in double quotes\n sql_columns = \", \".join([f'\"{column}\"' for column in column_list])\n\n # Construct the CREATE VIEW query with a simple modulo operation for the 80/20 split\n create_view_query = f\"\"\"\n CREATE OR REPLACE VIEW \"{self.table}\" AS\n SELECT {sql_columns}, CASE\n WHEN MOD(ROW_NUMBER() OVER (ORDER BY {id_column}), 10) < 8 THEN True -- Assign 80% to training\n ELSE False -- Assign roughly 20% to validation/test\n END AS training\n FROM {self.base_table_name}\n \"\"\"\n\n # Execute the CREATE VIEW query\n data_source.execute_statement(create_view_query)\n
"},{"location":"core_classes/views/training_view/#sageworks.core.views.training_view.TrainingView.create","title":"create(feature_set, source_table=None, id_column=None, holdout_ids=None)
classmethod
","text":"Factory method to create and return a TrainingView instance.
Parameters:
Name Type Description Defaultfeature_set
FeatureSet
A FeatureSet object
requiredsource_table
str
The table/view to create the view from. Defaults to None.
None
id_column
str
The name of the id column. Defaults to None.
None
holdout_ids
Union[list[str], list[int], None]
A list of holdout ids. Defaults to None.
None
Returns:
Type DescriptionUnion[View, None]
Union[View, None]: The created View object (or None if failed to create the view)
Source code insrc/sageworks/core/views/training_view.py
@classmethod\ndef create(\n cls,\n feature_set: FeatureSet,\n source_table: str = None,\n id_column: str = None,\n holdout_ids: Union[list[str], list[int], None] = None,\n) -> Union[View, None]:\n \"\"\"Factory method to create and return a TrainingView instance.\n\n Args:\n feature_set (FeatureSet): A FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None.\n id_column (str, optional): The name of the id column. Defaults to None.\n holdout_ids (Union[list[str], list[int], None], optional): A list of holdout ids. Defaults to None.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Instantiate the TrainingView with \"training\" as the view name\n instance = cls(\"training\", feature_set, source_table)\n\n # Drop any columns generated from AWS\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n source_table_columns = get_column_list(instance.data_source, instance.source_table)\n column_list = [col for col in source_table_columns if col not in aws_cols]\n\n # Sanity check on the id column\n if not id_column:\n instance.log.important(\"No id column specified, we'll try the auto_id_column ..\")\n if not instance.auto_id_column:\n instance.log.error(\"No id column specified and no auto_id_column found, aborting ..\")\n return None\n else:\n if instance.auto_id_column not in column_list:\n instance.log.error(\n f\"Auto id column {instance.auto_id_column} not found in column list, aborting ..\"\n )\n return None\n else:\n id_column = instance.auto_id_column\n\n # If we don't have holdout ids, create a default training view\n if not holdout_ids:\n instance._default_training_view(instance.data_source, id_column)\n return View(instance.data_source, instance.view_name, auto_create_view=False)\n\n # Format the list of holdout ids for SQL IN clause\n if holdout_ids and all(isinstance(id, str) for id in holdout_ids):\n formatted_holdout_ids = \", \".join(f\"'{id}'\" for id in holdout_ids)\n else:\n formatted_holdout_ids = \", \".join(map(str, holdout_ids))\n\n # Enclose each column name in double quotes\n sql_columns = \", \".join([f'\"{column}\"' for column in column_list])\n\n # Construct the CREATE VIEW query\n create_view_query = f\"\"\"\n CREATE OR REPLACE VIEW {instance.table} AS\n SELECT {sql_columns}, CASE\n WHEN {id_column} IN ({formatted_holdout_ids}) THEN False\n ELSE True\n END AS training\n FROM {instance.source_table}\n \"\"\"\n\n # Execute the CREATE VIEW query\n instance.data_source.execute_statement(create_view_query)\n\n # Return the View\n return View(instance.data_source, instance.view_name, auto_create_view=False)\n
"},{"location":"core_classes/views/training_view/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"data_algorithms/overview/","title":"Data Algorithms","text":"Data Algorithms
WIP: These classes are currently actively being developed and are subject to change in both API and functionality over time. They provide a set of data algorithms for various types of data storage. We currently have subdirectorys for:
SQL: SQL queries that provide a wide range of functionality:
Welcome to the SageWorks Data Algorithms
Docs TBD
"},{"location":"data_algorithms/overview/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"data_algorithms/dataframes/overview/","title":"Pandas Dataframe Algorithms","text":"Pandas Dataframes
Pandas dataframes are obviously not going to scale as well as our Spark and SQL Algorithms, but for 'moderate' sized data these algorithms provide some nice functionality.
Pandas Dataframe Algorithms
SageWorks has a growing set of algorithms and data processing tools for Pandas Dataframes. In general these algorithm will take a dataframe as input and give you back a dataframe with additional columns.
FeatureSpaceProximity: A class for neighbor lookups using KNN with optional target information.
DimensionalityReduction: Perform Dimensionality Reduction on a DataFrame
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity","title":"FeatureSpaceProximity
","text":"Source code in src/sageworks/algorithms/dataframe/feature_space_proximity.py
class FeatureSpaceProximity:\n def __init__(self, df: pd.DataFrame, features: list, id_column: str, target: str = None, neighbors: int = 10):\n \"\"\"FeatureSpaceProximity: A class for neighbor lookups using KNN with optional target information.\n\n Args:\n df: Pandas DataFrame\n features: List of feature column names\n id_column: Name of the ID column\n target: Optional name of the target column to include target-based functionality (default: None)\n neighbors: Number of neighbors to use in the KNN model (default: 10)\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.df = df\n self.features = features\n self.id_column = id_column\n self.target = target\n self.knn_neighbors = neighbors\n\n # Standardize the feature values and build the KNN model\n self.log.info(\"Building KNN model for FeatureSpaceProximity...\")\n self.scaler = StandardScaler().fit(df[features])\n scaled_features = self.scaler.transform(df[features])\n self.knn_model = NearestNeighbors(n_neighbors=neighbors, algorithm=\"auto\").fit(scaled_features)\n\n # Compute Z-Scores or Consistency Scores for the target values\n if self.target and is_numeric_dtype(self.df[self.target]):\n self.log.info(\"Computing Z-Scores for target values...\")\n self.target_z_scores()\n else:\n self.log.info(\"Computing target consistency scores...\")\n self.target_consistency()\n\n # Now compute the outlier scores\n self.log.info(\"Computing outlier scores...\")\n self.outliers()\n\n @classmethod\n def from_model(cls, model) -> \"FeatureSpaceProximity\":\n \"\"\"Create a FeatureSpaceProximity instance from a SageWorks model object.\n\n Args:\n model (Model): A SageWorks model object.\n\n Returns:\n FeatureSpaceProximity: A new instance of the FeatureSpaceProximity class.\n \"\"\"\n from sageworks.api import FeatureSet\n\n # Extract necessary attributes from the SageWorks model\n fs = FeatureSet(model.get_input())\n features = model.features()\n target = model.target()\n\n # Retrieve the training DataFrame from the feature set\n df = fs.view(\"training\").pull_dataframe()\n\n # Create and return a new instance of FeatureSpaceProximity\n return cls(df=df, features=features, id_column=fs.id_column, target=target)\n\n def neighbors(self, query_id: Union[str, int], radius: float = None, include_self: bool = True) -> pd.DataFrame:\n \"\"\"Return neighbors of the given query ID, either by fixed neighbors or within a radius.\n\n Args:\n query_id (Union[str, int]): The ID of the query point.\n radius (float): Optional radius within which neighbors are to be searched, else use fixed neighbors.\n include_self (bool): Whether to include the query ID itself in the neighbor results.\n\n Returns:\n pd.DataFrame: Filtered DataFrame that includes the query ID, its neighbors, and optionally target values.\n \"\"\"\n if query_id not in self.df[self.id_column].values:\n self.log.warning(f\"Query ID '{query_id}' not found in the DataFrame. Returning an empty DataFrame.\")\n return pd.DataFrame()\n\n # Get a single-row DataFrame for the query ID\n query_df = self.df[self.df[self.id_column] == query_id]\n\n # Use the neighbors_bulk method with the appropriate radius\n neighbors_info_df = self.neighbors_bulk(query_df, radius=radius, include_self=include_self)\n\n # Extract the neighbor IDs and distances from the results\n neighbor_ids = neighbors_info_df[\"neighbor_ids\"].iloc[0]\n neighbor_distances = neighbors_info_df[\"neighbor_distances\"].iloc[0]\n\n # Sort neighbors by distance (ascending order)\n sorted_neighbors = sorted(zip(neighbor_ids, neighbor_distances), key=lambda x: x[1])\n sorted_ids, sorted_distances = zip(*sorted_neighbors)\n\n # Filter the internal DataFrame to include only the sorted neighbors\n neighbors_df = self.df[self.df[self.id_column].isin(sorted_ids)]\n neighbors_df = neighbors_df.set_index(self.id_column).reindex(sorted_ids).reset_index()\n neighbors_df[\"knn_distance\"] = sorted_distances\n return neighbors_df\n\n def neighbors_bulk(self, query_df: pd.DataFrame, radius: float = None, include_self: bool = False) -> pd.DataFrame:\n \"\"\"Return neighbors for each row in the given query dataframe, either by fixed neighbors or within a radius.\n\n Args:\n query_df: Pandas DataFrame with the same features as the training data.\n radius: Optional radius within which neighbors are to be searched, else use fixed neighbors.\n include_self: Boolean indicating whether to include the query ID in the neighbor results.\n\n Returns:\n pd.DataFrame: DataFrame with query ID, neighbor IDs, neighbor targets, and neighbor distances.\n \"\"\"\n # Scale the query data using the same scaler as the training data\n query_scaled = self.scaler.transform(query_df[self.features])\n\n # Retrieve neighbors based on radius or standard neighbors\n if radius is not None:\n distances, indices = self.knn_model.radius_neighbors(query_scaled, radius=radius)\n else:\n distances, indices = self.knn_model.kneighbors(query_scaled)\n\n # Collect neighbor information (IDs, target values, and distances)\n query_ids = query_df[self.id_column].values\n neighbor_ids = [[self.df.iloc[idx][self.id_column] for idx in index_list] for index_list in indices]\n neighbor_targets = (\n [\n [self.df.loc[self.df[self.id_column] == neighbor, self.target].values[0] for neighbor in index_list]\n for index_list in neighbor_ids\n ]\n if self.target\n else None\n )\n neighbor_distances = [list(dist_list) for dist_list in distances]\n\n # Automatically remove the query ID itself from the neighbor results if include_self is False\n for i, query_id in enumerate(query_ids):\n if query_id in neighbor_ids[i] and not include_self:\n idx_to_remove = neighbor_ids[i].index(query_id)\n neighbor_ids[i].pop(idx_to_remove)\n neighbor_distances[i].pop(idx_to_remove)\n if neighbor_targets:\n neighbor_targets[i].pop(idx_to_remove)\n\n # Sort neighbors by distance (ascending order)\n sorted_neighbors = sorted(zip(neighbor_ids[i], neighbor_distances[i]), key=lambda x: x[1])\n neighbor_ids[i], neighbor_distances[i] = list(zip(*sorted_neighbors)) if sorted_neighbors else ([], [])\n if neighbor_targets:\n neighbor_targets[i] = [\n self.df.loc[self.df[self.id_column] == neighbor, self.target].values[0]\n for neighbor in neighbor_ids[i]\n ]\n\n # Create and return a results DataFrame with the updated neighbor information\n result_df = pd.DataFrame(\n {\n \"query_id\": query_ids,\n \"neighbor_ids\": neighbor_ids,\n \"neighbor_distances\": neighbor_distances,\n }\n )\n\n if neighbor_targets:\n result_df[\"neighbor_targets\"] = neighbor_targets\n\n return result_df\n\n def outliers(self) -> None:\n \"\"\"Compute a unified 'outlier' score based on either 'target_z' or 'target_consistency'.\"\"\"\n if \"target_z\" in self.df.columns:\n # Normalize Z-Scores to a 0-1 range\n self.df[\"outlier\"] = (self.df[\"target_z\"].abs() / (self.df[\"target_z\"].abs().max() + 1e-6)).clip(0, 1)\n\n elif \"target_consistency\" in self.df.columns:\n # Calculate outlier score as 1 - consistency\n self.df[\"outlier\"] = 1 - self.df[\"target_consistency\"]\n\n else:\n self.log.warning(\"No 'target_z' or 'target_consistency' column found to compute outlier scores.\")\n\n def target_z_scores(self) -> None:\n \"\"\"Compute Z-Scores for NUMERIC target values.\"\"\"\n if not self.target:\n self.log.warning(\"No target column defined for Z-Score computation.\")\n return\n\n # Get the neighbors and distances for each internal observation\n distances, indices = self.knn_model.kneighbors()\n\n # Retrieve all neighbor target values in a single operation\n neighbor_targets = self.df[self.target].values[indices] # Shape will be (n_samples, n_neighbors)\n\n # Compute the mean and std along the neighbors axis (axis=1)\n neighbor_means = neighbor_targets.mean(axis=1)\n neighbor_stds = neighbor_targets.std(axis=1, ddof=0)\n\n # Vectorized Z-score calculation\n current_targets = self.df[self.target].values\n z_scores = np.where(neighbor_stds == 0, 0.0, (current_targets - neighbor_means) / neighbor_stds)\n\n # Assign the computed Z-Scores back to the DataFrame\n self.df[\"target_z\"] = z_scores\n\n def target_consistency(self) -> None:\n \"\"\"Compute a Neighborhood Consistency Score for CATEGORICAL targets.\"\"\"\n if not self.target:\n self.log.warning(\"No target column defined for neighborhood consistency computation.\")\n return\n\n # Get the neighbors and distances for each internal observation (already excludes the query)\n distances, indices = self.knn_model.kneighbors()\n\n # Calculate the Neighborhood Consistency Score for each observation\n consistency_scores = []\n for idx, idx_list in enumerate(indices):\n query_target = self.df.iloc[idx][self.target] # Get current observation's target value\n\n # Get the neighbors' target values\n neighbor_targets = self.df.iloc[idx_list][self.target]\n\n # Calculate the proportion of neighbors that have the same category as the query observation\n consistency_score = (neighbor_targets == query_target).mean()\n consistency_scores.append(consistency_score)\n\n # Add the 'target_consistency' column to the internal dataframe\n self.df[\"target_consistency\"] = consistency_scores\n\n def get_neighbor_indices_and_distances(self):\n \"\"\"Retrieve neighbor indices and distances for all points in the dataset.\"\"\"\n distances, indices = self.knn_model.kneighbors()\n return indices, distances\n\n def target_summary(self, query_id: Union[str, int]) -> pd.DataFrame:\n \"\"\"WIP: Provide a summary of target values in the neighborhood of the given query ID\"\"\"\n neighbors_df = self.neighbors(query_id, include_self=False)\n if self.target and not neighbors_df.empty:\n summary_stats = neighbors_df[self.target].describe()\n return pd.DataFrame(summary_stats).transpose()\n else:\n self.log.warning(f\"No target values found for neighbors of Query ID '{query_id}'.\")\n return pd.DataFrame()\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.__init__","title":"__init__(df, features, id_column, target=None, neighbors=10)
","text":"FeatureSpaceProximity: A class for neighbor lookups using KNN with optional target information.
Parameters:
Name Type Description Defaultdf
DataFrame
Pandas DataFrame
requiredfeatures
list
List of feature column names
requiredid_column
str
Name of the ID column
requiredtarget
str
Optional name of the target column to include target-based functionality (default: None)
None
neighbors
int
Number of neighbors to use in the KNN model (default: 10)
10
Source code in src/sageworks/algorithms/dataframe/feature_space_proximity.py
def __init__(self, df: pd.DataFrame, features: list, id_column: str, target: str = None, neighbors: int = 10):\n \"\"\"FeatureSpaceProximity: A class for neighbor lookups using KNN with optional target information.\n\n Args:\n df: Pandas DataFrame\n features: List of feature column names\n id_column: Name of the ID column\n target: Optional name of the target column to include target-based functionality (default: None)\n neighbors: Number of neighbors to use in the KNN model (default: 10)\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.df = df\n self.features = features\n self.id_column = id_column\n self.target = target\n self.knn_neighbors = neighbors\n\n # Standardize the feature values and build the KNN model\n self.log.info(\"Building KNN model for FeatureSpaceProximity...\")\n self.scaler = StandardScaler().fit(df[features])\n scaled_features = self.scaler.transform(df[features])\n self.knn_model = NearestNeighbors(n_neighbors=neighbors, algorithm=\"auto\").fit(scaled_features)\n\n # Compute Z-Scores or Consistency Scores for the target values\n if self.target and is_numeric_dtype(self.df[self.target]):\n self.log.info(\"Computing Z-Scores for target values...\")\n self.target_z_scores()\n else:\n self.log.info(\"Computing target consistency scores...\")\n self.target_consistency()\n\n # Now compute the outlier scores\n self.log.info(\"Computing outlier scores...\")\n self.outliers()\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.from_model","title":"from_model(model)
classmethod
","text":"Create a FeatureSpaceProximity instance from a SageWorks model object.
Parameters:
Name Type Description Defaultmodel
Model
A SageWorks model object.
requiredReturns:
Name Type DescriptionFeatureSpaceProximity
FeatureSpaceProximity
A new instance of the FeatureSpaceProximity class.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
@classmethod\ndef from_model(cls, model) -> \"FeatureSpaceProximity\":\n \"\"\"Create a FeatureSpaceProximity instance from a SageWorks model object.\n\n Args:\n model (Model): A SageWorks model object.\n\n Returns:\n FeatureSpaceProximity: A new instance of the FeatureSpaceProximity class.\n \"\"\"\n from sageworks.api import FeatureSet\n\n # Extract necessary attributes from the SageWorks model\n fs = FeatureSet(model.get_input())\n features = model.features()\n target = model.target()\n\n # Retrieve the training DataFrame from the feature set\n df = fs.view(\"training\").pull_dataframe()\n\n # Create and return a new instance of FeatureSpaceProximity\n return cls(df=df, features=features, id_column=fs.id_column, target=target)\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.get_neighbor_indices_and_distances","title":"get_neighbor_indices_and_distances()
","text":"Retrieve neighbor indices and distances for all points in the dataset.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def get_neighbor_indices_and_distances(self):\n \"\"\"Retrieve neighbor indices and distances for all points in the dataset.\"\"\"\n distances, indices = self.knn_model.kneighbors()\n return indices, distances\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.neighbors","title":"neighbors(query_id, radius=None, include_self=True)
","text":"Return neighbors of the given query ID, either by fixed neighbors or within a radius.
Parameters:
Name Type Description Defaultquery_id
Union[str, int]
The ID of the query point.
requiredradius
float
Optional radius within which neighbors are to be searched, else use fixed neighbors.
None
include_self
bool
Whether to include the query ID itself in the neighbor results.
True
Returns:
Type DescriptionDataFrame
pd.DataFrame: Filtered DataFrame that includes the query ID, its neighbors, and optionally target values.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def neighbors(self, query_id: Union[str, int], radius: float = None, include_self: bool = True) -> pd.DataFrame:\n \"\"\"Return neighbors of the given query ID, either by fixed neighbors or within a radius.\n\n Args:\n query_id (Union[str, int]): The ID of the query point.\n radius (float): Optional radius within which neighbors are to be searched, else use fixed neighbors.\n include_self (bool): Whether to include the query ID itself in the neighbor results.\n\n Returns:\n pd.DataFrame: Filtered DataFrame that includes the query ID, its neighbors, and optionally target values.\n \"\"\"\n if query_id not in self.df[self.id_column].values:\n self.log.warning(f\"Query ID '{query_id}' not found in the DataFrame. Returning an empty DataFrame.\")\n return pd.DataFrame()\n\n # Get a single-row DataFrame for the query ID\n query_df = self.df[self.df[self.id_column] == query_id]\n\n # Use the neighbors_bulk method with the appropriate radius\n neighbors_info_df = self.neighbors_bulk(query_df, radius=radius, include_self=include_self)\n\n # Extract the neighbor IDs and distances from the results\n neighbor_ids = neighbors_info_df[\"neighbor_ids\"].iloc[0]\n neighbor_distances = neighbors_info_df[\"neighbor_distances\"].iloc[0]\n\n # Sort neighbors by distance (ascending order)\n sorted_neighbors = sorted(zip(neighbor_ids, neighbor_distances), key=lambda x: x[1])\n sorted_ids, sorted_distances = zip(*sorted_neighbors)\n\n # Filter the internal DataFrame to include only the sorted neighbors\n neighbors_df = self.df[self.df[self.id_column].isin(sorted_ids)]\n neighbors_df = neighbors_df.set_index(self.id_column).reindex(sorted_ids).reset_index()\n neighbors_df[\"knn_distance\"] = sorted_distances\n return neighbors_df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.neighbors_bulk","title":"neighbors_bulk(query_df, radius=None, include_self=False)
","text":"Return neighbors for each row in the given query dataframe, either by fixed neighbors or within a radius.
Parameters:
Name Type Description Defaultquery_df
DataFrame
Pandas DataFrame with the same features as the training data.
requiredradius
float
Optional radius within which neighbors are to be searched, else use fixed neighbors.
None
include_self
bool
Boolean indicating whether to include the query ID in the neighbor results.
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: DataFrame with query ID, neighbor IDs, neighbor targets, and neighbor distances.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def neighbors_bulk(self, query_df: pd.DataFrame, radius: float = None, include_self: bool = False) -> pd.DataFrame:\n \"\"\"Return neighbors for each row in the given query dataframe, either by fixed neighbors or within a radius.\n\n Args:\n query_df: Pandas DataFrame with the same features as the training data.\n radius: Optional radius within which neighbors are to be searched, else use fixed neighbors.\n include_self: Boolean indicating whether to include the query ID in the neighbor results.\n\n Returns:\n pd.DataFrame: DataFrame with query ID, neighbor IDs, neighbor targets, and neighbor distances.\n \"\"\"\n # Scale the query data using the same scaler as the training data\n query_scaled = self.scaler.transform(query_df[self.features])\n\n # Retrieve neighbors based on radius or standard neighbors\n if radius is not None:\n distances, indices = self.knn_model.radius_neighbors(query_scaled, radius=radius)\n else:\n distances, indices = self.knn_model.kneighbors(query_scaled)\n\n # Collect neighbor information (IDs, target values, and distances)\n query_ids = query_df[self.id_column].values\n neighbor_ids = [[self.df.iloc[idx][self.id_column] for idx in index_list] for index_list in indices]\n neighbor_targets = (\n [\n [self.df.loc[self.df[self.id_column] == neighbor, self.target].values[0] for neighbor in index_list]\n for index_list in neighbor_ids\n ]\n if self.target\n else None\n )\n neighbor_distances = [list(dist_list) for dist_list in distances]\n\n # Automatically remove the query ID itself from the neighbor results if include_self is False\n for i, query_id in enumerate(query_ids):\n if query_id in neighbor_ids[i] and not include_self:\n idx_to_remove = neighbor_ids[i].index(query_id)\n neighbor_ids[i].pop(idx_to_remove)\n neighbor_distances[i].pop(idx_to_remove)\n if neighbor_targets:\n neighbor_targets[i].pop(idx_to_remove)\n\n # Sort neighbors by distance (ascending order)\n sorted_neighbors = sorted(zip(neighbor_ids[i], neighbor_distances[i]), key=lambda x: x[1])\n neighbor_ids[i], neighbor_distances[i] = list(zip(*sorted_neighbors)) if sorted_neighbors else ([], [])\n if neighbor_targets:\n neighbor_targets[i] = [\n self.df.loc[self.df[self.id_column] == neighbor, self.target].values[0]\n for neighbor in neighbor_ids[i]\n ]\n\n # Create and return a results DataFrame with the updated neighbor information\n result_df = pd.DataFrame(\n {\n \"query_id\": query_ids,\n \"neighbor_ids\": neighbor_ids,\n \"neighbor_distances\": neighbor_distances,\n }\n )\n\n if neighbor_targets:\n result_df[\"neighbor_targets\"] = neighbor_targets\n\n return result_df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.outliers","title":"outliers()
","text":"Compute a unified 'outlier' score based on either 'target_z' or 'target_consistency'.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def outliers(self) -> None:\n \"\"\"Compute a unified 'outlier' score based on either 'target_z' or 'target_consistency'.\"\"\"\n if \"target_z\" in self.df.columns:\n # Normalize Z-Scores to a 0-1 range\n self.df[\"outlier\"] = (self.df[\"target_z\"].abs() / (self.df[\"target_z\"].abs().max() + 1e-6)).clip(0, 1)\n\n elif \"target_consistency\" in self.df.columns:\n # Calculate outlier score as 1 - consistency\n self.df[\"outlier\"] = 1 - self.df[\"target_consistency\"]\n\n else:\n self.log.warning(\"No 'target_z' or 'target_consistency' column found to compute outlier scores.\")\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.target_consistency","title":"target_consistency()
","text":"Compute a Neighborhood Consistency Score for CATEGORICAL targets.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def target_consistency(self) -> None:\n \"\"\"Compute a Neighborhood Consistency Score for CATEGORICAL targets.\"\"\"\n if not self.target:\n self.log.warning(\"No target column defined for neighborhood consistency computation.\")\n return\n\n # Get the neighbors and distances for each internal observation (already excludes the query)\n distances, indices = self.knn_model.kneighbors()\n\n # Calculate the Neighborhood Consistency Score for each observation\n consistency_scores = []\n for idx, idx_list in enumerate(indices):\n query_target = self.df.iloc[idx][self.target] # Get current observation's target value\n\n # Get the neighbors' target values\n neighbor_targets = self.df.iloc[idx_list][self.target]\n\n # Calculate the proportion of neighbors that have the same category as the query observation\n consistency_score = (neighbor_targets == query_target).mean()\n consistency_scores.append(consistency_score)\n\n # Add the 'target_consistency' column to the internal dataframe\n self.df[\"target_consistency\"] = consistency_scores\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.target_summary","title":"target_summary(query_id)
","text":"WIP: Provide a summary of target values in the neighborhood of the given query ID
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def target_summary(self, query_id: Union[str, int]) -> pd.DataFrame:\n \"\"\"WIP: Provide a summary of target values in the neighborhood of the given query ID\"\"\"\n neighbors_df = self.neighbors(query_id, include_self=False)\n if self.target and not neighbors_df.empty:\n summary_stats = neighbors_df[self.target].describe()\n return pd.DataFrame(summary_stats).transpose()\n else:\n self.log.warning(f\"No target values found for neighbors of Query ID '{query_id}'.\")\n return pd.DataFrame()\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.target_z_scores","title":"target_z_scores()
","text":"Compute Z-Scores for NUMERIC target values.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def target_z_scores(self) -> None:\n \"\"\"Compute Z-Scores for NUMERIC target values.\"\"\"\n if not self.target:\n self.log.warning(\"No target column defined for Z-Score computation.\")\n return\n\n # Get the neighbors and distances for each internal observation\n distances, indices = self.knn_model.kneighbors()\n\n # Retrieve all neighbor target values in a single operation\n neighbor_targets = self.df[self.target].values[indices] # Shape will be (n_samples, n_neighbors)\n\n # Compute the mean and std along the neighbors axis (axis=1)\n neighbor_means = neighbor_targets.mean(axis=1)\n neighbor_stds = neighbor_targets.std(axis=1, ddof=0)\n\n # Vectorized Z-score calculation\n current_targets = self.df[self.target].values\n z_scores = np.where(neighbor_stds == 0, 0.0, (current_targets - neighbor_means) / neighbor_stds)\n\n # Assign the computed Z-Scores back to the DataFrame\n self.df[\"target_z\"] = z_scores\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.residuals_calculator.ResidualsCalculator","title":"ResidualsCalculator
","text":" Bases: BaseEstimator
, TransformerMixin
A custom transformer for calculating residuals using cross-validation or an endpoint.
This transformer performs K-Fold cross-validation (if no endpoint is provided), or it uses the endpoint to generate predictions and compute residuals. It adds 'prediction', 'residuals', 'residuals_abs', 'prediction_100', 'residuals_100', and 'residuals_100_abs' columns to the input DataFrame.
Attributes:
Name Type Descriptionmodel_class
Union[RegressorMixin, XGBRegressor]
The machine learning model class used for predictions.
n_splits
int
Number of splits for cross-validation.
random_state
int
Random state for reproducibility.
endpoint
Optional
The SageWorks endpoint object for running inference, if provided.
Source code insrc/sageworks/algorithms/dataframe/residuals_calculator.py
class ResidualsCalculator(BaseEstimator, TransformerMixin):\n \"\"\"\n A custom transformer for calculating residuals using cross-validation or an endpoint.\n\n This transformer performs K-Fold cross-validation (if no endpoint is provided), or it uses the endpoint\n to generate predictions and compute residuals. It adds 'prediction', 'residuals', 'residuals_abs',\n 'prediction_100', 'residuals_100', and 'residuals_100_abs' columns to the input DataFrame.\n\n Attributes:\n model_class (Union[RegressorMixin, XGBRegressor]): The machine learning model class used for predictions.\n n_splits (int): Number of splits for cross-validation.\n random_state (int): Random state for reproducibility.\n endpoint (Optional): The SageWorks endpoint object for running inference, if provided.\n \"\"\"\n\n def __init__(\n self,\n endpoint: Optional[object] = None,\n reference_model_class: Union[RegressorMixin, XGBRegressor] = XGBRegressor,\n ):\n \"\"\"\n Initializes the ResidualsCalculator with the specified parameters.\n\n Args:\n endpoint (Optional): A SageWorks endpoint object to run inference, if available.\n reference_model_class (Union[RegressorMixin, XGBRegressor]): The reference model class for predictions.\n \"\"\"\n self.n_splits = 5\n self.random_state = 42\n self.reference_model_class = reference_model_class # Store the class, instantiate the model later\n self.reference_model = None # Lazy model initialization\n self.endpoint = endpoint # Use this endpoint for inference if provided\n self.X = None\n self.y = None\n\n def fit(self, X: pd.DataFrame, y: pd.Series) -> BaseEstimator:\n \"\"\"\n Fits the model. If no endpoint is provided, fitting involves storing the input data\n and initializing a reference model.\n\n Args:\n X (pd.DataFrame): The input features.\n y (pd.Series): The target variable.\n\n Returns:\n self: Returns an instance of self.\n \"\"\"\n self.X = X\n self.y = y\n\n if self.endpoint is None:\n # Only initialize the reference model if no endpoint is provided\n self.reference_model = self.reference_model_class()\n return self\n\n def transform(self, X: pd.DataFrame) -> pd.DataFrame:\n \"\"\"\n Transforms the input DataFrame by adding 'prediction', 'residuals', 'residuals_abs',\n 'prediction_100', 'residuals_100', and 'residuals_100_abs' columns.\n\n Args:\n X (pd.DataFrame): The input features.\n\n Returns:\n pd.DataFrame: The transformed DataFrame with additional columns.\n \"\"\"\n check_is_fitted(self, [\"X\", \"y\"]) # Ensure fit has been called\n\n if self.endpoint:\n # If an endpoint is provided, run inference on the full data\n result_df = self._run_inference_via_endpoint(X)\n else:\n # If no endpoint, perform cross-validation and full model fitting\n result_df = self._run_cross_validation(X)\n\n return result_df\n\n def _run_cross_validation(self, X: pd.DataFrame) -> pd.DataFrame:\n \"\"\"\n Handles the cross-validation process when no endpoint is provided.\n\n Args:\n X (pd.DataFrame): The input features.\n\n Returns:\n pd.DataFrame: DataFrame with predictions and residuals from cross-validation and full model fit.\n \"\"\"\n kf = KFold(n_splits=self.n_splits, shuffle=True, random_state=self.random_state)\n\n # Initialize pandas Series to store predictions and residuals, aligned by index\n predictions = pd.Series(index=self.y.index, dtype=np.float64)\n residuals = pd.Series(index=self.y.index, dtype=np.float64)\n residuals_abs = pd.Series(index=self.y.index, dtype=np.float64)\n\n # Perform cross-validation and collect predictions and residuals\n for train_index, test_index in kf.split(self.X):\n X_train, X_test = self.X.iloc[train_index], self.X.iloc[test_index]\n y_train, y_test = self.y.iloc[train_index], self.y.iloc[test_index]\n\n # Fit the model on the training data\n self.reference_model.fit(X_train, y_train)\n\n # Predict on the test data\n y_pred = self.reference_model.predict(X_test)\n\n # Compute residuals and absolute residuals\n residuals_fold = y_test - y_pred\n residuals_abs_fold = np.abs(residuals_fold)\n\n # Place the predictions and residuals in the correct positions based on index\n predictions.iloc[test_index] = y_pred\n residuals.iloc[test_index] = residuals_fold\n residuals_abs.iloc[test_index] = residuals_abs_fold\n\n # Train on all data and compute residuals for 100% training\n self.reference_model.fit(self.X, self.y)\n y_pred_100 = self.reference_model.predict(self.X)\n residuals_100 = self.y - y_pred_100\n residuals_100_abs = np.abs(residuals_100)\n\n # Create a copy of the provided DataFrame and add the new columns\n result_df = X.copy()\n result_df[\"prediction\"] = predictions\n result_df[\"residuals\"] = residuals\n result_df[\"residuals_abs\"] = residuals_abs\n result_df[\"prediction_100\"] = y_pred_100\n result_df[\"residuals_100\"] = residuals_100\n result_df[\"residuals_100_abs\"] = residuals_100_abs\n result_df[self.y.name] = self.y # Add the target column back\n\n return result_df\n\n def _run_inference_via_endpoint(self, X: pd.DataFrame) -> pd.DataFrame:\n \"\"\"\n Handles the inference process when an endpoint is provided.\n\n Args:\n X (pd.DataFrame): The input features.\n\n Returns:\n pd.DataFrame: DataFrame with predictions and residuals from the endpoint.\n \"\"\"\n # Run inference on all data using the endpoint (include the target column)\n X = X.copy()\n X.loc[:, self.y.name] = self.y\n results_df = self.endpoint.inference(X)\n predictions = results_df[\"prediction\"]\n\n # Compute residuals and residuals_abs based on the endpoint's predictions\n residuals = self.y - predictions\n residuals_abs = np.abs(residuals)\n\n # To maintain consistency, populate both 'prediction' and 'prediction_100' with the same values\n result_df = X.copy()\n result_df[\"prediction\"] = predictions\n result_df[\"residuals\"] = residuals\n result_df[\"residuals_abs\"] = residuals_abs\n result_df[\"prediction_100\"] = predictions\n result_df[\"residuals_100\"] = residuals\n result_df[\"residuals_100_abs\"] = residuals_abs\n\n return result_df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.residuals_calculator.ResidualsCalculator.__init__","title":"__init__(endpoint=None, reference_model_class=XGBRegressor)
","text":"Initializes the ResidualsCalculator with the specified parameters.
Parameters:
Name Type Description Defaultendpoint
Optional
A SageWorks endpoint object to run inference, if available.
None
reference_model_class
Union[RegressorMixin, XGBRegressor]
The reference model class for predictions.
XGBRegressor
Source code in src/sageworks/algorithms/dataframe/residuals_calculator.py
def __init__(\n self,\n endpoint: Optional[object] = None,\n reference_model_class: Union[RegressorMixin, XGBRegressor] = XGBRegressor,\n):\n \"\"\"\n Initializes the ResidualsCalculator with the specified parameters.\n\n Args:\n endpoint (Optional): A SageWorks endpoint object to run inference, if available.\n reference_model_class (Union[RegressorMixin, XGBRegressor]): The reference model class for predictions.\n \"\"\"\n self.n_splits = 5\n self.random_state = 42\n self.reference_model_class = reference_model_class # Store the class, instantiate the model later\n self.reference_model = None # Lazy model initialization\n self.endpoint = endpoint # Use this endpoint for inference if provided\n self.X = None\n self.y = None\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.residuals_calculator.ResidualsCalculator.fit","title":"fit(X, y)
","text":"Fits the model. If no endpoint is provided, fitting involves storing the input data and initializing a reference model.
Parameters:
Name Type Description DefaultX
DataFrame
The input features.
requiredy
Series
The target variable.
requiredReturns:
Name Type Descriptionself
BaseEstimator
Returns an instance of self.
Source code insrc/sageworks/algorithms/dataframe/residuals_calculator.py
def fit(self, X: pd.DataFrame, y: pd.Series) -> BaseEstimator:\n \"\"\"\n Fits the model. If no endpoint is provided, fitting involves storing the input data\n and initializing a reference model.\n\n Args:\n X (pd.DataFrame): The input features.\n y (pd.Series): The target variable.\n\n Returns:\n self: Returns an instance of self.\n \"\"\"\n self.X = X\n self.y = y\n\n if self.endpoint is None:\n # Only initialize the reference model if no endpoint is provided\n self.reference_model = self.reference_model_class()\n return self\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.residuals_calculator.ResidualsCalculator.transform","title":"transform(X)
","text":"Transforms the input DataFrame by adding 'prediction', 'residuals', 'residuals_abs', 'prediction_100', 'residuals_100', and 'residuals_100_abs' columns.
Parameters:
Name Type Description DefaultX
DataFrame
The input features.
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The transformed DataFrame with additional columns.
Source code insrc/sageworks/algorithms/dataframe/residuals_calculator.py
def transform(self, X: pd.DataFrame) -> pd.DataFrame:\n \"\"\"\n Transforms the input DataFrame by adding 'prediction', 'residuals', 'residuals_abs',\n 'prediction_100', 'residuals_100', and 'residuals_100_abs' columns.\n\n Args:\n X (pd.DataFrame): The input features.\n\n Returns:\n pd.DataFrame: The transformed DataFrame with additional columns.\n \"\"\"\n check_is_fitted(self, [\"X\", \"y\"]) # Ensure fit has been called\n\n if self.endpoint:\n # If an endpoint is provided, run inference on the full data\n result_df = self._run_inference_via_endpoint(X)\n else:\n # If no endpoint, perform cross-validation and full model fitting\n result_df = self._run_cross_validation(X)\n\n return result_df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.dimensionality_reduction.DimensionalityReduction","title":"DimensionalityReduction
","text":"Source code in src/sageworks/algorithms/dataframe/dimensionality_reduction.py
class DimensionalityReduction:\n def __init__(self):\n \"\"\"DimensionalityReduction: Perform Dimensionality Reduction on a DataFrame\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.projection_model = None\n self.features = None\n\n def fit_transform(self, df: pd.DataFrame, features: list = None, projection: str = \"TSNE\") -> pd.DataFrame:\n \"\"\"Fit and Transform the DataFrame\n Args:\n df: Pandas DataFrame\n features: List of feature column names (default: None)\n projection: The projection model to use (TSNE, MDS or PCA, default: PCA)\n Returns:\n Pandas DataFrame with new columns x and y\n \"\"\"\n\n # If no features are given, indentify all numeric columns\n if features is None:\n features = [x for x in df.select_dtypes(include=\"number\").columns.tolist() if not x.endswith(\"id\")]\n # Also drop group_count if it exists\n features = [x for x in features if x != \"group_count\"]\n self.log.info(\"No features given, auto identifying numeric columns...\")\n self.log.info(f\"{features}\")\n self.features = features\n\n # Sanity checks\n if not all(column in df.columns for column in self.features):\n self.log.critical(\"Some features are missing in the DataFrame\")\n return df\n if len(self.features) < 2:\n self.log.critical(\"At least two features are required\")\n return df\n if df.empty:\n self.log.critical(\"DataFrame is empty\")\n return df\n\n # Most projection models will fail if there are any NaNs in the data\n # So we'll fill NaNs with the mean value for that column\n for col in df[self.features].columns:\n df[col].fillna(df[col].mean(), inplace=True)\n\n # Normalize the features\n scaler = StandardScaler()\n normalized_data = scaler.fit_transform(df[self.features])\n df[self.features] = normalized_data\n\n # Project the multidimensional features onto an x,y plane\n self.log.info(\"Projecting features onto an x,y plane...\")\n\n # Perform the projection\n if projection == \"TSNE\":\n # Perplexity is a hyperparameter that controls the number of neighbors used to compute the manifold\n # The number of neighbors should be less than the number of samples\n perplexity = min(40, len(df) - 1)\n self.log.info(f\"Perplexity: {perplexity}\")\n self.projection_model = TSNE(perplexity=perplexity)\n elif projection == \"MDS\":\n self.projection_model = MDS(n_components=2, random_state=0)\n elif projection == \"PCA\":\n self.projection_model = PCA(n_components=2)\n\n # Fit the projection model\n # Hack PCA + TSNE to work together\n projection = self.projection_model.fit_transform(df[self.features])\n\n # Put the projection results back into the given DataFrame\n df[\"x\"] = projection[:, 0] # Projection X Column\n df[\"y\"] = projection[:, 1] # Projection Y Column\n\n # Jitter the data to resolve coincident points\n # df = self.resolve_coincident_points(df)\n\n # Return the DataFrame with the new columns\n return df\n\n @staticmethod\n def resolve_coincident_points(df: pd.DataFrame):\n \"\"\"Resolve coincident points in a DataFrame\n Args:\n df(pd.DataFrame): The DataFrame to resolve coincident points in\n Returns:\n pd.DataFrame: The DataFrame with resolved coincident points\n \"\"\"\n # Adding Jitter to the projection\n x_scale = (df[\"x\"].max() - df[\"x\"].min()) * 0.1\n y_scale = (df[\"y\"].max() - df[\"y\"].min()) * 0.1\n df[\"x\"] += np.random.normal(-x_scale, +x_scale, len(df))\n df[\"y\"] += np.random.normal(-y_scale, +y_scale, len(df))\n return df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.dimensionality_reduction.DimensionalityReduction.__init__","title":"__init__()
","text":"DimensionalityReduction: Perform Dimensionality Reduction on a DataFrame
Source code insrc/sageworks/algorithms/dataframe/dimensionality_reduction.py
def __init__(self):\n \"\"\"DimensionalityReduction: Perform Dimensionality Reduction on a DataFrame\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.projection_model = None\n self.features = None\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.dimensionality_reduction.DimensionalityReduction.fit_transform","title":"fit_transform(df, features=None, projection='TSNE')
","text":"Fit and Transform the DataFrame Args: df: Pandas DataFrame features: List of feature column names (default: None) projection: The projection model to use (TSNE, MDS or PCA, default: PCA) Returns: Pandas DataFrame with new columns x and y
Source code insrc/sageworks/algorithms/dataframe/dimensionality_reduction.py
def fit_transform(self, df: pd.DataFrame, features: list = None, projection: str = \"TSNE\") -> pd.DataFrame:\n \"\"\"Fit and Transform the DataFrame\n Args:\n df: Pandas DataFrame\n features: List of feature column names (default: None)\n projection: The projection model to use (TSNE, MDS or PCA, default: PCA)\n Returns:\n Pandas DataFrame with new columns x and y\n \"\"\"\n\n # If no features are given, indentify all numeric columns\n if features is None:\n features = [x for x in df.select_dtypes(include=\"number\").columns.tolist() if not x.endswith(\"id\")]\n # Also drop group_count if it exists\n features = [x for x in features if x != \"group_count\"]\n self.log.info(\"No features given, auto identifying numeric columns...\")\n self.log.info(f\"{features}\")\n self.features = features\n\n # Sanity checks\n if not all(column in df.columns for column in self.features):\n self.log.critical(\"Some features are missing in the DataFrame\")\n return df\n if len(self.features) < 2:\n self.log.critical(\"At least two features are required\")\n return df\n if df.empty:\n self.log.critical(\"DataFrame is empty\")\n return df\n\n # Most projection models will fail if there are any NaNs in the data\n # So we'll fill NaNs with the mean value for that column\n for col in df[self.features].columns:\n df[col].fillna(df[col].mean(), inplace=True)\n\n # Normalize the features\n scaler = StandardScaler()\n normalized_data = scaler.fit_transform(df[self.features])\n df[self.features] = normalized_data\n\n # Project the multidimensional features onto an x,y plane\n self.log.info(\"Projecting features onto an x,y plane...\")\n\n # Perform the projection\n if projection == \"TSNE\":\n # Perplexity is a hyperparameter that controls the number of neighbors used to compute the manifold\n # The number of neighbors should be less than the number of samples\n perplexity = min(40, len(df) - 1)\n self.log.info(f\"Perplexity: {perplexity}\")\n self.projection_model = TSNE(perplexity=perplexity)\n elif projection == \"MDS\":\n self.projection_model = MDS(n_components=2, random_state=0)\n elif projection == \"PCA\":\n self.projection_model = PCA(n_components=2)\n\n # Fit the projection model\n # Hack PCA + TSNE to work together\n projection = self.projection_model.fit_transform(df[self.features])\n\n # Put the projection results back into the given DataFrame\n df[\"x\"] = projection[:, 0] # Projection X Column\n df[\"y\"] = projection[:, 1] # Projection Y Column\n\n # Jitter the data to resolve coincident points\n # df = self.resolve_coincident_points(df)\n\n # Return the DataFrame with the new columns\n return df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.dimensionality_reduction.DimensionalityReduction.resolve_coincident_points","title":"resolve_coincident_points(df)
staticmethod
","text":"Resolve coincident points in a DataFrame Args: df(pd.DataFrame): The DataFrame to resolve coincident points in Returns: pd.DataFrame: The DataFrame with resolved coincident points
Source code insrc/sageworks/algorithms/dataframe/dimensionality_reduction.py
@staticmethod\ndef resolve_coincident_points(df: pd.DataFrame):\n \"\"\"Resolve coincident points in a DataFrame\n Args:\n df(pd.DataFrame): The DataFrame to resolve coincident points in\n Returns:\n pd.DataFrame: The DataFrame with resolved coincident points\n \"\"\"\n # Adding Jitter to the projection\n x_scale = (df[\"x\"].max() - df[\"x\"].min()) * 0.1\n y_scale = (df[\"y\"].max() - df[\"y\"].min()) * 0.1\n df[\"x\"] += np.random.normal(-x_scale, +x_scale, len(df))\n df[\"y\"] += np.random.normal(-y_scale, +y_scale, len(df))\n return df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.dimensionality_reduction.test","title":"test()
","text":"Test for the Dimensionality Reduction Class
Source code insrc/sageworks/algorithms/dataframe/dimensionality_reduction.py
def test():\n \"\"\"Test for the Dimensionality Reduction Class\"\"\"\n # Set some pandas options\n pd.set_option(\"display.max_columns\", None)\n pd.set_option(\"display.width\", 1000)\n\n # Make some fake data\n data = {\n \"ID\": [\n \"id_0\",\n \"id_0\",\n \"id_2\",\n \"id_3\",\n \"id_4\",\n \"id_5\",\n \"id_6\",\n \"id_7\",\n \"id_8\",\n \"id_9\",\n ],\n \"feat1\": [1.0, 1.0, 1.1, 3.0, 4.0, 1.0, 1.0, 1.1, 3.0, 4.0],\n \"feat2\": [1.0, 1.0, 1.1, 3.0, 4.0, 1.0, 1.0, 1.1, 3.0, 4.0],\n \"feat3\": [0.1, 0.1, 0.2, 1.6, 2.5, 0.1, 0.1, 0.2, 1.6, 2.5],\n \"price\": [31, 60, 62, 40, 20, 31, 61, 60, 40, 20],\n }\n data_df = pd.DataFrame(data)\n features = [\"feat1\", \"feat2\", \"feat3\"]\n\n # Create the class and run the dimensionality reduction\n projection = DimensionalityReduction()\n new_df = projection.fit_transform(data_df, features=features, projection=\"TSNE\")\n\n # Check that the x and y columns were added\n assert \"x\" in new_df.columns\n assert \"y\" in new_df.columns\n\n # Output the DataFrame\n print(new_df)\n
"},{"location":"data_algorithms/dataframes/overview/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"data_algorithms/graphs/overview/","title":"Graph Algorithms","text":"Graph Algorithms
WIP: These classes are currently actively being developed and are subject to change in both API and functionality over time.
Graph Algorithms
Docs TBD
"},{"location":"data_algorithms/graphs/overview/#sageworks.algorithms.graph.light.proximity_graph.ProximityGraph","title":"ProximityGraph
","text":"Build a proximity graph of the nearest neighbors based on feature space.
Source code insrc/sageworks/algorithms/graph/light/proximity_graph.py
class ProximityGraph:\n \"\"\"\n Build a proximity graph of the nearest neighbors based on feature space.\n \"\"\"\n\n def __init__(self, n_neighbors: int = 5):\n \"\"\"Initialize the ProximityGraph with the specified parameters.\n\n Args:\n n_neighbors (int): Number of neighbors to consider (default: 5)\n \"\"\"\n self.n_neighbors = n_neighbors\n self.nx_graph = nx.Graph()\n\n def build_graph(\n self,\n df: pd.DataFrame,\n features: list,\n id_column: str,\n target: str,\n store_features=True,\n ) -> nx.Graph:\n \"\"\"\n Processes the input DataFrame and builds a proximity graph.\n\n Args:\n df (pd.DataFrame): The input DataFrame containing feature columns.\n features (list): List of feature column names to be used for building the proximity graph.\n id_column (str): Name of the ID column in the DataFrame.\n target (str): Name of the target column in the DataFrame.\n store_features (bool): Whether to store the features as node attributes (default: True).\n\n Returns:\n nx.Graph: The proximity graph as a NetworkX graph.\n \"\"\"\n # Drop NaNs from the DataFrame using the provided utility\n df = drop_nans(df)\n\n # Initialize FeatureSpaceProximity with the input DataFrame and the specified features\n knn_spider = FeatureSpaceProximity(\n df,\n features=features,\n id_column=id_column,\n target=target,\n neighbors=self.n_neighbors,\n )\n\n # Use FeatureSpaceProximity to get all neighbor indices and distances\n indices, distances = knn_spider.get_neighbor_indices_and_distances()\n\n # Compute max distance for scaling (to [0, 1])\n max_distance = distances.max()\n\n # Initialize an empty graph\n self.nx_graph = nx.Graph()\n\n # Use the ID column for node IDs instead of relying on the DataFrame index\n node_ids = df[id_column].values\n\n # Add nodes with their features as attributes using the ID column\n for node_id in node_ids:\n if store_features:\n self.nx_graph.add_node(\n node_id, **df[df[id_column] == node_id].iloc[0].to_dict()\n ) # Use .iloc[0] for correct node attributes\n else:\n self.nx_graph.add_node(node_id)\n\n # Add edges with weights based on inverse distance\n for i, neighbors in enumerate(indices):\n one_edge_added = False\n for j, neighbor_idx in enumerate(neighbors):\n if i != neighbor_idx:\n # Compute the weight of the edge (inverse of distance)\n weight = 1.0 - (distances[i][j] / max_distance) # Scale to [0, 1]\n\n # Map back to the ID column instead of the DataFrame index\n src_node = node_ids[i]\n dst_node = node_ids[neighbor_idx]\n\n # Add the edge to the graph (if the weight is greater than 0.1)\n if weight > 0.1 or not one_edge_added:\n self.nx_graph.add_edge(src_node, dst_node, weight=weight)\n one_edge_added = True\n\n # Return the NetworkX graph\n return self.nx_graph\n\n def get_neighborhood(self, node_id: Union[str, int], radius: int = 1) -> nx.Graph:\n \"\"\"\n Get a subgraph containing nodes within a given number of hops around a specific node.\n\n Args:\n node_id: The ID of the node to center the neighborhood around.\n radius: The number of hops to consider around the node (default: 1).\n\n Returns:\n nx.Graph: A subgraph containing the specified neighborhood.\n \"\"\"\n # Use NetworkX's ego_graph to extract the neighborhood within the given radius\n if node_id in self.nx_graph:\n return nx.ego_graph(self.nx_graph, node_id, radius=radius)\n else:\n raise ValueError(f\"Node ID '{node_id}' not found in the graph.\")\n
"},{"location":"data_algorithms/graphs/overview/#sageworks.algorithms.graph.light.proximity_graph.ProximityGraph.__init__","title":"__init__(n_neighbors=5)
","text":"Initialize the ProximityGraph with the specified parameters.
Parameters:
Name Type Description Defaultn_neighbors
int
Number of neighbors to consider (default: 5)
5
Source code in src/sageworks/algorithms/graph/light/proximity_graph.py
def __init__(self, n_neighbors: int = 5):\n \"\"\"Initialize the ProximityGraph with the specified parameters.\n\n Args:\n n_neighbors (int): Number of neighbors to consider (default: 5)\n \"\"\"\n self.n_neighbors = n_neighbors\n self.nx_graph = nx.Graph()\n
"},{"location":"data_algorithms/graphs/overview/#sageworks.algorithms.graph.light.proximity_graph.ProximityGraph.build_graph","title":"build_graph(df, features, id_column, target, store_features=True)
","text":"Processes the input DataFrame and builds a proximity graph.
Parameters:
Name Type Description Defaultdf
DataFrame
The input DataFrame containing feature columns.
requiredfeatures
list
List of feature column names to be used for building the proximity graph.
requiredid_column
str
Name of the ID column in the DataFrame.
requiredtarget
str
Name of the target column in the DataFrame.
requiredstore_features
bool
Whether to store the features as node attributes (default: True).
True
Returns:
Type DescriptionGraph
nx.Graph: The proximity graph as a NetworkX graph.
Source code insrc/sageworks/algorithms/graph/light/proximity_graph.py
def build_graph(\n self,\n df: pd.DataFrame,\n features: list,\n id_column: str,\n target: str,\n store_features=True,\n) -> nx.Graph:\n \"\"\"\n Processes the input DataFrame and builds a proximity graph.\n\n Args:\n df (pd.DataFrame): The input DataFrame containing feature columns.\n features (list): List of feature column names to be used for building the proximity graph.\n id_column (str): Name of the ID column in the DataFrame.\n target (str): Name of the target column in the DataFrame.\n store_features (bool): Whether to store the features as node attributes (default: True).\n\n Returns:\n nx.Graph: The proximity graph as a NetworkX graph.\n \"\"\"\n # Drop NaNs from the DataFrame using the provided utility\n df = drop_nans(df)\n\n # Initialize FeatureSpaceProximity with the input DataFrame and the specified features\n knn_spider = FeatureSpaceProximity(\n df,\n features=features,\n id_column=id_column,\n target=target,\n neighbors=self.n_neighbors,\n )\n\n # Use FeatureSpaceProximity to get all neighbor indices and distances\n indices, distances = knn_spider.get_neighbor_indices_and_distances()\n\n # Compute max distance for scaling (to [0, 1])\n max_distance = distances.max()\n\n # Initialize an empty graph\n self.nx_graph = nx.Graph()\n\n # Use the ID column for node IDs instead of relying on the DataFrame index\n node_ids = df[id_column].values\n\n # Add nodes with their features as attributes using the ID column\n for node_id in node_ids:\n if store_features:\n self.nx_graph.add_node(\n node_id, **df[df[id_column] == node_id].iloc[0].to_dict()\n ) # Use .iloc[0] for correct node attributes\n else:\n self.nx_graph.add_node(node_id)\n\n # Add edges with weights based on inverse distance\n for i, neighbors in enumerate(indices):\n one_edge_added = False\n for j, neighbor_idx in enumerate(neighbors):\n if i != neighbor_idx:\n # Compute the weight of the edge (inverse of distance)\n weight = 1.0 - (distances[i][j] / max_distance) # Scale to [0, 1]\n\n # Map back to the ID column instead of the DataFrame index\n src_node = node_ids[i]\n dst_node = node_ids[neighbor_idx]\n\n # Add the edge to the graph (if the weight is greater than 0.1)\n if weight > 0.1 or not one_edge_added:\n self.nx_graph.add_edge(src_node, dst_node, weight=weight)\n one_edge_added = True\n\n # Return the NetworkX graph\n return self.nx_graph\n
"},{"location":"data_algorithms/graphs/overview/#sageworks.algorithms.graph.light.proximity_graph.ProximityGraph.get_neighborhood","title":"get_neighborhood(node_id, radius=1)
","text":"Get a subgraph containing nodes within a given number of hops around a specific node.
Parameters:
Name Type Description Defaultnode_id
Union[str, int]
The ID of the node to center the neighborhood around.
requiredradius
int
The number of hops to consider around the node (default: 1).
1
Returns:
Type DescriptionGraph
nx.Graph: A subgraph containing the specified neighborhood.
Source code insrc/sageworks/algorithms/graph/light/proximity_graph.py
def get_neighborhood(self, node_id: Union[str, int], radius: int = 1) -> nx.Graph:\n \"\"\"\n Get a subgraph containing nodes within a given number of hops around a specific node.\n\n Args:\n node_id: The ID of the node to center the neighborhood around.\n radius: The number of hops to consider around the node (default: 1).\n\n Returns:\n nx.Graph: A subgraph containing the specified neighborhood.\n \"\"\"\n # Use NetworkX's ego_graph to extract the neighborhood within the given radius\n if node_id in self.nx_graph:\n return nx.ego_graph(self.nx_graph, node_id, radius=radius)\n else:\n raise ValueError(f\"Node ID '{node_id}' not found in the graph.\")\n
"},{"location":"data_algorithms/graphs/overview/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"data_algorithms/spark/overview/","title":"Graph Algorithms","text":"Graph Algorithms
WIP: These classes are currently actively being developed and are subject to change in both API and functionality over time.
Graph Algorithms
Docs TBD
ComputationView Class: Create a View with a subset of columns for display purposes
"},{"location":"data_algorithms/spark/overview/#sageworks.core.views.computation_view.ComputationView","title":"ComputationView
","text":" Bases: ColumnSubsetView
ComputationView Class: Create a View with a subset of columns for computation purposes
Common Usage# Create a default ComputationView\nfs = FeatureSet(\"test_features\")\ncomp_view = ComputationView.create(fs)\ndf = comp_view.pull_dataframe()\n\n# Create a ComputationView with a specific set of columns\ncomp_view = ComputationView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n# Query the view\ndf = comp_view.query(f\"SELECT * FROM {comp_view.table} where prediction > 0.5\")\n
Source code in src/sageworks/core/views/computation_view.py
class ComputationView(ColumnSubsetView):\n \"\"\"ComputationView Class: Create a View with a subset of columns for computation purposes\n\n Common Usage:\n ```python\n # Create a default ComputationView\n fs = FeatureSet(\"test_features\")\n comp_view = ComputationView.create(fs)\n df = comp_view.pull_dataframe()\n\n # Create a ComputationView with a specific set of columns\n comp_view = ComputationView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n # Query the view\n df = comp_view.query(f\"SELECT * FROM {comp_view.table} where prediction > 0.5\")\n ```\n \"\"\"\n\n @classmethod\n def create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n ) -> Union[View, None]:\n \"\"\"Factory method to create and return a ComputationView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"computation\" view name\n return ColumnSubsetView.create(\"computation\", artifact, source_table, column_list, column_limit)\n
"},{"location":"data_algorithms/spark/overview/#sageworks.core.views.computation_view.ComputationView.create","title":"create(artifact, source_table=None, column_list=None, column_limit=30)
classmethod
","text":"Factory method to create and return a ComputationView instance.
Parameters:
Name Type Description Defaultartifact
Union[DataSource, FeatureSet]
The DataSource or FeatureSet object
requiredsource_table
str
The table/view to create the view from. Defaults to None
None
column_list
Union[list[str], None]
A list of columns to include. Defaults to None.
None
column_limit
int
The max number of columns to include. Defaults to 30.
30
Returns:
Type DescriptionUnion[View, None]
Union[View, None]: The created View object (or None if failed to create the view)
Source code insrc/sageworks/core/views/computation_view.py
@classmethod\ndef create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n) -> Union[View, None]:\n \"\"\"Factory method to create and return a ComputationView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"computation\" view name\n return ColumnSubsetView.create(\"computation\", artifact, source_table, column_list, column_limit)\n
"},{"location":"data_algorithms/spark/overview/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"data_algorithms/sql/overview/","title":"SQL Algorithms","text":"SQL Algorithms
One of the main benefit of SQL Algorithms is that the 'heavy lifting' is all done on the SQL Database, so if you have large datassets this is the place for you.
SQL: SQL queries that provide a wide range of functionality:
SQL based Outliers: Compute outliers for all the columns in a DataSource using SQL
SQL based Descriptive Stats: Compute Descriptive Stats for all the numeric columns in a DataSource using SQL
SQL based Correlations: Compute Correlations for all the numeric columns in a DataSource using SQL
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.outliers.Outliers","title":"Outliers
","text":"Outliers: Class to compute outliers for all the columns in a DataSource using SQL
Source code insrc/sageworks/algorithms/sql/outliers.py
class Outliers:\n \"\"\"Outliers: Class to compute outliers for all the columns in a DataSource using SQL\"\"\"\n\n def __init__(self):\n \"\"\"SQLOutliers Initialization\"\"\"\n self.outlier_group = \"unknown\"\n\n def compute_outliers(\n self, data_source: DataSourceAbstract, scale: float = 1.5, use_stddev: bool = False\n ) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n Args:\n data_source(DataSource): The DataSource that we're computing outliers on\n scale (float): The scale to use for either the IQR or stddev outlier calculation (default: 1.5)\n use_stddev (bool): Option to use the standard deviation for the outlier calculation (default: False)\n Returns:\n pd.DataFrame: A DataFrame of outliers for this DataSource\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n\n # Note: If use_stddev is True, then the scale parameter needs to be adjusted\n if use_stddev and scale == 1.5: # If the default scale is used, adjust it\n scale = 2.5\n\n # Compute the numeric outliers\n outlier_df = self._numeric_outliers(data_source, scale, use_stddev)\n\n # If there are no outliers, return a DataFrame with the computation columns but no rows\n if outlier_df is None:\n columns = data_source.view(\"computation\").columns\n return pd.DataFrame(columns=columns + [\"outlier_group\"])\n\n # Get the top N outliers for each outlier group\n outlier_df = self.get_top_n_outliers(outlier_df)\n\n # Make sure the dataframe isn't too big, if it's too big sample it down\n if len(outlier_df) > 300:\n log.important(f\"Outliers DataFrame is too large {len(outlier_df)}, sampling down to 300 rows\")\n outlier_df = outlier_df.sample(300)\n\n # Sort by outlier_group and reset the index\n outlier_df = outlier_df.sort_values(\"outlier_group\").reset_index(drop=True)\n\n # Shorten any long string values\n outlier_df = shorten_values(outlier_df)\n return outlier_df\n\n def _numeric_outliers(self, data_source: DataSourceAbstract, scale: float, use_stddev=False) -> pd.DataFrame:\n \"\"\"Internal method to compute outliers for all numeric columns\n Args:\n data_source(DataSource): The DataSource that we're computing outliers on\n scale (float): The scale to use for the IQR outlier calculation\n use_stddev (bool): Option to use the standard deviation for the outlier calculation (default: False)\n Returns:\n pd.DataFrame: A DataFrame of all the outliers combined\n \"\"\"\n\n # Grab the column stats and descriptive stats for this DataSource\n column_stats = data_source.column_stats()\n descriptive_stats = data_source.descriptive_stats()\n\n # If there are no numeric columns, return None\n if not descriptive_stats:\n log.warning(\"No numeric columns found in the current computation view of the DataSource\")\n log.warning(\"If the data source was created from a DataFrame, ensure that the DataFrame was properly typed\")\n log.warning(\"Recommendation: Properly type the DataFrame and recreate the SageWorks artifact\")\n return None\n\n # Get the column names and types from the DataSource\n column_details = data_source.view(\"computation\").column_details()\n\n # For every column in the data_source that is numeric get the outliers\n # This loop computes the columns, lower bounds, and upper bounds for the SQL query\n log.info(\"Computing Outliers for numeric columns...\")\n numeric = [\"tinyint\", \"smallint\", \"int\", \"bigint\", \"float\", \"double\", \"decimal\"]\n columns = []\n lower_bounds = []\n upper_bounds = []\n for column, data_type in column_details.items():\n if data_type in numeric:\n # Skip columns that just have one value (or are all nans)\n if column_stats[column][\"unique\"] <= 1:\n log.info(f\"Skipping unary column {column} with value {descriptive_stats[column]['min']}\")\n continue\n\n # Skip columns that are 'binary' columns\n if column_stats[column][\"unique\"] == 2:\n log.info(f\"Skipping binary column {column}\")\n continue\n\n # Do they want to use the stddev instead of IQR?\n if use_stddev:\n mean = descriptive_stats[column][\"mean\"]\n stddev = descriptive_stats[column][\"stddev\"]\n lower_bound = mean - (stddev * scale)\n upper_bound = mean + (stddev * scale)\n\n # Compute the IQR for this column\n else:\n iqr = descriptive_stats[column][\"q3\"] - descriptive_stats[column][\"q1\"]\n lower_bound = descriptive_stats[column][\"q1\"] - (iqr * scale)\n upper_bound = descriptive_stats[column][\"q3\"] + (iqr * scale)\n\n # Add the column, lower bound, and upper bound to the lists\n columns.append(column)\n lower_bounds.append(lower_bound)\n upper_bounds.append(upper_bound)\n\n # Compute the SQL query\n query = self._multi_column_outlier_query(data_source, columns, lower_bounds, upper_bounds)\n outlier_df = data_source.query(query)\n\n # Label the outlier groups\n outlier_df = self._label_outlier_groups(outlier_df, columns, lower_bounds, upper_bounds)\n return outlier_df\n\n @staticmethod\n def _multi_column_outlier_query(\n data_source: DataSourceAbstract, columns: list, lower_bounds: list, upper_bounds: list\n ) -> str:\n \"\"\"Internal method to compute outliers for multiple columns\n Args:\n data_source(DataSource): The DataSource that we're computing outliers on\n columns(list): The columns to compute outliers on\n lower_bounds(list): The lower bounds for outliers\n upper_bounds(list): The upper bounds for outliers\n Returns:\n str: A SQL query to compute outliers for multiple columns\n \"\"\"\n # Grab the DataSource computation table name\n table = data_source.view(\"computation\").table\n\n # Get the column names and types from the DataSource\n column_details = data_source.view(\"computation\").column_details()\n sql_columns = \", \".join([f'\"{col}\"' for col in column_details.keys()])\n\n query = f'SELECT {sql_columns} FROM \"{table}\" WHERE '\n for col, lb, ub in zip(columns, lower_bounds, upper_bounds):\n query += f\"({col} < {lb} OR {col} > {ub}) OR \"\n query = query[:-4]\n\n # Add a limit just in case\n query += \" LIMIT 5000\"\n return query\n\n @staticmethod\n def _label_outlier_groups(\n outlier_df: pd.DataFrame, columns: list, lower_bounds: list, upper_bounds: list\n ) -> pd.DataFrame:\n \"\"\"Internal method to label outliers by group.\n Args:\n outlier_df(pd.DataFrame): The DataFrame of outliers\n columns(list): The columns for which to compute outliers\n lower_bounds(list): The lower bounds for each column\n upper_bounds(list): The upper bounds for each column\n Returns:\n pd.DataFrame: A DataFrame with an added 'outlier_group' column, indicating the type of outlier.\n \"\"\"\n\n column_outlier_dfs = []\n for col, lb, ub in zip(columns, lower_bounds, upper_bounds):\n mask_low = outlier_df[col] < lb\n mask_high = outlier_df[col] > ub\n\n low_df = outlier_df[mask_low].copy()\n low_df[\"outlier_group\"] = f\"{col}_low\"\n\n high_df = outlier_df[mask_high].copy()\n high_df[\"outlier_group\"] = f\"{col}_high\"\n\n column_outlier_dfs.extend([low_df, high_df])\n\n # If there are no outliers, return the original DataFrame with an empty 'outlier_group' column\n if not column_outlier_dfs:\n log.critical(\"No outliers found in the data source.. probably something is wrong\")\n return outlier_df.assign(outlier_group=\"\")\n\n # Concatenate the DataFrames and return\n return pd.concat(column_outlier_dfs, ignore_index=True)\n\n @staticmethod\n def get_top_n_outliers(outlier_df: pd.DataFrame, n: int = 10) -> pd.DataFrame:\n \"\"\"Function to retrieve the top N highest and lowest outliers for each outlier group.\n\n Args:\n outlier_df (pd.DataFrame): The DataFrame of outliers with 'outlier_group' column\n n (int): Number of top outliers to retrieve for each group, defaults to 10\n\n Returns:\n pd.DataFrame: A DataFrame containing the top N outliers for each outlier group\n \"\"\"\n\n def get_extreme_values(group: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Helper function to get the top N extreme values from a group.\"\"\"\n col, extreme_type = group.name.rsplit(\"_\", 1)\n if extreme_type == \"low\":\n return group.nsmallest(n, col)\n else:\n return group.nlargest(n, col)\n\n # Group by 'outlier_group' and apply the helper function, explicitly selecting columns\n top_outliers = outlier_df.groupby(\"outlier_group\", group_keys=False).apply(\n get_extreme_values, include_groups=True\n )\n return top_outliers.reset_index(drop=True)\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.outliers.Outliers.__init__","title":"__init__()
","text":"SQLOutliers Initialization
Source code insrc/sageworks/algorithms/sql/outliers.py
def __init__(self):\n \"\"\"SQLOutliers Initialization\"\"\"\n self.outlier_group = \"unknown\"\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.outliers.Outliers.compute_outliers","title":"compute_outliers(data_source, scale=1.5, use_stddev=False)
","text":"Compute outliers for all the numeric columns in a DataSource Args: data_source(DataSource): The DataSource that we're computing outliers on scale (float): The scale to use for either the IQR or stddev outlier calculation (default: 1.5) use_stddev (bool): Option to use the standard deviation for the outlier calculation (default: False) Returns: pd.DataFrame: A DataFrame of outliers for this DataSource Notes: Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma) The scale parameter can be adjusted to change the IQR multiplier
Source code insrc/sageworks/algorithms/sql/outliers.py
def compute_outliers(\n self, data_source: DataSourceAbstract, scale: float = 1.5, use_stddev: bool = False\n) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n Args:\n data_source(DataSource): The DataSource that we're computing outliers on\n scale (float): The scale to use for either the IQR or stddev outlier calculation (default: 1.5)\n use_stddev (bool): Option to use the standard deviation for the outlier calculation (default: False)\n Returns:\n pd.DataFrame: A DataFrame of outliers for this DataSource\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n\n # Note: If use_stddev is True, then the scale parameter needs to be adjusted\n if use_stddev and scale == 1.5: # If the default scale is used, adjust it\n scale = 2.5\n\n # Compute the numeric outliers\n outlier_df = self._numeric_outliers(data_source, scale, use_stddev)\n\n # If there are no outliers, return a DataFrame with the computation columns but no rows\n if outlier_df is None:\n columns = data_source.view(\"computation\").columns\n return pd.DataFrame(columns=columns + [\"outlier_group\"])\n\n # Get the top N outliers for each outlier group\n outlier_df = self.get_top_n_outliers(outlier_df)\n\n # Make sure the dataframe isn't too big, if it's too big sample it down\n if len(outlier_df) > 300:\n log.important(f\"Outliers DataFrame is too large {len(outlier_df)}, sampling down to 300 rows\")\n outlier_df = outlier_df.sample(300)\n\n # Sort by outlier_group and reset the index\n outlier_df = outlier_df.sort_values(\"outlier_group\").reset_index(drop=True)\n\n # Shorten any long string values\n outlier_df = shorten_values(outlier_df)\n return outlier_df\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.outliers.Outliers.get_top_n_outliers","title":"get_top_n_outliers(outlier_df, n=10)
staticmethod
","text":"Function to retrieve the top N highest and lowest outliers for each outlier group.
Parameters:
Name Type Description Defaultoutlier_df
DataFrame
The DataFrame of outliers with 'outlier_group' column
requiredn
int
Number of top outliers to retrieve for each group, defaults to 10
10
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame containing the top N outliers for each outlier group
Source code insrc/sageworks/algorithms/sql/outliers.py
@staticmethod\ndef get_top_n_outliers(outlier_df: pd.DataFrame, n: int = 10) -> pd.DataFrame:\n \"\"\"Function to retrieve the top N highest and lowest outliers for each outlier group.\n\n Args:\n outlier_df (pd.DataFrame): The DataFrame of outliers with 'outlier_group' column\n n (int): Number of top outliers to retrieve for each group, defaults to 10\n\n Returns:\n pd.DataFrame: A DataFrame containing the top N outliers for each outlier group\n \"\"\"\n\n def get_extreme_values(group: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Helper function to get the top N extreme values from a group.\"\"\"\n col, extreme_type = group.name.rsplit(\"_\", 1)\n if extreme_type == \"low\":\n return group.nsmallest(n, col)\n else:\n return group.nlargest(n, col)\n\n # Group by 'outlier_group' and apply the helper function, explicitly selecting columns\n top_outliers = outlier_df.groupby(\"outlier_group\", group_keys=False).apply(\n get_extreme_values, include_groups=True\n )\n return top_outliers.reset_index(drop=True)\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.descriptive_stats.descriptive_stats","title":"descriptive_stats(data_source)
","text":"Compute Descriptive Stats for all the numeric columns in a DataSource Args: data_source(DataSource): The DataSource that we're computing descriptive stats on Returns: dict(dict): A dictionary of descriptive stats for each column in this format {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4, 'mean': 2.5, 'stddev': 1.5}, 'col2': ...}
Source code insrc/sageworks/algorithms/sql/descriptive_stats.py
def descriptive_stats(data_source: DataSourceAbstract) -> dict[dict]:\n \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n Args:\n data_source(DataSource): The DataSource that we're computing descriptive stats on\n Returns:\n dict(dict): A dictionary of descriptive stats for each column in this format\n {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4, 'mean': 2.5, 'stddev': 1.5},\n 'col2': ...}\n \"\"\"\n # Grab the DataSource computation view table name\n table = data_source.view(\"computation\").table\n\n # Figure out which columns are numeric\n num_type = [\"double\", \"float\", \"int\", \"bigint\", \"smallint\", \"tinyint\"]\n details = data_source.view(\"computation\").column_details()\n numeric = [column for column, data_type in details.items() if data_type in num_type]\n\n # Sanity Check for numeric columns\n if len(numeric) == 0:\n log.warning(\"No numeric columns found in the current computation view of the DataSource\")\n log.warning(\"If the data source was created from a DataFrame, ensure that the DataFrame was properly typed\")\n log.warning(\"Recommendation: Properly type the DataFrame and recreate the SageWorks artifact\")\n return {}\n\n # Build the query\n query = descriptive_stats_query(numeric, table)\n\n # Run the query\n log.debug(query)\n result_df = data_source.query(query)\n\n # Process the results\n # Note: The result_df is a DataFrame with a single row and a column for each stat metric\n stats_dict = result_df.to_dict(orient=\"index\")[0]\n\n # Convert the dictionary to a nested dictionary\n # Note: The keys are in the format col1__col2\n nested_descriptive_stats = defaultdict(dict)\n for key, value in stats_dict.items():\n col1, col2 = key.split(\"___\")\n nested_descriptive_stats[col1][col2] = value\n\n # Return the nested dictionary\n return dict(nested_descriptive_stats)\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.descriptive_stats.descriptive_stats_query","title":"descriptive_stats_query(columns, table_name)
","text":"Build a query to compute the descriptive stats for all columns in a table Args: columns(list(str)): The columns to compute descriptive stats on table_name(str): The table to compute descriptive stats on Returns: str: The SQL query to compute descriptive stats
Source code insrc/sageworks/algorithms/sql/descriptive_stats.py
def descriptive_stats_query(columns: list[str], table_name: str) -> str:\n \"\"\"Build a query to compute the descriptive stats for all columns in a table\n Args:\n columns(list(str)): The columns to compute descriptive stats on\n table_name(str): The table to compute descriptive stats on\n Returns:\n str: The SQL query to compute descriptive stats\n \"\"\"\n query = f'SELECT <<column_descriptive_stats>> FROM \"{table_name}\"'\n column_descriptive_stats = \"\"\n for c in columns:\n column_descriptive_stats += (\n f'min(\"{c}\") AS \"{c}___min\", '\n f'approx_percentile(\"{c}\", 0.25) AS \"{c}___q1\", '\n f'approx_percentile(\"{c}\", 0.5) AS \"{c}___median\", '\n f'approx_percentile(\"{c}\", 0.75) AS \"{c}___q3\", '\n f'max(\"{c}\") AS \"{c}___max\", '\n f'avg(\"{c}\") AS \"{c}___mean\", '\n f'stddev(\"{c}\") AS \"{c}___stddev\", '\n )\n query = query.replace(\"<<column_descriptive_stats>>\", column_descriptive_stats[:-2])\n\n # Return the query\n return query\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.correlations.correlation_query","title":"correlation_query(columns, table_name)
","text":"Build a query to compute the correlations between columns in a table
Parameters:
Name Type Description Defaultcolumns
list(str
The columns to compute correlations on
requiredtable_name
str
The table to compute correlations on
requiredReturns:
Name Type Descriptionstr
str
The SQL query to compute correlations
Pearson correlation coefficient ranges from -1 to 1:+1 indicates a perfect positive linear relationship. -1 indicates a perfect negative linear relationship. 0 indicates no linear relationship.
Source code insrc/sageworks/algorithms/sql/correlations.py
def correlation_query(columns: list[str], table_name: str) -> str:\n \"\"\"Build a query to compute the correlations between columns in a table\n\n Args:\n columns (list(str)): The columns to compute correlations on\n table_name (str): The table to compute correlations on\n\n Returns:\n str: The SQL query to compute correlations\n\n Notes: Pearson correlation coefficient ranges from -1 to 1:\n +1 indicates a perfect positive linear relationship.\n -1 indicates a perfect negative linear relationship.\n 0 indicates no linear relationship.\n \"\"\"\n query = f'SELECT <<cross_correlations>> FROM \"{table_name}\"'\n cross_correlations = \"\"\n for c in columns:\n for d in columns:\n if c != d:\n cross_correlations += f'corr(\"{c}\", \"{d}\") AS \"{c}__{d}\", '\n query = query.replace(\"<<cross_correlations>>\", cross_correlations[:-2])\n\n # Return the query\n return query\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.correlations.correlations","title":"correlations(data_source)
","text":"Compute Correlations for all the numeric columns in a DataSource Args: data_source(DataSource): The DataSource that we're computing correlations on Returns: dict(dict): A dictionary of correlations for each column in this format {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...}, 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}
Source code insrc/sageworks/algorithms/sql/correlations.py
def correlations(data_source: DataSourceAbstract) -> dict[dict]:\n \"\"\"Compute Correlations for all the numeric columns in a DataSource\n Args:\n data_source(DataSource): The DataSource that we're computing correlations on\n Returns:\n dict(dict): A dictionary of correlations for each column in this format\n {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n \"\"\"\n data_source.log.info(\"Computing Correlations for numeric columns...\")\n\n # Figure out which columns are numeric\n num_type = [\"double\", \"float\", \"int\", \"bigint\", \"smallint\", \"tinyint\"]\n details = data_source.view(\"computation\").column_details()\n\n # Get the numeric columns\n numeric = [column for column, data_type in details.items() if data_type in num_type]\n\n # If we have at least two numeric columns, compute the correlations\n if len(numeric) < 2:\n return {}\n\n # Grab the DataSource computation table name\n table = data_source.view(\"computation\").table\n\n # Build the query\n query = correlation_query(numeric, table)\n\n # Run the query\n log.debug(query)\n result_df = data_source.query(query)\n\n # Drop any columns that have NaNs\n result_df = result_df.dropna(axis=1)\n\n # Process the results\n # Note: The result_df is a DataFrame with a single row and a column for each pairwise correlation\n correlation_dict = result_df.to_dict(orient=\"index\")[0]\n\n # Convert the dictionary to a nested dictionary\n # Note: The keys are in the format col1__col2\n nested_corr = defaultdict(dict)\n for key, value in correlation_dict.items():\n col1, col2 = key.split(\"__\")\n nested_corr[col1][col2] = value\n\n # Sort the nested dictionaries\n sorted_dict = {}\n for key, sub_dict in nested_corr.items():\n sorted_dict[key] = {k: v for k, v in sorted(sub_dict.items(), key=lambda item: item[1], reverse=True)}\n return sorted_dict\n
"},{"location":"data_algorithms/sql/overview/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"enterprise/","title":"SageWorks Enterprise","text":"The SageWorks API and User Interfaces cover a broad set of AWS Machine Learning services and provide easy to use abstractions and visualizations of your AWS ML data. We offer a wide range of options to best fit your companies needs.
Accelerate ML Pipeline development with an Enterprise License! Free Enterprise: Lite Enterprise: Standard Enterprise: Pro Python API \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 SageWorks REPL \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Dashboard \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 AWS Onboarding \u2796 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Dashboard Plugins \u2796 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Custom Pages \u2796 \u2796 \ud83d\udfe2 \ud83d\udfe2 Themes \u2796 \u2796 \ud83d\udfe2 \ud83d\udfe2 ML Pipelines \u2796 \u2796 \u2796 \ud83d\udfe2 Project Branding \u2796 \u2796 \u2796 \ud83d\udfe2 Prioritized Feature Requests \u2796 \u2796 \u2796 \ud83d\udfe2 Pricing \u2796 $1500* $3000* $4000**USD per month, includes AWS setup, support, and training: Everything needed to accelerate your AWS ML Development team. Interested in Data Science/Engineering consulting? We have top notch Consultants with a depth and breadth of AWS ML/DS/Engineering expertise.
"},{"location":"enterprise/#try-sageworks","title":"Try SageWorks","text":"We encourage new users to try out the free version, first. We offer support in our Discord channel and our Documentation has instructions for how to get started with SageWorks. So try it out and when you're ready to accelerate your AWS ML Adventure with an Enterprise licence contact us at SageWorks Sales
"},{"location":"enterprise/#data-engineeringscience-consulting","title":"Data Engineering/Science Consulting","text":"Alongside our SageWorks Enterprise offerings, we provide comprehensive consulting services and domain expertise through our Partnerships. We specialize in AWS Machine Learning Systems and our extended team of Data Scientists and Engineers, have Masters and Ph.D. degrees in Computer Science, Chemistry, and Pharmacology. We also have a parntership with Nomic Networks to support our Network Security Clients.
Using AWS and SageWorks, our experts are equipped to deliver tailored solutions that are focused on your project needs and deliverables. For more information please touch base and we'll set up a free initial consultation SageWorks Consulting
"},{"location":"enterprise/#contact-us","title":"Contact Us","text":"Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or SageWorks Sales
"},{"location":"enterprise/private_saas/","title":"Benefits of a Private SaaS Architecture","text":""},{"location":"enterprise/private_saas/#self-hosted-vs-private-saas-vs-public-saas","title":"Self Hosted vs Private SaaS vs Public SaaS?","text":"At the top level your team/project is making a decision about how they are going to build, expand, support, and maintain a machine learning pipeline.
Conceptual ML Pipeline
Data \u2b95 Features \u2b95 Models \u2b95 Deployment (end-user application)\n
Concrete/Real World Example
S3 \u2b95 Glue Job \u2b95 Data Catalog \u2b95 FeatureGroups \u2b95 Models \u2b95 Endpoints \u2b95 App\n
When building out a framework to support ML Pipelines there are three main options:
The other choice, that we're not going to cover here, is whether you use AWS, Azure, GCP, or something else. SageWorks is architected and powered by a broad and rich set of AWS ML Pipeline services. We believe that AWS provides the best set of functionality and APIs for flexible, real world ML architectures.
"},{"location":"enterprise/private_saas/#resources","title":"Resources","text":"See our full presentation on the SageWorks Private SaaS Architecture
"},{"location":"enterprise/project_branding/","title":"Project Branding","text":"The SageWorks Dashboard can be customized extensively. Using SageWorks Project Branding allows you to change page headers, titles, and logos to match your project. All user interfaces will reflect your project name and company logos.
"},{"location":"enterprise/project_branding/#contact-us","title":"Contact Us","text":"Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or SageWorks Sales.
"},{"location":"enterprise/themes/","title":"SageWorks Themes","text":"The SageWorks Dashboard can be customized extensively. Using SageWorks Themes allows you to customize the User Interfaces to suit your preferences, including completely customized color palettes and fonts. We offer a set of default 'dark' and 'light' themes, but we'll also customize the theme to match your company's color palette and logos.
"},{"location":"enterprise/themes/#contact-us","title":"Contact Us","text":"Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or SageWorks Sales.
"},{"location":"getting_started/","title":"Getting Started","text":"For the initial setup of SageWorks we'll be using the SageWorks REPL. When you start sageworks
it will recognize that it needs to complete the initial configuration and will guide you through that process.
Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"getting_started/#initial-setupconfig","title":"Initial Setup/Config","text":"Notes: SageWorks uses your existing AWS account/profile/SSO. So if you don't already have an AWS Profile or SSO Setup you'll need to do that first AWS Setup
Okay so after you've completed your AWS Setup you can now install SageWorks.
> pip install sageworks\n> sageworks <-- This starts the REPL\n\nWelcome to SageWorks!\nLooks like this is your first time using SageWorks...\nLet's get you set up...\nAWS_PROFILE: my_aws_profile\nSAGEWORKS_BUCKET: my-company-sageworks\n[optional] REDIS_HOST(localhost): my-redis.cache.amazon (or leave blank)\n[optional] REDIS_PORT(6379):\n[optional] REDIS_PASSWORD():\n[optional] SAGEWORKS_API_KEY(open_source): my_api_key (or leave blank)\n
That's It: You're now all set. This configuration only needs to be ONCE :)"},{"location":"getting_started/#data-scientistsengineers","title":"Data Scientists/Engineers","text":"For companies that are setting up SageWorks on an internal AWS Account: Company AWS Setup
"},{"location":"getting_started/#additional-resources","title":"Additional Resources","text":"AWS Glue Simplified
AWS Glue Jobs are a great way to automate ETL and data processing. SageWorks takes all the hassle out of creating and debugging Glue Jobs. Follow this guide and empower your Glue Jobs with SageWorks!
SageWorks make creating, testing, and debugging of AWS Glue Jobs easy. The exact same SageWorks API Classes are used in your Glue Jobs. Also since SageWorks manages the roles for both API and Glue Jobs you'll be able to test new Glue Jobs locally and minimizes surprises when deploying your Glue Job.
"},{"location":"glue/#glue-job-setup","title":"Glue Job Setup","text":"Setting up a AWS Glue Job that uses SageWorks is straight forward. SageWorks can be 'installed' on AWS Glue via the --additional-python-modules
parameter and then you can use the Sageworks API just like normal.
Here are the settings and a screen shot to guide you. There are several ways to set up and run Glue Jobs, with either the SageWorks-ExecutionRole or using the SageWorksAPIPolicy. Please feel free to contact SageWorks support if you need any help with setting up Glue Jobs.
Glue IAM Role Details
If your Glue Jobs already use an existing IAM Role then you can add the SageWorksAPIPolicy
to that Role to enable the Glue Job to perform SageWorks API Tasks.
Anyone familiar with a typical Glue Job should be pleasantly surpised by how simple the example below is. Also SageWorks allows you to test Glue Jobs locally using the same code that you use for script and Notebooks (see Glue Testing)
Glue Job Arguments
AWS Glue Jobs take arguments in the form of Job Parameters (see screenshot above). There's a SageWorks utility function get_resolved_options
that turns these Job Parameters into a nice dictionary for ease of use.
import sys\n\n# SageWorks Imports\nfrom sageworks.api.data_source import DataSource\nfrom sageworks.utils.config_manager import ConfigManager\nfrom sageworks.utils.glue_utils import get_resolved_options\n\n# Convert Glue Job Args to a Dictionary\nglue_args = get_resolved_options(sys.argv)\n\n# Set the SAGEWORKS_BUCKET for the ConfigManager\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", glue_args[\"sageworks-bucket\"])\n\n# Create a new Data Source from an S3 Path\nsource_path = \"s3://sageworks-public-data/common/abalone.csv\"\nmy_data = DataSource(source_path, name=\"abalone_glue_test\")\n
"},{"location":"glue/#glue-example-2","title":"Glue Example 2","text":"This example takes two 'Job Parameters'
The example will convert all CSV files in an S3 bucket/prefix and load them up as DataSources in SageWorks.
examples/glue_load_s3_bucket.pyimport sys\n\n# SageWorks Imports\nfrom sageworks.api.data_source import DataSource\nfrom sageworks.utils.config_manager import ConfigManager\nfrom sageworks.utils.glue_utils import get_resolved_options, list_s3_files\n\n# Convert Glue Job Args to a Dictionary\nglue_args = get_resolved_options(sys.argv)\n\n# Set the SAGEWORKS_BUCKET for the ConfigManager\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", glue_args[\"sageworks-bucket\"])\n\n# List all the CSV files in the given S3 Path\ninput_s3_path = glue_args[\"input-s3-path\"]\nfor input_file in list_s3_files(input_s3_path):\n\n # Note: If we don't specify a name, one will be 'auto-generated'\n my_data = DataSource(input_file, name=None)\n
"},{"location":"glue/#exception-log-forwarding","title":"Exception Log Forwarding","text":"When a Glue Job crashes (has an exception), the AWS console will show you the last line of the exception, this is mostly useless. If you use SageWorks log forwarding the exception/stack will be forwarded to CloudWatch.
from sageworks.utils.sageworks_logging import exception_log_forward\n\nwith exception_log_forward():\n <my glue code>\n ...\n <exception happens>\n <more of my code>\n
The exception_log_forward
sets up a context manager that will trap exceptions and forward the exception/stack to CloudWatch for diagnosis. "},{"location":"glue/#glue-job-local-testing","title":"Glue Job Local Testing","text":"Glue Power without the Pain. SageWorks manages the AWS Execution Role, so local API and Glue Jobs will have the same permissions/access. Also using the same Code as your notebooks or scripts makes creating and testing Glue Jobs a breeze.
export SAGEWORKS_CONFIG=<your config> # Only if not already set up\npython my_glue_job.py --sageworks-bucket <your bucket>\n
"},{"location":"glue/#additional-resources","title":"Additional Resources","text":"SageWorks Lambda Layers
AWS Lambda Jobs are a great way to spin up data processing jobs. Follow this guide and empower AWS Lambda with SageWorks!
SageWorks makes creating, testing, and debugging of AWS Lambda Functions easy. The exact same SageWorks API Classes are used in your AWS Lambda Functions. Also since SageWorks manages the access policies you'll be able to test new Lambda Jobs locally and minimizes surprises when deploying.
Work In Progress
The SageWorks Lambda Layers are a great way to use SageWorks but they are still in 'beta' mode so please let us know if you have any issues.
"},{"location":"lambda_layer/#lambda-job-setup","title":"Lambda Job Setup","text":"Setting up a AWS Lambda Job that uses SageWorks is straight forward. SageWorks can be 'installed' using a Lambda Layer and then you can use the Sageworks API just like normal.
Here are the ARNs for the current SageWorks Lambda Layers, please note they are specified with region and Python version in the name, so if your lambda is us-east-1, python 3.12, pick this ARN with those values in it.
"},{"location":"lambda_layer/#python-312-if-you-need-another-versionregion-let-us-know","title":"Python 3.12 (if you need another version/region let us know)","text":"us-east-1
us-west-2
Note: If you're using lambdas on a different region or with a different Python version, just let us know and we'll publish some additional layers.
At the bottom of the Lambda page there's an 'Add Layer' button. You can click that button and specify the layer using the ARN above. Also in the 'General Configuration' set these parameters:
Set the SAGEWORKS_BUCKET ENV SageWorks will need to know what bucket to work out of, so go into the Configuration...Environment Variables... and add one for the SageWorks bucket that your are using for AWS Account (dev, prod, etc).
Lambda Role Details
If your Lambda Function already use an existing IAM Role then you can add the SageWorks policies to that Role to enable the Lambda Job to perform SageWorks API Tasks. See SageWorks Access Controls
"},{"location":"lambda_layer/#sageworks-lambda-example","title":"SageWorks Lambda Example","text":"Here's a simple example of using SageWorks in your Lambda Function.
SageWorks Layer is Compressed
The SageWorks Lambda Layer is compressed (to fit all the awesome). This means that the load_lambda_layer()
method must be called before using any other SageWorks imports, see the example below. If you do not do this you'll probably get a No module named 'numpy'
error or something like that.
import json\nfrom pprint import pprint\nfrom sageworks.utils.lambda_utils import load_lambda_layer\n\n# Load/Decompress the SageWorks Lambda Layer\nload_lambda_layer()\n\n# After 'load_lambda_layer()' we can use other SageWorks imports\nfrom sageworks.api import Meta\nfrom sageworks.api import Model \n\ndef lambda_handler(event, context):\n\n # Create our Meta Class and get a list of our Models\n meta = Meta()\n models = meta.models()\n\n print(f\"Number of Models: {len(models)}\")\n print(models)\n\n # Onboard a model\n model = Model(\"abalone-regression\")\n pprint(model.details())\n\n # Return success\n return {\n 'statusCode': 200,\n 'body': { \"incoming_event\": event}\n }\n
"},{"location":"lambda_layer/#exception-log-forwarding","title":"Exception Log Forwarding","text":"When a Lambda Job crashes (has an exception), the AWS console will show you the last line of the exception, this is mostly useless. If you use SageWorks log forwarding the exception/stack will be forwarded to CloudWatch.
from sageworks.utils.sageworks_logging import exception_log_forward\n\nwith exception_log_forward():\n <my lambda code>\n ...\n <exception happens>\n <more of my code>\n
The exception_log_forward
sets up a context manager that will trap exceptions and forward the exception/stack to CloudWatch for diagnosis. "},{"location":"lambda_layer/#lambda-function-local-testing","title":"Lambda Function Local Testing","text":"Lambda Power without the Pain. SageWorks manages the AWS Execution Role/Policies, so local API and Lambda Functions will have the same permissions/access. Also using the same Code as your notebooks or scripts makes creating and testing Lambda Functions a breeze.
python my_lambda_function.py --sageworks-bucket <your bucket>\n
"},{"location":"lambda_layer/#additional-resources","title":"Additional Resources","text":"Using SageWorks for ML Pipelines: SageWorks API Classes
Consulting Available: SuperCowPowers LLC
Artifact and Column Naming?
You might have noticed that SageWorks has some unintuitive constraints when naming Artifacts and restrictions on column names. All of these restrictions come from AWS. SageWorks uses Glue, Athena, Feature Store, Models and Endpoints, each of these services have their own constraints, SageWorks simply 'reflects' those contraints.
"},{"location":"misc/faq/#naming-underscores-dashes-and-lower-case","title":"Naming: Underscores, Dashes, and Lower Case","text":"Data Sources and Feature Sets must adhere to AWS restrictions on table names and columns names (here is a snippet from the AWS documentation)
Database, table, and column names
When you create schema in AWS Glue to query in Athena, consider the following:
A database name cannot be longer than 255 characters. A table name cannot be longer than 255 characters. A column name cannot be longer than 255 characters.
The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character.
For more info see: Glue Best Practices
"},{"location":"misc/faq/#datasourcefeatureset-use-_-and-modelendpoint-use-","title":"DataSource/FeatureSet use '_' and Model/Endpoint use '-'","text":"You may notice that DataSource and FeatureSet uuid/name examples have underscores but the model and endpoints have dashes. Yes, it\u2019s super annoying to have one convention for DataSources and FeatureSets and another for Models and Endpoints but this is an AWS restriction and not something that SageWorks can control.
DataSources and FeatureSet: Underscores. You cannot use a dash because both classes use Athena for Storage and Athena tables names cannot have a dash.
Models and Endpoints: Dashes. You cannot use an underscores because AWS imposes a restriction on the naming.
"},{"location":"misc/faq/#additional-information-on-the-lower-case-issue","title":"Additional information on the lower case issue","text":"We\u2019ve tried to create a glue table with Mixed Case column names and haven\u2019t had any luck. We\u2019ve bypassed wrangler and used the boto3 low level calls directly. In all cases when it shows up in the Glue Table the columns have always been converted to lower case. We've also tried uses the Athena DDL directly, that also doesn't work. Here's the relevant AWS documentation and the two scripts that reproduce the issue.
AWS Docs
Scripts to Reproduce
SageWorks is a medium granularity framework that manages and aggregates AWS\u00ae Services into classes and concepts. When you use SageWorks you think about DataSources, FeatureSets, Models, and Endpoints. Underneath the hood those classes handle all the details around updating and managing a complex set of AWS Services. All the power and none of the pain so that your team can Do Science Faster!
"},{"location":"misc/general_info/#sageworks-documentation","title":"SageWorks Documentation","text":"See our Python API and AWS documentation here: SageWorks Documentation
"},{"location":"misc/general_info/#full-sageworks-overview","title":"Full SageWorks OverView","text":"SageWorks Architected FrameWork
"},{"location":"misc/general_info/#why-sageworks","title":"Why SageWorks?","text":"Visibility into the AWS Services that underpin the SageWorks Classes. We can see that SageWorks automatically tags and tracks the inputs of all artifacts providing 'data provenance' for all steps in the AWS modeling pipeline.
Image TBD
Clearly illustrated: SageWorks provides intuitive and transparent visibility into the full pipeline of your AWS Sagemaker Deployments.
"},{"location":"misc/general_info/#getting-started","title":"Getting Started","text":"The SageWorks Classes are organized to work in concert with AWS Services. For more details on the current classes and class hierarchies see SageWorks Classes and Concepts.
"},{"location":"misc/general_info/#contributions","title":"Contributions","text":"If you'd like to contribute to the SageWorks project, you're more than welcome. All contributions will fall under the existing project license. If you are interested in contributing or have questions please feel free to contact us at sageworks@supercowpowers.com.
"},{"location":"misc/general_info/#sageworks-alpha-testers-wanted","title":"SageWorks Alpha Testers Wanted","text":"Our experienced team can provide development and consulting services to help you effectively use Amazon\u2019s Machine Learning services within your organization.
The popularity of cloud based Machine Learning services is booming. The problem many companies face is how that capability gets effectively used and harnessed to drive real business decisions and provide concrete value for their organization.
Using SageWorks will minimize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a SageWorks Alpha Tester, contact us at sageworks@supercowpowers.com.
\u00ae Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates.
"},{"location":"misc/sageworks_classes_concepts/","title":"SageWorks Classes and Concepts","text":"A flexible, rapid, and customizable AWS\u00ae ML Sandbox. Here's some of the classes and concepts we use in the SageWorks system:
Endpoint
Transforms
Our experienced team can provide development and consulting services to help you effectively use Amazon\u2019s Machine Learning services within your organization.
The popularity of cloud based Machine Learning services is booming. The problem many companies face is how that capability gets effectively used and harnessed to drive real business decisions and provide concrete value for their organization.
Using SageWorks will minimizize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a SageWorks Alpha Tester, contact us at sageworks@supercowpowers.com.
"},{"location":"misc/scp_consulting/#typical-engagements","title":"Typical Engagements","text":"SageWorks clients typically want a tailored web_interface that helps to drive business decisions and provides value for their organization.
Rapid Prototyping is typically done via these steps.
Quick Construction of Web Interface (tailored)
Goto Step 1
When the client is happy/excited about the ProtoType we then bolt down the system, test the heavy paths, review AWS access, security and ensure 'least privileged' roles and policies.
Contact us for a free initial consultation on how we can accelerate the use of AWS ML at your company sageworks@supercowpowers.com.
"},{"location":"plugins/","title":"OverView","text":"SageWorks Plugins
The SageWorks toolkit provides a flexible plugin architecture to expand, enhance, or even replace the Dashboard. Make custom UI components, views, and entire pages with the plugin classes described here.
The SageWorks Plugin system allows clients to customize how their AWS Machine Learning Pipeline is displayed, analyzed, and visualized. Our easy to use Python API enables developers to make new Dash/Plotly components, data views, and entirely new web pages focused on business use cases.
"},{"location":"plugins/#concept-docs","title":"Concept Docs","text":"Many classes in SageWorks need additional high-level material that covers class design and illustrates class usage. Here's the Concept Docs for Plugins:
Each plugin class inherits from the SageWorks PluginInterface class and needs to set two attributes and implement two methods. These requirements are set so that each Plugin will conform to the Sageworks infrastructure; if the required attributes and methods aren\u2019t included in the class definition, errors will be raised during tests and at runtime.
Note: For full code see Model Plugin Example
class ModelPlugin(PluginInterface):\n \"\"\"MyModelPlugin Component\"\"\"\n\n \"\"\"Initialize this Plugin Component \"\"\"\n auto_load_page = PluginPage.MODEL\n plugin_input_type = PluginInputType.MODEL\n\n def create_component(self, component_id: str) -> dcc.Graph:\n \"\"\"Create the container for this component\n Args:\n component_id (str): The ID of the web component\n Returns:\n dcc.Graph: The EndpointTurbo Component\n \"\"\"\n self.component_id = component_id\n self.container = dcc.Graph(id=component_id, ...)\n\n # Fill in plugin properties\n self.properties = [(self.component_id, \"figure\")]\n\n # Return the container\n return self.container\n\n def update_properties(self, model: Model, **kwargs) -> list:\n \"\"\"Update the properties for the plugin.\n\n Args:\n model (Model): An instantiated Model object\n **kwargs: Additional keyword arguments\n\n Returns:\n list: A list of the updated property values\n \"\"\"\n\n # Create a pie chart with the endpoint name as the title\n pie_figure = go.Figure(data=..., ...)\n\n # Return the updated property values for the plugin\n return [pie_figure]\n
"},{"location":"plugins/#required-attributes","title":"Required Attributes","text":"The class variable plugin_page determines what type of plugin the MyPlugin class is. This variable is inspected during plugin loading at runtime in order to load the plugin to the correct artifact page in the Sageworks dashboard. The PluginPage class can be DATA_SOURCE, FEATURE_SET, MODEL, or ENDPOINT.
"},{"location":"plugins/#s3-bucket-plugins-work-in-progress","title":"S3 Bucket Plugins (Work in Progress)","text":"Note: This functionality is coming soon
Offers the most flexibility and fast prototyping. Simple set your config/env for blah to an S3 Path and SageWorks will load the plugins from S3 directly.
Helpful Tip
You can copy files from your local system up to S3 with this handy AWS CLI call
aws s3 cp . s3://my-sageworks/sageworks_plugins \\\n --recursive --exclude \"*\" --include \"*.py\"\n
"},{"location":"plugins/#additional-resources","title":"Additional Resources","text":"Need help with plugins? Want to develop a customized application tailored to your business needs?
The SageWorks framework makes AWS\u00ae both easier to use and more powerful. SageWorks handles all the details around updating and managing a complex set of AWS Services. With a simple-to-use Python API and a beautiful set of web interfaces, SageWorks makes creating AWS ML pipelines a snap.
Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"presentations/#sageworks-presentations_1","title":"SageWorks Presentations","text":"The SageWorks API documentation SageWorks API covers our in-depth Python API and contains code examples. The code examples are provided in the Github repo examples/
directory. For a full code listing of any example please visit our SageWorks Examples
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
\u00ae Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates
"},{"location":"release_notes/0_7_8/","title":"Release 0.7.8","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on [Discord](https://discord.gg/WHAJuz8sw8
Since we've recently introduced a View() class for DataSources and FeatureSets we needed to rename a few classes/modules.
"},{"location":"release_notes/0_7_8/#featuresets","title":"FeatureSets","text":"For setting holdout ids we've changed/combined to just one method set_training_holdouts()
, so if you're using create_training_view()
or set_holdout_ids()
you can now just use the unified method set_training_holdouts()
.
There's also a change to getting the training view table method.
old: fs.get_training_view_table(create=False)\nnew: fs.get_training_view_table(), does not need the create=False\n
"},{"location":"release_notes/0_7_8/#models","title":"Models","text":"inference_predictions() --> get_inference_predictions()\n
"},{"location":"release_notes/0_7_8/#webplugins","title":"Web/Plugins","text":"We've changed the Web/UI View class to 'WebView'. So anywhere where you used to have view just replace with web_view
from sageworks.views.artifacts_view import ArtifactsView\n
is now from sageworks.web_views.artifacts_web_view import ArtifactsWebView\n
"},{"location":"release_notes/0_7_8/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_11/","title":"Release 0.8.11","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover all the changes from 0.8.8
to 0.8.11
The AWSAccountClamp had too many responsibilities so that class has been split up into two classes and a set of utilities:
For all/most of these API changes they include both DataSources and FeatureSets. We're using a FeatureSet (fs) in the examples below but also applies to DataSoources.
Column Names/Table Names
fs.column_names() -> fs.columns\nfs.get_table_name() -> fs.table_name\n
Display/Training/Computation Views
In general methods for FS/DS are now part of the View API, here's a change list:
fs.get_display_view() -> fs.view(\"display\")\nfs.get_training_view() -> fs.view(\"training\")\nfs.get_display_columns() -> fs.view(\"display\").columns\nfs.get_computation_columns() -> fs.view(\"computation\").columns\nfs.get_training_view_table() -> fs.view(\"training\").table_name\nfs.get_training_data(self) -> fs.view(\"training\").pull_dataframe()\n
Some FS/DS methods have also been removed
num_display_columns() -> gone num_computation_columns() -> gone
Views: Methods that we're Keeping
We're keeping the methods below since they handle some underlying mechanics and serve as nice convenience methods.
ds/fs.set_display_columns()\nds/fs.set_computation_columns()\n
AWSAccountClamp
AWSAccountClamp().boto_session() --> AWSAccountClamp().boto3_session\n
All Classes
If the class previously had a boto_session
attribute that has been renamed to boto3_session
For sageworks==0.8.8
you needed to be careful about when/where you set your config/ENV vars. With >=0.8.9
you can now use the typical setup like this:
```\nfrom sageworks.utils.config_manager import ConfigManager\n\n# Set the SageWorks Config\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", args_dict[\"sageworks-bucket\"])\ncm.set_config(\"REDIS_HOST\", args_dict[\"redis-host\"])\n```\n
"},{"location":"release_notes/0_8_11/#robust-modelnotreadyexception-handling","title":"Robust ModelNotReadyException Handling","text":"AWS will 'deep freeze' Serverless Endpoints and if that endpoint hasn't been used for a while it can sometimes take a long time to come up and be ready for inference. SageWorks now properly manages this AWS error condition, it will report the issue, wait 60 seconds, and try again 5 times before raising the exception.
(endpoint_core.py:502) ERROR Endpoint model not ready\n(endpoint_core.py:503) ERROR Waiting and Retrying...\n...\nAfter a while, inference will run successfully :)\n
"},{"location":"release_notes/0_8_11/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_20/","title":"Release 0.8.20","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.11
to 0.8.20
The cloud_watch
AWS log aggregator, is now officially awesome. It provides a fairly sophisticated way of both doing broad scanning and deep dives on individual streams. Please see our Cloud Watch documentation.
The View classes have finished their refactoring. The 'read' class View()
can be constructed either directly or with the ds/fs.view(\"display\")
methods. See Views for more details. There also a set of classes for constructing views, please see View Overview
Table Name attribute
The table_name
attribute/property has been replaced with just table
ds.table_name -> ds.table\nfs.table_name -> fs.table\nview.table_name -> view.table\n
Endpoint Confusion Matrix
The endpoint
class had a method called confusion_matrix()
this has been renamed to the more descriptive generate_confusion_matrix()
. Note: The model method, of the same name, has NOT changed.
end.confusion_matrix() -> end.generate_confusion_matrix()\nmodel.confusion_matrix() == no change\n
Fixed: There was a corner case where if you had the following sequence:
set_training_holdouts()
The corner case was a race-condition where the FeatureSet would not 'know' that a training view was already there and would create a default training view.
"},{"location":"release_notes/0_8_20/#improvements","title":"Improvements","text":"The log messages that you receive on a plugin validation failure should now be more distinquishable and more informative. They will look like this and in some cases even tell you the line to look at.
ERROR Plugin 'MyPlugin' failed validation:\nERROR File: ../sageworks_plugins/web_components/my_plugin.py\nERROR Class: MyPlugin\nERROR Details: my_plugin.py (line 35): Incorrect return type for update_properties (expected list, got Figure)\n
"},{"location":"release_notes/0_8_20/#internal-api-changes","title":"Internal API Changes","text":"In theory these API should not affect end user of the SageWorks API but are documented here for completeness.
The internal method used by Artifact subclasses has changed names from ensure_valid_name
to is_name_valid
, we've also introduced an optional argument to turn on/off lowercase enforcement, this will be used later when we support uppercase for models, endpoints, and graphs.
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_22/","title":"Release 0.8.22","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.20
to 0.8.22
Mostly bug fixes and minor API changes.
"},{"location":"release_notes/0_8_22/#api-changes","title":"API Changes","text":"Removing target_column
arg when creating FeatureSets
When creating a FeatureSet via DataSource or Pandas Dataframe there was an optional argument for the target_column
after some discussion we decided to remove this argument. In general FeatureSets
are often used to create multiple models with different targets, so it doesn't make sense to specify a target
at the FeatureSet level.
Changed for both DataSource.to_features()
and the PandasToFeatures()
classes.
Fixed: The SHAP computation was occasionally complaining about the additivity check so we flipped that flag to False
shap_vals = explainer.shap_values(X_pred, check_additivity=False)\n
"},{"location":"release_notes/0_8_22/#improvements","title":"Improvements","text":"The optional requirements for [UI]
now include matplotlib since it will probably be useful in the future.
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_23/","title":"Release 0.8.23","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.22
to 0.8.23
Mostly bug fixes and minor API changes.
"},{"location":"release_notes/0_8_23/#api-changes","title":"API Changes","text":"Removing auto_one_hot
arg from PandasToFeatures
and DataSource.to_features()
When creating a PandasToFeatures
object or using DataSource.to_features()
there was an optional argument auto_one_hot
. This would try to automatically convert object/string columns to be one-hot encoded. In general this was only useful for 'toy' datasets but for more complex data we need to specify exactly which columns we want converted.
Adding optional one_hot_columns
arg to PandasToFeatures.set_input()
and DataSource.to_features()
When calling either of these FeatureSet creation methods you can now add an option arg one_hot_columns
as a list of columns that you would like to be one-hot encoded.
Our pandas dependency was outdated and causing an issue with an include_groups
arg when outlier groups were computed. We've changed the requirements:
pandas>=2.1.2\nto\npandas>=2.2.1\n
We also have a ticket for the logic change so that we avoid the deprecation warning."},{"location":"release_notes/0_8_23/#improvements","title":"Improvements","text":"The time to ingest
new rows into a FeatureSet can take a LONG time. Calling the FeatureGroup AWS API and waiting on the results is what takes all the time.
There will hopefully be a series of optimizations around this process, the first one is simply increasing the number of workers/processes for the ingestion manager class.
feature_group.ingest(.., max_processes=8)\n(has been changed to)\nfeature_group.ingest(..., max_processes=16, num_workers=4)\n
"},{"location":"release_notes/0_8_23/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_27/","title":"Release 0.8.27","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.23
to 0.8.27
KNNSpider() --> FeatureSpaceProximity()
If you were previously using the KNNSpider
that class has been replaced with FeatureSpaceProximity
. The API is also a bit different please see the documentation on the FeatureSpaceProximity Class.
The model scripts used in deployed AWS Endpoints are now case-insensitive. In general this should make the use of the endpoints a bit more flexible for End-User Applications to hit the endpoints with less pre-processing of their column names.
CloudWatch default buffers have been increased to 60 seconds as we appears to have been hitting some AWS limits with running 10 concurrent glue jobs.
"},{"location":"release_notes/0_8_27/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_29/","title":"Release 0.8.29","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.27
to 0.8.29
Locking AWS Model Training Image: AWS will randomly update the images associated with training and model registration. In particular the SKLearn Estimator has been updated into a non-working state for our use cases. So for both training and registration we're now explicitly specifying the image that we want to use.
self.estimator = SKLearn(\n ...\n framework_version=\"1.2-1\",\n image_uri=image, # New\n )\n
"},{"location":"release_notes/0_8_29/#api-changes","title":"API Changes","text":"delete() --> class.delete(uuid)
We've changed the API for deleting artifacts in AWS (DataSource, FeatureSet, etc). This is part of our efforts to minimize race-conditions when objects are deleted.
my_model = Model(\"xyz\") # Creating object\nmy_model.delete() # just to delete\n\n<Now just one line>\nModel.delete(\"xyz\") # Delete\n
Bulk Delete: Added a Bulk Delete utility
from sageworks.utils.bulk_utils import bulk_delete\n\ndelete_list = [(\"DataSource\", \"abc\"), (\"FeatureSet\", \"abc_features\")]\nbulk_delete(delete_list)\n
"},{"location":"release_notes/0_8_29/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_33/","title":"Release 0.8.33","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.29
to 0.8.33
Replaced WatchTower Code: Had lots of issues with WatchTower on Glue/Lambda, the use of forks/threads was overkill for our logging needs, so simply replaced the code with boto3 put_log_events()
calls and some simple token handling and buffering.
None
"},{"location":"release_notes/0_8_33/#improvementsfixes","title":"Improvements/Fixes","text":"DataSource from DataFrame: When creating a DataSource from a Pandas Dataframe, the internal transform()
was not deleting the existing DataSource (if it existed).
ROCAUC on subset of classes: When running inference on input data that only had a subset of the classification labels (e.g. rows only had \"low\" and \"medium\" when model was trained on \"low\", \"medium\", \"high\"). The input to ROCAUC needed to be adjusted so that ROCAUC doesn't crash. When this case happens we're returning proper defaults based on scikit learn docs.
"},{"location":"release_notes/0_8_33/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_35/","title":"Release 0.8.35","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.33
to 0.8.35
SageWorks REPL: The REPL now has a workaround for the current iPython embedded shell namespace scoping issue. See: iPython Embedded Shell Scoping Issue. So this pretty much means the REPL is 110% more awesome now!
"},{"location":"release_notes/0_8_35/#api-changes","title":"API Changes","text":"None
"},{"location":"release_notes/0_8_35/#improvementsfixes","title":"Improvements/Fixes","text":"AWS Service Broker: The AWS service broker was dramatic when it pulls meta data for something that just got deleted (or partially deleted), it was throwing CRITICAL log messages. We've refined the AWS error handling so that it's more granular about the error_codes for Validation or ResourceNotFound exceptions those are reduced to WARNINGS.
ROCAUC modifications: Version 0.8.33
put in quite a few changes, for 0.8.35
we've also added logic to both validate and ensure proper order of the probability columns with the class labels.
Code Diff v0.8.33 --> v0.8.35
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What do you call a cow with no legs? ........Ground beef.
"},{"location":"release_notes/0_8_35/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_36/","title":"Release 0.8.36","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.35
to 0.8.36
Fast Inference: The current inference method for endpoints provides error handling, metrics calculations and capture mechanics. There are use cases where the inference needs to happen as fast as possible without all the additional features. So we've added a fast_inference()
method that streamlines the calls to the endpoint.
end = Endpoint(\"my_endpoint\")\nend.inference(df) # Metrics, Capture, Error Handling\nWall time: 5.07 s\n\nend.fast_inference(df) # No frills, but Fast!\nWall time: 308 ms\n
"},{"location":"release_notes/0_8_36/#api-changes","title":"API Changes","text":"None
"},{"location":"release_notes/0_8_36/#improvementsfixes","title":"Improvements/Fixes","text":"Version Update Check: Added functionality that checks the current SageWorks version against the latest released and gives a log message for update available.
ROCAUC modifications: Functionality now includes 'per label' rocauc calculation along with label order and alignment from previous versions.
"},{"location":"release_notes/0_8_36/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.35 --> v0.8.36
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What\u2019s a cow\u2019s best subject in school? ......Cow-culus.
"},{"location":"release_notes/0_8_36/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_39/","title":"Release 0.8.39","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.36
to 0.8.39
Just a small set of error handling and bug fixes.
"},{"location":"release_notes/0_8_39/#api-changes","title":"API Changes","text":"None
"},{"location":"release_notes/0_8_39/#improvementsfixes","title":"Improvements/Fixes","text":"Scatter Plot: Fixed a corner case where the hoover columns included AWS generated fields.
Athena Queries: Put in additional error handling and retries when looking for and querying Athena/Glue Catalogs. These changes affect both DataSource and Features (which have DataSources internally for offline storage).
FeatureSet Creation: Put in additional error handling and retries when pulling AWS meta data for FeatureSets (and internal DataSources).
"},{"location":"release_notes/0_8_39/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.36 --> v0.8.39
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
That feeling like you\u2019ve done this before? .... Deja-moo
"},{"location":"release_notes/0_8_39/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_42/","title":"Release 0.8.42","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.39
to 0.8.42
Artifact deletion got a substantial overhaul. The 4 main classes received internal code changes for how they get deleted. Specifically deletion is now handled via a class method that allows an artifact to be delteed without instantiating an object. The API for deletion is actually more flexible now, please see API Changes below.
"},{"location":"release_notes/0_8_42/#api-changes","title":"API Changes","text":"Artifact Deletion
The API for Artifact deletion is more flexible, if you already have an instantiated object, you can simply call delete()
on it. If you're deleting an object in bulk/batch mode, you can call the class method managed_delete()
, see code example below.
fs = FeatureSet(\"my_fs\")\nfs.delete() # Used for notebooks, scripts, etc.. \nOR\nFeatureSet.managed_delete(\"my_fs\") # Bulk/batch/internal use\n\n<Same API for DataSources, Models, and Endpoints>\n
Note: Internally these use the same functionality, the dual API is simply for ease-of-use."},{"location":"release_notes/0_8_42/#improvementsfixes","title":"Improvements/Fixes","text":"Race Conditions
In theory, the changes to a class based delete will reduce race conditions where an object would try to create itself (just to be deleted) and the AWS Service Broker was encountering partially created (or partially deleted objects) and would barf error messages.
Slightly Better Throttling Logic
The AWS Throttles have been 'tuned' a bit to back off a bit faster and also not retry the list_tags request when the ARN isn't found.
"},{"location":"release_notes/0_8_42/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.39 --> v0.8.42
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
That feeling like you\u2019ve done this before? .... Deja-moo
"},{"location":"release_notes/0_8_42/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_46/","title":"Release 0.8.46","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.42
to 0.8.46
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We're starting to put in deprecation warning as we streamline classes and APIs. If you're using a class or method that's going to be deprecated you'll see a log message like this:
my_class = SomeOldClass()\nWARNING SomeOldClass is deprecated and will be removed in version 0.9.\n
In general these warning messages will be annoying but they will help us smoothly transistion and streamline our Classes and APIs.
"},{"location":"release_notes/0_8_46/#deprecations","title":"Deprecations","text":"Meta()
The new Meta()
class will provide API that aligns with the AWS list
and describe
API. We'll have functionality for listing objects (models, feature sets, etc) and then functionality around the details for a named artifact.
meta = Meta()\nmodels_list = meta.models() # List API\nend_list = meta.endpoints() # List API\n\nfs_dict = meta.feature_set(\"my_fs\") # Describe API\nmodel_dict = meta.model(\"my_model\") # Describe API\n
For more details see: Meta Class
The new Meta() API will be used inside of the Artifact classes (see Internal Changes...Artifacts... below)
Artifact Classes
The artifact classes (DataSource, FeatureSet, Model, Endpoint) have had some old arguments removed.
DataSource(force_refresh=True) -> Gone (remove it)\nFeatureSet(force_refresh=True) -> Gone (remove it)\nModel(force_refresh=True) -> Gone (remove it)\nModel(legacy=True) -> Gone (remove it)\n
"},{"location":"release_notes/0_8_46/#improvements","title":"Improvements","text":"Scalability
The changes to caching and the Meta() class should allow better horizontal scaling, we'll flex out the stress tests for upcoming releases before 0.9.0
.
Table Names starting with Numbers
Some of the Athena queries didn't properly escape the tables names and if you created a DataSource/FeatureSet with a name that started with a number the query would fail. Fixed now. :)
"},{"location":"release_notes/0_8_46/#internal-changes","title":"Internal Changes","text":"Meta()
Meta()
doesn't do any caching now. If you want to use Caching as part of your meta data retrieval use the CachedMeta()
class.
Artifacts
We're got rid of most (soon all) caching for individual Artifacts, if you're constructing an artifact object, you probably want detailed information that's 'up to date' and waiting a bit is probably fine. Note: We'll still make these instantiations as fast as we can, removing the caching logic will as least simplify the implementations.
"},{"location":"release_notes/0_8_46/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.42 --> v0.8.46
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
That feeling like you\u2019ve done this before? .... Deja-moo
"},{"location":"release_notes/0_8_46/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_50/","title":"Release 0.8.50","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.46
to 0.8.50
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We're going to lock in id_columns when FeatureSets are created, AWS FeatureGroup requires an id column, so this is the best place to do it, see API Changes below.
"},{"location":"release_notes/0_8_50/#featureset-robust-handling-of-training-column","title":"FeatureSet: Robust handling of training column","text":"In the past we haven't supported giving a training column as input data. FeatureSets are read-only, so locking in the training rows is 'suboptimal'. In general you might want to use the FeatureSet for several models with different training/hold_out sets. Now if a FeatureSet detects a training column it will give the follow message:
Training column detected: Since FeatureSets are read only, SageWorks \ncreates training views that can be dynamically changed. We'll use \nthis training column to create a training view.\n
"},{"location":"release_notes/0_8_50/#endpoint-auto_inference","title":"Endpoint: auto_inference()","text":"We're changing the internal logic for the auto_inference()
method to include the id_column in it's output.
FeatureSet
When creating a FeatureSet the id_column
is now a required argument.
ds = DataSource(\"test_data\")\nfs = ds.to_features(\"test_features\", id_column=\"my_id\") <-- Required\n
to_features = PandasToFeatures(\"my_feature_set\")\nto_features.set_input(df_features, id_column=\"my_id\") <-- Required\nto_features.set_output_tags([\"blah\", \"whatever\"])\nto_features.transform()\n
If you're data doesn't have a id column you can specify \"auto\" to_features = PandasToFeatures(\"my_feature_set\")\nto_features.set_input(df_features, id_column=\"auto\") <-- Auto Id (index)\n
For more details see: FeatureSet Class
The new Meta() API will be used inside of the Artifact classes (see Internal Changes...Artifacts... below)
"},{"location":"release_notes/0_8_50/#improvements","title":"Improvements","text":"DFStore
Robust handling of slashes, so now it will 'just work' with various upserts and gets:
```\n# These all give you /ml/shap_value dataframe\ndf_store.get(\"/ml/shap_values\")\ndf_store.get(\"ml/shap_values\")\ndf_store.get(\"//ml/shap_values\")\n```\n
"},{"location":"release_notes/0_8_50/#internal-changes","title":"Internal Changes","text":"There's a whole new directory structure that helps isolate Cloud Platform specific funcitonality.
- sageworks/src\n - core/cloud_platform\n - aws\n - azure\n - gcp\n
DFStore
now uses AWSDFStore
as its concrete implementation class.CachedMeta
and AWSAccountClamp
have had a revamp of their singleton logic.So as part of our v0.9.0 Roadmap we're continuing to revamp caching. We're experimenting with CachedMeta Class inside the Artifact classes. Caching continues to be challenging for the framework, it's an absolute must for Web Inferface/UI performance and then it needs to get out of the way for batch jobs and the concurrent building of ML pipelines.
"},{"location":"release_notes/0_8_50/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.46 --> v0.8.50
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
That feeling like you\u2019ve done this before? .... Deja-moo
"},{"location":"release_notes/0_8_50/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_55/","title":"Release 0.8.55","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.50
to 0.8.55
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We're got a good suggestion from one of our beta customers to change the training column to use True/False values instead of 1/0. Having boolean values make semantic sense and make filtering easier and more intuitive.
"},{"location":"release_notes/0_8_55/#api-changes","title":"API Changes","text":"FeatureSet Queries
Since the training column now contains True/False, any code that you have where you're doing a query against the training view.
fs.query(f'SELECT * FROM \"{table}\" where training = 1')\n<changed to>\nfs.query(f'SELECT * FROM \"{table}\" where training = TRUE')\n\nfs.query(f'SELECT * FROM \"{table}\" where training = 0')\n<changed to>\nfs.query(f'SELECT * FROM \"{table}\" where training = FALSE')\n
Also dataframe filtering is easier now, so if you have a call to filter the dataframe that also needs to change.
df_train = all_df[all_df[\"training\"] == 1].copy()\n<changed to>\ndf_train = all_df[all_df[\"training\"]].copy()\n\ndf_val = all_df[all_df[\"training\"] == 0].copy()\n<changed to>\ndf_val = all_df[~all_df[\"training\"]].copy()\n
For more details see: Training View Model Instantiation
We got a request to reduce the time for Model() object instantiation. So we created a new CachedModel()
class that is much faster to instantiate.
%time Model(\"abalone-regression\")\nCPU times: user 227 ms, sys: 19.5 ms, total: 246 ms\nWall time: 2.97 s\n\n%time CachedModel(\"abalone-regression\")\nCPU times: user 8.83 ms, sys: 2.64 ms, total: 11.5 ms\nWall time: 22.7 ms\n
For more details see: CachedModel"},{"location":"release_notes/0_8_55/#improvements","title":"Improvements","text":"SageWorks REPL Onboarding
At some point the onboarding with SageWorks REPL got broken and wasn't properly responding when the user didn't have a complete AWS/SageWorks setup.
"},{"location":"release_notes/0_8_55/#internal-changes","title":"Internal Changes","text":"The decorator for the CachedMeta class did not work properly in Python 3.9 so had to be slightly refactored.
"},{"location":"release_notes/0_8_55/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.50 --> v0.8.55
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
That feeling like you\u2019ve done this before? .... Deja-moo
"},{"location":"release_notes/0_8_55/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_58/","title":"Release 0.8.58","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.55
to 0.8.58
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We've created a new set of Cached Classes:
As part of this there's now a sageworks/cached
directory that housed these classes and the CachedMeta
class.
Meta Imports Yes, this changed AGAIN :)
from sageworks.meta import Meta\n<change to>\nfrom sageworks.api import Meta\n
CachedModel Import
from sageworks.api import CachedModel\n<change to>\nfrom sageworks.cached.cached_model import CachedModel\n
For more details see: CachedModel"},{"location":"release_notes/0_8_58/#improvements","title":"Improvements","text":"Dashboard Responsiveness
The whole point of these Cached Classes is to improve Dashboard/Web Interface responsiveness. The Dashboard uses both the CachedMeta and Cached(Artifact) classes to make both overview and drilldowns faster.
Supporting a Client Use Case There was a use case where a set of plugin pages needed to iterate over all the models to gather and aggregate information. We've supported that use case with a new decorator that avoids overloading AWS/Throttling issues.
Internal The Dashboard now refreshes all data every 90 seconds, so if you don't see you're new model on the dashboard... just wait longer. :)
"},{"location":"release_notes/0_8_58/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.55 --> v0.8.58
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What do you call a nonsense meeting? .... Moo-larkey
"},{"location":"release_notes/0_8_58/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_6/","title":"Release 0.8.6","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines. We've also fixed various corner cases mostly around 'half constructed' AWS artifacts (models/endpoints).
"},{"location":"release_notes/0_8_6/#additional-functionality","title":"Additional Functionality","text":"Model to Endpoint under AWS Throttle
A corner case where the to_endpoint()
method would fail when not 'knowing' the model input. This happened when AWS was throttling responses and the get_input()
of the Endpoint returned unknown
which caused a NoneType
error when using the 'unknown' model.
Empty Model Package Groups
There are cases where customers might construct a Model Package Group (MPG) container and not put any Model Packages in that Group. SageWorks has assumed that all MPGs would have at least one model package. The current 'support' for empty MPGs treats it as an error condition but the API tries to accommodate the condition and will properly display the model group. The group will indicate that it's 'empty' and provides an alert health icons.
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_60/","title":"Release 0.8.60","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.58
to 0.8.60
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We've now exposed additional functionality and API around adding your own custom models. The new custom model support is documented on the Features to Models page.
"},{"location":"release_notes/0_8_60/#api-changes","title":"API Changes","text":"None
"},{"location":"release_notes/0_8_60/#notes","title":"Notes","text":"Custom models introduce models that don't have model metrics or inference runs, so you'll see a lot of log messages complaining about not finding metrics or inference results, please just ignore those, we'll put in additional logic to address those cases.
"},{"location":"release_notes/0_8_60/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.58 --> v0.8.60
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What do you call a nonsense meeting? .... Moo-larkey
"},{"location":"release_notes/0_8_60/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_71/","title":"Release 0.8.71","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.60
to 0.8.71
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We learned that thread safety is good when using plugin classes. We had a model plugin class that was setting an attribute in one callback and then using that attribute in another callback, this mostly worked until it didn't. Anyway so the Inference Run dropdown box on the Models page now actually works correctly.
"},{"location":"release_notes/0_8_71/#api-changes","title":"API Changes","text":"None
"},{"location":"release_notes/0_8_71/#internal-changes","title":"Internal Changes","text":"When using PandasToFeatures it will overwrite FeatureSets if you give the same name. This behavior is expected. The issue was that it was super eager about doing that and would do it during class initiation, so we've moved that logic to when transform()
is called.
# Create a Feature Set from a DataFrame\ndf_to_features = PandasToFeatures(\"test_features\")\ndf_to_features.set_input(data_df, id_column=\"id\", one_hot_columns=[\"food\"])\ndf_to_features.set_output_tags([\"test\", \"small\"])\ndf_to_features.transform() <--- Overwrite happens here\n
"},{"location":"release_notes/0_8_71/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.60 --> v0.8.71
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What do you call that feeling like you\u2019ve done this before? Deja-moo
"},{"location":"release_notes/0_8_71/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_72/","title":"Release 0.8.72","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.71
to 0.8.72
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
For content verification purposes we've added a hash()
method to all of the SageWorks Artifact classes (DataSource, FeatureSet, Model, Endpoint, Graph, etc). Also for DataSources and FeatureSets there is a table_hash()
method that will compute a total hash of all data in the Athena table.
ds = DataSource(\"abalone_data\")\n\nds.modified()\nOut[2]: datetime.datetime(2024, 11, 17, 19, 45, 58, tzinfo=tzlocal())\n\nds.hash()\nOut[3]: '67a9ebb495af573604794aa9c31eded8'\n\nds.table_hash()\nOut[4]: '622f5ddba9d4cad2cf642d1ea5555de9'\n\nfs = FeatureSet(\"test_features\")\n\nfs.hash()\nOut[5]: '1571eee207b72f14bd5065d6c4acdaaf'\n\n# Note: Model/Endpoint hashes will backtrack to model.tar.gz and can be used for validation\nmodel = Model(\"abalone-regression\")\nend = Endpoint(\"abalone-regression-end\")\n\nmodel.get_model_data_url()\nOut[6]: 's3://sagemaker-us-west-2-507740646243/abalone-regression-2024-11-18-03-09/output/model.tar.gz'\n\nmodel.hash()\nOut[7]: '00def9381366cdd062413d0b395ba70c'\n\n# Verify endpoint is using expected model\nend.hash()\nOut[7]: '00def9381366cdd062413d0b395ba70c'\n\n# Realtime endpoint created from the same model\nend = Endpoint(\"abalone-regression-end-rt\")\nend.hash()\nOut[8]: '00def9381366cdd062413d0b395ba70c'\n
Note: You will get a performance warning when running table_hash() on DataSources and FeatureSets as it typically involves a deeper computation on the table contents of that artifact.
"},{"location":"release_notes/0_8_72/#api-changes","title":"API Changes","text":"get_database()
has a deprecation warning, it's replaced with just the database
property.
ds.get_database()\n<replaced by>\nds.database\n
Added the hash()
method to Artifacts (see above).
table_hash()
method to DataSources and FeatuerSet (see above).TBD
"},{"location":"release_notes/0_8_72/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.71 --> v0.8.72
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What do you call that feeling like you\u2019ve done this before? Deja-moo
"},{"location":"release_notes/0_8_72/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_8/","title":"Release 0.8.8","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
"},{"location":"release_notes/0_8_8/#additional-functionality","title":"Additional Functionality","text":"Auto Inference name change
When auto_inference is run on an endpoint the name of that inference run is currently training_holdout
. That is too close to model_training
and is confusing. So we're going to change the name to auto_inference
which is way more explanatory and intuitive.
Porting plugins: There should really not be any hard coding for training_holdout
, plugins should just call list_inference_runs()
(see below) and use the first one on the list.
list_inference_runs()
The list_inference_runs()
method on Models has been improved. It now handles error states better (no model, no model training data) and will return 'model_training' LAST on the list, this should improve UX for plugin components.
Examples
model = Model(\"abalone-regression\")\n model.list_inference_runs()\n Out[1]: ['auto_inference', 'model_training']\n\n model = Model(\"wine-classification\")\n model.list_inference_runs()\n Out[2]: ['auto_inference', 'training_holdout', 'model_training']\n\n model = Model(\"aqsol-mol-regression\")\n model.list_inference_runs()\n Out[3]: ['training_holdout', 'model_training']\n\n model = Model(\"empty-model-group\")\n model.list_inference_runs()\n Out[4]: []\n
"},{"location":"release_notes/0_8_8/#glue-job-changes","title":"Glue Job Changes","text":"We're spinning up the CloudWatch Handler much earlier now, so if you're setting config like this:
from sageworks.utils.config_manager import ConfigManager\n\n# Set the SageWorks Config\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", args_dict[\"sageworks-bucket\"])\ncm.set_config(\"REDIS_HOST\", args_dict[\"redis-host\"])\n
Just switch out that code for this code. Note: these need to be set before importing sageworks
# Set these ENV vars for SageWorks \nos.environ['SAGEWORKS_BUCKET'] = args_dict[\"sageworks-bucket\"]\nos.environ[\"REDIS_HOST\"] = args_dict[\"redis-host\"]\n
"},{"location":"release_notes/0_8_8/#misc","title":"Misc","text":"Confusion Matrix support for 'ordinal' labels
Pandas has an \u2018ordinal\u2019 type, so the confusion matrix method endpoint.confusion_matrix()
now checks the label column to see if it\u2019s ordinal and uses that order, if not just it will alphabetically sort.
Note: This change may not affect your UI experience. Confusion matricies are saved in the Sageworks/S3 meta data storage, so a bunch of stuff upstream will also need to happen. FeatureSet object/api for setting the label order, recreation of the model/endpoint and confustion matrix, etc. In general this is a forwarding looking change that will be useful later. :)
"},{"location":"release_notes/0_8_8/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"repl/","title":"SageWorks REPL","text":"Visibility and Control
The SageWorks REPL provides AWS ML Pipeline visibility just like the SageWorks Dashboard but also provides control over the creation, modification, and deletion of artifacts through the Python API.
The SageWorks REPL is a customized iPython shell. It provides tailored functionality for easy interaction with SageWorks objects and since it's based on iPython developers will feel right at home using autocomplete, history, help, etc. Both easy and powerful, the SageWorks REPL puts control of AWS ML Pipelines at your fingertips.
"},{"location":"repl/#installation","title":"Installation","text":"pip install sageworks
Just type sageworks
at the command line and the SageWorks shell will spin up and provide a command view of your AWS Machine Learning Pipelines.
At startup the SageWorks shell, will connect to your AWS Account and create a summary of the Machine Learning artifacts currently residing on the account.
Available Commands:
All of the API Classes are auto-loaded, so drilling down on an individual artifact is easy. The same Python API is provided so if you want additional info on a model, for instance, simply create a model object and use any of the documented API methods.
m = Model(\"abalone-regression\")\nm.details()\n<shows info about the model>\n
"},{"location":"repl/#additional-resources","title":"Additional Resources","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
"},{"location":"road_maps/0_9_0/#general","title":"General","text":"Streamlining
We've learned a lot from our beta testers!
One of the important lessons is not to 'over manage' AWS. We want to provide useful, granular Classes and APIs. Getting out of the way is just as important as providing functionality. So streamlining will be a big part of our 0.9.0
roadmap.
Horizontal Scaling
The framwork is currently struggling with 10 parallel ML pipelines being run concurrently. When running simultaneous pipelines we're seeing AWS service/query contention, throttling, and the occasional 'partial state' that we get back from AWS.
Our plan for 0.9.0
is to have formalized horizontal stress testing that tests everything from 4 to 32 concurrent ML pipelines. Even though 32 may not seem like much, AWS has various quotas and limits that we'll be hitting, so 32 is a good goal for 0.9.0
. Obviously once we get to 32 we'll look forward to an architecture that will support 100's of concurrent pipelines.
Full Artifact Load or None
For the SageWorks\u2019 DataSource, FeatureSet, Model, and Endpoint classes the new functionality will ensure that objects are only instantiated when all required data is fully available, returning None if the artifact ID is invalid or if the object is only partially constructed in AWS.
By preventing partially constructed objects, this approach reduces runtime errors when accessing incomplete attributes and simplifies error handling for clients, enhancing robustness and reliability. We are looking at Pydantic for formally capturing schema and types (see Pydantic).
Onboarding
We'll have to balance the 'full artifact or None' with the need to onboard()
artifacts created outside of SageWorks. We'll probably have a class method for onboarding, something like:
my_model = Model.onboard(\"some_random_model\")\nif my_model is None:\n <handle failure to onboard>\n
Caching
Caching needs a substantial overhaul. Right now SageWorks over uses caching. We baked it into our AWSServiceBroker and that gets used by everything.
Caching only really makes sense when we can't wait for AWS to respond to requests. The Dashboard and Web Interfaces are the only use case where responsiveness is important. Other use cases like nightly batch processing, scripts or notebooks, will work totally fine waiting for AWS responses.
Class/API Reductions
The organic growth of SageWorks was based on user feedback and testing, that organic growth has led to an abundance of Classes and API calls. We'll be identifying classes and methods that are 'cruft' from some development push and will be deprecating those.
"},{"location":"road_maps/0_9_0/#deprecation-warnings","title":"Deprecation Warnings","text":"We're starting to put in deprecation warning as we streamline classes and APIs. If you're using a class or method that's going to be deprecated you'll see a log message like this:
broker = AWSServiceBroker()\nWARNING AWSServiceBroker is deprecated and will be removed in version 0.9.\n
If you're using a class that's NOT going to be deprecated but currently uses/relies on one that is you'll still get a warning that you can ignore (developers will take care of it).
# This class is NOT deprecated but an internal class is\nmeta = Meta() \nWARNING AWSServiceBroker is deprecated and will be removed in version 0.9.\n
In general these warning messages will be annoying but they will help us smoothly transistion and streamline our Classes and APIs.
"},{"location":"road_maps/0_9_0/#deprecations","title":"Deprecations","text":"Meta()
The new Meta()
class will provide API that aligns with the AWS list
and describe
API. We'll have functionality for listing objects (models, feature sets, etc) and then functionality around the details for a named artifact.
meta = Meta()\nmodels_list = meta.models() # List API\nend_list = meta.endpoints() # List API\n\nfs_dict = meta.feature_set(\"my_fs\") # Describe API\nmodel_dict = meta.model(\"my_model\") # Describe API\n
The new Meta() API will be used inside of the Artifact classes (see Internal Changes...Artifacts... below)
"},{"location":"road_maps/0_9_0/#improvementsfixes","title":"Improvements/Fixes","text":"FeatureSet
When running concurrent ML pipelines we occasion get a partially constructed FeatureSet, FeatureSets will now 'wait and fail' if they detect partially constructed data (like offline storage not being ready).
"},{"location":"road_maps/0_9_0/#internal-changes","title":"Internal Changes","text":"Meta()
We're going to make a bunch of changes to Meta()
specifically around more granular (LESS) caching. Also there will be an AWSMeta()
subclass that manages the AWS specific API calls. We'll also put stubs in for AzureMeta()
and GCPMeta()
, cause hey we might have a client who really wants that flexibility.
The new Meta class will also include API that's more aligned to the AWS list
and describe
interfacts. Allowing both broad and deep queries of the Machine Learning Artifacts within AWS.
Artifacts
We're getting rid of caching for individual Artifacts, if you're constructing an artifact object, you probably want detailed information that's 'up to date' and waiting a bit is probably fine. Note: We'll still make these instantiations as fast as we can, removing the caching logic will as least simplify the implementations.
"},{"location":"road_maps/0_9_0/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"road_maps/0_9_5/","title":"Road Map v0.9.5","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
"},{"location":"road_maps/0_9_5/#general","title":"General","text":"ML Pipelines
We've learned a lot from our beta testers!
One of the important lessons is that when you make it easier to build ML Pipelines the users are going to build lots of pipelines.
For the creation, monitoring, and deployment of 50-100 of pipelines, we need to focus on the consoldation of artifacts into Pipelines
.
Pipelines are DAGs
The use of Directed Acyclic Graphs for the storage and management of ML Pipelines will provide a good abstraction. Real world ML Pipelines will often branch multiple times, 1 DataSource may become 2 FeatureSets might become 3 Models/Endpoints.
New Pipeline Dashboard Top Page
The current main page shows all the individual artifacts, as we scale up to 100's models we need 2 additional levels of aggregation:
New Pipeline Details Page
When a pipeline is clicked on the top page, a Pipeline details page comes up for that specific pipeline. This page will give all relevant information about the pipeline, including model performance, monitoring, and endpoint status.
Awesome image TBD
"},{"location":"road_maps/0_9_5/#versioned-artifacts","title":"Versioned Artifacts","text":"Our beta customers have requested versioning for artifacts, so we support versioning for both Model and FeatureSets. Endpoints and DataSources typically do not need versioning, so we may wait on the versioning support for those artifact until a later version.
"},{"location":"road_maps/0_9_5/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to SageWorks","text":"The SageWorks framework makes AWS\u00ae both easier to use and more powerful. SageWorks handles all the details around updating and managing a complex set of AWS Services. With a simple-to-use Python API and a beautiful set of web interfaces, SageWorks makes creating AWS ML pipelines a snap. It also dramatically improves both the usability and visibility across the entire spectrum of services: Glue Jobs, Athena, Feature Store, Models, and Endpoints. SageWorks makes it easy to build production ready, AWS powered, machine learning pipelines.
SageWorks Dashboard: AWS Pipelines in a Whole New Light!"},{"location":"#full-aws-overview","title":"Full AWS OverView","text":"Secure your Data, Empower your ML Pipelines
SageWorks is architected as a Private SaaS. This hybrid architecture is the ultimate solution for businesses that prioritize data control and security. SageWorks deploys as an AWS Stack within your own cloud environment, ensuring compliance with stringent corporate and regulatory standards. It offers the flexibility to tailor solutions to your specific business needs through our comprehensive plugin support, both components and full web interfaces. By using SageWorks, you maintain absolute control over your data while benefiting from the power, security, and scalability of AWS cloud services. SageWorks Private SaaS Architecture
"},{"location":"#dashboard-and-api","title":"Dashboard and API","text":"The SageWorks package has two main components, a Web Interface that provides visibility into AWS ML PIpelines and a Python API that makes creation and usage of the AWS ML Services easier than using/learning the services directly.
"},{"location":"#web-interfaces","title":"Web Interfaces","text":"The SageWorks Dashboard has a set of web interfaces that give visibility into the AWS Glue and SageMaker Services. There are currently 5 web interfaces available:
SageWorks API Documentation: SageWorks API Classes
The main functionality of the Python API is to encapsulate and manage a set of AWS services underneath a Python Object interface. The Python Classes are used to create and interact with Machine Learning Pipeline Artifacts.
"},{"location":"#getting-started","title":"Getting Started","text":"SageWorks will need some initial setup when you first start using it. See our Getting Started guide on how to connect SageWorks to your AWS Account.
"},{"location":"#additional-resources","title":"Additional Resources","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
In general SageWorks works well, out of the box, with the standard set of limits for AWS accounts. SageWorks supports throttling, timeouts, and a broad set of AWS error handling routines for general purpose usage.
When using SageWorks for large scale deployments there are a set of AWS Service limits that will need to be increased.
"},{"location":"admin/aws_service_limits/#serverless-endpoints","title":"ServerLess Endpoints","text":"There are two serverless endpoint quotas that will need to be adjusted.
When running a large set of parallel Glue/Batch Jobs that are creating FeatureGroups, some clients have hit this limit.
\"ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateFeatureGroup operation: The account-level service limit 'Maximum number of feature group creation workflows executing in parallel' is 4 FeatureGroups, with current utilization of 4 FeatureGroups and a request delta of 1 FeatureGroups. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.\"}
Unfortunately this one is not adjustable through the AWS Service Quota console and you'll have to initiate an AWS Support ticket.
"},{"location":"admin/aws_service_limits/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"admin/base_docker_push/","title":"SageWorks Base Docker Build and Push","text":"Notes and information on how to do the Docker Builds and Push to AWS ECR.
"},{"location":"admin/base_docker_push/#update-sageworks-version","title":"Update SageWorks Version","text":"vi Dockerfile\n\n# Install latest Sageworks\nRUN pip install --no-cache-dir 'sageworks[ml-tool,chem]'==0.7.0\n
"},{"location":"admin/base_docker_push/#build-the-docker-image","title":"Build the Docker Image","text":"Note: For a client specific config file you'll need to copy it locally so that it's within Dockers 'build context'. If you're building the 'vanilla' open source Docker image, then you can use the open_source_config.json
that's in the directory already.
docker build --build-arg SAGEWORKS_CONFIG=open_source_config.json -t \\\nsageworks_base:v0_7_0_amd64 --platform linux/amd64 .\n
"},{"location":"admin/base_docker_push/#test-the-image-locally","title":"Test the Image Locally","text":"You have a docker_local_base
alias in your ~/.zshrc
:)
aws ecr-public get-login-password --region us-east-1 --profile \\\nscp_sandbox_admin | docker login --username AWS \\\n--password-stdin public.ecr.aws\n
"},{"location":"admin/base_docker_push/#tagpush-the-image-to-aws-ecr","title":"Tag/Push the Image to AWS ECR","text":"docker tag sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64\n
docker push public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64\n
"},{"location":"admin/base_docker_push/#update-the-latest-tag","title":"Update the 'latest' tag","text":"docker tag public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:latest\n
docker push public.ecr.aws/m6i5k1r2/sageworks_base:latest\n
"},{"location":"admin/base_docker_push/#update-the-stable-tag","title":"Update the 'stable' tag","text":"This is obviously only when you want to mark a version as stable. Meaning that it seems to 'be good and stable (ish)' :)
docker tag public.ecr.aws/m6i5k1r2/sageworks_base:v0_7_0_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_base:stable\n
docker push public.ecr.aws/m6i5k1r2/sageworks_base:stable\n
"},{"location":"admin/base_docker_push/#test-the-ecr-image","title":"Test the ECR Image","text":"You have a docker_ecr_base
alias in your ~/.zshrc
:)
Notes and information on how to do the Dashboard Docker Builds and Push to AWS ECR.
"},{"location":"admin/dashboard_docker_push/#update-sageworks-version","title":"Update SageWorks Version","text":"cd applications/aws_dashboard\nvi Dockerfile\n\n# Install Sageworks (changes often)\nRUN pip install --no-cache-dir sageworks==0.4.13 <-- change this\n
"},{"location":"admin/dashboard_docker_push/#build-the-docker-image","title":"Build the Docker Image","text":"Note: For a client specific config file you'll need to copy it locally so that it's within Dockers 'build context'. If you're building the 'vanilla' open source Docker image, then you can use the open_source_config.json
that's in the directory already.
docker build --build-arg SAGEWORKS_CONFIG=open_source_config.json -t \\\nsageworks_dashboard:v0_4_13_amd64 --platform linux/amd64 .\n
Docker with Custom Plugins: If you're using custom plugins you should visit our Dashboard with Plugins) page.
"},{"location":"admin/dashboard_docker_push/#test-the-image-locally","title":"Test the Image Locally","text":"You have a docker_local_dashboard
alias in your ~/.zshrc
:)
aws ecr-public get-login-password --region us-east-1 --profile \\\nscp_sandbox_admin | docker login --username AWS \\\n--password-stdin public.ecr.aws\n
"},{"location":"admin/dashboard_docker_push/#tagpush-the-image-to-aws-ecr","title":"Tag/Push the Image to AWS ECR","text":"docker tag sageworks_dashboard:v0_4_13_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64\n
docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64\n
"},{"location":"admin/dashboard_docker_push/#update-the-latest-tag","title":"Update the 'latest' tag","text":"docker tag public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_4_13_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:latest\n
docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:latest\n
"},{"location":"admin/dashboard_docker_push/#update-the-stable-tag","title":"Update the 'stable' tag","text":"This is obviously only when you want to mark a version as stable. Meaning that it seems to 'be good and stable (ish)' :)
docker tag public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_5_4_amd64 \\\npublic.ecr.aws/m6i5k1r2/sageworks_dashboard:stable\n
docker push public.ecr.aws/m6i5k1r2/sageworks_dashboard:stable\n
"},{"location":"admin/dashboard_docker_push/#test-the-ecr-image","title":"Test the ECR Image","text":"You have a docker_ecr_dashboard
alias in your ~/.zshrc
:)
Notes and information on how to include plugins with your SageWorks Dashboard.
If you don't already have a Dockerfile, here's one to get you started, just place this into your repo/directory that has the plugins.
# Pull base sageworks dashboard image with specific tag (pick latest or stable)\nFROM public.ecr.aws/m6i5k1r2/sageworks_dashboard:latest\n\n# Copy the plugin files into the Dashboard plugins dir\nCOPY ./sageworks_plugins /app/sageworks_plugins\nENV SAGEWORKS_PLUGINS=/app/sageworks_plugins\n
Note: Your plugins directory should looks like this
sageworks_plugins/\n pages/\n my_plugin_page.py\n ...\n views/\n my_plugin_view.py\n ...\n web_components/\n my_component.py\n ...\n
"},{"location":"admin/dashboard_with_plugins/#build-it","title":"Build it","text":"docker build -t my_sageworks_with_plugins:v1_0 --platform linux/amd64 .\n
"},{"location":"admin/dashboard_with_plugins/#test-the-image-locally","title":"Test the Image Locally","text":"You'll need to use AWS Credentials for this, it's a bit complicated, please contact SageWorks Support sageworks@supercowpowers.com or chat us up on Discord
"},{"location":"admin/dashboard_with_plugins/#login-to-your-ecr","title":"Login to your ECR","text":"Okay.. so after testing locally you're ready to push the Docker image (with Plugins) to the your ECR.
Note: This ECR should be private as your plugins are customized for specific business use cases.
Your ECR location will have this form
<aws_account_id>.dkr.ecr.us-east-1.amazonaws.com\n
aws ecr get-login-password --region us-east-1 --profile <aws_profile> \\\n| docker login --username AWS --password-stdin \\\n<aws_account_id>.dkr.ecr.us-east-1.amazonaws.com\n
"},{"location":"admin/dashboard_with_plugins/#tagpush-the-image-to-aws-ecr","title":"Tag/Push the Image to AWS ECR","text":"docker tag my_sageworks_with_plugins:v1_0 \\\n<aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/sageworks_with_plugins:v1_0\n
docker push \\\n<aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/sageworks_with_plugins:v1_0\n
"},{"location":"admin/dashboard_with_plugins/#deploying-plugin-docker-image-to-aws","title":"Deploying Plugin Docker Image to AWS","text":"Okay now that you have your plugin Docker Image you can deploy to your AWS account:
Copy the Dashboard CDK files
This is cheesy but just copy all the CDK files into your repo/directory.
cp -r sageworks/aws_setup/sageworks_dashboard_full /my/sageworks/stuff/\n
Change the Docker Image to Deploy
Now open up the app.py
file and change this line to your Docker Image
# When you want a different docker image change this line\ndashboard_image = \"public.ecr.aws/m6i5k1r2/sageworks_dashboard:v0_8_3_amd64\"\n
Make sure your SAGEWORKS_CONFIG
is properly set, and run the following commands:
export SAGEWORKS_CONFIG=/Users/<user_name>/.sageworks/sageworks_config.json\ncdk diff\ncdk deploy\n
CDK Diff
In particular, pay attention to the cdk diff
it should ONLY have the image name as a difference.
cdk diff\n[-] \"Image\": \"<account>.dkr.ecr.us-east-1/my-plugins:latest_123\",\n[+] \"Image\": \"<account>.dkr.ecr.us-east-1/my-plugins:latest_456\",\n
"},{"location":"admin/dashboard_with_plugins/#note-on-sageworks-configuration","title":"Note on SageWorks Configuration","text":"All Configuration is managed by the CDK Python Script and the SAGEWORKS_CONFIG
ENV var. If you want to change things like REDIS_HOST
or SAGEWORKS_BUCKET
you should do that with a sageworks.config
file and then point the SAGEWORKS_CONFIG
ENV var to that file.
Notes and information on how to do the PyPI release for the SageMaker project. For full details on packaging you can reference this page Packaging
The following instructions should work, but things change :)
"},{"location":"admin/pypi_release/#package-requirements","title":"Package Requirements","text":"The easiest thing to do is setup a \\~/.pypirc file with the following contents
[distutils]\nindex-servers =\n pypi\n testpypi\n\n[pypi]\nusername = __token__\npassword = pypi-AgEIcH...\n\n[testpypi]\nusername = __token__\npassword = pypi-AgENdG...\n
"},{"location":"admin/pypi_release/#tox-background","title":"Tox Background","text":"Tox will install the SageMaker Sandbox package into a blank virtualenv and then execute all the tests against the newly installed package. So if everything goes okay, you know the pypi package installed fine and the tests (which puls from the installed sageworks
package) also ran okay.
$ cd sageworks\n$ tox \n
If ALL the test above pass...
"},{"location":"admin/pypi_release/#clean-any-previous-distribution-files","title":"Clean any previous distribution files","text":"make clean\n
"},{"location":"admin/pypi_release/#tag-the-new-version","title":"Tag the New Version","text":"git tag v0.1.8 (or whatever)\ngit push --tags\n
"},{"location":"admin/pypi_release/#create-the-test-pypi-release","title":"Create the TEST PyPI Release","text":"python -m build\ntwine upload dist/* -r testpypi\n
"},{"location":"admin/pypi_release/#install-the-test-pypi-release","title":"Install the TEST PyPI Release","text":"pip install --index-url https://test.pypi.org/simple sageworks\n
"},{"location":"admin/pypi_release/#create-the-real-pypi-release","title":"Create the REAL PyPI Release","text":"twine upload dist/* -r pypi\n
"},{"location":"admin/pypi_release/#push-any-possible-changes-to-github","title":"Push any possible changes to Github","text":"git push\n
"},{"location":"admin/sageworks_docker_for_lambdas/","title":"SageWorks Docker Image for Lambdas","text":"Using the SageWorks Docker Image for AWS Lambda Jobs allows your Lambda Jobs to use and create AWS ML Pipeline Artifacts with SageWorks.
AWS, for some reason, does not allow Public ECRs to be used for Lambda Docker images. So you'll have to copy the Docker image into your private ECR.
"},{"location":"admin/sageworks_docker_for_lambdas/#creating-a-private-ecr","title":"Creating a Private ECR","text":"You only need to do this if you don't already have a private ECR.
"},{"location":"admin/sageworks_docker_for_lambdas/#aws-console-to-create-private-ecr","title":"AWS Console to create Private ECR","text":"sageworks_base
.Create the ECR repository using the AWS CLI:
aws ecr create-repository --repository-name sageworks_base --region <region>\n
"},{"location":"admin/sageworks_docker_for_lambdas/#pulling-docker-image-into-private-ecr","title":"Pulling Docker Image into Private ECR","text":"Note: You'll only need to do this when you want to update the SageWorks Docker image
Pull the SageWorks Public ECR Image
docker pull public.ecr.aws/m6i5k1r2/sageworks_base:latest\n
Tag the image for your private ECR
docker tag public.ecr.aws/m6i5k1r2/sageworks_base:latest \\\n<your-account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:latest\n
Push the image to your private ECR
aws ecr get-login-password --region <region> --profile <profile> | \\\ndocker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com\n\ndocker push <account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:<tag>\n
"},{"location":"admin/sageworks_docker_for_lambdas/#using-the-docker-image-for-your-lambdas","title":"Using the Docker Image for your Lambdas","text":"Okay, now that you have the SageWorks Docker image in your private ECR, here's how you use that image for your Lambda jobs.
"},{"location":"admin/sageworks_docker_for_lambdas/#aws-console","title":"AWS Console","text":"<account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:<tag>
.Create the Lambda function using the AWS CLI:
aws lambda create-function \\\n --region <region> \\\n --function-name myLambdaFunction \\\n --package-type Image \\\n --code ImageUri=<account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:<tag> \\\n --role arn:aws:iam::<account-id>:role/<execution-role>\n
"},{"location":"admin/sageworks_docker_for_lambdas/#python-cdk","title":"Python CDK","text":"Define the Lambda function in your CDK app:
from aws_cdk import (\n aws_lambda as _lambda,\n core\n)\n\nclass MyLambdaStack(core.Stack):\n def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:\n super().__init__(scope, id, **kwargs)\n\n _lambda.Function(self, \"MyLambdaFunction\",\n code=_lambda.Code.from_ecr_image(\"<account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:<tag>\"),\n handler=_lambda.Handler.FROM_IMAGE,\n runtime=_lambda.Runtime.FROM_IMAGE,\n role=iam.Role.from_role_arn(self, \"LambdaRole\", \"arn:aws:iam::<account-id>:role/<execution-role>\"))\n\napp = core.App()\nMyLambdaStack(app, \"MyLambdaStack\")\napp.synth()\n
"},{"location":"admin/sageworks_docker_for_lambdas/#cloudformation","title":"Cloudformation","text":"Define the Lambda function in your CloudFormation template.
Resources:\n MyLambdaFunction:\n Type: AWS::Lambda::Function\n Properties:\n Code:\n ImageUri: <account-id>.dkr.ecr.<region>.amazonaws.com/<private-repo>:<tag>\n Role: arn:aws:iam::<account-id>:role/<execution-role>\n PackageType: Image\n
"},{"location":"api_classes/data_source/","title":"DataSource","text":"DataSource Examples
Examples of using the DataSource class are in the Examples section at the bottom of this page. S3 data, local files, and Pandas dataframes, DataSource can read data from many different sources.
DataSource: Manages AWS Data Catalog creation and management. DataSources are set up so that can easily be queried with AWS Athena. All DataSources are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) DataSources can be viewed and explored within the SageWorks Dashboard UI.
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource","title":"DataSource
","text":" Bases: AthenaSource
DataSource: SageWorks DataSource API Class
Common Usagemy_data = DataSource(name_of_source)\nmy_data.details()\nmy_features = my_data.to_features()\n
Source code in src/sageworks/api/data_source.py
class DataSource(AthenaSource):\n \"\"\"DataSource: SageWorks DataSource API Class\n\n Common Usage:\n ```python\n my_data = DataSource(name_of_source)\n my_data.details()\n my_features = my_data.to_features()\n ```\n \"\"\"\n\n def __init__(self, source: Union[str, pd.DataFrame], name: str = None, tags: list = None, **kwargs):\n \"\"\"\n Initializes a new DataSource object.\n\n Args:\n source (Union[str, pd.DataFrame]): Source of data (existing name, filepath, S3 path, or a Pandas DataFrame)\n name (str): The name of the data source (must be lowercase). If not specified, a name will be generated\n tags (list[str]): A list of tags associated with the data source. If not specified tags will be generated.\n \"\"\"\n\n # Ensure the ds_name is valid\n if name:\n Artifact.is_name_valid(name)\n\n # If the data source name wasn't given, generate it\n else:\n name = extract_data_source_basename(source)\n name = Artifact.generate_valid_name(name)\n\n # Sanity check for dataframe sources\n if name == \"dataframe\":\n msg = \"Set the 'name' argument in the constructor: DataSource(df, name='my_data')\"\n self.log.critical(msg)\n raise ValueError(msg)\n\n # Set the tags and load the source\n tags = [name] if tags is None else tags\n self._load_source(source, name, tags)\n\n # Call superclass init\n super().__init__(name, **kwargs)\n\n def details(self, **kwargs) -> dict:\n \"\"\"DataSource Details\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().details(**kwargs)\n\n def query(self, query: str) -> pd.DataFrame:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the DataSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n return super().query(query)\n\n def pull_dataframe(self, include_aws_columns=False) -> pd.DataFrame:\n \"\"\"Return a DataFrame of ALL the data from this DataSource\n\n Args:\n include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of ALL the data from this DataSource\n\n Note:\n Obviously this is not recommended for large datasets :)\n \"\"\"\n\n # Get the table associated with the data\n self.log.info(f\"Pulling all data from {self.uuid}...\")\n table = super().table\n query = f'SELECT * FROM \"{table}\"'\n df = self.query(query)\n\n # Drop any columns generated from AWS\n if not include_aws_columns:\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n df = df.drop(columns=aws_cols, errors=\"ignore\")\n return df\n\n def to_features(\n self,\n name: str,\n id_column: str,\n tags: list = None,\n event_time_column: str = None,\n one_hot_columns: list = None,\n ) -> Union[FeatureSet, None]:\n \"\"\"\n Convert the DataSource to a FeatureSet\n\n Args:\n name (str): Set the name for feature set (must be lowercase).\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n tags (list, optional: Set the tags for the feature set. If not specified tags will be generated\n event_time_column (str, optional): Set the event time for feature set. If not specified will be generated\n one_hot_columns (list, optional): Set the columns to be one-hot encoded. (default: None)\n\n Returns:\n FeatureSet: The FeatureSet created from the DataSource (or None if the FeatureSet isn't created)\n \"\"\"\n\n # Ensure the feature_set_name is valid\n if not Artifact.is_name_valid(name):\n self.log.critical(f\"Invalid FeatureSet name: {name}, not creating FeatureSet!\")\n return None\n\n # Set the Tags\n tags = [name] if tags is None else tags\n\n # Transform the DataSource to a FeatureSet\n data_to_features = DataToFeaturesLight(self.uuid, name)\n data_to_features.set_output_tags(tags)\n data_to_features.transform(\n id_column=id_column,\n event_time_column=event_time_column,\n one_hot_columns=one_hot_columns,\n )\n\n # Return the FeatureSet (which will now be up-to-date)\n return FeatureSet(name)\n\n def _load_source(self, source: str, name: str, tags: list):\n \"\"\"Load the source of the data\"\"\"\n self.log.info(f\"Loading source: {source}...\")\n\n # Pandas DataFrame Source\n if isinstance(source, pd.DataFrame):\n my_loader = PandasToData(name)\n my_loader.set_input(source)\n my_loader.set_output_tags(tags)\n my_loader.transform()\n\n # S3 Source\n source = source if isinstance(source, str) else str(source)\n if source.startswith(\"s3://\"):\n my_loader = S3ToDataSourceLight(source, name)\n my_loader.set_output_tags(tags)\n my_loader.transform()\n\n # File Source\n elif os.path.isfile(source):\n my_loader = CSVToDataSource(source, name)\n my_loader.set_output_tags(tags)\n my_loader.transform()\n
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.__init__","title":"__init__(source, name=None, tags=None, **kwargs)
","text":"Initializes a new DataSource object.
Parameters:
Name Type Description Defaultsource
Union[str, DataFrame]
Source of data (existing name, filepath, S3 path, or a Pandas DataFrame)
requiredname
str
The name of the data source (must be lowercase). If not specified, a name will be generated
None
tags
list[str]
A list of tags associated with the data source. If not specified tags will be generated.
None
Source code in src/sageworks/api/data_source.py
def __init__(self, source: Union[str, pd.DataFrame], name: str = None, tags: list = None, **kwargs):\n \"\"\"\n Initializes a new DataSource object.\n\n Args:\n source (Union[str, pd.DataFrame]): Source of data (existing name, filepath, S3 path, or a Pandas DataFrame)\n name (str): The name of the data source (must be lowercase). If not specified, a name will be generated\n tags (list[str]): A list of tags associated with the data source. If not specified tags will be generated.\n \"\"\"\n\n # Ensure the ds_name is valid\n if name:\n Artifact.is_name_valid(name)\n\n # If the data source name wasn't given, generate it\n else:\n name = extract_data_source_basename(source)\n name = Artifact.generate_valid_name(name)\n\n # Sanity check for dataframe sources\n if name == \"dataframe\":\n msg = \"Set the 'name' argument in the constructor: DataSource(df, name='my_data')\"\n self.log.critical(msg)\n raise ValueError(msg)\n\n # Set the tags and load the source\n tags = [name] if tags is None else tags\n self._load_source(source, name, tags)\n\n # Call superclass init\n super().__init__(name, **kwargs)\n
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.details","title":"details(**kwargs)
","text":"DataSource Details
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the DataSource
Source code insrc/sageworks/api/data_source.py
def details(self, **kwargs) -> dict:\n \"\"\"DataSource Details\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.pull_dataframe","title":"pull_dataframe(include_aws_columns=False)
","text":"Return a DataFrame of ALL the data from this DataSource
Parameters:
Name Type Description Defaultinclude_aws_columns
bool
Include the AWS columns in the DataFrame (default: False)
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame of ALL the data from this DataSource
NoteObviously this is not recommended for large datasets :)
Source code insrc/sageworks/api/data_source.py
def pull_dataframe(self, include_aws_columns=False) -> pd.DataFrame:\n \"\"\"Return a DataFrame of ALL the data from this DataSource\n\n Args:\n include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of ALL the data from this DataSource\n\n Note:\n Obviously this is not recommended for large datasets :)\n \"\"\"\n\n # Get the table associated with the data\n self.log.info(f\"Pulling all data from {self.uuid}...\")\n table = super().table\n query = f'SELECT * FROM \"{table}\"'\n df = self.query(query)\n\n # Drop any columns generated from AWS\n if not include_aws_columns:\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n df = df.drop(columns=aws_cols, errors=\"ignore\")\n return df\n
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.query","title":"query(query)
","text":"Query the AthenaSource
Parameters:
Name Type Description Defaultquery
str
The query to run against the DataSource
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The results of the query
Source code insrc/sageworks/api/data_source.py
def query(self, query: str) -> pd.DataFrame:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the DataSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n return super().query(query)\n
"},{"location":"api_classes/data_source/#sageworks.api.data_source.DataSource.to_features","title":"to_features(name, id_column, tags=None, event_time_column=None, one_hot_columns=None)
","text":"Convert the DataSource to a FeatureSet
Parameters:
Name Type Description Defaultname
str
Set the name for feature set (must be lowercase).
requiredid_column
str
The ID column (must be specified, use \"auto\" for auto-generated IDs).
requiredtags
list
Set the tags for the feature set. If not specified tags will be generated
None
event_time_column
str
Set the event time for feature set. If not specified will be generated
None
one_hot_columns
list
Set the columns to be one-hot encoded. (default: None)
None
Returns:
Name Type DescriptionFeatureSet
Union[FeatureSet, None]
The FeatureSet created from the DataSource (or None if the FeatureSet isn't created)
Source code insrc/sageworks/api/data_source.py
def to_features(\n self,\n name: str,\n id_column: str,\n tags: list = None,\n event_time_column: str = None,\n one_hot_columns: list = None,\n) -> Union[FeatureSet, None]:\n \"\"\"\n Convert the DataSource to a FeatureSet\n\n Args:\n name (str): Set the name for feature set (must be lowercase).\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n tags (list, optional: Set the tags for the feature set. If not specified tags will be generated\n event_time_column (str, optional): Set the event time for feature set. If not specified will be generated\n one_hot_columns (list, optional): Set the columns to be one-hot encoded. (default: None)\n\n Returns:\n FeatureSet: The FeatureSet created from the DataSource (or None if the FeatureSet isn't created)\n \"\"\"\n\n # Ensure the feature_set_name is valid\n if not Artifact.is_name_valid(name):\n self.log.critical(f\"Invalid FeatureSet name: {name}, not creating FeatureSet!\")\n return None\n\n # Set the Tags\n tags = [name] if tags is None else tags\n\n # Transform the DataSource to a FeatureSet\n data_to_features = DataToFeaturesLight(self.uuid, name)\n data_to_features.set_output_tags(tags)\n data_to_features.transform(\n id_column=id_column,\n event_time_column=event_time_column,\n one_hot_columns=one_hot_columns,\n )\n\n # Return the FeatureSet (which will now be up-to-date)\n return FeatureSet(name)\n
"},{"location":"api_classes/data_source/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Create a DataSource from an S3 Path or File Path
datasource_from_s3.pyfrom sageworks.api.data_source import DataSource\n\n# Create a DataSource from an S3 Path (or a local file)\nsource_path = \"s3://sageworks-public-data/common/abalone.csv\"\n# source_path = \"/full/path/to/local/file.csv\"\n\nmy_data = DataSource(source_path)\nprint(my_data.details())\n
Create a DataSource from a Pandas Dataframe
datasource_from_df.pyfrom sageworks.utils.test_data_generator import TestDataGenerator\nfrom sageworks.api.data_source import DataSource\n\n# Create a DataSource from a Pandas DataFrame\ngen_data = TestDataGenerator()\ndf = gen_data.person_data()\n\ntest_data = DataSource(df, name=\"test_data\")\nprint(test_data.details())\n
Query a DataSource
All SageWorks DataSources use AWS Athena, so any query that you can make with Athena is accessible through the DataSource API.
datasource_query.pyfrom sageworks.api.data_source import DataSource\n\n# Grab a DataSource\nmy_data = DataSource(\"abalone_data\")\n\n# Make some queries using the Athena backend\ndf = my_data.query(\"select * from abalone_data where height > .3\")\nprint(df.head())\n\ndf = my_data.query(\"select * from abalone_data where class_number_of_rings < 3\")\nprint(df.head())\n
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10\n1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8\n\n sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1\n1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2\n
Create a FeatureSet from a DataSource
datasource_to_featureset.pyfrom sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\ntest_data = DataSource('test_data')\nmy_features = test_data.to_features()\nprint(my_features.details())\n
"},{"location":"api_classes/data_source/#sageworks-ui","title":"SageWorks UI","text":"Whenever a DataSource is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.
SageWorks Dashboard: DataSourcesNot Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/df_store/","title":"SageWorks DataFrame Storage","text":"Examples
Examples of using the Parameter Storage class are listed at the bottom of this page Examples.
"},{"location":"api_classes/df_store/#why-dataframe-storage","title":"Why DataFrame Storage?","text":"Great question, there's a couple of reasons. The first is that the Parameter Store in AWS has a 4KB limit, so that won't support any kind of 'real data'. The second reason is that DataFrames are commonly used as part of the data engineering, science, and ML pipeline construction process. Providing storage of named DataFrames in an accessible location that can be inspected and used by your ML Team comes in super handy.
"},{"location":"api_classes/df_store/#efficient-storage","title":"Efficient Storage","text":"All DataFrames are stored in the Parquet format using 'snappy' storage. Parquet is a columnar storage format that efficiently handles large datasets, and using Snappy compression reduces file size while maintaining fast read/write speeds.
DFStore: Fast/efficient storage of DataFrames using AWS S3/Parquet/Snappy
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore","title":"DFStore
","text":" Bases: AWSDFStore
DFStore: Fast/efficient storage of DataFrames using AWS S3/Parquet/Snappy
Common Usagedf_store = DFStore()\n\n# List Data\ndf_store.list()\n\n# Add DataFrame\ndf = pd.DataFrame({\"A\": [1, 2], \"B\": [3, 4]})\ndf_store.upsert(\"/test/my_data\", df)\n\n# Retrieve DataFrame\ndf = df_store.get(\"/test/my_data\")\nprint(df)\n\n# Delete Data\ndf_store.delete(\"/test/my_data\")\n
Source code in src/sageworks/api/df_store.py
class DFStore(AWSDFStore):\n \"\"\"DFStore: Fast/efficient storage of DataFrames using AWS S3/Parquet/Snappy\n\n Common Usage:\n ```python\n df_store = DFStore()\n\n # List Data\n df_store.list()\n\n # Add DataFrame\n df = pd.DataFrame({\"A\": [1, 2], \"B\": [3, 4]})\n df_store.upsert(\"/test/my_data\", df)\n\n # Retrieve DataFrame\n df = df_store.get(\"/test/my_data\")\n print(df)\n\n # Delete Data\n df_store.delete(\"/test/my_data\")\n ```\n \"\"\"\n\n def __init__(self, path_prefix: Union[str, None] = None):\n \"\"\"DFStore Init Method\n\n Args:\n path_prefix (Union[str, None], optional): Add a path prefix to storage locations (Defaults to None)\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Initialize the SuperClass\n super().__init__(path_prefix=path_prefix)\n\n def list(self, include_cache: bool = False) -> list:\n \"\"\"List all the objects in the data_store prefix.\n\n Args:\n include_cache (bool, optional): Include cache objects in the list (Defaults to False).\n\n Returns:\n list: A list of all the objects in the data_store prefix.\n \"\"\"\n return super().list(include_cache=include_cache)\n\n def summary(self, include_cache: bool = False) -> pd.DataFrame:\n \"\"\"Return a nicely formatted summary of object locations, sizes (in MB), and modified dates.\n\n Args:\n include_cache (bool, optional): Include cache objects in the summary (Defaults to False).\n\n Returns:\n pd.DataFrame: A formatted DataFrame with the summary details.\n \"\"\"\n return super().summary(include_cache=include_cache)\n\n def details(self, include_cache: bool = False) -> pd.DataFrame:\n \"\"\"Return a DataFrame with detailed metadata for all objects in the data_store prefix.\n\n Args:\n include_cache (bool, optional): Include cache objects in the details (Defaults to False).\n\n Returns:\n pd.DataFrame: A DataFrame with detailed metadata for all objects in the data_store prefix.\n \"\"\"\n return super().details(include_cache=include_cache)\n\n def check(self, location: str) -> bool:\n \"\"\"Check if a DataFrame exists at the specified location\n\n Args:\n location (str): The location of the data to check.\n\n Returns:\n bool: True if the data exists, False otherwise.\n \"\"\"\n return super().check(location)\n\n def get(self, location: str) -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve a DataFrame from AWS S3.\n\n Args:\n location (str): The location of the data to retrieve.\n\n Returns:\n pd.DataFrame: The retrieved DataFrame or None if not found.\n \"\"\"\n _df = super().get(location)\n if _df is None:\n self.log.error(f\"Dataframe not found at location: {location}\")\n return _df\n\n def upsert(self, location: str, data: Union[pd.DataFrame, pd.Series]):\n \"\"\"Insert or update a DataFrame or Series in the AWS S3.\n\n Args:\n location (str): The location of the data.\n data (Union[pd.DataFrame, pd.Series]): The data to be stored.\n \"\"\"\n super().upsert(location, data)\n\n def last_modified(self, location: str) -> Union[datetime, None]:\n \"\"\"Get the last modified date of the DataFrame at the specified location.\n\n Args:\n location (str): The location of the data to check.\n\n Returns:\n Union[datetime, None]: The last modified date of the DataFrame or None if not found.\n \"\"\"\n return super().last_modified(location)\n\n def delete(self, location: str):\n \"\"\"Delete a DataFrame from the AWS S3.\n\n Args:\n location (str): The location of the data to delete.\n \"\"\"\n super().delete(location)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.__init__","title":"__init__(path_prefix=None)
","text":"DFStore Init Method
Parameters:
Name Type Description Defaultpath_prefix
Union[str, None]
Add a path prefix to storage locations (Defaults to None)
None
Source code in src/sageworks/api/df_store.py
def __init__(self, path_prefix: Union[str, None] = None):\n \"\"\"DFStore Init Method\n\n Args:\n path_prefix (Union[str, None], optional): Add a path prefix to storage locations (Defaults to None)\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Initialize the SuperClass\n super().__init__(path_prefix=path_prefix)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.check","title":"check(location)
","text":"Check if a DataFrame exists at the specified location
Parameters:
Name Type Description Defaultlocation
str
The location of the data to check.
requiredReturns:
Name Type Descriptionbool
bool
True if the data exists, False otherwise.
Source code insrc/sageworks/api/df_store.py
def check(self, location: str) -> bool:\n \"\"\"Check if a DataFrame exists at the specified location\n\n Args:\n location (str): The location of the data to check.\n\n Returns:\n bool: True if the data exists, False otherwise.\n \"\"\"\n return super().check(location)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.delete","title":"delete(location)
","text":"Delete a DataFrame from the AWS S3.
Parameters:
Name Type Description Defaultlocation
str
The location of the data to delete.
required Source code insrc/sageworks/api/df_store.py
def delete(self, location: str):\n \"\"\"Delete a DataFrame from the AWS S3.\n\n Args:\n location (str): The location of the data to delete.\n \"\"\"\n super().delete(location)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.details","title":"details(include_cache=False)
","text":"Return a DataFrame with detailed metadata for all objects in the data_store prefix.
Parameters:
Name Type Description Defaultinclude_cache
bool
Include cache objects in the details (Defaults to False).
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame with detailed metadata for all objects in the data_store prefix.
Source code insrc/sageworks/api/df_store.py
def details(self, include_cache: bool = False) -> pd.DataFrame:\n \"\"\"Return a DataFrame with detailed metadata for all objects in the data_store prefix.\n\n Args:\n include_cache (bool, optional): Include cache objects in the details (Defaults to False).\n\n Returns:\n pd.DataFrame: A DataFrame with detailed metadata for all objects in the data_store prefix.\n \"\"\"\n return super().details(include_cache=include_cache)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.get","title":"get(location)
","text":"Retrieve a DataFrame from AWS S3.
Parameters:
Name Type Description Defaultlocation
str
The location of the data to retrieve.
requiredReturns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The retrieved DataFrame or None if not found.
Source code insrc/sageworks/api/df_store.py
def get(self, location: str) -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve a DataFrame from AWS S3.\n\n Args:\n location (str): The location of the data to retrieve.\n\n Returns:\n pd.DataFrame: The retrieved DataFrame or None if not found.\n \"\"\"\n _df = super().get(location)\n if _df is None:\n self.log.error(f\"Dataframe not found at location: {location}\")\n return _df\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.last_modified","title":"last_modified(location)
","text":"Get the last modified date of the DataFrame at the specified location.
Parameters:
Name Type Description Defaultlocation
str
The location of the data to check.
requiredReturns:
Type DescriptionUnion[datetime, None]
Union[datetime, None]: The last modified date of the DataFrame or None if not found.
Source code insrc/sageworks/api/df_store.py
def last_modified(self, location: str) -> Union[datetime, None]:\n \"\"\"Get the last modified date of the DataFrame at the specified location.\n\n Args:\n location (str): The location of the data to check.\n\n Returns:\n Union[datetime, None]: The last modified date of the DataFrame or None if not found.\n \"\"\"\n return super().last_modified(location)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.list","title":"list(include_cache=False)
","text":"List all the objects in the data_store prefix.
Parameters:
Name Type Description Defaultinclude_cache
bool
Include cache objects in the list (Defaults to False).
False
Returns:
Name Type Descriptionlist
list
A list of all the objects in the data_store prefix.
Source code insrc/sageworks/api/df_store.py
def list(self, include_cache: bool = False) -> list:\n \"\"\"List all the objects in the data_store prefix.\n\n Args:\n include_cache (bool, optional): Include cache objects in the list (Defaults to False).\n\n Returns:\n list: A list of all the objects in the data_store prefix.\n \"\"\"\n return super().list(include_cache=include_cache)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.summary","title":"summary(include_cache=False)
","text":"Return a nicely formatted summary of object locations, sizes (in MB), and modified dates.
Parameters:
Name Type Description Defaultinclude_cache
bool
Include cache objects in the summary (Defaults to False).
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A formatted DataFrame with the summary details.
Source code insrc/sageworks/api/df_store.py
def summary(self, include_cache: bool = False) -> pd.DataFrame:\n \"\"\"Return a nicely formatted summary of object locations, sizes (in MB), and modified dates.\n\n Args:\n include_cache (bool, optional): Include cache objects in the summary (Defaults to False).\n\n Returns:\n pd.DataFrame: A formatted DataFrame with the summary details.\n \"\"\"\n return super().summary(include_cache=include_cache)\n
"},{"location":"api_classes/df_store/#sageworks.api.df_store.DFStore.upsert","title":"upsert(location, data)
","text":"Insert or update a DataFrame or Series in the AWS S3.
Parameters:
Name Type Description Defaultlocation
str
The location of the data.
requireddata
Union[DataFrame, Series]
The data to be stored.
required Source code insrc/sageworks/api/df_store.py
def upsert(self, location: str, data: Union[pd.DataFrame, pd.Series]):\n \"\"\"Insert or update a DataFrame or Series in the AWS S3.\n\n Args:\n location (str): The location of the data.\n data (Union[pd.DataFrame, pd.Series]): The data to be stored.\n \"\"\"\n super().upsert(location, data)\n
"},{"location":"api_classes/df_store/#examples","title":"Examples","text":"These example show how to use the DFStore()
class to list, add, and get dataframes from AWS Storage.
SageWorks REPL
If you'd like to experiment with listing, adding, and getting dataframe with the DFStore()
class, you can spin up the SageWorks REPL, use the class and test out all the methods. Try it out! SageWorks REPL
from sageworks.api.df_store import DFStore\ndf_store = DFStore()\n\n# List DataFrames\ndf_store().list()\n\nOut[1]:\nml/confustion_matrix (0.002MB/2024-09-23 16:44:48)\nml/hold_out_ids (0.094MB/2024-09-23 16:57:01)\nml/my_awesome_df (0.002MB/2024-09-23 16:43:30)\nml/shap_values (0.019MB/2024-09-23 16:57:21)\n\n# Add a DataFrame\ndf = pd.DataFrame({\"A\": [1]*1000, \"B\": [3]*1000})\ndf_store.upsert(\"test/test_df\", df)\n\n# List DataFrames (we can just use the REPR)\ndf_store\n\nOut[2]:\nml/confustion_matrix (0.002MB/2024-09-23 16:44:48)\nml/hold_out_ids (0.094MB/2024-09-23 16:57:01)\nml/my_awesome_df (0.002MB/2024-09-23 16:43:30)\nml/shap_values (0.019MB/2024-09-23 16:57:21)\ntest/test_df (0.002MB/2024-09-23 16:59:27)\n\n# Retrieve dataframes\nreturn_df = df_store.get(\"test/test_df\")\nreturn_df.head()\n\nOut[3]:\n A B\n0 1 3\n1 1 3\n2 1 3\n3 1 3\n4 1 3\n\n# Delete dataframes\ndf_store.delete(\"test/test_df\")\n
Compressed Storage is Automatic
All DataFrames are stored in the Parquet format using 'snappy' storage. Parquet is a columnar storage format that efficiently handles large datasets, and using Snappy compression reduces file size while maintaining fast read/write speeds.
"},{"location":"api_classes/endpoint/","title":"Endpoint","text":"Endpoint Examples
Examples of using the Endpoint class are listed at the bottom of this page Examples.
Endpoint: Manages AWS Endpoint creation and deployment. Endpoints are automatically set up and provisioned for deployment into AWS. Endpoints can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional model details and performance metrics
"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint","title":"Endpoint
","text":" Bases: EndpointCore
Endpoint: SageWorks Endpoint API Class
Common Usagemy_endpoint = Endpoint(name)\nmy_endpoint.details()\nmy_endpoint.inference(eval_df)\n
Source code in src/sageworks/api/endpoint.py
class Endpoint(EndpointCore):\n \"\"\"Endpoint: SageWorks Endpoint API Class\n\n Common Usage:\n ```python\n my_endpoint = Endpoint(name)\n my_endpoint.details()\n my_endpoint.inference(eval_df)\n ```\n \"\"\"\n\n def details(self, **kwargs) -> dict:\n \"\"\"Endpoint Details\n\n Returns:\n dict: A dictionary of details about the Endpoint\n \"\"\"\n return super().details(**kwargs)\n\n def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n capture_uuid (str, optional): The UUID of the capture to use (default: None)\n id_column (str, optional): The name of the column to use as the ID (default: None)\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n \"\"\"\n return super().inference(eval_df, capture_uuid, id_column)\n\n def auto_inference(self, capture: bool = False) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the FeatureSet evaluation data\n\n Args:\n capture (bool): Capture the inference results\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n \"\"\"\n return super().auto_inference(capture)\n\n def fast_inference(self, eval_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n\n Note:\n There's no sanity checks or error handling... just FAST Inference!\n \"\"\"\n return super().fast_inference(eval_df)\n
"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.auto_inference","title":"auto_inference(capture=False)
","text":"Run inference on the Endpoint using the FeatureSet evaluation data
Parameters:
Name Type Description Defaultcapture
bool
Capture the inference results
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: The DataFrame with predictions
Source code insrc/sageworks/api/endpoint.py
def auto_inference(self, capture: bool = False) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the FeatureSet evaluation data\n\n Args:\n capture (bool): Capture the inference results\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n \"\"\"\n return super().auto_inference(capture)\n
"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.details","title":"details(**kwargs)
","text":"Endpoint Details
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the Endpoint
Source code insrc/sageworks/api/endpoint.py
def details(self, **kwargs) -> dict:\n \"\"\"Endpoint Details\n\n Returns:\n dict: A dictionary of details about the Endpoint\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.fast_inference","title":"fast_inference(eval_df)
","text":"Run inference on the Endpoint using the provided DataFrame
Parameters:
Name Type Description Defaulteval_df
DataFrame
The DataFrame to run predictions on
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The DataFrame with predictions
NoteThere's no sanity checks or error handling... just FAST Inference!
Source code insrc/sageworks/api/endpoint.py
def fast_inference(self, eval_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n\n Note:\n There's no sanity checks or error handling... just FAST Inference!\n \"\"\"\n return super().fast_inference(eval_df)\n
"},{"location":"api_classes/endpoint/#sageworks.api.endpoint.Endpoint.inference","title":"inference(eval_df, capture_uuid=None, id_column=None)
","text":"Run inference on the Endpoint using the provided DataFrame
Parameters:
Name Type Description Defaulteval_df
DataFrame
The DataFrame to run predictions on
requiredcapture_uuid
str
The UUID of the capture to use (default: None)
None
id_column
str
The name of the column to use as the ID (default: None)
None
Returns:
Type DescriptionDataFrame
pd.DataFrame: The DataFrame with predictions
Source code insrc/sageworks/api/endpoint.py
def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n capture_uuid (str, optional): The UUID of the capture to use (default: None)\n id_column (str, optional): The name of the column to use as the ID (default: None)\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n \"\"\"\n return super().inference(eval_df, capture_uuid, id_column)\n
"},{"location":"api_classes/endpoint/#examples","title":"Examples","text":"Run Inference on an Endpoint
endpoint_inference.pyfrom sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import Model\nfrom sageworks.api.endpoint import Endpoint\n\n# Grab an existing Endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# SageWorks has full ML Pipeline provenance, so we can backtrack the inputs,\n# get a DataFrame of data (not used for training) and run inference\nmodel = Model(endpoint.get_input())\nfs = FeatureSet(model.get_input())\nathena_table = fs.view(\"training\").table\ndf = fs.query(f\"SELECT * FROM {athena_table} where training = FALSE\")\n\n# Run inference/predictions on the Endpoint\nresults_df = endpoint.inference(df)\n\n# Run inference/predictions and capture the results\nresults_df = endpoint.inference(df, capture=True)\n\n# Run inference/predictions using the FeatureSet evaluation data\nresults_df = endpoint.auto_inference(capture=True)\n
Output
Processing...\n class_number_of_rings prediction\n0 13 11.477922\n1 12 12.316887\n2 8 7.612847\n3 8 9.663341\n4 9 9.075263\n.. ... ...\n839 8 8.069856\n840 15 14.915502\n841 11 10.977605\n842 10 10.173433\n843 7 7.297976\n
Endpoint Details The details() method
The detail()
method on the Endpoint class provides a lot of useful information. All of the SageWorks classes have a details()
method try it out!
from sageworks.api.endpoint import Endpoint\nfrom pprint import pprint\n\n# Get Endpoint and print out it's details\nendpoint = Endpoint(\"abalone-regression-end\")\npprint(endpoint.details())\n
Output
{\n 'input': 'abalone-regression',\n 'instance': 'Serverless (2GB/5)',\n 'model_metrics': metric_name value\n 0 RMSE 2.190\n 1 MAE 1.544\n 2 R2 0.504,\n 'model_name': 'abalone-regression',\n 'model_type': 'regressor',\n 'modified': datetime.datetime(2023, 12, 29, 17, 48, 35, 115000, tzinfo=datetime.timezone.utc),\n class_number_of_rings prediction\n0 9 8.648378\n1 11 9.717787\n2 11 10.933070\n3 10 9.899738\n4 9 10.014504\n.. ... ...\n495 10 10.261657\n496 9 10.788254\n497 13 7.779886\n498 12 14.718514\n499 13 10.637320\n 'sageworks_tags': ['abalone', 'regression'],\n 'status': 'InService',\n 'uuid': 'abalone-regression-end',\n 'variant': 'AllTraffic'}\n
Endpoint Metrics
endpoint_metrics.pyfrom sageworks.api.endpoint import Endpoint\n\n# Grab an existing Endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# SageWorks tracks both Model performance and Endpoint Metrics\nmodel_metrics = endpoint.details()[\"model_metrics\"]\nendpoint_metrics = endpoint.endpoint_metrics()\nprint(model_metrics)\nprint(endpoint_metrics)\n
Output
metric_name value\n0 RMSE 2.190\n1 MAE 1.544\n2 R2 0.504\n\n Invocations ModelLatency OverheadLatency ModelSetupTime Invocation5XXErrors\n29 0.0 0.00 0.00 0.00 0.0\n30 1.0 1.11 23.73 23.34 0.0\n31 0.0 0.00 0.00 0.00 0.0\n48 0.0 0.00 0.00 0.00 0.0\n49 5.0 0.45 9.64 23.57 0.0\n50 2.0 0.57 0.08 0.00 0.0\n51 0.0 0.00 0.00 0.00 0.0\n60 4.0 0.33 5.80 22.65 0.0\n61 1.0 1.11 23.35 23.10 0.0\n62 0.0 0.00 0.00 0.00 0.0\n...\n
"},{"location":"api_classes/endpoint/#sageworks-ui","title":"SageWorks UI","text":"Running these few lines of code creates and deploys an AWS Endpoint. The Endpoint artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI. SageWorks will monitor the endpoint, plot invocations, latencies, and tracks error metrics.
SageWorks Dashboard: EndpointsNot Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/feature_set/","title":"FeatureSet","text":"FeatureSet Examples
Examples of using the FeatureSet Class are in the Examples section at the bottom of this page. AWS Feature Store and Feature Groups are quite complicated to set up manually but the SageWorks FeatureSet makes it a breeze!
FeatureSet: Manages AWS Feature Store/Group creation and management. FeatureSets are set up so they can easily be queried with AWS Athena. All FeatureSets are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) FeatureSets can be viewed and explored within the SageWorks Dashboard UI.
"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet","title":"FeatureSet
","text":" Bases: FeatureSetCore
FeatureSet: SageWorks FeatureSet API Class
Common Usagemy_features = FeatureSet(name)\nmy_features.details()\nmy_features.to_model(\n ModelType.REGRESSOR,\n name=\"abalone-regression\",\n target_column=\"class_number_of_rings\"\n feature_list=[\"my\", \"best\", \"features\"])\n)\n
Source code in src/sageworks/api/feature_set.py
class FeatureSet(FeatureSetCore):\n \"\"\"FeatureSet: SageWorks FeatureSet API Class\n\n Common Usage:\n ```python\n my_features = FeatureSet(name)\n my_features.details()\n my_features.to_model(\n ModelType.REGRESSOR,\n name=\"abalone-regression\",\n target_column=\"class_number_of_rings\"\n feature_list=[\"my\", \"best\", \"features\"])\n )\n ```\n \"\"\"\n\n def details(self, **kwargs) -> dict:\n \"\"\"FeatureSet Details\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().details(**kwargs)\n\n def query(self, query: str, **kwargs) -> pd.DataFrame:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the FeatureSet\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n return super().query(query, **kwargs)\n\n def pull_dataframe(self, include_aws_columns=False) -> pd.DataFrame:\n \"\"\"Return a DataFrame of ALL the data from this FeatureSet\n\n Args:\n include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of ALL the data from this FeatureSet\n\n Note:\n Obviously this is not recommended for large datasets :)\n \"\"\"\n\n # Get the table associated with the data\n self.log.info(f\"Pulling all data from {self.uuid}...\")\n query = f\"SELECT * FROM {self.athena_table}\"\n df = self.query(query)\n\n # Drop any columns generated from AWS\n if not include_aws_columns:\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n df = df.drop(columns=aws_cols, errors=\"ignore\")\n return df\n\n def to_model(\n self,\n model_type: ModelType = ModelType.UNKNOWN,\n model_class: str = None,\n name: str = None,\n tags: list = None,\n description: str = None,\n feature_list: list = None,\n target_column: str = None,\n **kwargs,\n ) -> Union[Model, None]:\n \"\"\"Create a Model from the FeatureSet\n\n Args:\n\n model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\n model_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\n name (str): Set the name for the model. If not specified, a name will be generated\n tags (list): Set the tags for the model. If not specified tags will be generated.\n description (str): Set the description for the model. If not specified a description is generated.\n feature_list (list): Set the feature list for the model. If not specified a feature list is generated.\n target_column (str): The target column for the model (use None for unsupervised model)\n\n Returns:\n Model: The Model created from the FeatureSet (or None if the Model could not be created)\n \"\"\"\n\n # Ensure the model_name is valid\n if name:\n if not Artifact.is_name_valid(name, delimiter=\"-\", lower_case=False):\n self.log.critical(f\"Invalid Model name: {name}, not creating Model!\")\n return None\n\n # If the model_name wasn't given generate it\n else:\n name = self.uuid.replace(\"_features\", \"\") + \"-model\"\n name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n # Create the Model Tags\n tags = [name] if tags is None else tags\n\n # Transform the FeatureSet into a Model\n features_to_model = FeaturesToModel(self.uuid, name, model_type=model_type, model_class=model_class)\n features_to_model.set_output_tags(tags)\n features_to_model.transform(\n target_column=target_column, description=description, feature_list=feature_list, **kwargs\n )\n\n # Return the Model\n return Model(name)\n
"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.details","title":"details(**kwargs)
","text":"FeatureSet Details
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the FeatureSet
Source code insrc/sageworks/api/feature_set.py
def details(self, **kwargs) -> dict:\n \"\"\"FeatureSet Details\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.pull_dataframe","title":"pull_dataframe(include_aws_columns=False)
","text":"Return a DataFrame of ALL the data from this FeatureSet
Parameters:
Name Type Description Defaultinclude_aws_columns
bool
Include the AWS columns in the DataFrame (default: False)
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame of ALL the data from this FeatureSet
NoteObviously this is not recommended for large datasets :)
Source code insrc/sageworks/api/feature_set.py
def pull_dataframe(self, include_aws_columns=False) -> pd.DataFrame:\n \"\"\"Return a DataFrame of ALL the data from this FeatureSet\n\n Args:\n include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of ALL the data from this FeatureSet\n\n Note:\n Obviously this is not recommended for large datasets :)\n \"\"\"\n\n # Get the table associated with the data\n self.log.info(f\"Pulling all data from {self.uuid}...\")\n query = f\"SELECT * FROM {self.athena_table}\"\n df = self.query(query)\n\n # Drop any columns generated from AWS\n if not include_aws_columns:\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n df = df.drop(columns=aws_cols, errors=\"ignore\")\n return df\n
"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.query","title":"query(query, **kwargs)
","text":"Query the AthenaSource
Parameters:
Name Type Description Defaultquery
str
The query to run against the FeatureSet
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The results of the query
Source code insrc/sageworks/api/feature_set.py
def query(self, query: str, **kwargs) -> pd.DataFrame:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the FeatureSet\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n return super().query(query, **kwargs)\n
"},{"location":"api_classes/feature_set/#sageworks.api.feature_set.FeatureSet.to_model","title":"to_model(model_type=ModelType.UNKNOWN, model_class=None, name=None, tags=None, description=None, feature_list=None, target_column=None, **kwargs)
","text":"Create a Model from the FeatureSet
Args:
model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\nmodel_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\nname (str): Set the name for the model. If not specified, a name will be generated\ntags (list): Set the tags for the model. If not specified tags will be generated.\ndescription (str): Set the description for the model. If not specified a description is generated.\nfeature_list (list): Set the feature list for the model. If not specified a feature list is generated.\ntarget_column (str): The target column for the model (use None for unsupervised model)\n
Returns:
Name Type DescriptionModel
Union[Model, None]
The Model created from the FeatureSet (or None if the Model could not be created)
Source code insrc/sageworks/api/feature_set.py
def to_model(\n self,\n model_type: ModelType = ModelType.UNKNOWN,\n model_class: str = None,\n name: str = None,\n tags: list = None,\n description: str = None,\n feature_list: list = None,\n target_column: str = None,\n **kwargs,\n) -> Union[Model, None]:\n \"\"\"Create a Model from the FeatureSet\n\n Args:\n\n model_type (ModelType): The type of model to create (See sageworks.model.ModelType)\n model_class (str): The model class to use for the model (e.g. \"KNeighborsRegressor\", default: None)\n name (str): Set the name for the model. If not specified, a name will be generated\n tags (list): Set the tags for the model. If not specified tags will be generated.\n description (str): Set the description for the model. If not specified a description is generated.\n feature_list (list): Set the feature list for the model. If not specified a feature list is generated.\n target_column (str): The target column for the model (use None for unsupervised model)\n\n Returns:\n Model: The Model created from the FeatureSet (or None if the Model could not be created)\n \"\"\"\n\n # Ensure the model_name is valid\n if name:\n if not Artifact.is_name_valid(name, delimiter=\"-\", lower_case=False):\n self.log.critical(f\"Invalid Model name: {name}, not creating Model!\")\n return None\n\n # If the model_name wasn't given generate it\n else:\n name = self.uuid.replace(\"_features\", \"\") + \"-model\"\n name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n # Create the Model Tags\n tags = [name] if tags is None else tags\n\n # Transform the FeatureSet into a Model\n features_to_model = FeaturesToModel(self.uuid, name, model_type=model_type, model_class=model_class)\n features_to_model.set_output_tags(tags)\n features_to_model.transform(\n target_column=target_column, description=description, feature_list=feature_list, **kwargs\n )\n\n # Return the Model\n return Model(name)\n
"},{"location":"api_classes/feature_set/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Create a FeatureSet from a Datasource
datasource_to_featureset.pyfrom sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\nds = DataSource('test_data')\nfs = ds.to_features(\"test_features\", id_column=\"id\")\nprint(fs.details())\n
FeatureSet EDA Statistics
featureset_eda.py
from sageworks.api.feature_set import FeatureSet\nimport pandas as pd\n\n# Grab a FeatureSet and pull some of the EDA Stats\nmy_features = FeatureSet('test_features')\n\n# Grab some of the EDA Stats\ncorr_data = my_features.correlations()\ncorr_df = pd.DataFrame(corr_data)\nprint(corr_df)\n\n# Get some outliers\noutliers = my_features.outliers()\npprint(outliers.head())\n\n# Full set of EDA Stats\neda_stats = my_features.column_stats()\npprint(eda_stats)\n
Output age food_pizza food_steak food_sushi food_tacos height id iq_score\nage NaN -0.188645 -0.256356 0.263048 0.054211 0.439678 -0.054948 -0.295513\nfood_pizza -0.188645 NaN -0.288175 -0.229591 -0.196818 -0.494380 0.137282 0.395378\nfood_steak -0.256356 -0.288175 NaN -0.374920 -0.321403 -0.002542 -0.005199 0.076477\nfood_sushi 0.263048 -0.229591 -0.374920 NaN -0.256064 0.536396 0.038279 -0.435033\nfood_tacos 0.054211 -0.196818 -0.321403 -0.256064 NaN -0.091493 -0.051398 0.033364\nheight 0.439678 -0.494380 -0.002542 0.536396 -0.091493 NaN -0.117372 -0.655210\nid -0.054948 0.137282 -0.005199 0.038279 -0.051398 -0.117372 NaN 0.106020\niq_score -0.295513 0.395378 0.076477 -0.435033 0.033364 -0.655210 0.106020 NaN\n\n name height weight salary age iq_score likes_dogs food_pizza food_steak food_sushi food_tacos outlier_group\n0 Person 96 57.582840 148.461349 80000.000000 43 150.000000 1 0 0 0 0 height_low\n1 Person 68 73.918663 189.527313 219994.000000 80 100.000000 0 0 0 1 0 iq_score_low\n2 Person 49 70.381790 261.237000 175633.703125 49 107.933998 0 0 0 1 0 iq_score_low\n3 Person 90 73.488739 193.840698 227760.000000 72 110.821541 1 0 0 0 0 salary_high\n\n<lots of EDA data and statistics>\n
Query a FeatureSet
All SageWorks FeatureSet have an 'offline' store that uses AWS Athena, so any query that you can make with Athena is accessible through the FeatureSet API.
featureset_query.pyfrom sageworks.api.feature_set import FeatureSet\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"abalone_features\")\n\n# Make some queries using the Athena backend\ndf = my_features.query(\"select * from abalone_features where height > .3\")\nprint(df.head())\n\ndf = my_features.query(\"select * from abalone_features where class_number_of_rings < 3\")\nprint(df.head())\n
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10\n1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8\n\n sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1\n1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2\n
Create a Model from a FeatureSet
featureset_to_model.pyfrom sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import ModelType\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet('test_features')\n\n# Create a Model from the FeatureSet\n# Note: ModelTypes can be CLASSIFIER, REGRESSOR, \n# UNSUPERVISED, or TRANSFORMER\nmy_model = my_features.to_model(model_type=ModelType.REGRESSOR, \n target_column=\"iq_score\")\npprint(my_model.details())\n
Output
{'approval_status': 'Approved',\n 'content_types': ['text/csv'],\n ...\n 'inference_types': ['ml.t2.medium'],\n 'input': 'test_features',\n 'model_metrics': metric_name value\n 0 RMSE 7.924\n 1 MAE 6.554,\n 2 R2 0.604,\n 'regression_predictions': iq_score prediction\n 0 136.519012 139.964460\n 1 133.616974 130.819950\n 2 122.495415 124.967834\n 3 133.279510 121.010284\n 4 127.881073 113.825005\n ...\n 'response_types': ['text/csv'],\n 'sageworks_tags': ['test-model'],\n 'shapley_values': None,\n 'size': 0.0,\n 'status': 'Completed',\n 'transform_types': ['ml.m5.large'],\n 'uuid': 'test-model',\n 'version': 1}\n
"},{"location":"api_classes/feature_set/#sageworks-ui","title":"SageWorks UI","text":"Whenever a FeatureSet is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.
SageWorks Dashboard: FeatureSetsNot Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/meta/","title":"Meta","text":"Meta Examples
Examples of using the Meta class are listed at the bottom of this page Examples.
Meta: A class that provides high level information and summaries of Cloud Platform Artifacts. The Meta class provides 'account' information, configuration, etc. It also provides metadata for Artifacts, such as Data Sources, Feature Sets, Models, and Endpoints.
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta","title":"Meta
","text":" Bases: CloudMeta
Meta: A class that provides metadata functionality for Cloud Platform Artifacts.
Common Usagefrom sageworks.api import Meta\nmeta = Meta()\n\n# Get the AWS Account Info\nmeta.account()\nmeta.config()\n\n# These are 'list' methods\nmeta.etl_jobs()\nmeta.data_sources()\nmeta.feature_sets(details=True/False)\nmeta.models(details=True/False)\nmeta.endpoints()\nmeta.views()\n\n# These are 'describe' methods\nmeta.data_source(\"abalone_data\")\nmeta.feature_set(\"abalone_features\")\nmeta.model(\"abalone-regression\")\nmeta.endpoint(\"abalone-endpoint\")\n
Source code in src/sageworks/api/meta.py
class Meta(CloudMeta):\n \"\"\"Meta: A class that provides metadata functionality for Cloud Platform Artifacts.\n\n Common Usage:\n ```python\n from sageworks.api import Meta\n meta = Meta()\n\n # Get the AWS Account Info\n meta.account()\n meta.config()\n\n # These are 'list' methods\n meta.etl_jobs()\n meta.data_sources()\n meta.feature_sets(details=True/False)\n meta.models(details=True/False)\n meta.endpoints()\n meta.views()\n\n # These are 'describe' methods\n meta.data_source(\"abalone_data\")\n meta.feature_set(\"abalone_features\")\n meta.model(\"abalone-regression\")\n meta.endpoint(\"abalone-endpoint\")\n ```\n \"\"\"\n\n def account(self) -> dict:\n \"\"\"Cloud Platform Account Info\n\n Returns:\n dict: Cloud Platform Account Info\n \"\"\"\n return super().account()\n\n def config(self) -> dict:\n \"\"\"Return the current SageWorks Configuration\n\n Returns:\n dict: The current SageWorks Configuration\n \"\"\"\n return super().config()\n\n def incoming_data(self) -> pd.DataFrame:\n \"\"\"Get summary data about data in the incoming raw data\n\n Returns:\n pd.DataFrame: A summary of the incoming raw data\n \"\"\"\n return super().incoming_data()\n\n def etl_jobs(self) -> pd.DataFrame:\n \"\"\"Get summary data about Extract, Transform, Load (ETL) Jobs\n\n Returns:\n pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform\n \"\"\"\n return super().etl_jobs()\n\n def data_sources(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Data Sources deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform\n \"\"\"\n return super().data_sources()\n\n def views(self, database: str = \"sageworks\") -> pd.DataFrame:\n \"\"\"Get a summary of the all the Views, for the given database, in AWS\n\n Args:\n database (str, optional): Glue database. Defaults to 'sageworks'.\n\n Returns:\n pd.DataFrame: A summary of all the Views, for the given database, in AWS\n \"\"\"\n return super().views(database=database)\n\n def feature_sets(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Feature Sets deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform\n \"\"\"\n return super().feature_sets(details=details)\n\n def models(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Models deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Models deployed in the Cloud Platform\n \"\"\"\n return super().models(details=details)\n\n def endpoints(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Endpoints deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Endpoints in the Cloud Platform\n \"\"\"\n return super().endpoints()\n\n def glue_job(self, job_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Glue Job\n\n Args:\n job_name (str): The name of the Glue Job\n\n Returns:\n dict: The details of the Glue Job (None if not found)\n \"\"\"\n return super().glue_job(job_name=job_name)\n\n def data_source(self, data_source_name: str, database: str = \"sageworks\") -> Union[dict, None]:\n \"\"\"Get the details of a specific Data Source\n\n Args:\n data_source_name (str): The name of the Data Source\n database (str, optional): The Glue database. Defaults to 'sageworks'.\n\n Returns:\n dict: The details of the Data Source (None if not found)\n \"\"\"\n return super().data_source(table_name=data_source_name, database=database)\n\n def feature_set(self, feature_set_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Feature Set\n\n Args:\n feature_set_name (str): The name of the Feature Set\n\n Returns:\n dict: The details of the Feature Set (None if not found)\n \"\"\"\n return super().feature_set(feature_group_name=feature_set_name)\n\n def model(self, model_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Model\n\n Args:\n model_name (str): The name of the Model\n\n Returns:\n dict: The details of the Model (None if not found)\n \"\"\"\n return super().model(model_group_name=model_name)\n\n def endpoint(self, endpoint_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: The details of the Endpoint (None if not found)\n \"\"\"\n return super().endpoint(endpoint_name=endpoint_name)\n\n def __repr__(self):\n return f\"Meta()\\n\\t{super().__repr__()}\"\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.account","title":"account()
","text":"Cloud Platform Account Info
Returns:
Name Type Descriptiondict
dict
Cloud Platform Account Info
Source code insrc/sageworks/api/meta.py
def account(self) -> dict:\n \"\"\"Cloud Platform Account Info\n\n Returns:\n dict: Cloud Platform Account Info\n \"\"\"\n return super().account()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.config","title":"config()
","text":"Return the current SageWorks Configuration
Returns:
Name Type Descriptiondict
dict
The current SageWorks Configuration
Source code insrc/sageworks/api/meta.py
def config(self) -> dict:\n \"\"\"Return the current SageWorks Configuration\n\n Returns:\n dict: The current SageWorks Configuration\n \"\"\"\n return super().config()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.data_source","title":"data_source(data_source_name, database='sageworks')
","text":"Get the details of a specific Data Source
Parameters:
Name Type Description Defaultdata_source_name
str
The name of the Data Source
requireddatabase
str
The Glue database. Defaults to 'sageworks'.
'sageworks'
Returns:
Name Type Descriptiondict
Union[dict, None]
The details of the Data Source (None if not found)
Source code insrc/sageworks/api/meta.py
def data_source(self, data_source_name: str, database: str = \"sageworks\") -> Union[dict, None]:\n \"\"\"Get the details of a specific Data Source\n\n Args:\n data_source_name (str): The name of the Data Source\n database (str, optional): The Glue database. Defaults to 'sageworks'.\n\n Returns:\n dict: The details of the Data Source (None if not found)\n \"\"\"\n return super().data_source(table_name=data_source_name, database=database)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.data_sources","title":"data_sources()
","text":"Get a summary of the Data Sources deployed in the Cloud Platform
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform
Source code insrc/sageworks/api/meta.py
def data_sources(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Data Sources deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform\n \"\"\"\n return super().data_sources()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.endpoint","title":"endpoint(endpoint_name)
","text":"Get the details of a specific Endpoint
Parameters:
Name Type Description Defaultendpoint_name
str
The name of the Endpoint
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Endpoint (None if not found)
Source code insrc/sageworks/api/meta.py
def endpoint(self, endpoint_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: The details of the Endpoint (None if not found)\n \"\"\"\n return super().endpoint(endpoint_name=endpoint_name)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.endpoints","title":"endpoints()
","text":"Get a summary of the Endpoints deployed in the Cloud Platform
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Endpoints in the Cloud Platform
Source code insrc/sageworks/api/meta.py
def endpoints(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Endpoints deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Endpoints in the Cloud Platform\n \"\"\"\n return super().endpoints()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.etl_jobs","title":"etl_jobs()
","text":"Get summary data about Extract, Transform, Load (ETL) Jobs
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform
Source code insrc/sageworks/api/meta.py
def etl_jobs(self) -> pd.DataFrame:\n \"\"\"Get summary data about Extract, Transform, Load (ETL) Jobs\n\n Returns:\n pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform\n \"\"\"\n return super().etl_jobs()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.feature_set","title":"feature_set(feature_set_name)
","text":"Get the details of a specific Feature Set
Parameters:
Name Type Description Defaultfeature_set_name
str
The name of the Feature Set
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Feature Set (None if not found)
Source code insrc/sageworks/api/meta.py
def feature_set(self, feature_set_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Feature Set\n\n Args:\n feature_set_name (str): The name of the Feature Set\n\n Returns:\n dict: The details of the Feature Set (None if not found)\n \"\"\"\n return super().feature_set(feature_group_name=feature_set_name)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.feature_sets","title":"feature_sets(details=False)
","text":"Get a summary of the Feature Sets deployed in the Cloud Platform
Parameters:
Name Type Description Defaultdetails
bool
Include detailed information. Defaults to False.
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform
Source code insrc/sageworks/api/meta.py
def feature_sets(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Feature Sets deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform\n \"\"\"\n return super().feature_sets(details=details)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.glue_job","title":"glue_job(job_name)
","text":"Get the details of a specific Glue Job
Parameters:
Name Type Description Defaultjob_name
str
The name of the Glue Job
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Glue Job (None if not found)
Source code insrc/sageworks/api/meta.py
def glue_job(self, job_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Glue Job\n\n Args:\n job_name (str): The name of the Glue Job\n\n Returns:\n dict: The details of the Glue Job (None if not found)\n \"\"\"\n return super().glue_job(job_name=job_name)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.incoming_data","title":"incoming_data()
","text":"Get summary data about data in the incoming raw data
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the incoming raw data
Source code insrc/sageworks/api/meta.py
def incoming_data(self) -> pd.DataFrame:\n \"\"\"Get summary data about data in the incoming raw data\n\n Returns:\n pd.DataFrame: A summary of the incoming raw data\n \"\"\"\n return super().incoming_data()\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.model","title":"model(model_name)
","text":"Get the details of a specific Model
Parameters:
Name Type Description Defaultmodel_name
str
The name of the Model
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Model (None if not found)
Source code insrc/sageworks/api/meta.py
def model(self, model_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Model\n\n Args:\n model_name (str): The name of the Model\n\n Returns:\n dict: The details of the Model (None if not found)\n \"\"\"\n return super().model(model_group_name=model_name)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.models","title":"models(details=False)
","text":"Get a summary of the Models deployed in the Cloud Platform
Parameters:
Name Type Description Defaultdetails
bool
Include detailed information. Defaults to False.
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Models deployed in the Cloud Platform
Source code insrc/sageworks/api/meta.py
def models(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Models deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Models deployed in the Cloud Platform\n \"\"\"\n return super().models(details=details)\n
"},{"location":"api_classes/meta/#sageworks.api.meta.Meta.views","title":"views(database='sageworks')
","text":"Get a summary of the all the Views, for the given database, in AWS
Parameters:
Name Type Description Defaultdatabase
str
Glue database. Defaults to 'sageworks'.
'sageworks'
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of all the Views, for the given database, in AWS
Source code insrc/sageworks/api/meta.py
def views(self, database: str = \"sageworks\") -> pd.DataFrame:\n \"\"\"Get a summary of the all the Views, for the given database, in AWS\n\n Args:\n database (str, optional): Glue database. Defaults to 'sageworks'.\n\n Returns:\n pd.DataFrame: A summary of all the Views, for the given database, in AWS\n \"\"\"\n return super().views(database=database)\n
"},{"location":"api_classes/meta/#examples","title":"Examples","text":"These example show how to use the Meta()
class to pull lists of artifacts from AWS. DataSources, FeatureSets, Models, Endpoints and more. If you're building a web interface plugin, the Meta class is a great place to start.
SageWorks REPL
If you'd like to see exactly what data/details you get back from the Meta()
class, you can spin up the SageWorks REPL, use the class and test out all the methods. Try it out! SageWorks REPL
meta = Meta()\nmodel_df = meta.models()\nmodel_df\n Model Group Health Owner ... Input Status Description\n0 wine-classification healthy - ... wine_features Completed Wine Classification Model\n1 abalone-regression-full healthy - ... abalone_features Completed Abalone Regression Model\n2 abalone-regression healthy - ... abalone_features Completed Abalone Regression Model\n\n[3 rows x 10 columns]\n
List the Models in AWS
meta_list_models.pyfrom sageworks.api import Meta\n\n# Create our Meta Class and get a list of our Models\nmeta = Meta()\nmodel_df = meta.models()\n\nprint(f\"Number of Models: {len(model_df)}\")\nprint(model_df)\n\n# Get more details data on the Models\nmodel_names = model_df[\"Model Group\"].tolist()\nfor name in model_names:\n pprint(meta.model(name))\n
Output
Number of Models: 3\n Model Group Health Owner ... Input Status Description\n0 wine-classification healthy - ... wine_features Completed Wine Classification Model\n1 abalone-regression-full healthy - ... abalone_features Completed Abalone Regression Model\n2 abalone-regression healthy - ... abalone_features Completed Abalone Regression Model\n\n[3 rows x 10 columns]\nwine-classification\nabalone-regression-full\nabalone-regression\n
Getting Model Performance Metrics
meta_model_metrics.pyfrom sageworks.api import Meta\n\n# Create our Meta Class and get a list of our Models\nmeta = Meta()\nmodel_df = meta.models()\n\nprint(f\"Number of Models: {len(model_df)}\")\nprint(model_df)\n\n# Get more details data on the Models\nmodel_names = model_df[\"Model Group\"].tolist()\nfor name in model_names[:5]:\n model_details = meta.model(name)\n print(f\"\\n\\nModel: {name}\")\n performance_metrics = model_details[\"sageworks_meta\"][\"sageworks_inference_metrics\"]\n print(f\"\\tPerformance Metrics: {performance_metrics}\")\n
Output
wine-classification\n ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/wine-classification\n Description: Wine Classification Model\n Tags: wine::classification\n Performance Metrics:\n [{'wine_class': 'TypeA', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 12}, {'wine_class': 'TypeB', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 14}, {'wine_class': 'TypeC', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 9}]\n\nabalone-regression\n ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/abalone-regression\n Description: Abalone Regression Model\n Tags: abalone::regression\n Performance Metrics:\n [{'MAE': 1.64, 'RMSE': 2.246, 'R2': 0.502, 'MAPE': 16.393, 'MedAE': 1.209, 'NumRows': 834}]\n
List the Endpoints in AWS
meta_list_endpoints.pyfrom pprint import pprint\nfrom sageworks.api import Meta\n\n# Create our Meta Class and get a list of our Endpoints\nmeta = Meta()\nendpoint_df = meta.endpoints()\nprint(f\"Number of Endpoints: {len(endpoint_df)}\")\nprint(endpoint_df)\n\n# Get more details data on the Endpoints\nendpoint_names = endpoint_df[\"Name\"].tolist()\nfor name in endpoint_names:\n pprint(meta.endpoint(name))\n
Output
Number of Endpoints: 2\n Name Health Instance Created ... Status Variant Capture Samp(%)\n0 wine-classification-end healthy Serverless (2GB/5) 2024-03-23 23:09 ... InService AllTraffic False -\n1 abalone-regression-end healthy Serverless (2GB/5) 2024-03-23 21:11 ... InService AllTraffic False -\n\n[2 rows x 10 columns]\nwine-classification-end\n<lots of details about endpoints>\n
Not Finding some particular AWS Data?
The SageWorks Meta API Class also has (details=True)
arguments, so make sure to check those out.
Model Examples
Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!
Model: Manages AWS Model Package/Group creation and management.
Models are automatically set up and provisioned for deployment into AWS. Models can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional model details and performance metrics
"},{"location":"api_classes/model/#sageworks.api.model.Model","title":"Model
","text":" Bases: ModelCore
Model: SageWorks Model API Class.
Common Usagemy_model = Model(name)\nmy_model.details()\nmy_model.to_endpoint()\n
Source code in src/sageworks/api/model.py
class Model(ModelCore):\n \"\"\"Model: SageWorks Model API Class.\n\n Common Usage:\n ```python\n my_model = Model(name)\n my_model.details()\n my_model.to_endpoint()\n ```\n \"\"\"\n\n def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the Model Details.\n\n Returns:\n dict: A dictionary of details about the Model\n \"\"\"\n return super().details(**kwargs)\n\n def to_endpoint(self, name: str = None, tags: list = None, serverless: bool = True) -> Endpoint:\n \"\"\"Create an Endpoint from the Model.\n\n Args:\n name (str): Set the name for the endpoint. If not specified, an automatic name will be generated\n tags (list): Set the tags for the endpoint. If not specified automatic tags will be generated.\n serverless (bool): Set the endpoint to be serverless (default: True)\n\n Returns:\n Endpoint: The Endpoint created from the Model\n \"\"\"\n\n # Ensure the endpoint_name is valid\n if name:\n Artifact.is_name_valid(name, delimiter=\"-\", lower_case=False)\n\n # If the endpoint_name wasn't given generate it\n else:\n name = self.uuid.replace(\"_features\", \"\") + \"-end\"\n name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n # Create the Endpoint Tags\n tags = [name] if tags is None else tags\n\n # Create an Endpoint from the Model\n model_to_endpoint = ModelToEndpoint(self.uuid, name, serverless=serverless)\n model_to_endpoint.set_output_tags(tags)\n model_to_endpoint.transform()\n\n # Return the Endpoint\n return Endpoint(name)\n
"},{"location":"api_classes/model/#sageworks.api.model.Model.details","title":"details(**kwargs)
","text":"Retrieve the Model Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the Model
Source code insrc/sageworks/api/model.py
def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the Model Details.\n\n Returns:\n dict: A dictionary of details about the Model\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"api_classes/model/#sageworks.api.model.Model.to_endpoint","title":"to_endpoint(name=None, tags=None, serverless=True)
","text":"Create an Endpoint from the Model.
Parameters:
Name Type Description Defaultname
str
Set the name for the endpoint. If not specified, an automatic name will be generated
None
tags
list
Set the tags for the endpoint. If not specified automatic tags will be generated.
None
serverless
bool
Set the endpoint to be serverless (default: True)
True
Returns:
Name Type DescriptionEndpoint
Endpoint
The Endpoint created from the Model
Source code insrc/sageworks/api/model.py
def to_endpoint(self, name: str = None, tags: list = None, serverless: bool = True) -> Endpoint:\n \"\"\"Create an Endpoint from the Model.\n\n Args:\n name (str): Set the name for the endpoint. If not specified, an automatic name will be generated\n tags (list): Set the tags for the endpoint. If not specified automatic tags will be generated.\n serverless (bool): Set the endpoint to be serverless (default: True)\n\n Returns:\n Endpoint: The Endpoint created from the Model\n \"\"\"\n\n # Ensure the endpoint_name is valid\n if name:\n Artifact.is_name_valid(name, delimiter=\"-\", lower_case=False)\n\n # If the endpoint_name wasn't given generate it\n else:\n name = self.uuid.replace(\"_features\", \"\") + \"-end\"\n name = Artifact.generate_valid_name(name, delimiter=\"-\")\n\n # Create the Endpoint Tags\n tags = [name] if tags is None else tags\n\n # Create an Endpoint from the Model\n model_to_endpoint = ModelToEndpoint(self.uuid, name, serverless=serverless)\n model_to_endpoint.set_output_tags(tags)\n model_to_endpoint.transform()\n\n # Return the Endpoint\n return Endpoint(name)\n
"},{"location":"api_classes/model/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Create a Model from a FeatureSet
featureset_to_model.pyfrom sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import ModelType\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"test_features\")\n\n# Create a Model from the FeatureSet\n# Note: ModelTypes can be CLASSIFIER, REGRESSOR (XGBoost is default)\nmy_model = my_features.to_model(model_type=ModelType.REGRESSOR, \n target_column=\"iq_score\")\npprint(my_model.details())\n
Output
{'approval_status': 'Approved',\n 'content_types': ['text/csv'],\n ...\n 'inference_types': ['ml.t2.medium'],\n 'input': 'test_features',\n 'model_metrics': metric_name value\n 0 RMSE 7.924\n 1 MAE 6.554,\n 2 R2 0.604,\n 'regression_predictions': iq_score prediction\n 0 136.519012 139.964460\n 1 133.616974 130.819950\n 2 122.495415 124.967834\n 3 133.279510 121.010284\n 4 127.881073 113.825005\n ...\n 'response_types': ['text/csv'],\n 'sageworks_tags': ['test-model'],\n 'shapley_values': None,\n 'size': 0.0,\n 'status': 'Completed',\n 'transform_types': ['ml.m5.large'],\n 'uuid': 'test-model',\n 'version': 1}\n
Use a specific Scikit-Learn Model
featureset_to_knn.py
from sageworks.api.feature_set import FeatureSet\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"abalone_features\")\n\n# Transform FeatureSet into KNN Regression Model\n# Note: model_class can be any sckit-learn model \n# \"KNeighborsRegressor\", \"BayesianRidge\",\n# \"GaussianNB\", \"AdaBoostClassifier\", etc\nmy_model = my_features.to_model(\n model_class=\"KNeighborsRegressor\",\n target_column=\"class_number_of_rings\",\n name=\"abalone-knn-reg\",\n description=\"Abalone KNN Regression\",\n tags=[\"abalone\", \"knn\"],\n train_all_data=True,\n)\npprint(my_model.details())\n
Another Scikit-Learn Example featureset_to_rfc.pyfrom sageworks.api.feature_set import FeatureSet\nfrom pprint import pprint\n\n# Grab a FeatureSet\nmy_features = FeatureSet(\"wine_features\")\n\n# Using a Scikit-Learn Model\n# Note: model_class can be any sckit-learn model (\"KNeighborsRegressor\", \"BayesianRidge\",\n# \"GaussianNB\", \"AdaBoostClassifier\", \"Ridge, \"Lasso\", \"SVC\", \"SVR\", etc...)\nmy_model = my_features.to_model(\n model_class=\"RandomForestClassifier\",\n target_column=\"wine_class\",\n name=\"wine-rfc-class\",\n description=\"Wine RandomForest Classification\",\n tags=[\"wine\", \"rfc\"]\n)\npprint(my_model.details())\n
Create an Endpoint from a Model
Endpoint Costs
Serverless endpoints are a great option, they have no AWS charges when not running. A realtime endpoint has less latency (no cold start) but AWS charges an hourly fee which can add up quickly!
model_to_endpoint.pyfrom sageworks.api.model import Model\n\n# Grab the abalone regression Model\nmodel = Model(\"abalone-regression\")\n\n# By default, an Endpoint is serverless, you can\n# make a realtime endpoint with serverless=False\nmodel.to_endpoint(name=\"abalone-regression-end\",\n tags=[\"abalone\", \"regression\"],\n serverless=True)\n
Model Health Check and Metrics
model_metrics.pyfrom sageworks.api.model import Model\n\n# Grab the abalone-regression Model\nmodel = Model(\"abalone-regression\")\n\n# Perform a health check on the model\n# Note: The health_check() method returns 'issues' if there are any\n# problems, so if there are no issues, the model is healthy\nhealth_issues = model.health_check()\nif not health_issues:\n print(\"Model is Healthy\")\nelse:\n print(\"Model has issues\")\n print(health_issues)\n\n# Get the model metrics and regression predictions\nprint(model.model_metrics())\nprint(model.regression_predictions())\n
Output
Model is Healthy\n metric_name value\n0 RMSE 2.190\n1 MAE 1.544\n2 R2 0.504\n\n class_number_of_rings prediction\n0 9 8.648378\n1 11 9.717787\n2 11 10.933070\n3 10 9.899738\n4 9 10.014504\n.. ... ...\n495 10 10.261657\n496 9 10.788254\n497 13 7.779886\n498 12 14.718514\n499 13 10.637320\n
"},{"location":"api_classes/model/#sageworks-ui","title":"SageWorks UI","text":"Running these few lines of code creates an AWS Model Package Group and an AWS Model Package. These model artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI.
SageWorks Dashboard: ModelsNot Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/monitor/","title":"Monitor","text":"Monitor Examples
Examples of using the Monitor class are listed at the bottom of this page Examples.
Monitor: Manages AWS Endpoint Monitor creation and deployment. Endpoints Monitors are set up and provisioned for deployment into AWS. Monitors can be viewed in the AWS Sagemaker interfaces or in the SageWorks Dashboard UI, which provides additional monitor details and performance metrics
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor","title":"Monitor
","text":" Bases: MonitorCore
Monitor: SageWorks Monitor API Class
Common Usagemon = Endpoint(name).get_monitor() # Pull from endpoint OR\nmon = Monitor(name) # Create using Endpoint Name\nmon.summary()\nmon.details()\n\n# One time setup methods\nmon.add_data_capture()\nmon.create_baseline()\nmon.create_monitoring_schedule()\n\n# Pull information from the monitor\nbaseline_df = mon.get_baseline()\nconstraints_df = mon.get_constraints()\nstats_df = mon.get_statistics()\ninput_df, output_df = mon.get_latest_data_capture()\n
Source code in src/sageworks/api/monitor.py
class Monitor(MonitorCore):\n \"\"\"Monitor: SageWorks Monitor API Class\n\n Common Usage:\n ```\n mon = Endpoint(name).get_monitor() # Pull from endpoint OR\n mon = Monitor(name) # Create using Endpoint Name\n mon.summary()\n mon.details()\n\n # One time setup methods\n mon.add_data_capture()\n mon.create_baseline()\n mon.create_monitoring_schedule()\n\n # Pull information from the monitor\n baseline_df = mon.get_baseline()\n constraints_df = mon.get_constraints()\n stats_df = mon.get_statistics()\n input_df, output_df = mon.get_latest_data_capture()\n ```\n \"\"\"\n\n def summary(self) -> dict:\n \"\"\"Monitor Summary\n\n Returns:\n dict: A dictionary of summary information about the Monitor\n \"\"\"\n return super().summary()\n\n def details(self) -> dict:\n \"\"\"Monitor Details\n\n Returns:\n dict: A dictionary of details about the Monitor\n \"\"\"\n return super().details()\n\n def add_data_capture(self, capture_percentage=100):\n \"\"\"\n Add data capture configuration for this Monitor/endpoint.\n\n Args:\n capture_percentage (int): Percentage of data to capture. Defaults to 100.\n \"\"\"\n super().add_data_capture(capture_percentage)\n\n def create_baseline(self, recreate: bool = False):\n \"\"\"Code to create a baseline for monitoring\n\n Args:\n recreate (bool): If True, recreate the baseline even if it already exists\n\n Notes:\n This will create/write three files to the baseline_dir:\n - baseline.csv\n - constraints.json\n - statistics.json\n \"\"\"\n super().create_baseline(recreate)\n\n def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n \"\"\"\n Sets up the monitoring schedule for the model endpoint.\n\n Args:\n schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n \"\"\"\n super().create_monitoring_schedule(schedule, recreate)\n\n def get_latest_data_capture(self) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Get the latest data capture input and output from S3.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n return super().get_latest_data_capture()\n\n def get_baseline(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n Returns:\n pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n \"\"\"\n return super().get_baseline()\n\n def get_constraints(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the constraints from the baseline\n\n Returns:\n pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n \"\"\"\n return super().get_constraints()\n\n def get_statistics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the statistics from the baseline\n\n Returns:\n pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n \"\"\"\n return super().get_statistics()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.add_data_capture","title":"add_data_capture(capture_percentage=100)
","text":"Add data capture configuration for this Monitor/endpoint.
Parameters:
Name Type Description Defaultcapture_percentage
int
Percentage of data to capture. Defaults to 100.
100
Source code in src/sageworks/api/monitor.py
def add_data_capture(self, capture_percentage=100):\n \"\"\"\n Add data capture configuration for this Monitor/endpoint.\n\n Args:\n capture_percentage (int): Percentage of data to capture. Defaults to 100.\n \"\"\"\n super().add_data_capture(capture_percentage)\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.create_baseline","title":"create_baseline(recreate=False)
","text":"Code to create a baseline for monitoring
Parameters:
Name Type Description Defaultrecreate
bool
If True, recreate the baseline even if it already exists
False
Notes This will create/write three files to the baseline_dir: - baseline.csv - constraints.json - statistics.json
Source code insrc/sageworks/api/monitor.py
def create_baseline(self, recreate: bool = False):\n \"\"\"Code to create a baseline for monitoring\n\n Args:\n recreate (bool): If True, recreate the baseline even if it already exists\n\n Notes:\n This will create/write three files to the baseline_dir:\n - baseline.csv\n - constraints.json\n - statistics.json\n \"\"\"\n super().create_baseline(recreate)\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.create_monitoring_schedule","title":"create_monitoring_schedule(schedule='hourly', recreate=False)
","text":"Sets up the monitoring schedule for the model endpoint.
Parameters:
Name Type Description Defaultschedule
str
The schedule for the monitoring job (hourly or daily, defaults to hourly).
'hourly'
recreate
bool
If True, recreate the monitoring schedule even if it already exists.
False
Source code in src/sageworks/api/monitor.py
def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n \"\"\"\n Sets up the monitoring schedule for the model endpoint.\n\n Args:\n schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n \"\"\"\n super().create_monitoring_schedule(schedule, recreate)\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.details","title":"details()
","text":"Monitor Details
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the Monitor
Source code insrc/sageworks/api/monitor.py
def details(self) -> dict:\n \"\"\"Monitor Details\n\n Returns:\n dict: A dictionary of details about the Monitor\n \"\"\"\n return super().details()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_baseline","title":"get_baseline()
","text":"Code to get the baseline CSV from the S3 baseline directory
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)
Source code insrc/sageworks/api/monitor.py
def get_baseline(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n Returns:\n pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n \"\"\"\n return super().get_baseline()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_constraints","title":"get_constraints()
","text":"Code to get the constraints from the baseline
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)
Source code insrc/sageworks/api/monitor.py
def get_constraints(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the constraints from the baseline\n\n Returns:\n pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n \"\"\"\n return super().get_constraints()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_latest_data_capture","title":"get_latest_data_capture()
","text":"Get the latest data capture input and output from S3.
Returns:
Name Type DescriptionDataFrame
input), DataFrame(output
Flattened and processed DataFrames for input and output data.
Source code insrc/sageworks/api/monitor.py
def get_latest_data_capture(self) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Get the latest data capture input and output from S3.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n return super().get_latest_data_capture()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.get_statistics","title":"get_statistics()
","text":"Code to get the statistics from the baseline
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)
Source code insrc/sageworks/api/monitor.py
def get_statistics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the statistics from the baseline\n\n Returns:\n pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n \"\"\"\n return super().get_statistics()\n
"},{"location":"api_classes/monitor/#sageworks.api.monitor.Monitor.summary","title":"summary()
","text":"Monitor Summary
Returns:
Name Type Descriptiondict
dict
A dictionary of summary information about the Monitor
Source code insrc/sageworks/api/monitor.py
def summary(self) -> dict:\n \"\"\"Monitor Summary\n\n Returns:\n dict: A dictionary of summary information about the Monitor\n \"\"\"\n return super().summary()\n
"},{"location":"api_classes/monitor/#examples","title":"Examples","text":"Initial Setup of the Endpoint Monitor
monitor_setup.pyfrom sageworks.api.monitor import Monitor\n\n# Create an Endpoint Monitor Class and perform initial Setup\nendpoint_name = \"abalone-regression-end-rt\"\nmon = Monitor(endpoint_name)\n\n# Add data capture to the endpoint\nmon.add_data_capture(capture_percentage=100)\n\n# Create a baseline for monitoring\nmon.create_baseline()\n\n# Set up the monitoring schedule\nmon.create_monitoring_schedule(schedule=\"hourly\")\n
Pulling Information from an Existing Monitor
monitor_usage.pyfrom sageworks.api.monitor import Monitor\nfrom sageworks.api.endpoint import Endpoint\n\n# Construct a Monitor Class in one of Two Ways\nmon = Endpoint(\"abalone-regression-end-rt\").get_monitor()\nmon = Monitor(\"abalone-regression-end-rt\")\n\n# Check the summary and details of the monitoring class\nmon.summary()\nmon.details()\n\n# Check the baseline outputs (baseline, constraints, statistics)\nbase_df = mon.get_baseline()\nbase_df.head()\n\nconstraints_df = mon.get_constraints()\nconstraints_df.head()\n\nstatistics_df = mon.get_statistics()\nstatistics_df.head()\n\n# Get the latest data capture (inputs and outputs)\ninput_df, output_df = mon.get_latest_data_capture()\ninput_df.head()\noutput_df.head()\n
"},{"location":"api_classes/monitor/#sageworks-ui","title":"SageWorks UI","text":"Running these few lines of code creates and deploys an AWS Endpoint Monitor. The Monitor status and outputs can be viewed in the Sagemaker Console interfaces or in the SageWorks Dashboard UI. SageWorks will use the monitor to track various metrics including Data Quality, Model Bias, etc...
SageWorks Dashboard: EndpointsNot Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/overview/","title":"Overview","text":"Just Getting Started?
You're in the right place, the SageWorks API Classes are the best way to get started with SageWorks!
"},{"location":"api_classes/overview/#welcome-to-the-sageworks-api-classes","title":"Welcome to the SageWorks API Classes","text":"These classes provide high-level APIs for the SageWorks package, they enable your team to build full AWS Machine Learning Pipelines. They handle all the details around updating and managing a complex set of AWS Services. Each class provides an essential component of the overall ML Pipline. Simply combine the classes to build production ready, AWS powered, machine learning pipelines.
from sageworks.api.data_source import DataSource\nfrom sageworks.api.feature_set import FeatureSet\nfrom sageworks.api.model import Model, ModelType\nfrom sageworks.api.endpoint import Endpoint\n\n# Create the abalone_data DataSource\nds = DataSource(\"s3://sageworks-public-data/common/abalone.csv\")\n\n# Now create a FeatureSet\nds.to_features(\"abalone_features\")\n\n# Create the abalone_regression Model\nfs = FeatureSet(\"abalone_features\")\nfs.to_model(\n ModelType.REGRESSOR,\n name=\"abalone-regression\",\n target_column=\"class_number_of_rings\",\n tags=[\"abalone\", \"regression\"],\n description=\"Abalone Regression Model\",\n)\n\n# Create the abalone_regression Endpoint\nmodel = Model(\"abalone-regression\")\nmodel.to_endpoint(name=\"abalone-regression-end\", tags=[\"abalone\", \"regression\"])\n\n# Now we'll run inference on the endpoint\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# Get a DataFrame of data (not used to train) and run predictions\nathena_table = fs.view(\"training\").table\ndf = fs.query(f\"SELECT * FROM {athena_table} where training = FALSE\")\nresults = endpoint.predict(df)\nprint(results[[\"class_number_of_rings\", \"prediction\"]])\n
Output
Processing...\n class_number_of_rings prediction\n0 12 10.477794\n1 11 11.11835\n2 14 13.605763\n3 12 11.744759\n4 17 15.55189\n.. ... ...\n826 7 7.981503\n827 11 11.246113\n828 9 9.592911\n829 6 6.129388\n830 8 7.628252\n
Full AWS ML Pipeline Achievement Unlocked!
Bing! You just built and deployed a full AWS Machine Learning Pipeline. You can now use the SageWorks Dashboard web interface to inspect your AWS artifacts. A comprehensive set of Exploratory Data Analysis techniques and Model Performance Metrics are available for your entire team to review, inspect and interact with.
Examples
All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Examples
Examples of using the Parameter Storage class are listed at the bottom of this page Examples.
ParameterStore: Manages SageWorks parameters in AWS Systems Manager Parameter Store.
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore","title":"ParameterStore
","text":"ParameterStore: Manages SageWorks parameters in AWS Systems Manager Parameter Store.
Common Usageparams = ParameterStore()\n\n# List Parameters\nparams.list()\n\n['/sageworks/abalone_info',\n '/sageworks/my_data',\n '/sageworks/test',\n '/sageworks/pipelines/my_pipeline']\n\n# Add Key\nparams.upsert(\"key\", \"value\")\nvalue = params.get(\"key\")\n\n# Add any data (lists, dictionaries, etc..)\nmy_data = {\"key\": \"value\", \"number\": 4.2, \"list\": [1,2,3]}\nparams.upsert(\"my_data\", my_data)\n\n# Retrieve data\nreturn_value = params.get(\"my_data\")\npprint(return_value)\n\n{'key': 'value', 'list': [1, 2, 3], 'number': 4.2}\n\n# Delete parameters\nparam_store.delete(\"my_data\")\n
Source code in src/sageworks/api/parameter_store.py
class ParameterStore:\n \"\"\"ParameterStore: Manages SageWorks parameters in AWS Systems Manager Parameter Store.\n\n Common Usage:\n ```python\n params = ParameterStore()\n\n # List Parameters\n params.list()\n\n ['/sageworks/abalone_info',\n '/sageworks/my_data',\n '/sageworks/test',\n '/sageworks/pipelines/my_pipeline']\n\n # Add Key\n params.upsert(\"key\", \"value\")\n value = params.get(\"key\")\n\n # Add any data (lists, dictionaries, etc..)\n my_data = {\"key\": \"value\", \"number\": 4.2, \"list\": [1,2,3]}\n params.upsert(\"my_data\", my_data)\n\n # Retrieve data\n return_value = params.get(\"my_data\")\n pprint(return_value)\n\n {'key': 'value', 'list': [1, 2, 3], 'number': 4.2}\n\n # Delete parameters\n param_store.delete(\"my_data\")\n ```\n \"\"\"\n\n def __init__(self):\n \"\"\"ParameterStore Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Initialize a SageWorks Session (to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSSession().boto3_session\n\n # Create a Systems Manager (SSM) client for Parameter Store operations\n self.ssm_client = self.boto3_session.client(\"ssm\")\n\n def list(self) -> list:\n \"\"\"List all parameters in the AWS Parameter Store.\n\n Returns:\n list: A list of parameter names and details.\n \"\"\"\n try:\n # Set up parameters for our search\n params = {\"MaxResults\": 50}\n\n # Initialize the list to collect parameter names\n all_parameters = []\n\n # Make the initial call to describe parameters\n response = self.ssm_client.describe_parameters(**params)\n\n # Aggregate the names from the initial response\n all_parameters.extend(param[\"Name\"] for param in response[\"Parameters\"])\n\n # Continue to paginate if there's a NextToken\n while \"NextToken\" in response:\n # Update the parameters with the NextToken for subsequent calls\n params[\"NextToken\"] = response[\"NextToken\"]\n response = self.ssm_client.describe_parameters(**params)\n\n # Aggregate the names from the subsequent responses\n all_parameters.extend(param[\"Name\"] for param in response[\"Parameters\"])\n\n except Exception as e:\n self.log.error(f\"Failed to list parameters: {e}\")\n return []\n\n # Return the aggregated list of parameter names\n return all_parameters\n\n def get(self, name: str, warn: bool = True, decrypt: bool = True) -> Union[str, list, dict, None]:\n \"\"\"Retrieve a parameter value from the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter to retrieve.\n warn (bool): Whether to log a warning if the parameter is not found.\n decrypt (bool): Whether to decrypt secure string parameters.\n\n Returns:\n Union[str, list, dict, None]: The value of the parameter or None if not found.\n \"\"\"\n try:\n # Retrieve the parameter from Parameter Store\n response = self.ssm_client.get_parameter(Name=name, WithDecryption=decrypt)\n value = response[\"Parameter\"][\"Value\"]\n\n # Auto-detect and decompress if needed\n if value.startswith(\"COMPRESSED:\"):\n # Base64 decode and decompress\n self.log.important(f\"Decompressing parameter '{name}'...\")\n compressed_value = base64.b64decode(value[len(\"COMPRESSED:\") :])\n value = zlib.decompress(compressed_value).decode(\"utf-8\")\n\n # Attempt to parse the value back to its original type\n try:\n parsed_value = json.loads(value)\n return parsed_value\n except (json.JSONDecodeError, TypeError):\n # If parsing fails, return the value as is (assumed to be a simple string)\n return value\n\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] == \"ParameterNotFound\":\n if warn:\n self.log.warning(f\"Parameter '{name}' not found\")\n else:\n self.log.error(f\"Failed to get parameter '{name}': {e}\")\n return None\n\n def upsert(self, name: str, value, overwrite: bool = True):\n \"\"\"Insert or update a parameter in the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter.\n value (str | list | dict): The value of the parameter.\n overwrite (bool): Whether to overwrite an existing parameter (default: True)\n \"\"\"\n try:\n\n # Anything that's not a string gets converted to JSON\n if not isinstance(value, str):\n value = json.dumps(value)\n\n # Check size and compress if necessary\n if len(value) > 4096:\n self.log.warning(f\"Parameter {name} exceeds 4KB ({len(value)} Bytes) Compressing...\")\n compressed_value = zlib.compress(value.encode(\"utf-8\"), level=9)\n encoded_value = \"COMPRESSED:\" + base64.b64encode(compressed_value).decode(\"utf-8\")\n\n # Report on the size of the compressed value\n compressed_size = len(compressed_value)\n if compressed_size > 4096:\n doc_link = \"https://supercowpowers.github.io/sageworks/api_classes/df_store\"\n self.log.error(f\"Compressed size {compressed_size} bytes, cannot store > 4KB\")\n self.log.error(f\"For larger data use the DFStore() class ({doc_link})\")\n return\n\n # Insert or update the compressed parameter in Parameter Store\n try:\n # Insert or update the compressed parameter in Parameter Store\n self.ssm_client.put_parameter(Name=name, Value=encoded_value, Type=\"String\", Overwrite=overwrite)\n self.log.info(f\"Parameter '{name}' added/updated successfully with compression.\")\n return\n except Exception as e:\n self.log.critical(f\"Failed to add/update compressed parameter '{name}': {e}\")\n raise\n\n # Insert or update the parameter normally if under 4KB\n self.ssm_client.put_parameter(Name=name, Value=value, Type=\"String\", Overwrite=overwrite)\n self.log.info(f\"Parameter '{name}' added/updated successfully.\")\n\n except Exception as e:\n self.log.critical(f\"Failed to add/update parameter '{name}': {e}\")\n raise\n\n def delete(self, name: str):\n \"\"\"Delete a parameter from the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter to delete.\n \"\"\"\n try:\n # Delete the parameter from Parameter Store\n self.ssm_client.delete_parameter(Name=name)\n self.log.info(f\"Parameter '{name}' deleted successfully.\")\n except Exception as e:\n self.log.error(f\"Failed to delete parameter '{name}': {e}\")\n\n def __repr__(self):\n \"\"\"Return a string representation of the ParameterStore object.\"\"\"\n return \"\\n\".join(self.list())\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.__init__","title":"__init__()
","text":"ParameterStore Init Method
Source code insrc/sageworks/api/parameter_store.py
def __init__(self):\n \"\"\"ParameterStore Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Initialize a SageWorks Session (to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSSession().boto3_session\n\n # Create a Systems Manager (SSM) client for Parameter Store operations\n self.ssm_client = self.boto3_session.client(\"ssm\")\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.__repr__","title":"__repr__()
","text":"Return a string representation of the ParameterStore object.
Source code insrc/sageworks/api/parameter_store.py
def __repr__(self):\n \"\"\"Return a string representation of the ParameterStore object.\"\"\"\n return \"\\n\".join(self.list())\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.delete","title":"delete(name)
","text":"Delete a parameter from the AWS Parameter Store.
Parameters:
Name Type Description Defaultname
str
The name of the parameter to delete.
required Source code insrc/sageworks/api/parameter_store.py
def delete(self, name: str):\n \"\"\"Delete a parameter from the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter to delete.\n \"\"\"\n try:\n # Delete the parameter from Parameter Store\n self.ssm_client.delete_parameter(Name=name)\n self.log.info(f\"Parameter '{name}' deleted successfully.\")\n except Exception as e:\n self.log.error(f\"Failed to delete parameter '{name}': {e}\")\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.get","title":"get(name, warn=True, decrypt=True)
","text":"Retrieve a parameter value from the AWS Parameter Store.
Parameters:
Name Type Description Defaultname
str
The name of the parameter to retrieve.
requiredwarn
bool
Whether to log a warning if the parameter is not found.
True
decrypt
bool
Whether to decrypt secure string parameters.
True
Returns:
Type DescriptionUnion[str, list, dict, None]
Union[str, list, dict, None]: The value of the parameter or None if not found.
Source code insrc/sageworks/api/parameter_store.py
def get(self, name: str, warn: bool = True, decrypt: bool = True) -> Union[str, list, dict, None]:\n \"\"\"Retrieve a parameter value from the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter to retrieve.\n warn (bool): Whether to log a warning if the parameter is not found.\n decrypt (bool): Whether to decrypt secure string parameters.\n\n Returns:\n Union[str, list, dict, None]: The value of the parameter or None if not found.\n \"\"\"\n try:\n # Retrieve the parameter from Parameter Store\n response = self.ssm_client.get_parameter(Name=name, WithDecryption=decrypt)\n value = response[\"Parameter\"][\"Value\"]\n\n # Auto-detect and decompress if needed\n if value.startswith(\"COMPRESSED:\"):\n # Base64 decode and decompress\n self.log.important(f\"Decompressing parameter '{name}'...\")\n compressed_value = base64.b64decode(value[len(\"COMPRESSED:\") :])\n value = zlib.decompress(compressed_value).decode(\"utf-8\")\n\n # Attempt to parse the value back to its original type\n try:\n parsed_value = json.loads(value)\n return parsed_value\n except (json.JSONDecodeError, TypeError):\n # If parsing fails, return the value as is (assumed to be a simple string)\n return value\n\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] == \"ParameterNotFound\":\n if warn:\n self.log.warning(f\"Parameter '{name}' not found\")\n else:\n self.log.error(f\"Failed to get parameter '{name}': {e}\")\n return None\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.list","title":"list()
","text":"List all parameters in the AWS Parameter Store.
Returns:
Name Type Descriptionlist
list
A list of parameter names and details.
Source code insrc/sageworks/api/parameter_store.py
def list(self) -> list:\n \"\"\"List all parameters in the AWS Parameter Store.\n\n Returns:\n list: A list of parameter names and details.\n \"\"\"\n try:\n # Set up parameters for our search\n params = {\"MaxResults\": 50}\n\n # Initialize the list to collect parameter names\n all_parameters = []\n\n # Make the initial call to describe parameters\n response = self.ssm_client.describe_parameters(**params)\n\n # Aggregate the names from the initial response\n all_parameters.extend(param[\"Name\"] for param in response[\"Parameters\"])\n\n # Continue to paginate if there's a NextToken\n while \"NextToken\" in response:\n # Update the parameters with the NextToken for subsequent calls\n params[\"NextToken\"] = response[\"NextToken\"]\n response = self.ssm_client.describe_parameters(**params)\n\n # Aggregate the names from the subsequent responses\n all_parameters.extend(param[\"Name\"] for param in response[\"Parameters\"])\n\n except Exception as e:\n self.log.error(f\"Failed to list parameters: {e}\")\n return []\n\n # Return the aggregated list of parameter names\n return all_parameters\n
"},{"location":"api_classes/parameter_store/#sageworks.api.parameter_store.ParameterStore.upsert","title":"upsert(name, value, overwrite=True)
","text":"Insert or update a parameter in the AWS Parameter Store.
Parameters:
Name Type Description Defaultname
str
The name of the parameter.
requiredvalue
str | list | dict
The value of the parameter.
requiredoverwrite
bool
Whether to overwrite an existing parameter (default: True)
True
Source code in src/sageworks/api/parameter_store.py
def upsert(self, name: str, value, overwrite: bool = True):\n \"\"\"Insert or update a parameter in the AWS Parameter Store.\n\n Args:\n name (str): The name of the parameter.\n value (str | list | dict): The value of the parameter.\n overwrite (bool): Whether to overwrite an existing parameter (default: True)\n \"\"\"\n try:\n\n # Anything that's not a string gets converted to JSON\n if not isinstance(value, str):\n value = json.dumps(value)\n\n # Check size and compress if necessary\n if len(value) > 4096:\n self.log.warning(f\"Parameter {name} exceeds 4KB ({len(value)} Bytes) Compressing...\")\n compressed_value = zlib.compress(value.encode(\"utf-8\"), level=9)\n encoded_value = \"COMPRESSED:\" + base64.b64encode(compressed_value).decode(\"utf-8\")\n\n # Report on the size of the compressed value\n compressed_size = len(compressed_value)\n if compressed_size > 4096:\n doc_link = \"https://supercowpowers.github.io/sageworks/api_classes/df_store\"\n self.log.error(f\"Compressed size {compressed_size} bytes, cannot store > 4KB\")\n self.log.error(f\"For larger data use the DFStore() class ({doc_link})\")\n return\n\n # Insert or update the compressed parameter in Parameter Store\n try:\n # Insert or update the compressed parameter in Parameter Store\n self.ssm_client.put_parameter(Name=name, Value=encoded_value, Type=\"String\", Overwrite=overwrite)\n self.log.info(f\"Parameter '{name}' added/updated successfully with compression.\")\n return\n except Exception as e:\n self.log.critical(f\"Failed to add/update compressed parameter '{name}': {e}\")\n raise\n\n # Insert or update the parameter normally if under 4KB\n self.ssm_client.put_parameter(Name=name, Value=value, Type=\"String\", Overwrite=overwrite)\n self.log.info(f\"Parameter '{name}' added/updated successfully.\")\n\n except Exception as e:\n self.log.critical(f\"Failed to add/update parameter '{name}': {e}\")\n raise\n
"},{"location":"api_classes/parameter_store/#bypassing-the-4k-limit","title":"Bypassing the 4k Limit","text":"AWS Parameter Storage has a 4k limit on values, the SageWorks class bypasses this limit by detecting large values (strings, data, whatever) and compressing those on the fly. The decompressing is also handled automatically, so for larger data simply use the add()
and get()
methods and it will all just work.
These example show how to use the ParameterStore()
class to list, add, and get parameters from the AWS Parameter Store Service.
SageWorks REPL
If you'd like to experiment with listing, adding, and getting data with the ParameterStore()
class, you can spin up the SageWorks REPL, use the class and test out all the methods. Try it out! SageWorks REPL
params = ParameterStore()\n\n# List Parameters\nparams.list()\n\n['/sageworks/abalone_info',\n '/sageworks/my_data',\n '/sageworks/test',\n '/sageworks/pipelines/my_pipeline']\n\n# Add Key\nparams.upsert(\"key\", \"value\")\nvalue = params.get(\"key\")\n\n# Add any data (lists, dictionaries, etc..)\nmy_data = {\"key\": \"value\", \"number\": 4.2, \"list\": [1,2,3]}\nparams.upsert(\"my_data\", my_data)\n\n# Retrieve data\nreturn_value = params.get(\"my_data\")\npprint(return_value)\n\n{'key': 'value', 'list': [1, 2, 3], 'number': 4.2}\n\n# Delete parameters\nparam_store.delete(\"my_data\")\n
list()
not showing ALL parameters?
If you want access to ALL the parameters in the parameter store set prefix=None
and everything will show up.
params = ParameterStore(prefix=None)\nparams.list()\n<all the keys>\n
"},{"location":"api_classes/pipelines/","title":"Pipelines","text":"Pipeline Examples
Examples of using the Pipeline classes are listed at the bottom of this page Examples.
Pipelines store sequences of SageWorks transforms. So if you have a nightly ML workflow you can capture that as a Pipeline. Here's an example pipeline:
nightly_sol_pipeline_v1.json{\n \"data_source\": {\n \"name\": \"nightly_data\",\n \"tags\": [\"solubility\", \"foo\"],\n \"s3_input\": \"s3://blah/blah.csv\"\n },\n \"feature_set\": {\n \"name\": \"nightly_features\",\n \"tags\": [\"blah\", \"blah\"],\n \"input\": \"nightly_data\"\n \"schema\": \"mol_descriptors_v1\"\n },\n \"model\": {\n \"name\": \u201cnightly_model\u201d,\n \"tags\": [\"blah\", \"blah\"],\n \"features\": [\"col1\", \"col2\"],\n \"target\": \u201csol\u201d,\n \"input\": \u201cnightly_features\u201d\n \"endpoint\": {\n ...\n} \n
PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.
Pipeline: Manages the details around a SageWorks Pipeline, including Execution
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager","title":"PipelineManager
","text":"PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.
Common Usagemy_manager = PipelineManager()\nmy_manager.list_pipelines()\nabalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\nmy_manager.save_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n
Source code in src/sageworks/api/pipeline_manager.py
class PipelineManager:\n \"\"\"PipelineManager: Manages SageWorks Pipelines, listing, creating, and saving them.\n\n Common Usage:\n ```python\n my_manager = PipelineManager()\n my_manager.list_pipelines()\n abalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\n my_manager.save_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n ```\n \"\"\"\n\n def __init__(self):\n \"\"\"Pipeline Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Grab our SageWorks Bucket from Config\n self.cm = ConfigManager()\n self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n if self.sageworks_bucket is None:\n self.log = logging.getLogger(\"sageworks\")\n self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n sys.exit(1)\n\n # Set the S3 Path for Pipelines\n self.bucket = self.sageworks_bucket\n self.prefix = \"pipelines/\"\n self.pipelines_s3_path = f\"s3://{self.sageworks_bucket}/pipelines/\"\n\n # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSAccountClamp().boto3_session\n\n # Read all the Pipelines from this S3 path\n self.s3_client = self.boto3_session.client(\"s3\")\n\n def list_pipelines(self) -> list:\n \"\"\"List all the Pipelines in the S3 Bucket\n\n Returns:\n list: A list of Pipeline names and details\n \"\"\"\n # List objects using the S3 client\n response = self.s3_client.list_objects_v2(Bucket=self.bucket, Prefix=self.prefix)\n\n # Check if there are objects\n if \"Contents\" in response:\n # Process the list of dictionaries (we only need the filename, the LastModified, and the Size)\n pipelines = [\n {\n \"name\": pipeline[\"Key\"].split(\"/\")[-1].replace(\".json\", \"\"),\n \"last_modified\": pipeline[\"LastModified\"],\n \"size\": pipeline[\"Size\"],\n }\n for pipeline in response[\"Contents\"]\n ]\n return pipelines\n else:\n self.log.important(f\"No pipelines found at {self.pipelines_s3_path}...\")\n return []\n\n # Create a new Pipeline from an Endpoint\n def create_from_endpoint(self, endpoint_name: str) -> dict:\n \"\"\"Create a Pipeline from an Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: A dictionary of the Pipeline\n \"\"\"\n self.log.important(f\"Creating Pipeline from Endpoint: {endpoint_name}...\")\n pipeline = {}\n endpoint = Endpoint(endpoint_name)\n model = Model(endpoint.get_input())\n feature_set = FeatureSet(model.get_input())\n data_source = DataSource(feature_set.get_input())\n s3_source = data_source.get_input()\n for name in [\"data_source\", \"feature_set\", \"model\", \"endpoint\"]:\n artifact = locals()[name]\n pipeline[name] = {\"name\": artifact.uuid, \"tags\": artifact.get_tags(), \"input\": artifact.get_input()}\n if name == \"model\":\n pipeline[name][\"model_type\"] = artifact.model_type.value\n pipeline[name][\"target_column\"] = artifact.target()\n pipeline[name][\"feature_list\"] = artifact.features()\n\n # Return the Pipeline\n return pipeline\n\n # Publish a Pipeline to SageWorks\n def publish_pipeline(self, name: str, pipeline: dict):\n \"\"\"Save a Pipeline to S3\n\n Args:\n name (str): The name of the Pipeline\n pipeline (dict): The Pipeline to save\n \"\"\"\n key = f\"{self.prefix}{name}.json\"\n self.log.important(f\"Saving {name} to S3: {self.bucket}/{key}...\")\n\n # Save the pipeline as an S3 JSON object\n self.s3_client.put_object(Body=json.dumps(pipeline, indent=4), Bucket=self.bucket, Key=key)\n\n def delete_pipeline(self, name: str):\n \"\"\"Delete a Pipeline from S3\n\n Args:\n name (str): The name of the Pipeline to delete\n \"\"\"\n key = f\"{self.prefix}{name}.json\"\n self.log.important(f\"Deleting {name} from S3: {self.bucket}/{key}...\")\n\n # Delete the pipeline object from S3\n self.s3_client.delete_object(Bucket=self.bucket, Key=key)\n\n # Save a Pipeline to a local file\n def save_pipeline_to_file(self, pipeline: dict, filepath: str):\n \"\"\"Save a Pipeline to a local file\n\n Args:\n pipeline (dict): The Pipeline to save\n filepath (str): The path to save the Pipeline\n \"\"\"\n\n # Sanity check the filepath\n if not filepath.endswith(\".json\"):\n filepath += \".json\"\n\n # Save the pipeline as a local JSON file\n with open(filepath, \"w\") as fp:\n json.dump(pipeline, fp, indent=4)\n\n def load_pipeline_from_file(self, filepath: str) -> dict:\n \"\"\"Load a Pipeline from a local file\n\n Args:\n filepath (str): The path of the Pipeline to load\n\n Returns:\n dict: The Pipeline loaded from the file\n \"\"\"\n\n # Load a pipeline as a local JSON file\n with open(filepath, \"r\") as fp:\n pipeline = json.load(fp)\n return pipeline\n\n def publish_pipeline_from_file(self, filepath: str):\n \"\"\"Publish a Pipeline to SageWorks from a local file\n\n Args:\n filepath (str): The path of the Pipeline to publish\n \"\"\"\n\n # Load a pipeline as a local JSON file\n pipeline = self.load_pipeline_from_file(filepath)\n\n # Get the pipeline name\n pipeline_name = filepath.split(\"/\")[-1].replace(\".json\", \"\")\n\n # Publish the Pipeline\n self.publish_pipeline(pipeline_name, pipeline)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.__init__","title":"__init__()
","text":"Pipeline Init Method
Source code insrc/sageworks/api/pipeline_manager.py
def __init__(self):\n \"\"\"Pipeline Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n\n # Grab our SageWorks Bucket from Config\n self.cm = ConfigManager()\n self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n if self.sageworks_bucket is None:\n self.log = logging.getLogger(\"sageworks\")\n self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n sys.exit(1)\n\n # Set the S3 Path for Pipelines\n self.bucket = self.sageworks_bucket\n self.prefix = \"pipelines/\"\n self.pipelines_s3_path = f\"s3://{self.sageworks_bucket}/pipelines/\"\n\n # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSAccountClamp().boto3_session\n\n # Read all the Pipelines from this S3 path\n self.s3_client = self.boto3_session.client(\"s3\")\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.create_from_endpoint","title":"create_from_endpoint(endpoint_name)
","text":"Create a Pipeline from an Endpoint
Parameters:
Name Type Description Defaultendpoint_name
str
The name of the Endpoint
requiredReturns:
Name Type Descriptiondict
dict
A dictionary of the Pipeline
Source code insrc/sageworks/api/pipeline_manager.py
def create_from_endpoint(self, endpoint_name: str) -> dict:\n \"\"\"Create a Pipeline from an Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: A dictionary of the Pipeline\n \"\"\"\n self.log.important(f\"Creating Pipeline from Endpoint: {endpoint_name}...\")\n pipeline = {}\n endpoint = Endpoint(endpoint_name)\n model = Model(endpoint.get_input())\n feature_set = FeatureSet(model.get_input())\n data_source = DataSource(feature_set.get_input())\n s3_source = data_source.get_input()\n for name in [\"data_source\", \"feature_set\", \"model\", \"endpoint\"]:\n artifact = locals()[name]\n pipeline[name] = {\"name\": artifact.uuid, \"tags\": artifact.get_tags(), \"input\": artifact.get_input()}\n if name == \"model\":\n pipeline[name][\"model_type\"] = artifact.model_type.value\n pipeline[name][\"target_column\"] = artifact.target()\n pipeline[name][\"feature_list\"] = artifact.features()\n\n # Return the Pipeline\n return pipeline\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.delete_pipeline","title":"delete_pipeline(name)
","text":"Delete a Pipeline from S3
Parameters:
Name Type Description Defaultname
str
The name of the Pipeline to delete
required Source code insrc/sageworks/api/pipeline_manager.py
def delete_pipeline(self, name: str):\n \"\"\"Delete a Pipeline from S3\n\n Args:\n name (str): The name of the Pipeline to delete\n \"\"\"\n key = f\"{self.prefix}{name}.json\"\n self.log.important(f\"Deleting {name} from S3: {self.bucket}/{key}...\")\n\n # Delete the pipeline object from S3\n self.s3_client.delete_object(Bucket=self.bucket, Key=key)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.list_pipelines","title":"list_pipelines()
","text":"List all the Pipelines in the S3 Bucket
Returns:
Name Type Descriptionlist
list
A list of Pipeline names and details
Source code insrc/sageworks/api/pipeline_manager.py
def list_pipelines(self) -> list:\n \"\"\"List all the Pipelines in the S3 Bucket\n\n Returns:\n list: A list of Pipeline names and details\n \"\"\"\n # List objects using the S3 client\n response = self.s3_client.list_objects_v2(Bucket=self.bucket, Prefix=self.prefix)\n\n # Check if there are objects\n if \"Contents\" in response:\n # Process the list of dictionaries (we only need the filename, the LastModified, and the Size)\n pipelines = [\n {\n \"name\": pipeline[\"Key\"].split(\"/\")[-1].replace(\".json\", \"\"),\n \"last_modified\": pipeline[\"LastModified\"],\n \"size\": pipeline[\"Size\"],\n }\n for pipeline in response[\"Contents\"]\n ]\n return pipelines\n else:\n self.log.important(f\"No pipelines found at {self.pipelines_s3_path}...\")\n return []\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.load_pipeline_from_file","title":"load_pipeline_from_file(filepath)
","text":"Load a Pipeline from a local file
Parameters:
Name Type Description Defaultfilepath
str
The path of the Pipeline to load
requiredReturns:
Name Type Descriptiondict
dict
The Pipeline loaded from the file
Source code insrc/sageworks/api/pipeline_manager.py
def load_pipeline_from_file(self, filepath: str) -> dict:\n \"\"\"Load a Pipeline from a local file\n\n Args:\n filepath (str): The path of the Pipeline to load\n\n Returns:\n dict: The Pipeline loaded from the file\n \"\"\"\n\n # Load a pipeline as a local JSON file\n with open(filepath, \"r\") as fp:\n pipeline = json.load(fp)\n return pipeline\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.publish_pipeline","title":"publish_pipeline(name, pipeline)
","text":"Save a Pipeline to S3
Parameters:
Name Type Description Defaultname
str
The name of the Pipeline
requiredpipeline
dict
The Pipeline to save
required Source code insrc/sageworks/api/pipeline_manager.py
def publish_pipeline(self, name: str, pipeline: dict):\n \"\"\"Save a Pipeline to S3\n\n Args:\n name (str): The name of the Pipeline\n pipeline (dict): The Pipeline to save\n \"\"\"\n key = f\"{self.prefix}{name}.json\"\n self.log.important(f\"Saving {name} to S3: {self.bucket}/{key}...\")\n\n # Save the pipeline as an S3 JSON object\n self.s3_client.put_object(Body=json.dumps(pipeline, indent=4), Bucket=self.bucket, Key=key)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.publish_pipeline_from_file","title":"publish_pipeline_from_file(filepath)
","text":"Publish a Pipeline to SageWorks from a local file
Parameters:
Name Type Description Defaultfilepath
str
The path of the Pipeline to publish
required Source code insrc/sageworks/api/pipeline_manager.py
def publish_pipeline_from_file(self, filepath: str):\n \"\"\"Publish a Pipeline to SageWorks from a local file\n\n Args:\n filepath (str): The path of the Pipeline to publish\n \"\"\"\n\n # Load a pipeline as a local JSON file\n pipeline = self.load_pipeline_from_file(filepath)\n\n # Get the pipeline name\n pipeline_name = filepath.split(\"/\")[-1].replace(\".json\", \"\")\n\n # Publish the Pipeline\n self.publish_pipeline(pipeline_name, pipeline)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline_manager.PipelineManager.save_pipeline_to_file","title":"save_pipeline_to_file(pipeline, filepath)
","text":"Save a Pipeline to a local file
Parameters:
Name Type Description Defaultpipeline
dict
The Pipeline to save
requiredfilepath
str
The path to save the Pipeline
required Source code insrc/sageworks/api/pipeline_manager.py
def save_pipeline_to_file(self, pipeline: dict, filepath: str):\n \"\"\"Save a Pipeline to a local file\n\n Args:\n pipeline (dict): The Pipeline to save\n filepath (str): The path to save the Pipeline\n \"\"\"\n\n # Sanity check the filepath\n if not filepath.endswith(\".json\"):\n filepath += \".json\"\n\n # Save the pipeline as a local JSON file\n with open(filepath, \"w\") as fp:\n json.dump(pipeline, fp, indent=4)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline","title":"Pipeline
","text":"Pipeline: SageWorks Pipeline API Class
Common Usagemy_pipeline = Pipeline(\"name\")\nmy_pipeline.details()\nmy_pipeline.execute() # Execute entire pipeline\nmy_pipeline.execute_partial([\"data_source\", \"feature_set\"])\nmy_pipeline.execute_partial([\"model\", \"endpoint\"])\n
Source code in src/sageworks/api/pipeline.py
class Pipeline:\n \"\"\"Pipeline: SageWorks Pipeline API Class\n\n Common Usage:\n ```python\n my_pipeline = Pipeline(\"name\")\n my_pipeline.details()\n my_pipeline.execute() # Execute entire pipeline\n my_pipeline.execute_partial([\"data_source\", \"feature_set\"])\n my_pipeline.execute_partial([\"model\", \"endpoint\"])\n ```\n \"\"\"\n\n def __init__(self, name: str):\n \"\"\"Pipeline Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.name = name\n\n # Grab our SageWorks Bucket from Config\n self.cm = ConfigManager()\n self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n if self.sageworks_bucket is None:\n self.log = logging.getLogger(\"sageworks\")\n self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n sys.exit(1)\n\n # Set the S3 Path for this Pipeline\n self.bucket = self.sageworks_bucket\n self.key = f\"pipelines/{self.name}.json\"\n self.s3_path = f\"s3://{self.bucket}/{self.key}\"\n\n # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSAccountClamp().boto3_session\n self.s3_client = self.boto3_session.client(\"s3\")\n\n # If this S3 Path exists, load the Pipeline\n if wr.s3.does_object_exist(self.s3_path):\n self.pipeline = self._get_pipeline()\n else:\n self.log.warning(f\"Pipeline {self.name} not found at {self.s3_path}\")\n self.pipeline = None\n\n def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n \"\"\"Set the input for the Pipeline\n\n Args:\n input (Union[str, pd.DataFrame]): The input for the Pipeline\n artifact (str): The artifact to set the input for (default: \"data_source\")\n \"\"\"\n self.pipeline[artifact][\"input\"] = input\n\n def set_training_holdouts(self, id_column: str, holdout_ids: list[str]):\n \"\"\"Set the input for the Pipeline\n\n Args:\n id_column (str): The column name of the unique identifier\n holdout_ids (list[str]): The list of unique identifiers to hold out\n \"\"\"\n self.pipeline[\"feature_set\"][\"id_column\"] = id_column\n self.pipeline[\"feature_set\"][\"holdout_ids\"] = holdout_ids\n\n def execute(self):\n \"\"\"Execute the entire Pipeline\n\n Raises:\n RunTimeException: If the pipeline execution fails in any way\n \"\"\"\n pipeline_executor = PipelineExecutor(self)\n pipeline_executor.execute()\n\n def execute_partial(self, subset: list):\n \"\"\"Execute a partial Pipeline\n\n Args:\n subset (list): A subset of the pipeline to execute\n\n Raises:\n RunTimeException: If the pipeline execution fails in any way\n \"\"\"\n pipeline_executor = PipelineExecutor(self)\n pipeline_executor.execute_partial(subset)\n\n def report_settable_fields(self, pipeline: dict = {}, path: str = \"\") -> None:\n \"\"\"\n Recursively finds and prints keys with settable fields in a JSON-like dictionary.\n\n Args:\n pipeline (dict): pipeline (or sub pipeline) to process.\n path (str): Current path to the key, used for nested dictionaries.\n \"\"\"\n # Grab the entire pipeline if not provided (first call)\n if not pipeline:\n self.log.important(f\"Checking Pipeline: {self.name}...\")\n pipeline = self.pipeline\n for key, value in pipeline.items():\n if isinstance(value, dict):\n # Recurse into sub-dictionary\n self.report_settable_fields(value, path + key + \" -> \")\n elif isinstance(value, str) and value.startswith(\"<<\") and value.endswith(\">>\"):\n # Check if required or optional\n required = \"[Required]\" if \"required\" in value else \"[Optional]\"\n self.log.important(f\"{required} Path: {path + key}\")\n\n def delete(self):\n \"\"\"Pipeline Deletion\"\"\"\n self.log.info(f\"Deleting Pipeline: {self.name}...\")\n wr.s3.delete_objects(self.s3_path)\n\n def _get_pipeline(self) -> dict:\n \"\"\"Internal: Get the pipeline as a JSON object from the specified S3 bucket and key.\"\"\"\n response = self.s3_client.get_object(Bucket=self.bucket, Key=self.key)\n json_object = json.loads(response[\"Body\"].read())\n return json_object\n\n def __repr__(self) -> str:\n \"\"\"String representation of this pipeline\n\n Returns:\n str: String representation of this pipeline\n \"\"\"\n # Class name and details\n class_name = self.__class__.__name__\n pipeline_details = json.dumps(self.pipeline, indent=4)\n return f\"{class_name}({pipeline_details})\"\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.__init__","title":"__init__(name)
","text":"Pipeline Init Method
Source code insrc/sageworks/api/pipeline.py
def __init__(self, name: str):\n \"\"\"Pipeline Init Method\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.name = name\n\n # Grab our SageWorks Bucket from Config\n self.cm = ConfigManager()\n self.sageworks_bucket = self.cm.get_config(\"SAGEWORKS_BUCKET\")\n if self.sageworks_bucket is None:\n self.log = logging.getLogger(\"sageworks\")\n self.log.critical(\"Could not find ENV var for SAGEWORKS_BUCKET!\")\n sys.exit(1)\n\n # Set the S3 Path for this Pipeline\n self.bucket = self.sageworks_bucket\n self.key = f\"pipelines/{self.name}.json\"\n self.s3_path = f\"s3://{self.bucket}/{self.key}\"\n\n # Grab a SageWorks Session (this allows us to assume the SageWorks ExecutionRole)\n self.boto3_session = AWSAccountClamp().boto3_session\n self.s3_client = self.boto3_session.client(\"s3\")\n\n # If this S3 Path exists, load the Pipeline\n if wr.s3.does_object_exist(self.s3_path):\n self.pipeline = self._get_pipeline()\n else:\n self.log.warning(f\"Pipeline {self.name} not found at {self.s3_path}\")\n self.pipeline = None\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.__repr__","title":"__repr__()
","text":"String representation of this pipeline
Returns:
Name Type Descriptionstr
str
String representation of this pipeline
Source code insrc/sageworks/api/pipeline.py
def __repr__(self) -> str:\n \"\"\"String representation of this pipeline\n\n Returns:\n str: String representation of this pipeline\n \"\"\"\n # Class name and details\n class_name = self.__class__.__name__\n pipeline_details = json.dumps(self.pipeline, indent=4)\n return f\"{class_name}({pipeline_details})\"\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.delete","title":"delete()
","text":"Pipeline Deletion
Source code insrc/sageworks/api/pipeline.py
def delete(self):\n \"\"\"Pipeline Deletion\"\"\"\n self.log.info(f\"Deleting Pipeline: {self.name}...\")\n wr.s3.delete_objects(self.s3_path)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.execute","title":"execute()
","text":"Execute the entire Pipeline
Raises:
Type DescriptionRunTimeException
If the pipeline execution fails in any way
Source code insrc/sageworks/api/pipeline.py
def execute(self):\n \"\"\"Execute the entire Pipeline\n\n Raises:\n RunTimeException: If the pipeline execution fails in any way\n \"\"\"\n pipeline_executor = PipelineExecutor(self)\n pipeline_executor.execute()\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.execute_partial","title":"execute_partial(subset)
","text":"Execute a partial Pipeline
Parameters:
Name Type Description Defaultsubset
list
A subset of the pipeline to execute
requiredRaises:
Type DescriptionRunTimeException
If the pipeline execution fails in any way
Source code insrc/sageworks/api/pipeline.py
def execute_partial(self, subset: list):\n \"\"\"Execute a partial Pipeline\n\n Args:\n subset (list): A subset of the pipeline to execute\n\n Raises:\n RunTimeException: If the pipeline execution fails in any way\n \"\"\"\n pipeline_executor = PipelineExecutor(self)\n pipeline_executor.execute_partial(subset)\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.report_settable_fields","title":"report_settable_fields(pipeline={}, path='')
","text":"Recursively finds and prints keys with settable fields in a JSON-like dictionary.
Args: pipeline (dict): pipeline (or sub pipeline) to process. path (str): Current path to the key, used for nested dictionaries.
Source code insrc/sageworks/api/pipeline.py
def report_settable_fields(self, pipeline: dict = {}, path: str = \"\") -> None:\n \"\"\"\n Recursively finds and prints keys with settable fields in a JSON-like dictionary.\n\n Args:\n pipeline (dict): pipeline (or sub pipeline) to process.\n path (str): Current path to the key, used for nested dictionaries.\n \"\"\"\n # Grab the entire pipeline if not provided (first call)\n if not pipeline:\n self.log.important(f\"Checking Pipeline: {self.name}...\")\n pipeline = self.pipeline\n for key, value in pipeline.items():\n if isinstance(value, dict):\n # Recurse into sub-dictionary\n self.report_settable_fields(value, path + key + \" -> \")\n elif isinstance(value, str) and value.startswith(\"<<\") and value.endswith(\">>\"):\n # Check if required or optional\n required = \"[Required]\" if \"required\" in value else \"[Optional]\"\n self.log.important(f\"{required} Path: {path + key}\")\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.set_input","title":"set_input(input, artifact='data_source')
","text":"Set the input for the Pipeline
Parameters:
Name Type Description Defaultinput
Union[str, DataFrame]
The input for the Pipeline
requiredartifact
str
The artifact to set the input for (default: \"data_source\")
'data_source'
Source code in src/sageworks/api/pipeline.py
def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n \"\"\"Set the input for the Pipeline\n\n Args:\n input (Union[str, pd.DataFrame]): The input for the Pipeline\n artifact (str): The artifact to set the input for (default: \"data_source\")\n \"\"\"\n self.pipeline[artifact][\"input\"] = input\n
"},{"location":"api_classes/pipelines/#sageworks.api.pipeline.Pipeline.set_training_holdouts","title":"set_training_holdouts(id_column, holdout_ids)
","text":"Set the input for the Pipeline
Parameters:
Name Type Description Defaultid_column
str
The column name of the unique identifier
requiredholdout_ids
list[str]
The list of unique identifiers to hold out
required Source code insrc/sageworks/api/pipeline.py
def set_training_holdouts(self, id_column: str, holdout_ids: list[str]):\n \"\"\"Set the input for the Pipeline\n\n Args:\n id_column (str): The column name of the unique identifier\n holdout_ids (list[str]): The list of unique identifiers to hold out\n \"\"\"\n self.pipeline[\"feature_set\"][\"id_column\"] = id_column\n self.pipeline[\"feature_set\"][\"holdout_ids\"] = holdout_ids\n
"},{"location":"api_classes/pipelines/#examples","title":"Examples","text":"Make a Pipeline
Pipelines are just JSON files (see sageworks/examples/pipelines/
). You can copy one and make changes to fit your objects/use case, or if you have a set of SageWorks artifacts created you can 'backtrack' from the Endpoint and have it create the Pipeline for you.
from sageworks.api.pipeline_manager import PipelineManager\n\n # Create a PipelineManager\nmy_manager = PipelineManager()\n\n# List the Pipelines\npprint(my_manager.list_pipelines())\n\n# Create a Pipeline from an Endpoint\nabalone_pipeline = my_manager.create_from_endpoint(\"abalone-regression-end\")\n\n# Publish the Pipeline\nmy_manager.publish_pipeline(\"abalone_pipeline_v1\", abalone_pipeline)\n
Output
Listing Pipelines...\n[{'last_modified': datetime.datetime(2024, 4, 16, 21, 10, 6, tzinfo=tzutc()),\n 'name': 'abalone_pipeline_v1',\n 'size': 445}]\n
Pipeline Details pipeline_details.pyfrom sageworks.api.pipeline import Pipeline\n\n# Retrieve an existing Pipeline\nmy_pipeline = Pipeline(\"abalone_pipeline_v1\")\npprint(my_pipeline.details())\n
Output
{\n \"name\": \"abalone_pipeline_v1\",\n \"s3_path\": \"s3://sandbox/pipelines/abalone_pipeline_v1.json\",\n \"pipeline\": {\n \"data_source\": {\n \"name\": \"abalone_data\",\n \"tags\": [\n \"abalone_data\"\n ],\n \"input\": \"/Users/briford/work/sageworks/data/abalone.csv\"\n },\n \"feature_set\": {\n \"name\": \"abalone_features\",\n \"tags\": [\n \"abalone_features\"\n ],\n \"input\": \"abalone_data\"\n },\n \"model\": {\n \"name\": \"abalone-regression\",\n \"tags\": [\n \"abalone\",\n \"regression\"\n ],\n \"input\": \"abalone_features\"\n },\n ...\n }\n}\n
Pipeline Execution
Pipeline Execution
Executing the Pipeline is obviously the most important reason for creating one. If gives you a reproducible way to capture, inspect, and run the same ML pipeline on different data (nightly).
pipeline_execution.pyfrom sageworks.api.pipeline import Pipeline\n\n# Retrieve an existing Pipeline\nmy_pipeline = Pipeline(\"abalone_pipeline_v1\")\n\n# Execute the Pipeline\nmy_pipeline.execute() # Full execution\n\n# Partial executions\nmy_pipeline.execute_partial([\"data_source\", \"feature_set\"])\nmy_pipeline.execute_partial([\"model\", \"endpoint\"])\n
"},{"location":"api_classes/pipelines/#pipelines-advanced","title":"Pipelines Advanced","text":"As part of the flexible architecture sometimes DataSources or FeatureSets can be created with a Pandas DataFrame. To support a DataFrame as input to a pipeline we can call the set_input()
method to the pipeline object. If you'd like to specify the set_hold_out_ids()
you can also provide a list of ids.
def set_input(self, input: Union[str, pd.DataFrame], artifact: str = \"data_source\"):\n \"\"\"Set the input for the Pipeline\n\n Args:\n input (Union[str, pd.DataFrame]): The input for the Pipeline\n artifact (str): The artifact to set the input for (default: \"data_source\")\n \"\"\"\n self.pipeline[artifact][\"input\"] = input\n\n def set_hold_out_ids(self, id_list: list):\n \"\"\"Set the input for the Pipeline\n\n Args:\n id_list (list): The list of hold out ids\n \"\"\"\n self.pipeline[\"feature_set\"][\"hold_out_ids\"] = id_list\n
Running a pipeline creates and deploys a set of SageWorks Artifacts, DataSource, FeatureSet, Model and Endpoint. These artifacts can be viewed in the Sagemaker Console/Notebook interfaces or in the SageWorks Dashboard UI.
Not Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes
"},{"location":"api_classes/views/","title":"Views","text":"View Examples
Examples of using the Views classes to extend the functionality of SageWorks Artifacts are in the Examples section at the bottom of this page.
Views are a powerful way to filter and agument your DataSources and FeatureSets. With Views you can subset columns, rows, and even add data to existing SageWorks Artifacts. If you want to compute outliers, runs some statistics or engineer some new features, Views are an easy way to change, modify, and add to DataSources and FeatureSets.
View: Read from a view (training, display, etc) for DataSources and FeatureSets.
"},{"location":"api_classes/views/#sageworks.core.views.view.View","title":"View
","text":"View: Read from a view (training, display, etc) for DataSources and FeatureSets.
Common Usage# Grab the Display View for a DataSource\ndisplay_view = ds.view(\"display\")\nprint(display_view.columns)\n\n# Pull a DataFrame for the view\ndf = display_view.pull_dataframe()\n\n# Views also work with FeatureSets\ncomp_view = fs.view(\"computation\")\ncomp_df = comp_view.pull_dataframe()\n\n# Query the view with a custom SQL query\nquery = f\"SELECT * FROM {comp_view.table} WHERE age > 30\"\ndf = comp_view.query(query)\n\n# Delete the view\ncomp_view.delete()\n
Source code in src/sageworks/core/views/view.py
class View:\n \"\"\"View: Read from a view (training, display, etc) for DataSources and FeatureSets.\n\n Common Usage:\n ```python\n\n # Grab the Display View for a DataSource\n display_view = ds.view(\"display\")\n print(display_view.columns)\n\n # Pull a DataFrame for the view\n df = display_view.pull_dataframe()\n\n # Views also work with FeatureSets\n comp_view = fs.view(\"computation\")\n comp_df = comp_view.pull_dataframe()\n\n # Query the view with a custom SQL query\n query = f\"SELECT * FROM {comp_view.table} WHERE age > 30\"\n df = comp_view.query(query)\n\n # Delete the view\n comp_view.delete()\n ```\n \"\"\"\n\n # Class attributes\n log = logging.getLogger(\"sageworks\")\n meta = Meta()\n\n def __init__(self, artifact: Union[DataSource, FeatureSet], view_name: str, **kwargs):\n \"\"\"View Constructor: Retrieve a View for the given artifact\n\n Args:\n artifact (Union[DataSource, FeatureSet]): A DataSource or FeatureSet object\n view_name (str): The name of the view to retrieve (e.g. \"training\")\n \"\"\"\n\n # Set the view name\n self.view_name = view_name\n\n # Is this a DataSource or a FeatureSet?\n self.is_feature_set = isinstance(artifact, FeatureSetCore)\n self.auto_id_column = artifact.id_column if self.is_feature_set else None\n\n # Get the data_source from the artifact\n self.artifact_name = artifact.uuid\n self.data_source = artifact.data_source if self.is_feature_set else artifact\n self.database = self.data_source.database\n\n # Construct our base_table_name\n self.base_table_name = self.data_source.table\n\n # Check if the view should be auto created\n self.auto_created = False\n if kwargs.get(\"auto_create_view\", True) and not self.exists():\n\n # A direct double check before we auto-create\n if not self.exists(skip_cache=True):\n self.log.important(\n f\"View {self.view_name} for {self.artifact_name} doesn't exist, attempting to auto-create...\"\n )\n self.auto_created = self._auto_create_view()\n\n # Check for failure of the auto-creation\n if not self.auto_created:\n self.log.error(\n f\"View {self.view_name} for {self.artifact_name} doesn't exist and cannot be auto-created...\"\n )\n self.view_name = self.columns = self.column_types = self.source_table = self.base_table_name = None\n return\n\n # Now fill some details about the view\n self.columns, self.column_types, self.source_table, self.join_view = view_details(\n self.table, self.data_source.database, self.data_source.boto3_session\n )\n\n def pull_dataframe(self, limit: int = 50000, head: bool = False) -> Union[pd.DataFrame, None]:\n \"\"\"Pull a DataFrame based on the view type\n\n Args:\n limit (int): The maximum number of rows to pull (default: 50000)\n head (bool): Return just the head of the DataFrame (default: False)\n\n Returns:\n Union[pd.DataFrame, None]: The DataFrame for the view or None if it doesn't exist\n \"\"\"\n\n # Pull the DataFrame\n if head:\n limit = 5\n pull_query = f'SELECT * FROM \"{self.table}\" LIMIT {limit}'\n df = self.data_source.query(pull_query)\n return df\n\n def query(self, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Query the view with a custom SQL query\n\n Args:\n query (str): The SQL query to execute\n\n Returns:\n Union[pd.DataFrame, None]: The DataFrame for the query or None if it doesn't exist\n \"\"\"\n return self.data_source.query(query)\n\n def column_details(self) -> dict:\n \"\"\"Return a dictionary of the column names and types for this view\n\n Returns:\n dict: A dictionary of the column names and types\n \"\"\"\n return dict(zip(self.columns, self.column_types))\n\n @property\n def table(self) -> str:\n \"\"\"Construct the view table name for the given view type\n\n Returns:\n str: The view table name\n \"\"\"\n if self.view_name is None:\n return None\n if self.view_name == \"base\":\n return self.base_table_name\n return f\"{self.base_table_name}_{self.view_name}\"\n\n def delete(self):\n \"\"\"Delete the database view (and supplemental data) if it exists.\"\"\"\n\n # List any supplemental tables for this data source\n supplemental_tables = list_supplemental_data_tables(self.base_table_name, self.database)\n for table in supplemental_tables:\n if self.view_name in table:\n self.log.important(f\"Deleting Supplemental Table {table}...\")\n delete_table(table, self.database, self.data_source.boto3_session)\n\n # Now drop the view\n self.log.important(f\"Dropping View {self.table}...\")\n drop_view_query = f'DROP VIEW \"{self.table}\"'\n\n # Execute the DROP VIEW query\n try:\n self.data_source.execute_statement(drop_view_query, silence_errors=True)\n except wr.exceptions.QueryFailed as e:\n if \"View not found\" in str(e):\n self.log.info(f\"View {self.table} not found, this is fine...\")\n else:\n raise\n\n # We want to do a small sleep so that AWS has time to catch up\n self.log.info(\"Sleeping for 3 seconds after dropping view to allow AWS to catch up...\")\n time.sleep(3)\n\n def exists(self, skip_cache: bool = False) -> bool:\n \"\"\"Check if the view exists in the database\n\n Args:\n skip_cache (bool): Skip the cache and check the database directly (default: False)\n Returns:\n bool: True if the view exists, False otherwise.\n \"\"\"\n # The BaseView always exists\n if self.view_name == \"base\":\n return True\n\n # If we're skipping the cache, we need to check the database directly\n if skip_cache:\n return self._check_database()\n\n # Use the meta class to see if the view exists\n views_df = self.meta.views(self.database)\n\n # Check if we have ANY views\n if views_df.empty:\n return False\n\n # Check if the view exists\n return self.table in views_df[\"Name\"].values\n\n def ensure_exists(self):\n \"\"\"Ensure if the view exists by making a query directly to the database. If it doesn't exist, create it\"\"\"\n\n # The BaseView always exists\n if self.view_name == \"base\":\n return True\n\n # Check the database directly\n if not self._check_database():\n self._auto_create_view()\n\n def _check_database(self) -> bool:\n \"\"\"Internal: Check if the view exists in the database\n\n Returns:\n bool: True if the view exists, False otherwise\n \"\"\"\n # Query to check if the table/view exists\n check_table_query = f\"\"\"\n SELECT table_name\n FROM information_schema.tables\n WHERE table_schema = '{self.database}' AND table_name = '{self.table}'\n \"\"\"\n _df = self.data_source.query(check_table_query)\n return not _df.empty\n\n def _auto_create_view(self) -> bool:\n \"\"\"Internal: Automatically create a view training, display, and computation views\n\n Returns:\n bool: True if the view was created, False otherwise\n\n Raises:\n ValueError: If the view type is not supported\n \"\"\"\n from sageworks.core.views import DisplayView, ComputationView, TrainingView\n\n # First if we're going to auto-create, we need to make sure the data source exists\n if not self.data_source.exists():\n self.log.error(f\"Data Source {self.data_source.uuid} does not exist...\")\n return False\n\n # DisplayView\n if self.view_name == \"display\":\n self.log.important(f\"Auto creating View {self.view_name} for {self.data_source.uuid}...\")\n DisplayView.create(self.data_source)\n return True\n\n # ComputationView\n if self.view_name == \"computation\":\n self.log.important(f\"Auto creating View {self.view_name} for {self.data_source.uuid}...\")\n ComputationView.create(self.data_source)\n return True\n\n # TrainingView\n if self.view_name == \"training\":\n # We're only going to create training views for FeatureSets\n if self.is_feature_set:\n self.log.important(f\"Auto creating View {self.view_name} for {self.data_source.uuid}...\")\n TrainingView.create(self.data_source, id_column=self.auto_id_column)\n return True\n else:\n self.log.warning(\"Training Views are only supported for FeatureSets...\")\n return False\n\n # If we get here, we don't support auto-creating this view\n self.log.warning(f\"Auto-Create for {self.view_name} not implemented yet...\")\n return False\n\n def __repr__(self):\n \"\"\"Return a string representation of this object\"\"\"\n\n # Set up various details that we want to print out\n auto = \"(Auto-Created)\" if self.auto_created else \"\"\n artifact = \"FeatureSet\" if self.is_feature_set else \"DataSource\"\n\n info = f'View: \"{self.view_name}\" for {artifact}(\"{self.artifact_name}\")\\n'\n info += f\" Database: {self.database}\\n\"\n info += f\" Table: {self.table}{auto}\\n\"\n info += f\" Source Table: {self.source_table}\\n\"\n info += f\" Join View: {self.join_view}\"\n return info\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.table","title":"table: str
property
","text":"Construct the view table name for the given view type
Returns:
Name Type Descriptionstr
str
The view table name
"},{"location":"api_classes/views/#sageworks.core.views.view.View.__init__","title":"__init__(artifact, view_name, **kwargs)
","text":"View Constructor: Retrieve a View for the given artifact
Parameters:
Name Type Description Defaultartifact
Union[DataSource, FeatureSet]
A DataSource or FeatureSet object
requiredview_name
str
The name of the view to retrieve (e.g. \"training\")
required Source code insrc/sageworks/core/views/view.py
def __init__(self, artifact: Union[DataSource, FeatureSet], view_name: str, **kwargs):\n \"\"\"View Constructor: Retrieve a View for the given artifact\n\n Args:\n artifact (Union[DataSource, FeatureSet]): A DataSource or FeatureSet object\n view_name (str): The name of the view to retrieve (e.g. \"training\")\n \"\"\"\n\n # Set the view name\n self.view_name = view_name\n\n # Is this a DataSource or a FeatureSet?\n self.is_feature_set = isinstance(artifact, FeatureSetCore)\n self.auto_id_column = artifact.id_column if self.is_feature_set else None\n\n # Get the data_source from the artifact\n self.artifact_name = artifact.uuid\n self.data_source = artifact.data_source if self.is_feature_set else artifact\n self.database = self.data_source.database\n\n # Construct our base_table_name\n self.base_table_name = self.data_source.table\n\n # Check if the view should be auto created\n self.auto_created = False\n if kwargs.get(\"auto_create_view\", True) and not self.exists():\n\n # A direct double check before we auto-create\n if not self.exists(skip_cache=True):\n self.log.important(\n f\"View {self.view_name} for {self.artifact_name} doesn't exist, attempting to auto-create...\"\n )\n self.auto_created = self._auto_create_view()\n\n # Check for failure of the auto-creation\n if not self.auto_created:\n self.log.error(\n f\"View {self.view_name} for {self.artifact_name} doesn't exist and cannot be auto-created...\"\n )\n self.view_name = self.columns = self.column_types = self.source_table = self.base_table_name = None\n return\n\n # Now fill some details about the view\n self.columns, self.column_types, self.source_table, self.join_view = view_details(\n self.table, self.data_source.database, self.data_source.boto3_session\n )\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.__repr__","title":"__repr__()
","text":"Return a string representation of this object
Source code insrc/sageworks/core/views/view.py
def __repr__(self):\n \"\"\"Return a string representation of this object\"\"\"\n\n # Set up various details that we want to print out\n auto = \"(Auto-Created)\" if self.auto_created else \"\"\n artifact = \"FeatureSet\" if self.is_feature_set else \"DataSource\"\n\n info = f'View: \"{self.view_name}\" for {artifact}(\"{self.artifact_name}\")\\n'\n info += f\" Database: {self.database}\\n\"\n info += f\" Table: {self.table}{auto}\\n\"\n info += f\" Source Table: {self.source_table}\\n\"\n info += f\" Join View: {self.join_view}\"\n return info\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.column_details","title":"column_details()
","text":"Return a dictionary of the column names and types for this view
Returns:
Name Type Descriptiondict
dict
A dictionary of the column names and types
Source code insrc/sageworks/core/views/view.py
def column_details(self) -> dict:\n \"\"\"Return a dictionary of the column names and types for this view\n\n Returns:\n dict: A dictionary of the column names and types\n \"\"\"\n return dict(zip(self.columns, self.column_types))\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.delete","title":"delete()
","text":"Delete the database view (and supplemental data) if it exists.
Source code insrc/sageworks/core/views/view.py
def delete(self):\n \"\"\"Delete the database view (and supplemental data) if it exists.\"\"\"\n\n # List any supplemental tables for this data source\n supplemental_tables = list_supplemental_data_tables(self.base_table_name, self.database)\n for table in supplemental_tables:\n if self.view_name in table:\n self.log.important(f\"Deleting Supplemental Table {table}...\")\n delete_table(table, self.database, self.data_source.boto3_session)\n\n # Now drop the view\n self.log.important(f\"Dropping View {self.table}...\")\n drop_view_query = f'DROP VIEW \"{self.table}\"'\n\n # Execute the DROP VIEW query\n try:\n self.data_source.execute_statement(drop_view_query, silence_errors=True)\n except wr.exceptions.QueryFailed as e:\n if \"View not found\" in str(e):\n self.log.info(f\"View {self.table} not found, this is fine...\")\n else:\n raise\n\n # We want to do a small sleep so that AWS has time to catch up\n self.log.info(\"Sleeping for 3 seconds after dropping view to allow AWS to catch up...\")\n time.sleep(3)\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.ensure_exists","title":"ensure_exists()
","text":"Ensure if the view exists by making a query directly to the database. If it doesn't exist, create it
Source code insrc/sageworks/core/views/view.py
def ensure_exists(self):\n \"\"\"Ensure if the view exists by making a query directly to the database. If it doesn't exist, create it\"\"\"\n\n # The BaseView always exists\n if self.view_name == \"base\":\n return True\n\n # Check the database directly\n if not self._check_database():\n self._auto_create_view()\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.exists","title":"exists(skip_cache=False)
","text":"Check if the view exists in the database
Parameters:
Name Type Description Defaultskip_cache
bool
Skip the cache and check the database directly (default: False)
False
Returns: bool: True if the view exists, False otherwise.
Source code insrc/sageworks/core/views/view.py
def exists(self, skip_cache: bool = False) -> bool:\n \"\"\"Check if the view exists in the database\n\n Args:\n skip_cache (bool): Skip the cache and check the database directly (default: False)\n Returns:\n bool: True if the view exists, False otherwise.\n \"\"\"\n # The BaseView always exists\n if self.view_name == \"base\":\n return True\n\n # If we're skipping the cache, we need to check the database directly\n if skip_cache:\n return self._check_database()\n\n # Use the meta class to see if the view exists\n views_df = self.meta.views(self.database)\n\n # Check if we have ANY views\n if views_df.empty:\n return False\n\n # Check if the view exists\n return self.table in views_df[\"Name\"].values\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.pull_dataframe","title":"pull_dataframe(limit=50000, head=False)
","text":"Pull a DataFrame based on the view type
Parameters:
Name Type Description Defaultlimit
int
The maximum number of rows to pull (default: 50000)
50000
head
bool
Return just the head of the DataFrame (default: False)
False
Returns:
Type DescriptionUnion[DataFrame, None]
Union[pd.DataFrame, None]: The DataFrame for the view or None if it doesn't exist
Source code insrc/sageworks/core/views/view.py
def pull_dataframe(self, limit: int = 50000, head: bool = False) -> Union[pd.DataFrame, None]:\n \"\"\"Pull a DataFrame based on the view type\n\n Args:\n limit (int): The maximum number of rows to pull (default: 50000)\n head (bool): Return just the head of the DataFrame (default: False)\n\n Returns:\n Union[pd.DataFrame, None]: The DataFrame for the view or None if it doesn't exist\n \"\"\"\n\n # Pull the DataFrame\n if head:\n limit = 5\n pull_query = f'SELECT * FROM \"{self.table}\" LIMIT {limit}'\n df = self.data_source.query(pull_query)\n return df\n
"},{"location":"api_classes/views/#sageworks.core.views.view.View.query","title":"query(query)
","text":"Query the view with a custom SQL query
Parameters:
Name Type Description Defaultquery
str
The SQL query to execute
requiredReturns:
Type DescriptionUnion[DataFrame, None]
Union[pd.DataFrame, None]: The DataFrame for the query or None if it doesn't exist
Source code insrc/sageworks/core/views/view.py
def query(self, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Query the view with a custom SQL query\n\n Args:\n query (str): The SQL query to execute\n\n Returns:\n Union[pd.DataFrame, None]: The DataFrame for the query or None if it doesn't exist\n \"\"\"\n return self.data_source.query(query)\n
"},{"location":"api_classes/views/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Listing Views
views.pyfrom sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\ntest_data = DataSource('test_data')\ntest_data.views()\n[\"display\", \"training\", \"computation\"]\n
Getting a Particular View
views.pyfrom sageworks.api.feature_set import FeatureSet\n\nfs = FeatureSet('test_features')\n\n# Grab the columns for the display view\ndisplay_view = fs.view(\"display\")\ndisplay_view.columns\n['id', 'name', 'height', 'weight', 'salary', ...]\n\n# Pull the dataframe for this view\ndf = display_view.pull_dataframe()\n id name height weight salary ...\n0 58 Person 58 71.781227 275.088196 162053.140625 \n
View Queries
All SageWorks Views are stored in AWS Athena, so any query that you can make with Athena is accessible through the View Query API.
view_query.pyfrom sageworks.api.feature_set import FeatureSet\n\n# Grab a FeatureSet View\nfs = FeatureSet(\"abalone_features\")\nt_view = fs.view(\"training\")\n\n# Make some queries using the Athena backend\ndf = t_view(f\"select * from {t_view.table} where height > .3\")\nprint(df.head())\n\ndf = t_view.query(\"select * from abalone_features where class_number_of_rings < 3\")\nprint(df.head())\n
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10\n1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8\n\n sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1\n1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2\n
Classes to construct View
The SageWorks Classes used to construct viewss are currently in 'Core'. So you can check out the documentation for those classes here: SageWorks View Creators
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"aws_setup/aws_access_management/","title":"AWS Acesss Management","text":"Need AWS Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
This page gives an overview of how SageWorks sets up roles and policies in a granular way that provides 'least priviledge' and also provides a unified framework for AWS access management.
"},{"location":"aws_setup/aws_access_management/#conceptual-slide-deck","title":"Conceptual Slide Deck","text":"SageWorks AWS Acesss Management
"},{"location":"aws_setup/aws_access_management/#aws-resources","title":"AWS Resources","text":"Follow the steps below to set up and connect using AWS Client VPN.
"},{"location":"aws_setup/aws_client_vpn/#step-1-create-a-client-vpn-endpoint-in-aws","title":"Step 1: Create a Client VPN Endpoint in AWS","text":"10.0.0.0/22
) that doesn\u2019t overlap with your VPC CIDR.0.0.0.0/0
to allow access to all resources in the VPC.Allow access
and specify the group you created or allow all users.AWS Client VPN is a straightforward, secure, and effective solution for connecting your laptop to an AWS VPC. It requires minimal setup and provides all the security controls you need, making it ideal for a single laptop and user.
"},{"location":"aws_setup/aws_setup/","title":"AWS Setup","text":"Need AWS Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"aws_setup/aws_setup/#get-some-information","title":"Get some information","text":"Write these values down, you'll need them as part of this AWS setup.
"},{"location":"aws_setup/aws_setup/#install-aws-cli","title":"Install AWS CLI","text":"AWS CLI Instructions
"},{"location":"aws_setup/aws_setup/#running-the-sso-configuration","title":"Running the SSO Configuration","text":"Note: You only need to do this once! Also this will create a NEW profile, so name the profile something like aws_sso
.
aws configure sso --profile <whatever> (e.g. aws_sso)\nSSO session name (Recommended): sso-session\nSSO start URL []: <the Start URL from info above>\nSSO region []: <the Region from info above>\nSSO registration scopes [sso:account:access]: <just hit return>\n
You will get a browser open/redirect at this point and get a list of available accounts.. something like below, just pick the correct account
There are 2 AWS accounts available to you.\n> SCP_Sandbox, briford+sandbox@supercowpowers.com (XXXX40646YYY)\n SCP_Main, briford@supercowpowers.com (XXX576391YYY)\n
Now pick the role that you're going to use
There are 2 roles available to you.\n> DataScientist\n AdministratorAccess\n\nCLI default client Region [None]: <same region as above>\nCLI default output format [None]: json\n
"},{"location":"aws_setup/aws_setup/#setting-up-some-aliases-for-bashzsh","title":"Setting up some aliases for bash/zsh","text":"Edit your favorite ~/.bashrc ~/.zshrc and add these nice aliases/helper
# AWS Aliases\nalias aws_sso='export AWS_PROFILE=aws_sso'\n\n# Default AWS Profile\nexport AWS_PROFILE=aws_sso\n
"},{"location":"aws_setup/aws_setup/#testing-your-new-aws-profile","title":"Testing your new AWS Profile","text":"Make sure your profile is active/set
env | grep AWS\nAWS_PROFILE=<aws_sso or whatever>\n
Now you can list the S3 buckets in the AWS Account aws ls s3\n
If you get some message like this... The SSO session associated with this profile has\nexpired or is otherwise invalid. To refresh this SSO\nsession run aws sso login with the corresponding\nprofile.\n
This is fine/good, a browser will open up and you can refresh your SSO Token.
After that you should get a listing of the S3 buckets without needed to refresh your token.
aws s3 ls\n\u276f aws s3 ls\n2023-03-20 20:06:53 aws-athena-query-results-XXXYYY-us-west-2\n2023-03-30 13:22:28 sagemaker-studio-XXXYYY-dbgyvq8ruka\n2023-03-24 22:05:55 sagemaker-us-west-2-XXXYYY\n2023-04-30 13:43:29 scp-sageworks-artifacts\n
"},{"location":"aws_setup/aws_setup/#back-to-initial-setup","title":"Back to Initial Setup","text":"If you're doing the initial setup of SageWorks you should now go back and finish that process: Getting Started
"},{"location":"aws_setup/aws_setup/#aws-resources","title":"AWS Resources","text":"Need AWS Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
This page tries to give helpful guidance when setting up AWS Accounts, Users, and Groups. In general AWS can be a bit tricky to set up the first time. Feel free to use any material in this guide but we're more than happy to help clients get their AWS Setup ready to go for FREE. Below are some guides for setting up a new AWS account for SageWorks and also setting up SSO Users and Groups within AWS.
"},{"location":"aws_setup/aws_tips_and_tricks/#new-aws-account-with-aws-organizations-easy","title":"New AWS Account (with AWS Organizations: easy)","text":"Email Trick
AWS will often not allow the same email to be used for different accounts. If you need a 'new' email just add a plus sign '+' at the end of your existing email (e.g. bob.smith+aws@gmail.com). This email will 'auto forward' to bob.smith@gmail.com.
"},{"location":"aws_setup/aws_tips_and_tricks/#new-aws-account-without-aws-organizations-a-bit-harder","title":"New AWS Account (without AWS Organizations: a bit harder)","text":"AWS SSO (Single Sign-On) is a cloud-based service that allows users to manage access to multiple AWS accounts and business applications using a single set of credentials. It simplifies the authentication process for users and provides centralized management of permissions and access control across various AWS resources. With AWS SSO, users can log in once and access all the applications and accounts they need, streamlining the user experience and increasing productivity. AWS SSO also enables IT administrators to manage access more efficiently by providing a single point of control for managing user access, permissions, and policies, reducing the risk of unauthorized access or security breaches.
"},{"location":"aws_setup/aws_tips_and_tricks/#setting-up-sso-users","title":"Setting up SSO Users","text":"The 'Add User' setup is fairly straight forward but here are some screen shots:
On the first panel you can fill in the users information.
"},{"location":"aws_setup/aws_tips_and_tricks/#groups","title":"Groups","text":"On the second panel we suggest that you have at LEAST two groups:
This allows you to put most of the users into the DataScientists group that has AWS policies based on their job role. AWS uses 'permission sets' and you assign AWS Policies. This approach makes it easy to give a group of users a set of relevant policies for their tasks.
Our standard setup is to have two permission sets with the following policies:
Add Policy: arn:aws:iam::aws:policy/job-function/DataScientist
IAM Identity Center --> Permission sets --> AdministratorAccess
See: Permission Sets for more details and instructions.
Another benefit of creating groups is that you can include that group in 'Trust Policy (assume_role)' for the SageWorks-ExecutionRole (this gets deployed as part of the SageWorks AWS Stack). This means that the management of what SageWorks can do/see/read/write is completely done through the SageWorks-ExecutionRole.
"},{"location":"aws_setup/aws_tips_and_tricks/#back-to-adding-user","title":"Back to Adding User","text":"Okay now that we have our groups set up we can go back to our original goal of adding a user. So here's the second panel with the groups and now we can hit 'Next'
On the third panel just review the details and hit the 'Add User' button at the bottom. The user will get an email giving them instructions on how to log on to their AWS account.
"},{"location":"aws_setup/aws_tips_and_tricks/#aws-console","title":"AWS Console","text":"Now when the user logs onto the AWS Console they should see something like this:
"},{"location":"aws_setup/aws_tips_and_tricks/#aws-clisso-setup-for-command-linepython-usage","title":"AWS CLI/SSO Setup for Command Line/Python Usage","text":"Please see our AWS Setup
"},{"location":"aws_setup/aws_tips_and_tricks/#aws-resources","title":"AWS Resources","text":"Welcome to the SageWorks AWS Setup Guide. SageWorks is deployed as an AWS Stack following the well architected system practices of AWS.
AWS Setup can be a bit complex
Setting up SageWorks with AWS can be a bit complex, but this only needs to be done ONCE for your entire company. The install uses standard CDK --> AWS Stacks and SageWorks tries to make it straight forward. If you have any troubles at all feel free to contact us a sageworks@supercowpowers.com or on Discord and we're happy to help you with AWS for FREE.
"},{"location":"aws_setup/core_stack/#two-main-options-when-using-sageworks","title":"Two main options when using SageWorks","text":"Either of these options are fully supported, but we highly suggest a NEW account as it gives the following benefits:
If your AWS Account already has users and groups set up you can skip this but here's our recommendations on setting up SSO Users and Groups
"},{"location":"aws_setup/core_stack/#onboarding-sageworks-to-your-aws-account","title":"Onboarding SageWorks to your AWS Account","text":"Pulling down the SageWorks Repo
git clone https://github.com/SuperCowPowers/sageworks.git\n
"},{"location":"aws_setup/core_stack/#sageworks-uses-aws-python-cdk-for-deployments","title":"SageWorks uses AWS Python CDK for Deployments","text":"If you don't have AWS CDK already installed you can do these steps:
Mac
brew install node \nnpm install -g aws-cdk\n
Linux sudo apt install nodejs\nsudo npm install -g aws-cdk\n
For more information on Linux installs see Digital Ocean NodeJS"},{"location":"aws_setup/core_stack/#create-an-s3-bucket-for-sageworks","title":"Create an S3 Bucket for SageWorks","text":"SageWorks pushes and pulls data from AWS, it will use this S3 Bucket for storage and processing. You should create a NEW S3 Bucket, we suggest a name like <company_name>-sageworks
Do the initial setup/config here: Getting Started. After you've done that come back to this section. For Stack Deployment additional things need to be added to your config file. The config file will be located in your home directory ~/.sageworks/sageworks_config.json
. Edit this file and add addition stuff for the deployment. Specifically there are two additional fields to be added (optional for both)
\"SAGEWORKS_SSO_GROUP\": DataScientist (or whatever)\n\"SAGEWORKS_ADDITIONAL_BUCKETS\": \"bucket1, bucket2\n
These are optional but are set/used by most SageWorks users. AWS Stuff
Activate your AWS Account that's used for SageWorks deployment. For this one time install you should use an Admin Account (or an account that had permissions to create/update AWS Stacks)
cd sageworks/aws_setup/sageworks_core\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n
"},{"location":"aws_setup/core_stack/#aws-account-setup-check","title":"AWS Account Setup Check","text":"After setting up SageWorks config/AWS Account you can run this test/checking script. If the results ends with INFO AWS Account Clamp: AOK!
you're in good shape. If not feel free to contact us on Discord and we'll get it straightened out for you :)
pip install sageworks (if not already installed)\ncd sageworks/aws_setup\npython aws_account_check.py\n<lot of print outs for various checks>\nINFO AWS Account Clamp: AOK!\n
Success
Congratulations: SageWorks is now deployed to your AWS Account. Deploying the AWS Stack only needs to be done once. Now that this is complete your developers can simply pip install sageworks
and start using the API.
If you ran into any issues with this procedure please contact us via Discord or email sageworks@supercowpowers.com and the SCP team will provide free setup and support for new SageWorks users.
"},{"location":"aws_setup/dashboard_stack/","title":"Deploy the SageWorks Dashboard Stack","text":"Deploying the Dashboard Stack is reasonably straight forward, it's the same approach as the Core Stack that you've already deployed.
Please review the Stack Details section to understand all the AWS components that are included and utilized in the SageWorks Dashboard Stack.
"},{"location":"aws_setup/dashboard_stack/#deploying-the-dashboard-stack","title":"Deploying the Dashboard Stack","text":"AWS Stuff
Activate your AWS Account that's used for SageWorks deployment. For this one time install you should use an Admin Account (or an account that had permissions to create/update AWS Stacks)
cd sageworks/aws_setup/sageworks_dashboard_full\nexport SAGEWORKS_CONFIG=/full/path/to/config.json\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n
"},{"location":"aws_setup/dashboard_stack/#stack-details","title":"Stack Details","text":"AWS Questions?
There's quite a bit to unpack when deploying an AWS powered Web Service. We're happy to help walk you through the details and options. Contact us anytime for a free consultation.
AWS Costs
Deploying the SageWorks Dashboard does incur some monthly AWS costs. If you're on a tight budget you can deploy the 'lite' version of the Dashboard Stack.
cd sageworks/aws_setup/sageworks_dashboard_lite\nexport SAGEWORKS_CONFIG=/full/path/to/config.json\npip install -r requirements.txt\ncdk bootstrap\ncdk deploy\n
"},{"location":"aws_setup/domain_cert_setup/","title":"AWS Domain and Certificate Instructions","text":"Need AWS Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
This page tries to give helpful guidance when setting up a new domain and SSL Certificate in your AWS Account.
"},{"location":"aws_setup/domain_cert_setup/#new-domain","title":"New Domain","text":"You'll want the SageWorks Dashboard to have a domain for your companies internal use. Customers will typically use a domain like <company_name>-ml-dashboard.com
but you are free to choose any domain you'd like.
Domains are tied to AWS Accounts
When you create a new domain in AWS Route 53, that domain is tied to that AWS Account. You can do a cross account setup for domains but it's a bit more tricky. We recommend that each account where SageWorks gets deployed owns the domain for that Dashboard.
"},{"location":"aws_setup/domain_cert_setup/#multiple-aws-accounts","title":"Multiple AWS Accounts","text":"Many customers will have a dev/stage/prod set of AWS accounts, if that the case then the best practice is to make a domain specific to each account. So for instance:
<company_name>-ml-dashboard-dev.com
<company_name>-ml-dashboard-prod.com
.This means that when you go to that Dashboard it's super obvious which environment your on.
"},{"location":"aws_setup/domain_cert_setup/#register-the-domain","title":"Register the Domain","text":"Open Route 53 Console Route 53 Console
Register your New Domain
Open ACM Console: AWS Certificate Manager (ACM) Console
Request a Certificate:
Add Domain Names:
yourdomain.com
).www.yourdomain.com
).Validation Method:
Add Tags (Optional):
Review and Request:
To complete the domain validation process for your SSL/TLS certificate, you need to add the CNAME records provided by AWS Certificate Manager (ACM) to your Route 53 hosted zone. This step ensures that you own the domain and allows ACM to issue the certificate.
"},{"location":"aws_setup/domain_cert_setup/#finding-cname-record-names-and-values","title":"Finding CNAME Record Names and Values","text":"You can find the CNAME record names and values in the AWS Certificate Manager (ACM) console:
Open ACM Console: AWS Certificate Manager (ACM) Console
Select Your Certificate:
View Domains Section:
Open Route 53 Console: Route 53 Console
Select Your Hosted Zone:
yourdomain.com
).Add the First CNAME Record:
_3e8623442477e9eeec.your-domain.com
).CNAME
._0908c89646d92.sdgjtdhdhz.acm-validations.aws.
) (include the trailing dot).Add the Second CNAME Record:
_75cd9364c643caa.www.your-domain.com
).CNAME
._f72f8cff4fb20f4.sdgjhdhz.acm-validations.aws.
) (include the trailing dot).DNS Propagation and Cert Validation
After adding the CNAME records, these DNS records will propagate through the DNS system and ACM will automatically detect the validation records and validate the domain. This process can take a few minutes or up to an hour.
"},{"location":"aws_setup/domain_cert_setup/#certificate-states","title":"Certificate States","text":"After requesting a certificate, it will go through the following states:
Pending Validation: The initial state after you request a certificate and before you complete the validation process. ACM is waiting for you to prove domain ownership by adding the CNAME records.
Issued: This state indicates that the certificate has been successfully validated and issued. You can now use this certificate with your AWS resources.
Validation Timed Out: If you do not complete the validation process within a specified period (usually 72 hours), the certificate request times out and enters this state.
Revoked: This state indicates that the certificate has been revoked and is no longer valid.
Failed: If the validation process fails for any reason, the certificate enters this state.
Inactive: This state indicates that the certificate is not currently in use.
The certificate status should obviously be in the Issued state, if not please contact SageWorks Support Team.
"},{"location":"aws_setup/domain_cert_setup/#retrieving-the-certificate-arn","title":"Retrieving the Certificate ARN","text":"Open ACM Console:
Check the Status:
Copy the Certificate ARN:
You now have the ARN for your certificate, which you can use in your AWS resources such as API Gateway, CloudFront, etc.
"},{"location":"aws_setup/domain_cert_setup/#aws-resources","title":"AWS Resources","text":"Now that the core Sageworks AWS Stack has been deployed. Let's test out SageWorks by building a full entire AWS ML Pipeline from start to finish. The script build_ml_pipeline.py
uses the SageWorks API to quickly and easily build an AWS Modeling Pipeline.
Taste the Awesome
The SageWorks \"hello world\" builds a full AWS ML Pipeline. From S3 to deployed model and endpoint. If you have any troubles at all feel free to contact us at sageworks email or on Discord and we're happy to help you for FREE.
This script will take a LONG TiME to run, most of the time is waiting on AWS to finalize FeatureGroups, train Models or deploy Endpoints.
\u276f python build_ml_pipeline.py\n<lot of building ML pipeline outputs>\n
After the script completes you will see that it's built out an AWS ML Pipeline and testing artifacts."},{"location":"aws_setup/full_pipeline/#run-the-sageworks-dashboard-local","title":"Run the SageWorks Dashboard (Local)","text":"Dashboard AWS Stack
Deploying the Dashboard Stack is straight-forward and provides a robust AWS Web Server with Load Balancer, Elastic Container Service, VPC Networks, etc. (see AWS Dashboard Stack)
For testing it's nice to run the Dashboard locally, but for longterm use the SageWorks Dashboard should be deployed as an AWS Stack. The deployed Stack allows everyone in the company to use, view, and interact with the AWS Machine Learning Artifacts created with SageWorks.
cd sageworks/application/aws_dashboard\n./dashboard\n
This will open a browser to http://localhost:8000 SageWorks Dashboard: AWS Pipelines in a Whole New Light!
Success
Congratulations: SageWorks is now deployed to your AWS Account. Deploying the AWS Stack only needs to be done once. Now that this is complete your developers can simply pip install sageworks
and start using the API.
If you ran into any issues with this procedure please contact us via Discord or email sageworks@supercowpowers.com and the SCP team will provide free setup and support for new SageWorks users.
"},{"location":"blogs_research/","title":"SageWorks Blogs","text":"Just Getting Started?
The SageWorks Blogs is a great way to see what's possible with SageWorks. Also if you're ready to jump in the API Classes will give you details on the SageWorks ML Pipeline Classes.
"},{"location":"blogs_research/#blogs","title":"Blogs","text":"Examples
All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
SageWorks EDS
The SageWorks toolkit a set of plots that show EDA results, it also has a flexible plugin architecture to expand, enhance, or even replace the current set of web components Dashboard.
The SageWorks framework has a broad range of Exploratory Data Analysis (EDA) functionality. Each time a DataSource or FeatureSet is created that data is run through a full set of EDA techniques:
SageWorks EDS
The SageWorks toolkit a set of plots that show EDA results, it also has a flexible plugin architecture to expand, enhance, or even replace the current set of web components Dashboard.
The SageWorks framework has a broad range of Exploratory Data Analysis (EDA) functionality. Each time a DataSource or FeatureSet is created that data is run through a full set of EDA techniques:
One of the latest EDA techniques we've added is the addition of a concept called High Target Gradients
[G_{ij} = \\frac{|y_i - y_j|}{d(x_i, x_j)}]
where (d(x_i, x_j)) is the distance between (x_i) and (x_j) in the feature space. This equation gives you the rate of change of the target value with respect to the change in features, similar to a slope in a two-dimensional space.
[G_{i}^{max} = \\max_{j \\neq i} G_{ij}]
This gives you a scalar value for each point in your training data that represents the maximum rate of change of the target value in its local neighborhood.
Usage: You can use (G_{i}^{max}) to identify and filter areas in the feature space that have high target gradients, which may indicate potential issues with data quality or feature representation.
Visualization: Plotting the distribution of (G_{i}^{max}) values or visualizing them in the context of the feature space can help you identify regions or specific points that warrant further investigation.
Amazon SageMaker Model Monitor currently provides the following types of monitoring:
Overview and Definition Residual analysis involves examining the differences between observed and predicted values, known as residuals, to assess the performance of a predictive model. It is a critical step in model evaluation as it helps identify patterns of errors, diagnose potential problems, and improve model performance. By understanding where and why a model's predictions deviate from actual values, we can make informed adjustments to the model or the data to enhance accuracy and robustness.
Sparse Data Regions The observation is in a part of feature space with little or no nearby training observations, leading to poor generalization in these regions and resulting in high prediction errors.
Noisy/Inconsistent Data and Preprocessing Issues The observation is in a part of feature space where the training data is noisy, incorrect, or has high variance in the target variable. Additionally, missing values or incorrect data transformations can introduce errors, leading to unreliable predictions and high residuals.
Feature Resolution The current feature set may not fully resolve the compounds, leading to \u2018collisions\u2019 where different compounds are assigned identical features. Such unresolved features can result in different compounds exhibiting the same features, causing high residuals due to unaccounted structural or chemical nuances.
Activity Cliffs Structurally similar compounds exhibit significantly different activities, making accurate prediction challenging due to steep changes in activity with minor structural modifications.
Feature Engineering Issues Irrelevant or redundant features and poor feature scaling can negatively impact the model's performance and accuracy, resulting in higher residuals.
Model Overfitting or Underfitting Overfitting occurs when the model is too complex and captures noise, while underfitting happens when the model is too simple and misses underlying patterns, both leading to inaccurate predictions.
"},{"location":"cached/cached_data_source/","title":"CachedDataSource","text":"Model Examples
Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!
CachedDataSource: Caches the method results for SageWorks DataSources
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource","title":"CachedDataSource
","text":" Bases: CachedArtifactMixin
, AthenaSource
CachedDataSource: Caches the method results for SageWorks DataSources
Note: Cached method values may lag underlying DataSource changes.
Common Usagemy_data = CachedDataSource(name)\nmy_data.details()\nmy_data.health_check()\nmy_data.sageworks_meta()\n
Source code in src/sageworks/cached/cached_data_source.py
class CachedDataSource(CachedArtifactMixin, AthenaSource):\n \"\"\"CachedDataSource: Caches the method results for SageWorks DataSources\n\n Note: Cached method values may lag underlying DataSource changes.\n\n Common Usage:\n ```python\n my_data = CachedDataSource(name)\n my_data.details()\n my_data.health_check()\n my_data.sageworks_meta()\n ```\n \"\"\"\n\n def __init__(self, data_uuid: str, database: str = \"sageworks\"):\n \"\"\"CachedDataSource Initialization\"\"\"\n AthenaSource.__init__(self, data_uuid=data_uuid, database=database, use_cached_meta=True)\n\n @CachedArtifactMixin.cache_result\n def summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Details.\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().summary(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Details.\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().details(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Health Check.\n\n Returns:\n dict: A dictionary of health check details for the DataSource\n \"\"\"\n return super().health_check(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def sageworks_meta(self) -> Union[dict, None]:\n \"\"\"Retrieve the SageWorks Metadata for this DataSource.\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n \"\"\"\n return super().sageworks_meta()\n\n def smart_sample(self) -> pd.DataFrame:\n \"\"\"Retrieve the Smart Sample for this DataSource.\n\n Returns:\n pd.DataFrame: The Smart Sample DataFrame\n \"\"\"\n return super().smart_sample(recompute=False)\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.__init__","title":"__init__(data_uuid, database='sageworks')
","text":"CachedDataSource Initialization
Source code insrc/sageworks/cached/cached_data_source.py
def __init__(self, data_uuid: str, database: str = \"sageworks\"):\n \"\"\"CachedDataSource Initialization\"\"\"\n AthenaSource.__init__(self, data_uuid=data_uuid, database=database, use_cached_meta=True)\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.details","title":"details(**kwargs)
","text":"Retrieve the DataSource Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the DataSource
Source code insrc/sageworks/cached/cached_data_source.py
@CachedArtifactMixin.cache_result\ndef details(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Details.\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.health_check","title":"health_check(**kwargs)
","text":"Retrieve the DataSource Health Check.
Returns:
Name Type Descriptiondict
dict
A dictionary of health check details for the DataSource
Source code insrc/sageworks/cached/cached_data_source.py
@CachedArtifactMixin.cache_result\ndef health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Health Check.\n\n Returns:\n dict: A dictionary of health check details for the DataSource\n \"\"\"\n return super().health_check(**kwargs)\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.sageworks_meta","title":"sageworks_meta()
","text":"Retrieve the SageWorks Metadata for this DataSource.
Returns:
Type DescriptionUnion[dict, None]
Union[dict, None]: Dictionary of SageWorks metadata for this Artifact
Source code insrc/sageworks/cached/cached_data_source.py
@CachedArtifactMixin.cache_result\ndef sageworks_meta(self) -> Union[dict, None]:\n \"\"\"Retrieve the SageWorks Metadata for this DataSource.\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n \"\"\"\n return super().sageworks_meta()\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.smart_sample","title":"smart_sample()
","text":"Retrieve the Smart Sample for this DataSource.
Returns:
Type DescriptionDataFrame
pd.DataFrame: The Smart Sample DataFrame
Source code insrc/sageworks/cached/cached_data_source.py
def smart_sample(self) -> pd.DataFrame:\n \"\"\"Retrieve the Smart Sample for this DataSource.\n\n Returns:\n pd.DataFrame: The Smart Sample DataFrame\n \"\"\"\n return super().smart_sample(recompute=False)\n
"},{"location":"cached/cached_data_source/#sageworks.cached.cached_data_source.CachedDataSource.summary","title":"summary(**kwargs)
","text":"Retrieve the DataSource Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the DataSource
Source code insrc/sageworks/cached/cached_data_source.py
@CachedArtifactMixin.cache_result\ndef summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the DataSource Details.\n\n Returns:\n dict: A dictionary of details about the DataSource\n \"\"\"\n return super().summary(**kwargs)\n
"},{"location":"cached/cached_data_source/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Pull DataSource Details
from sageworks.cached.cached_data_source import CachedDataSource\n\n# Grab a DataSource\nds = CachedDataSource(\"abalone_data\")\n\n# Show the details\nds.details()\n\n> ds.details()\n\n{'uuid': 'abalone_data',\n 'health_tags': [],\n 'aws_arn': 'arn:aws:glue:x:table/sageworks/abalone_data',\n 'size': 0.070272,\n 'created': '2024-11-09T20:42:34.000Z',\n 'modified': '2024-11-10T19:57:52.000Z',\n 'input': 's3://sageworks-public-data/common/aBaLone.CSV',\n 'sageworks_health_tags': '',\n 'sageworks_correlations': {'length': {'diameter': 0.9868115846024996,\n
"},{"location":"cached/cached_endpoint/","title":"CachedEndpoint","text":"Model Examples
Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!
CachedEndpoint: Caches the method results for SageWorks Endpoints
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint","title":"CachedEndpoint
","text":" Bases: CachedArtifactMixin
, EndpointCore
CachedEndpoint: Caches the method results for SageWorks Endpoints
Note: Cached method values may lag underlying Endpoint changes.
Common Usagemy_endpoint = CachedEndpoint(name)\nmy_endpoint.details()\nmy_endpoint.health_check()\nmy_endpoint.sageworks_meta()\n
Source code in src/sageworks/cached/cached_endpoint.py
class CachedEndpoint(CachedArtifactMixin, EndpointCore):\n \"\"\"CachedEndpoint: Caches the method results for SageWorks Endpoints\n\n Note: Cached method values may lag underlying Endpoint changes.\n\n Common Usage:\n ```python\n my_endpoint = CachedEndpoint(name)\n my_endpoint.details()\n my_endpoint.health_check()\n my_endpoint.sageworks_meta()\n ```\n \"\"\"\n\n def __init__(self, endpoint_uuid: str):\n \"\"\"CachedEndpoint Initialization\"\"\"\n EndpointCore.__init__(self, endpoint_uuid=endpoint_uuid, use_cached_meta=True)\n\n @CachedArtifactMixin.cache_result\n def summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Details.\n\n Returns:\n dict: A dictionary of details about the CachedEndpoint\n \"\"\"\n return super().summary(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Details.\n\n Returns:\n dict: A dictionary of details about the CachedEndpoint\n \"\"\"\n return super().details(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Health Check.\n\n Returns:\n dict: A dictionary of health check details for the CachedEndpoint\n \"\"\"\n return super().health_check(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).\n\n Returns:\n str: The Enumerated Model Type\n \"\"\"\n return super().sageworks_meta()\n\n @CachedArtifactMixin.cache_result\n def endpoint_metrics(self) -> Union[str, None]:\n \"\"\"Retrieve the Endpoint Metrics\n\n Returns:\n str: The Endpoint Metrics\n \"\"\"\n return super().endpoint_metrics()\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.__init__","title":"__init__(endpoint_uuid)
","text":"CachedEndpoint Initialization
Source code insrc/sageworks/cached/cached_endpoint.py
def __init__(self, endpoint_uuid: str):\n \"\"\"CachedEndpoint Initialization\"\"\"\n EndpointCore.__init__(self, endpoint_uuid=endpoint_uuid, use_cached_meta=True)\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.details","title":"details(**kwargs)
","text":"Retrieve the CachedEndpoint Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the CachedEndpoint
Source code insrc/sageworks/cached/cached_endpoint.py
@CachedArtifactMixin.cache_result\ndef details(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Details.\n\n Returns:\n dict: A dictionary of details about the CachedEndpoint\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.endpoint_metrics","title":"endpoint_metrics()
","text":"Retrieve the Endpoint Metrics
Returns:
Name Type Descriptionstr
Union[str, None]
The Endpoint Metrics
Source code insrc/sageworks/cached/cached_endpoint.py
@CachedArtifactMixin.cache_result\ndef endpoint_metrics(self) -> Union[str, None]:\n \"\"\"Retrieve the Endpoint Metrics\n\n Returns:\n str: The Endpoint Metrics\n \"\"\"\n return super().endpoint_metrics()\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.health_check","title":"health_check(**kwargs)
","text":"Retrieve the CachedEndpoint Health Check.
Returns:
Name Type Descriptiondict
dict
A dictionary of health check details for the CachedEndpoint
Source code insrc/sageworks/cached/cached_endpoint.py
@CachedArtifactMixin.cache_result\ndef health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Health Check.\n\n Returns:\n dict: A dictionary of health check details for the CachedEndpoint\n \"\"\"\n return super().health_check(**kwargs)\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.sageworks_meta","title":"sageworks_meta()
","text":"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).
Returns:
Name Type Descriptionstr
Union[str, None]
The Enumerated Model Type
Source code insrc/sageworks/cached/cached_endpoint.py
@CachedArtifactMixin.cache_result\ndef sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).\n\n Returns:\n str: The Enumerated Model Type\n \"\"\"\n return super().sageworks_meta()\n
"},{"location":"cached/cached_endpoint/#sageworks.cached.cached_endpoint.CachedEndpoint.summary","title":"summary(**kwargs)
","text":"Retrieve the CachedEndpoint Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the CachedEndpoint
Source code insrc/sageworks/cached/cached_endpoint.py
@CachedArtifactMixin.cache_result\ndef summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedEndpoint Details.\n\n Returns:\n dict: A dictionary of details about the CachedEndpoint\n \"\"\"\n return super().summary(**kwargs)\n
"},{"location":"cached/cached_endpoint/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Get Endpoint Details
from sageworks.cached.cached_endpoint import CachedEndpoint\n\n# Grab an Endpoint\nend = CachedEndpoint(\"abalone-regression\")\n\n# Get the Details\n end.details()\n\n{'uuid': 'abalone-regression-end',\n 'health_tags': [],\n 'status': 'InService',\n 'instance': 'Serverless (2GB/5)',\n 'instance_count': '-',\n 'variant': 'AllTraffic',\n 'model_name': 'abalone-regression',\n 'model_type': 'regressor',\n 'model_metrics': RMSE R2 MAPE MedAE NumRows\n 1.64 2.246 0.502 16.393 1.209 834,\n 'confusion_matrix': None,\n 'predictions': class_number_of_rings prediction id\n 0 16 10.516158 7\n 1 9 9.031365 8\n 2 10 9.264600 17\n 3 7 8.578638 18\n 4 12 10.492446 27\n .. ... ... ...\n 829 11 11.915862 4148\n 830 8 8.210898 4157\n 831 8 7.693689 4158\n 832 9 7.542521 4167\n 833 8 9.060015 4168\n
"},{"location":"cached/cached_feature_set/","title":"CachedFeatureSet","text":"Model Examples
Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!
CachedFeatureSet: Caches the method results for SageWorks FeatureSets
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet","title":"CachedFeatureSet
","text":" Bases: CachedArtifactMixin
, FeatureSetCore
CachedFeatureSet: Caches the method results for SageWorks FeatureSets
Note: Cached method values may lag underlying FeatureSet changes.
Common Usagemy_features = CachedFeatureSet(name)\nmy_features.details()\nmy_features.health_check()\nmy_features.sageworks_meta()\n
Source code in src/sageworks/cached/cached_feature_set.py
class CachedFeatureSet(CachedArtifactMixin, FeatureSetCore):\n \"\"\"CachedFeatureSet: Caches the method results for SageWorks FeatureSets\n\n Note: Cached method values may lag underlying FeatureSet changes.\n\n Common Usage:\n ```python\n my_features = CachedFeatureSet(name)\n my_features.details()\n my_features.health_check()\n my_features.sageworks_meta()\n ```\n \"\"\"\n\n def __init__(self, feature_set_uuid: str, database: str = \"sageworks\"):\n \"\"\"CachedFeatureSet Initialization\"\"\"\n FeatureSetCore.__init__(self, feature_set_uuid=feature_set_uuid, use_cached_meta=True)\n\n @CachedArtifactMixin.cache_result\n def summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Details.\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().summary(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Details.\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().details(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Health Check.\n\n Returns:\n dict: A dictionary of health check details for the FeatureSet\n \"\"\"\n return super().health_check(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the SageWorks Metadata for this DataSource.\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n \"\"\"\n return super().sageworks_meta()\n\n @CachedArtifactMixin.cache_result\n def smart_sample(self) -> pd.DataFrame:\n \"\"\"Retrieve the Smart Sample for this FeatureSet.\n\n Returns:\n pd.DataFrame: The Smart Sample DataFrame\n \"\"\"\n return super().smart_sample(recompute=False)\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.__init__","title":"__init__(feature_set_uuid, database='sageworks')
","text":"CachedFeatureSet Initialization
Source code insrc/sageworks/cached/cached_feature_set.py
def __init__(self, feature_set_uuid: str, database: str = \"sageworks\"):\n \"\"\"CachedFeatureSet Initialization\"\"\"\n FeatureSetCore.__init__(self, feature_set_uuid=feature_set_uuid, use_cached_meta=True)\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.details","title":"details(**kwargs)
","text":"Retrieve the FeatureSet Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the FeatureSet
Source code insrc/sageworks/cached/cached_feature_set.py
@CachedArtifactMixin.cache_result\ndef details(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Details.\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.health_check","title":"health_check(**kwargs)
","text":"Retrieve the FeatureSet Health Check.
Returns:
Name Type Descriptiondict
dict
A dictionary of health check details for the FeatureSet
Source code insrc/sageworks/cached/cached_feature_set.py
@CachedArtifactMixin.cache_result\ndef health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Health Check.\n\n Returns:\n dict: A dictionary of health check details for the FeatureSet\n \"\"\"\n return super().health_check(**kwargs)\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.sageworks_meta","title":"sageworks_meta()
","text":"Retrieve the SageWorks Metadata for this DataSource.
Returns:
Type DescriptionUnion[str, None]
Union[dict, None]: Dictionary of SageWorks metadata for this Artifact
Source code insrc/sageworks/cached/cached_feature_set.py
@CachedArtifactMixin.cache_result\ndef sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the SageWorks Metadata for this DataSource.\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n \"\"\"\n return super().sageworks_meta()\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.smart_sample","title":"smart_sample()
","text":"Retrieve the Smart Sample for this FeatureSet.
Returns:
Type DescriptionDataFrame
pd.DataFrame: The Smart Sample DataFrame
Source code insrc/sageworks/cached/cached_feature_set.py
@CachedArtifactMixin.cache_result\ndef smart_sample(self) -> pd.DataFrame:\n \"\"\"Retrieve the Smart Sample for this FeatureSet.\n\n Returns:\n pd.DataFrame: The Smart Sample DataFrame\n \"\"\"\n return super().smart_sample(recompute=False)\n
"},{"location":"cached/cached_feature_set/#sageworks.cached.cached_feature_set.CachedFeatureSet.summary","title":"summary(**kwargs)
","text":"Retrieve the FeatureSet Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the FeatureSet
Source code insrc/sageworks/cached/cached_feature_set.py
@CachedArtifactMixin.cache_result\ndef summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the FeatureSet Details.\n\n Returns:\n dict: A dictionary of details about the FeatureSet\n \"\"\"\n return super().summary(**kwargs)\n
"},{"location":"cached/cached_feature_set/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Pull FeatureSet Details
from sageworks.cached.cached_feature_set import CachedFeatureSet\n\n# Grab a FeatureSet\nfs = CachedFeatureSet(\"abalone_features\")\n\n# Show the details\nfs.details()\n\n> fs.details()\n\n{'uuid': 'abalone_features',\n 'health_tags': [],\n 'aws_arn': 'arn:aws:glue:x:table/sageworks/abalone_data',\n 'size': 0.070272,\n 'created': '2024-11-09T20:42:34.000Z',\n 'modified': '2024-11-10T19:57:52.000Z',\n 'input': 's3://sageworks-public-data/common/aBaLone.CSV',\n 'sageworks_health_tags': '',\n 'sageworks_correlations': {'length': {'diameter': 0.9868115846024996,\n
"},{"location":"cached/cached_meta/","title":"CachedMeta","text":"CachedMeta Examples
Examples of using the CachedMeta class are listed at the bottom of this page Examples.
CachedMeta: A class that provides caching for the Meta() class
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta","title":"CachedMeta
","text":" Bases: CloudMeta
CachedMeta: Singleton class for caching metadata functionality.
Common Usagefrom sageworks.cached.cached_meta import CachedMeta\nmeta = CachedMeta()\n\n# Get the AWS Account Info\nmeta.account()\nmeta.config()\n\n# These are 'list' methods\nmeta.etl_jobs()\nmeta.data_sources()\nmeta.feature_sets(details=True/False)\nmeta.models(details=True/False)\nmeta.endpoints()\nmeta.views()\n\n# These are 'describe' methods\nmeta.data_source(\"abalone_data\")\nmeta.feature_set(\"abalone_features\")\nmeta.model(\"abalone-regression\")\nmeta.endpoint(\"abalone-endpoint\")\n
Source code in src/sageworks/cached/cached_meta.py
class CachedMeta(CloudMeta):\n \"\"\"CachedMeta: Singleton class for caching metadata functionality.\n\n Common Usage:\n ```python\n from sageworks.cached.cached_meta import CachedMeta\n meta = CachedMeta()\n\n # Get the AWS Account Info\n meta.account()\n meta.config()\n\n # These are 'list' methods\n meta.etl_jobs()\n meta.data_sources()\n meta.feature_sets(details=True/False)\n meta.models(details=True/False)\n meta.endpoints()\n meta.views()\n\n # These are 'describe' methods\n meta.data_source(\"abalone_data\")\n meta.feature_set(\"abalone_features\")\n meta.model(\"abalone-regression\")\n meta.endpoint(\"abalone-endpoint\")\n ```\n \"\"\"\n\n _instance = None # Class attribute to hold the singleton instance\n\n def __new__(cls, *args, **kwargs):\n if cls._instance is None:\n cls._instance = super(CachedMeta, cls).__new__(cls)\n return cls._instance\n\n def __init__(self):\n \"\"\"CachedMeta Initialization\"\"\"\n if hasattr(self, \"_initialized\") and self._initialized:\n return # Prevent reinitialization\n\n self.log = logging.getLogger(\"sageworks\")\n self.log.important(\"Initializing CachedMeta...\")\n super().__init__()\n\n # Create both our Meta Cache and Fresh Cache (tracks if data is stale)\n self.meta_cache = SageWorksCache(prefix=\"meta\")\n self.fresh_cache = SageWorksCache(prefix=\"meta_fresh\", expire=90) # 90-second expiration\n\n # Create a ThreadPoolExecutor for refreshing stale data\n self.thread_pool = ThreadPoolExecutor(max_workers=5)\n\n # Mark the instance as initialized\n self._initialized = True\n\n def check(self):\n \"\"\"Check if our underlying caches are working\"\"\"\n return self.meta_cache.check()\n\n def list_meta_cache(self):\n \"\"\"List the current Meta Cache\"\"\"\n return self.meta_cache.list_keys()\n\n def clear_meta_cache(self):\n \"\"\"Clear the current Meta Cache\"\"\"\n self.meta_cache.clear()\n\n @cache_result\n def account(self) -> dict:\n \"\"\"Cloud Platform Account Info\n\n Returns:\n dict: Cloud Platform Account Info\n \"\"\"\n return super().account()\n\n @cache_result\n def config(self) -> dict:\n \"\"\"Return the current SageWorks Configuration\n\n Returns:\n dict: The current SageWorks Configuration\n \"\"\"\n return super().config()\n\n @cache_result\n def incoming_data(self) -> pd.DataFrame:\n \"\"\"Get summary data about data in the incoming raw data\n\n Returns:\n pd.DataFrame: A summary of the incoming raw data\n \"\"\"\n return super().incoming_data()\n\n @cache_result\n def etl_jobs(self) -> pd.DataFrame:\n \"\"\"Get summary data about Extract, Transform, Load (ETL) Jobs\n\n Returns:\n pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform\n \"\"\"\n return super().etl_jobs()\n\n @cache_result\n def data_sources(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Data Sources deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform\n \"\"\"\n return super().data_sources()\n\n @cache_result\n def views(self, database: str = \"sageworks\") -> pd.DataFrame:\n \"\"\"Get a summary of the all the Views, for the given database, in AWS\n\n Args:\n database (str, optional): Glue database. Defaults to 'sageworks'.\n\n Returns:\n pd.DataFrame: A summary of all the Views, for the given database, in AWS\n \"\"\"\n return super().views(database=database)\n\n @cache_result\n def feature_sets(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Feature Sets deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform\n \"\"\"\n return super().feature_sets(details=details)\n\n @cache_result\n def models(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Models deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Models deployed in the Cloud Platform\n \"\"\"\n return super().models(details=details)\n\n @cache_result\n def endpoints(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Endpoints deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Endpoints in the Cloud Platform\n \"\"\"\n return super().endpoints()\n\n @cache_result\n def glue_job(self, job_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Glue Job\n\n Args:\n job_name (str): The name of the Glue Job\n\n Returns:\n dict: The details of the Glue Job (None if not found)\n \"\"\"\n return super().glue_job(job_name=job_name)\n\n @cache_result\n def data_source(self, data_source_name: str, database: str = \"sageworks\") -> Union[dict, None]:\n \"\"\"Get the details of a specific Data Source\n\n Args:\n data_source_name (str): The name of the Data Source\n database (str, optional): The Glue database. Defaults to 'sageworks'.\n\n Returns:\n dict: The details of the Data Source (None if not found)\n \"\"\"\n return super().data_source(data_source_name=data_source_name, database=database)\n\n @cache_result\n def feature_set(self, feature_set_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Feature Set\n\n Args:\n feature_set_name (str): The name of the Feature Set\n\n Returns:\n dict: The details of the Feature Set (None if not found)\n \"\"\"\n return super().feature_set(feature_set_name=feature_set_name)\n\n @cache_result\n def model(self, model_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Model\n\n Args:\n model_name (str): The name of the Model\n\n Returns:\n dict: The details of the Model (None if not found)\n \"\"\"\n return super().model(model_name=model_name)\n\n @cache_result\n def endpoint(self, endpoint_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: The details of the Endpoint (None if not found)\n \"\"\"\n return super().endpoint(endpoint_name=endpoint_name)\n\n def _refresh_data_in_background(self, cache_key, method, *args, **kwargs):\n \"\"\"Background task to refresh AWS metadata.\"\"\"\n result = method(self, *args, **kwargs)\n self.meta_cache.set(cache_key, result)\n self.log.debug(f\"Updated Metadata for {cache_key}\")\n\n @staticmethod\n def _flatten_redis_key(method, *args, **kwargs):\n \"\"\"Flatten the args and kwargs into a single string\"\"\"\n arg_str = \"_\".join(str(arg) for arg in args)\n kwarg_str = \"_\".join(f\"{k}_{v}\" for k, v in sorted(kwargs.items()))\n return f\"{method.__name__}_{arg_str}_{kwarg_str}\".replace(\" \", \"\").replace(\"'\", \"\")\n\n def __del__(self):\n \"\"\"Destructor to shut down the thread pool gracefully.\"\"\"\n self.close()\n\n def close(self):\n \"\"\"Explicitly close the thread pool, if needed.\"\"\"\n if self.thread_pool:\n self.log.important(\"Shutting down the ThreadPoolExecutor...\")\n try:\n self.thread_pool.shutdown(wait=True) # Gracefully shutdown\n except RuntimeError as e:\n self.log.error(f\"Error during thread pool shutdown: {e}\")\n finally:\n self.thread_pool = None\n\n def __repr__(self):\n return f\"CachedMeta()\\n\\t{repr(self.meta_cache)}\\n\\t{super().__repr__()}\"\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.__del__","title":"__del__()
","text":"Destructor to shut down the thread pool gracefully.
Source code insrc/sageworks/cached/cached_meta.py
def __del__(self):\n \"\"\"Destructor to shut down the thread pool gracefully.\"\"\"\n self.close()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.__init__","title":"__init__()
","text":"CachedMeta Initialization
Source code insrc/sageworks/cached/cached_meta.py
def __init__(self):\n \"\"\"CachedMeta Initialization\"\"\"\n if hasattr(self, \"_initialized\") and self._initialized:\n return # Prevent reinitialization\n\n self.log = logging.getLogger(\"sageworks\")\n self.log.important(\"Initializing CachedMeta...\")\n super().__init__()\n\n # Create both our Meta Cache and Fresh Cache (tracks if data is stale)\n self.meta_cache = SageWorksCache(prefix=\"meta\")\n self.fresh_cache = SageWorksCache(prefix=\"meta_fresh\", expire=90) # 90-second expiration\n\n # Create a ThreadPoolExecutor for refreshing stale data\n self.thread_pool = ThreadPoolExecutor(max_workers=5)\n\n # Mark the instance as initialized\n self._initialized = True\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.account","title":"account()
","text":"Cloud Platform Account Info
Returns:
Name Type Descriptiondict
dict
Cloud Platform Account Info
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef account(self) -> dict:\n \"\"\"Cloud Platform Account Info\n\n Returns:\n dict: Cloud Platform Account Info\n \"\"\"\n return super().account()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.check","title":"check()
","text":"Check if our underlying caches are working
Source code insrc/sageworks/cached/cached_meta.py
def check(self):\n \"\"\"Check if our underlying caches are working\"\"\"\n return self.meta_cache.check()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.clear_meta_cache","title":"clear_meta_cache()
","text":"Clear the current Meta Cache
Source code insrc/sageworks/cached/cached_meta.py
def clear_meta_cache(self):\n \"\"\"Clear the current Meta Cache\"\"\"\n self.meta_cache.clear()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.close","title":"close()
","text":"Explicitly close the thread pool, if needed.
Source code insrc/sageworks/cached/cached_meta.py
def close(self):\n \"\"\"Explicitly close the thread pool, if needed.\"\"\"\n if self.thread_pool:\n self.log.important(\"Shutting down the ThreadPoolExecutor...\")\n try:\n self.thread_pool.shutdown(wait=True) # Gracefully shutdown\n except RuntimeError as e:\n self.log.error(f\"Error during thread pool shutdown: {e}\")\n finally:\n self.thread_pool = None\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.config","title":"config()
","text":"Return the current SageWorks Configuration
Returns:
Name Type Descriptiondict
dict
The current SageWorks Configuration
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef config(self) -> dict:\n \"\"\"Return the current SageWorks Configuration\n\n Returns:\n dict: The current SageWorks Configuration\n \"\"\"\n return super().config()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.data_source","title":"data_source(data_source_name, database='sageworks')
","text":"Get the details of a specific Data Source
Parameters:
Name Type Description Defaultdata_source_name
str
The name of the Data Source
requireddatabase
str
The Glue database. Defaults to 'sageworks'.
'sageworks'
Returns:
Name Type Descriptiondict
Union[dict, None]
The details of the Data Source (None if not found)
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef data_source(self, data_source_name: str, database: str = \"sageworks\") -> Union[dict, None]:\n \"\"\"Get the details of a specific Data Source\n\n Args:\n data_source_name (str): The name of the Data Source\n database (str, optional): The Glue database. Defaults to 'sageworks'.\n\n Returns:\n dict: The details of the Data Source (None if not found)\n \"\"\"\n return super().data_source(data_source_name=data_source_name, database=database)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.data_sources","title":"data_sources()
","text":"Get a summary of the Data Sources deployed in the Cloud Platform
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef data_sources(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Data Sources deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Data Sources deployed in the Cloud Platform\n \"\"\"\n return super().data_sources()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.endpoint","title":"endpoint(endpoint_name)
","text":"Get the details of a specific Endpoint
Parameters:
Name Type Description Defaultendpoint_name
str
The name of the Endpoint
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Endpoint (None if not found)
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef endpoint(self, endpoint_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Endpoint\n\n Args:\n endpoint_name (str): The name of the Endpoint\n\n Returns:\n dict: The details of the Endpoint (None if not found)\n \"\"\"\n return super().endpoint(endpoint_name=endpoint_name)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.endpoints","title":"endpoints()
","text":"Get a summary of the Endpoints deployed in the Cloud Platform
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Endpoints in the Cloud Platform
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef endpoints(self) -> pd.DataFrame:\n \"\"\"Get a summary of the Endpoints deployed in the Cloud Platform\n\n Returns:\n pd.DataFrame: A summary of the Endpoints in the Cloud Platform\n \"\"\"\n return super().endpoints()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.etl_jobs","title":"etl_jobs()
","text":"Get summary data about Extract, Transform, Load (ETL) Jobs
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef etl_jobs(self) -> pd.DataFrame:\n \"\"\"Get summary data about Extract, Transform, Load (ETL) Jobs\n\n Returns:\n pd.DataFrame: A summary of the ETL Jobs deployed in the Cloud Platform\n \"\"\"\n return super().etl_jobs()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.feature_set","title":"feature_set(feature_set_name)
","text":"Get the details of a specific Feature Set
Parameters:
Name Type Description Defaultfeature_set_name
str
The name of the Feature Set
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Feature Set (None if not found)
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef feature_set(self, feature_set_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Feature Set\n\n Args:\n feature_set_name (str): The name of the Feature Set\n\n Returns:\n dict: The details of the Feature Set (None if not found)\n \"\"\"\n return super().feature_set(feature_set_name=feature_set_name)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.feature_sets","title":"feature_sets(details=False)
","text":"Get a summary of the Feature Sets deployed in the Cloud Platform
Parameters:
Name Type Description Defaultdetails
bool
Include detailed information. Defaults to False.
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef feature_sets(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Feature Sets deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Feature Sets deployed in the Cloud Platform\n \"\"\"\n return super().feature_sets(details=details)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.glue_job","title":"glue_job(job_name)
","text":"Get the details of a specific Glue Job
Parameters:
Name Type Description Defaultjob_name
str
The name of the Glue Job
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Glue Job (None if not found)
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef glue_job(self, job_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Glue Job\n\n Args:\n job_name (str): The name of the Glue Job\n\n Returns:\n dict: The details of the Glue Job (None if not found)\n \"\"\"\n return super().glue_job(job_name=job_name)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.incoming_data","title":"incoming_data()
","text":"Get summary data about data in the incoming raw data
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the incoming raw data
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef incoming_data(self) -> pd.DataFrame:\n \"\"\"Get summary data about data in the incoming raw data\n\n Returns:\n pd.DataFrame: A summary of the incoming raw data\n \"\"\"\n return super().incoming_data()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.list_meta_cache","title":"list_meta_cache()
","text":"List the current Meta Cache
Source code insrc/sageworks/cached/cached_meta.py
def list_meta_cache(self):\n \"\"\"List the current Meta Cache\"\"\"\n return self.meta_cache.list_keys()\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.model","title":"model(model_name)
","text":"Get the details of a specific Model
Parameters:
Name Type Description Defaultmodel_name
str
The name of the Model
requiredReturns:
Name Type Descriptiondict
Union[dict, None]
The details of the Model (None if not found)
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef model(self, model_name: str) -> Union[dict, None]:\n \"\"\"Get the details of a specific Model\n\n Args:\n model_name (str): The name of the Model\n\n Returns:\n dict: The details of the Model (None if not found)\n \"\"\"\n return super().model(model_name=model_name)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.models","title":"models(details=False)
","text":"Get a summary of the Models deployed in the Cloud Platform
Parameters:
Name Type Description Defaultdetails
bool
Include detailed information. Defaults to False.
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of the Models deployed in the Cloud Platform
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef models(self, details: bool = False) -> pd.DataFrame:\n \"\"\"Get a summary of the Models deployed in the Cloud Platform\n\n Args:\n details (bool, optional): Include detailed information. Defaults to False.\n\n Returns:\n pd.DataFrame: A summary of the Models deployed in the Cloud Platform\n \"\"\"\n return super().models(details=details)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.CachedMeta.views","title":"views(database='sageworks')
","text":"Get a summary of the all the Views, for the given database, in AWS
Parameters:
Name Type Description Defaultdatabase
str
Glue database. Defaults to 'sageworks'.
'sageworks'
Returns:
Type DescriptionDataFrame
pd.DataFrame: A summary of all the Views, for the given database, in AWS
Source code insrc/sageworks/cached/cached_meta.py
@cache_result\ndef views(self, database: str = \"sageworks\") -> pd.DataFrame:\n \"\"\"Get a summary of the all the Views, for the given database, in AWS\n\n Args:\n database (str, optional): Glue database. Defaults to 'sageworks'.\n\n Returns:\n pd.DataFrame: A summary of all the Views, for the given database, in AWS\n \"\"\"\n return super().views(database=database)\n
"},{"location":"cached/cached_meta/#sageworks.cached.cached_meta.cache_result","title":"cache_result(method)
","text":"Decorator to cache method results in meta_cache
Source code insrc/sageworks/cached/cached_meta.py
def cache_result(method):\n \"\"\"Decorator to cache method results in meta_cache\"\"\"\n\n @wraps(method)\n def wrapper(self, *args, **kwargs):\n # Create a unique cache key based on the method name and arguments\n cache_key = CachedMeta._flatten_redis_key(method, *args, **kwargs)\n\n # Check for fresh data, spawn thread to refresh if stale\n if SageWorksCache.refresh_enabled and self.fresh_cache.get(cache_key) is None:\n self.log.debug(f\"Async: Metadata for {cache_key} refresh thread started...\")\n self.fresh_cache.set(cache_key, True) # Mark as refreshed\n\n # Spawn a thread to refresh data without blocking\n self.thread_pool.submit(self._refresh_data_in_background, cache_key, method, *args, **kwargs)\n\n # Return data (fresh or stale) if available\n cached_value = self.meta_cache.get(cache_key)\n if cached_value is not None:\n return cached_value\n\n # Fall back to calling the method if no cached data found\n self.log.important(f\"Blocking: Getting Metadata for {cache_key}\")\n result = method(self, *args, **kwargs)\n self.meta_cache.set(cache_key, result)\n return result\n\n return wrapper\n
"},{"location":"cached/cached_meta/#examples","title":"Examples","text":"These example show how to use the CachedMeta()
class to pull lists of artifacts from AWS. DataSources, FeatureSets, Models, Endpoints and more. If you're building a web interface plugin, the CachedMeta class is a great place to start.
SageWorks REPL
If you'd like to see exactly what data/details you get back from the CachedMeta()
class, you can spin up the SageWorks REPL, use the class and test out all the methods. Try it out! SageWorks REPL
CachedMeta = CachedMeta()\nmodel_df = CachedMeta.models()\nmodel_df\n Model Group Health Owner ... Input Status Description\n0 wine-classification healthy - ... wine_features Completed Wine Classification Model\n1 abalone-regression-full healthy - ... abalone_features Completed Abalone Regression Model\n2 abalone-regression healthy - ... abalone_features Completed Abalone Regression Model\n\n[3 rows x 10 columns]\n
List the Models in AWS
from sageworks.cached.cached_meta import CachedMeta\n\n# Create our CachedMeta Class and get a list of our Models\nCachedMeta = CachedMeta()\nmodel_df = CachedMeta.models()\n\nprint(f\"Number of Models: {len(model_df)}\")\nprint(model_df)\n\n# Get more details data on the Models\nmodel_names = model_df[\"Model Group\"].tolist()\nfor name in model_names:\n pprint(CachedMeta.model(name))\n
Output
Number of Models: 3\n Model Group Health Owner ... Input Status Description\n0 wine-classification healthy - ... wine_features Completed Wine Classification Model\n1 abalone-regression-full healthy - ... abalone_features Completed Abalone Regression Model\n2 abalone-regression healthy - ... abalone_features Completed Abalone Regression Model\n\n[3 rows x 10 columns]\nwine-classification\nabalone-regression-full\nabalone-regression\n
Getting Model Performance Metrics
from sageworks.cached.cached_meta import CachedMeta\n\n# Create our CachedMeta Class and get a list of our Models\nCachedMeta = CachedMeta()\nmodel_df = CachedMeta.models()\n\nprint(f\"Number of Models: {len(model_df)}\")\nprint(model_df)\n\n# Get more details data on the Models\nmodel_names = model_df[\"Model Group\"].tolist()\nfor name in model_names[:5]:\n model_details = CachedMeta.model(name)\n print(f\"\\n\\nModel: {name}\")\n performance_metrics = model_details[\"sageworks_CachedMeta\"][\"sageworks_inference_metrics\"]\n print(f\"\\tPerformance Metrics: {performance_metrics}\")\n
Output
wine-classification\n ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/wine-classification\n Description: Wine Classification Model\n Tags: wine::classification\n Performance Metrics:\n [{'wine_class': 'TypeA', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 12}, {'wine_class': 'TypeB', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 14}, {'wine_class': 'TypeC', 'precision': 1.0, 'recall': 1.0, 'fscore': 1.0, 'roc_auc': 1.0, 'support': 9}]\n\nabalone-regression\n ARN: arn:aws:sagemaker:us-west-2:507740646243:model-package-group/abalone-regression\n Description: Abalone Regression Model\n Tags: abalone::regression\n Performance Metrics:\n [{'MAE': 1.64, 'RMSE': 2.246, 'R2': 0.502, 'MAPE': 16.393, 'MedAE': 1.209, 'NumRows': 834}]\n
List the Endpoints in AWS
from pprint import pprint\nfrom sageworks.cached.cached_meta import CachedMeta\n\n# Create our CachedMeta Class and get a list of our Endpoints\nCachedMeta = CachedMeta()\nendpoint_df = CachedMeta.endpoints()\nprint(f\"Number of Endpoints: {len(endpoint_df)}\")\nprint(endpoint_df)\n\n# Get more details data on the Endpoints\nendpoint_names = endpoint_df[\"Name\"].tolist()\nfor name in endpoint_names:\n pprint(CachedMeta.endpoint(name))\n
Output
Number of Endpoints: 2\n Name Health Instance Created ... Status Variant Capture Samp(%)\n0 wine-classification-end healthy Serverless (2GB/5) 2024-03-23 23:09 ... InService AllTraffic False -\n1 abalone-regression-end healthy Serverless (2GB/5) 2024-03-23 21:11 ... InService AllTraffic False -\n\n[2 rows x 10 columns]\nwine-classification-end\n<lots of details about endpoints>\n
Not Finding some particular AWS Data?
The SageWorks CachedMeta API Class also has (details=True)
arguments, so make sure to check those out.
Model Examples
Examples of using the Model Class are in the Examples section at the bottom of this page. AWS Model setup and deployment are quite complicated to do manually but the SageWorks Model Class makes it a breeze!
CachedModel: Caches the method results for SageWorks Models
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel","title":"CachedModel
","text":" Bases: CachedArtifactMixin
, ModelCore
CachedModel: Caches the method results for SageWorks Models
Note: Cached method values may lag underlying Model changes.
Common Usagemy_model = CachedModel(name)\nmy_model.details()\nmy_model.health_check()\nmy_model.sageworks_meta()\n
Source code in src/sageworks/cached/cached_model.py
class CachedModel(CachedArtifactMixin, ModelCore):\n \"\"\"CachedModel: Caches the method results for SageWorks Models\n\n Note: Cached method values may lag underlying Model changes.\n\n Common Usage:\n ```python\n my_model = CachedModel(name)\n my_model.details()\n my_model.health_check()\n my_model.sageworks_meta()\n ```\n \"\"\"\n\n def __init__(self, uuid: str):\n \"\"\"CachedModel Initialization\"\"\"\n ModelCore.__init__(self, model_uuid=uuid, use_cached_meta=True)\n\n @CachedArtifactMixin.cache_result\n def summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Details.\n\n Returns:\n dict: A dictionary of details about the CachedModel\n \"\"\"\n return super().summary(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def details(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Details.\n\n Returns:\n dict: A dictionary of details about the CachedModel\n \"\"\"\n return super().details(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Health Check.\n\n Returns:\n dict: A dictionary of health check details for the CachedModel\n \"\"\"\n return super().health_check(**kwargs)\n\n @CachedArtifactMixin.cache_result\n def sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).\n\n Returns:\n str: The Enumerated Model Type\n \"\"\"\n return super().sageworks_meta()\n\n @CachedArtifactMixin.cache_result\n def get_endpoint_inference_path(self) -> Union[str, None]:\n \"\"\"Retrieve the Endpoint Inference Path.\n\n Returns:\n str: The Endpoint Inference Path\n \"\"\"\n return super().get_endpoint_inference_path()\n\n @CachedArtifactMixin.cache_result\n def list_inference_runs(self) -> list[str]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Returns:\n list[str]: List of Inference Runs\n \"\"\"\n return super().list_inference_runs()\n\n @CachedArtifactMixin.cache_result\n def get_inference_metrics(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: latest)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Metrics (might be None)\n \"\"\"\n return super().get_inference_metrics(capture_uuid=capture_uuid)\n\n @CachedArtifactMixin.cache_result\n def get_inference_predictions(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n \"\"\"\n # Note: This method can generate larger dataframes, so we'll sample if needed\n df = super().get_inference_predictions(capture_uuid=capture_uuid)\n if df is not None and len(df) > 5000:\n self.log.warning(f\"{self.uuid}:{capture_uuid} Sampling Inference Predictions to 5000 rows\")\n return df.sample(5000)\n return df\n\n @CachedArtifactMixin.cache_result\n def confusion_matrix(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the confusion matrix for the model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: latest)\n\n Returns:\n pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n \"\"\"\n return super().confusion_matrix(capture_uuid=capture_uuid)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.__init__","title":"__init__(uuid)
","text":"CachedModel Initialization
Source code insrc/sageworks/cached/cached_model.py
def __init__(self, uuid: str):\n \"\"\"CachedModel Initialization\"\"\"\n ModelCore.__init__(self, model_uuid=uuid, use_cached_meta=True)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.confusion_matrix","title":"confusion_matrix(capture_uuid='latest')
","text":"Retrieve the confusion matrix for the model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid (default: latest)
'latest'
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: DataFrame of the Confusion Matrix (might be None)
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef confusion_matrix(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the confusion matrix for the model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: latest)\n\n Returns:\n pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n \"\"\"\n return super().confusion_matrix(capture_uuid=capture_uuid)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.details","title":"details(**kwargs)
","text":"Retrieve the CachedModel Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the CachedModel
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef details(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Details.\n\n Returns:\n dict: A dictionary of details about the CachedModel\n \"\"\"\n return super().details(**kwargs)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.get_endpoint_inference_path","title":"get_endpoint_inference_path()
","text":"Retrieve the Endpoint Inference Path.
Returns:
Name Type Descriptionstr
Union[str, None]
The Endpoint Inference Path
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef get_endpoint_inference_path(self) -> Union[str, None]:\n \"\"\"Retrieve the Endpoint Inference Path.\n\n Returns:\n str: The Endpoint Inference Path\n \"\"\"\n return super().get_endpoint_inference_path()\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.get_inference_metrics","title":"get_inference_metrics(capture_uuid='latest')
","text":"Retrieve the captured prediction results for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid (default: latest)
'latest'
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: DataFrame of the Captured Metrics (might be None)
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef get_inference_metrics(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: latest)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Metrics (might be None)\n \"\"\"\n return super().get_inference_metrics(capture_uuid=capture_uuid)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.get_inference_predictions","title":"get_inference_predictions(capture_uuid='auto_inference')
","text":"Retrieve the captured prediction results for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid (default: training_holdout)
'auto_inference'
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: DataFrame of the Captured Predictions (might be None)
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef get_inference_predictions(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n \"\"\"\n # Note: This method can generate larger dataframes, so we'll sample if needed\n df = super().get_inference_predictions(capture_uuid=capture_uuid)\n if df is not None and len(df) > 5000:\n self.log.warning(f\"{self.uuid}:{capture_uuid} Sampling Inference Predictions to 5000 rows\")\n return df.sample(5000)\n return df\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.health_check","title":"health_check(**kwargs)
","text":"Retrieve the CachedModel Health Check.
Returns:
Name Type Descriptiondict
dict
A dictionary of health check details for the CachedModel
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef health_check(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Health Check.\n\n Returns:\n dict: A dictionary of health check details for the CachedModel\n \"\"\"\n return super().health_check(**kwargs)\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.list_inference_runs","title":"list_inference_runs()
","text":"Retrieve the captured prediction results for this model
Returns:
Type Descriptionlist[str]
list[str]: List of Inference Runs
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef list_inference_runs(self) -> list[str]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Returns:\n list[str]: List of Inference Runs\n \"\"\"\n return super().list_inference_runs()\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.sageworks_meta","title":"sageworks_meta()
","text":"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).
Returns:
Name Type Descriptionstr
Union[str, None]
The Enumerated Model Type
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef sageworks_meta(self) -> Union[str, None]:\n \"\"\"Retrieve the Enumerated Model Type (REGRESSOR, CLASSIFER, etc).\n\n Returns:\n str: The Enumerated Model Type\n \"\"\"\n return super().sageworks_meta()\n
"},{"location":"cached/cached_model/#sageworks.cached.cached_model.CachedModel.summary","title":"summary(**kwargs)
","text":"Retrieve the CachedModel Details.
Returns:
Name Type Descriptiondict
dict
A dictionary of details about the CachedModel
Source code insrc/sageworks/cached/cached_model.py
@CachedArtifactMixin.cache_result\ndef summary(self, **kwargs) -> dict:\n \"\"\"Retrieve the CachedModel Details.\n\n Returns:\n dict: A dictionary of details about the CachedModel\n \"\"\"\n return super().summary(**kwargs)\n
"},{"location":"cached/cached_model/#examples","title":"Examples","text":"All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Pull Inference Run
from sageworks.cached.cached_model import CachedModel\n\n# Grab a Model\nmodel = CachedModel(\"abalone-regression\")\n\n# List the inference runs\nmodel.list_inference_runs()\n['auto_inference', 'model_training']\n\n# Grab specific inference results\nmodel.get_inference_predictions(\"auto_inference\")\n class_number_of_rings prediction id\n0 16 10.516158 7\n1 9 9.031365 8\n.. ... ... ...\n831 8 7.693689 4158\n832 9 7.542521 4167\n
"},{"location":"cached/overview/","title":"Caching Overview","text":"Caching is Crazy
Yes, but it's a necessary evil for Web Interfaces. AWS APIs (boto3, Sagemaker) often takes multiple seconds to respond and will often throttle requests if spammed. So for quicker response and less spamming we're using Cached Classes for any Web Interface work.
"},{"location":"cached/overview/#welcome-to-the-sageworks-cached-classes","title":"Welcome to the SageWorks Cached Classes","text":"These classes provide caching for the for the most used SageWorks classes. They transparently handle all the details around retrieving and caching results from the underlying classes.
Examples
All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines. As part of this we're including CloudWatch log forwarding/aggregation for any service using the SageWorks API (Dashboard, Glue, Lambda, Notebook, Laptop, etc).
"},{"location":"cloudwatch/#log-groups-and-streams","title":"Log Groups and Streams","text":"The SageWorks logging setup includes the addition of a CloudWatch 'Handler' that forwards all log messages to the SageWorksLogGroup
Individual Streams
Each process running SageWorks will get a unique individual stream.
Since many jobs are run nightly/often, the stream will also have a date on the end... glue/my_job/2024_08_01_17_15
Logs in Easy Mode
The SageWorks cloud_watch
command line tool gives you access to important logs without the hassle. Automatic display of important event and the context around those events.
pip install sageworks\ncloud_watch\n
The cloud_watch
script will automatically show the interesting (WARNING and CRITICAL) messages from any source within the last hour. There are lots of options to the script, just use --help
to see options and descriptions.
cloud_watch --help\n
Here are some example options:
# Show important logs in last 12 hours\ncloud_watch --start-time 720 \n\n# Show a particular stream\ncloud_watch --stream glue/my_job \n\n# Show/search for a message substring\ncloud_watch --search SHAP\n\n# Show a log levels (matching and above)\ncloud_watch --log-level WARNING\ncloud_watch --log-level ERROR\ncloud_watch --log-level CRITICAL\nOR\ncloud_watch --log-level ALL (for all events)\n\n# Combine flags \ncloud_watch --log-level ERROR --search SHAP\ncloud_watch --log-level ERROR --stream Dashboard\n
These options can be used in combination and try out the other options to make the perfect log search :)
"},{"location":"cloudwatch/#more-information","title":"More Information","text":"Check out our presentation on SageWorks CloudWatch
"},{"location":"cloudwatch/#access-through-aws-console","title":"Access through AWS Console","text":"Since we're leveraging AWS functionality you can always use the AWS console to look/investigate the logs. In the AWS console go to CloudWatch... Log Groups... SageWorksLogGroup
"},{"location":"cloudwatch/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/overview/","title":"Core Classes","text":"SageWorks Core Classes
These classes interact with many of the Cloud Platform services and are therefore more complex. They provide additional control and refinement over the AWS ML Pipline. For most use cases the API Classes should be used
Welcome to the SageWorks Core Classes
The Core Classes provide low-level APIs for the SageWorks package, these classes directly interface with the AWS Sagemaker Pipeline interfaces and have a large number of methods with reasonable complexity.
The API Classes have method pass-through so just call the method on the API Class and voil\u00e0 it works the same.
"},{"location":"core_classes/overview/#artifacts","title":"Artifacts","text":"Transforms are a set of classes that transform one type of Artifact
to another type. For instance DataToFeatureSet
takes a DataSource
artifact and creates a FeatureSet
artifact.
API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the any class that inherits from the Artifact Class and voil\u00e0 it works the same.
The SageWorks Artifact class is a base/abstract class that defines API implemented by all the child classes (DataSource, FeatureSet, Model, Endpoint).
Artifact: Abstract Base Class for all Artifact classes in SageWorks. Artifacts simply reflect and aggregate one or more AWS Services
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact","title":"Artifact
","text":" Bases: ABC
Artifact: Abstract Base Class for all Artifact classes in SageWorks
Source code insrc/sageworks/core/artifacts/artifact.py
class Artifact(ABC):\n \"\"\"Artifact: Abstract Base Class for all Artifact classes in SageWorks\"\"\"\n\n # Class-level shared resources\n log = logging.getLogger(\"sageworks\")\n\n # Config Manager\n cm = ConfigManager()\n if not cm.config_okay():\n log = logging.getLogger(\"sageworks\")\n log.critical(\"SageWorks Configuration Incomplete...\")\n log.critical(\"Run the 'sageworks' command and follow the prompts...\")\n raise FatalConfigError()\n\n # AWS Account Clamp\n aws_account_clamp = AWSAccountClamp()\n boto3_session = aws_account_clamp.boto3_session\n sm_session = aws_account_clamp.sagemaker_session()\n sm_client = aws_account_clamp.sagemaker_client()\n aws_region = aws_account_clamp.region\n\n # Setup Bucket Paths\n sageworks_bucket = cm.get_config(\"SAGEWORKS_BUCKET\")\n data_sources_s3_path = f\"s3://{sageworks_bucket}/data-sources\"\n feature_sets_s3_path = f\"s3://{sageworks_bucket}/feature-sets\"\n models_s3_path = f\"s3://{sageworks_bucket}/models\"\n endpoints_s3_path = f\"s3://{sageworks_bucket}/endpoints\"\n\n # Delimiter for storing lists in AWS Tags\n tag_delimiter = \"::\"\n\n # Grab our Dataframe Storage\n df_cache = DFStore(path_prefix=\"/sageworks/dataframe_cache\")\n\n def __init__(self, uuid: str, use_cached_meta: bool = False):\n \"\"\"Initialize the Artifact Base Class\n\n Args:\n uuid (str): The UUID of this artifact\n use_cached_meta (bool): Should we use cached metadata? (default: False)\n \"\"\"\n self.uuid = uuid\n if use_cached_meta:\n self.log.info(f\"Using Cached Metadata for {self.uuid}\")\n self.meta = CachedMeta()\n else:\n self.meta = CloudMeta()\n\n def __post_init__(self):\n \"\"\"Artifact Post Initialization\"\"\"\n\n # Do I exist? (very metaphysical)\n if not self.exists():\n self.log.debug(f\"Artifact {self.uuid} does not exist\")\n return\n\n # Conduct a Health Check on this Artifact\n health_issues = self.health_check()\n if health_issues:\n if \"needs_onboard\" in health_issues:\n self.log.important(f\"Artifact {self.uuid} needs to be onboarded\")\n elif health_issues == [\"no_activity\"]:\n self.log.debug(f\"Artifact {self.uuid} has no activity, which is fine\")\n else:\n self.log.warning(f\"Health Check Failed {self.uuid}: {health_issues}\")\n for issue in health_issues:\n self.add_health_tag(issue)\n else:\n self.log.info(f\"Health Check Passed {self.uuid}\")\n\n @classmethod\n def is_name_valid(cls, name: str, delimiter: str = \"_\", lower_case: bool = True) -> bool:\n \"\"\"Check if the name adheres to the naming conventions for this Artifact.\n\n Args:\n name (str): The name/id to check.\n delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n lower_case (bool): Should the name be lowercased? (default: True)\n\n Returns:\n bool: True if the name is valid, False otherwise.\n \"\"\"\n valid_name = cls.generate_valid_name(name, delimiter=delimiter, lower_case=lower_case)\n if name != valid_name:\n cls.log.warning(f\"Artifact name: '{name}' is not valid. Convert it to something like: '{valid_name}'\")\n return False\n return True\n\n @staticmethod\n def generate_valid_name(name: str, delimiter: str = \"_\", lower_case: bool = True) -> str:\n \"\"\"Only allow letters and the specified delimiter, also lowercase the string.\n\n Args:\n name (str): The name/id string to check.\n delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n lower_case (bool): Should the name be lowercased? (default: True)\n\n Returns:\n str: A generated valid name/id.\n \"\"\"\n valid_name = \"\".join(c for c in name if c.isalnum() or c in [\"_\", \"-\"])\n if lower_case:\n valid_name = valid_name.lower()\n\n # Replace with the chosen delimiter\n return valid_name.replace(\"_\", delimiter).replace(\"-\", delimiter)\n\n @abstractmethod\n def exists(self) -> bool:\n \"\"\"Does the Artifact exist? Can we connect to it?\"\"\"\n pass\n\n def sageworks_meta(self) -> Union[dict, None]:\n \"\"\"Get the SageWorks specific metadata for this Artifact\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n\n Note: This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources and Graphs, those classes need to override this method.\n \"\"\"\n return self.meta.get_aws_tags(self.arn())\n\n def expected_meta(self) -> list[str]:\n \"\"\"Metadata we expect to see for this Artifact when it's ready\n Returns:\n list[str]: List of expected metadata keys\n \"\"\"\n\n # If an artifact has additional expected metadata override this method\n return [\"sageworks_status\"]\n\n @abstractmethod\n def refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n pass\n\n def ready(self) -> bool:\n \"\"\"Is the Artifact ready? Is initial setup complete and expected metadata populated?\"\"\"\n\n # If anything goes wrong, assume the artifact is not ready\n try:\n # Check for the expected metadata\n expected_meta = self.expected_meta()\n existing_meta = self.sageworks_meta()\n ready = set(existing_meta.keys()).issuperset(expected_meta)\n if ready:\n return True\n else:\n self.log.info(\"Artifact is not ready!\")\n return False\n except Exception as e:\n self.log.error(f\"Artifact malformed: {e}\")\n return False\n\n @abstractmethod\n def onboard(self) -> bool:\n \"\"\"Onboard this Artifact into SageWorks\n Returns:\n bool: True if the Artifact was successfully onboarded, False otherwise\n \"\"\"\n pass\n\n @abstractmethod\n def details(self) -> dict:\n \"\"\"Additional Details about this Artifact\"\"\"\n pass\n\n @abstractmethod\n def size(self) -> float:\n \"\"\"Return the size of this artifact in MegaBytes\"\"\"\n pass\n\n @abstractmethod\n def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n pass\n\n @abstractmethod\n def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n pass\n\n @abstractmethod\n def hash(self) -> str:\n \"\"\"Return the hash of this artifact, useful for content validation\"\"\"\n pass\n\n @abstractmethod\n def arn(self):\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n pass\n\n @abstractmethod\n def aws_url(self):\n \"\"\"AWS console/web interface for this artifact\"\"\"\n pass\n\n @abstractmethod\n def aws_meta(self) -> dict:\n \"\"\"Get the full AWS metadata for this artifact\"\"\"\n pass\n\n @abstractmethod\n def delete(self):\n \"\"\"Delete this artifact including all related AWS objects\"\"\"\n pass\n\n def upsert_sageworks_meta(self, new_meta: dict):\n \"\"\"Add SageWorks specific metadata to this Artifact\n Args:\n new_meta (dict): Dictionary of NEW metadata to add\n Note:\n This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources. The DataSource class overrides this method.\n \"\"\"\n # Sanity check\n aws_arn = self.arn()\n if aws_arn is None:\n self.log.error(f\"ARN is None for {self.uuid}!\")\n return\n\n # Add the new metadata to the existing metadata\n self.log.info(f\"Adding Tags to {self.uuid}:{str(new_meta)[:50]}...\")\n aws_tags = dict_to_aws_tags(new_meta)\n try:\n self.sm_client.add_tags(ResourceArn=aws_arn, Tags=aws_tags)\n except Exception as e:\n self.log.error(f\"Error adding metadata to {aws_arn}: {e}\")\n\n def remove_sageworks_meta(self, key_to_remove: str):\n \"\"\"Remove SageWorks specific metadata from this Artifact\n Args:\n key_to_remove (str): The metadata key to remove\n Note:\n This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources. The DataSource class overrides this method.\n \"\"\"\n aws_arn = self.arn()\n # Sanity check\n if aws_arn is None:\n self.log.error(f\"ARN is None for {self.uuid}!\")\n return\n self.log.info(f\"Removing SageWorks Metadata {key_to_remove} for Artifact: {aws_arn}...\")\n sagemaker_delete_tag(aws_arn, self.sm_session, key_to_remove)\n\n def get_tags(self, tag_type=\"user\") -> list:\n \"\"\"Get the tags for this artifact\n Args:\n tag_type (str): Type of tags to return (user or health)\n Returns:\n list[str]: List of tags for this artifact\n \"\"\"\n if tag_type == \"user\":\n user_tags = self.sageworks_meta().get(\"sageworks_tags\")\n return user_tags.split(self.tag_delimiter) if user_tags else []\n\n # Grab our health tags\n health_tags = self.sageworks_meta().get(\"sageworks_health_tags\")\n\n # If we don't have health tags, create the storage and return an empty list\n if health_tags is None:\n self.log.important(f\"{self.uuid} creating sageworks_health_tags storage...\")\n self.upsert_sageworks_meta({\"sageworks_health_tags\": \"\"})\n return []\n\n # Otherwise, return the health tags\n return health_tags.split(self.tag_delimiter) if health_tags else []\n\n def set_tags(self, tags):\n self.upsert_sageworks_meta({\"sageworks_tags\": self.tag_delimiter.join(tags)})\n\n def add_tag(self, tag, tag_type=\"user\"):\n \"\"\"Add a tag for this artifact, ensuring no duplicates and maintaining order.\n Args:\n tag (str): Tag to add for this artifact\n tag_type (str): Type of tag to add (user or health)\n \"\"\"\n current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n if tag not in current_tags:\n current_tags.append(tag)\n combined_tags = self.tag_delimiter.join(current_tags)\n if tag_type == \"user\":\n self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n else:\n self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n\n def remove_sageworks_tag(self, tag, tag_type=\"user\"):\n \"\"\"Remove a tag from this artifact if it exists.\n Args:\n tag (str): Tag to remove from this artifact\n tag_type (str): Type of tag to remove (user or health)\n \"\"\"\n current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n if tag in current_tags:\n current_tags.remove(tag)\n combined_tags = self.tag_delimiter.join(current_tags)\n if tag_type == \"user\":\n self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n elif tag_type == \"health\":\n self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n\n # Syntactic sugar for health tags\n def get_health_tags(self):\n return self.get_tags(tag_type=\"health\")\n\n def set_health_tags(self, tags):\n self.upsert_sageworks_meta({\"sageworks_health_tags\": self.tag_delimiter.join(tags)})\n\n def add_health_tag(self, tag):\n self.add_tag(tag, tag_type=\"health\")\n\n def remove_health_tag(self, tag):\n self.remove_sageworks_tag(tag, tag_type=\"health\")\n\n # Owner of this artifact\n def get_owner(self) -> str:\n \"\"\"Get the owner of this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_owner\", \"unknown\")\n\n def set_owner(self, owner: str):\n \"\"\"Set the owner of this artifact\n\n Args:\n owner (str): Owner to set for this artifact\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_owner\": owner})\n\n def get_input(self) -> str:\n \"\"\"Get the input data for this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_input\", \"unknown\")\n\n def set_input(self, input_data: str):\n \"\"\"Set the input data for this artifact\n\n Args:\n input_data (str): Name of input data for this artifact\n Note:\n This breaks the official provenance of the artifact, so use with caution.\n \"\"\"\n self.log.important(f\"{self.uuid}: Setting input to {input_data}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input_data})\n\n def get_status(self) -> str:\n \"\"\"Get the status for this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_status\", \"unknown\")\n\n def set_status(self, status: str):\n \"\"\"Set the status for this artifact\n Args:\n status (str): Status to set for this artifact\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_status\": status})\n\n def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this artifact\n Returns:\n list[str]: List of health issues\n \"\"\"\n health_issues = []\n if not self.ready():\n return [\"needs_onboard\"]\n # FIXME: Revisit AWS URL check\n # if \"unknown\" in self.aws_url():\n # health_issues.append(\"aws_url_unknown\")\n return health_issues\n\n def summary(self) -> dict:\n \"\"\"This is generic summary information for all Artifacts. If you\n want to get more detailed information, call the details() method\n which is implemented by the specific Artifact class\"\"\"\n basic = {\n \"uuid\": self.uuid,\n \"health_tags\": self.get_health_tags(),\n \"aws_arn\": self.arn(),\n \"size\": self.size(),\n \"created\": self.created(),\n \"modified\": self.modified(),\n \"input\": self.get_input(),\n }\n # Combine the sageworks metadata with the basic metadata\n return {**basic, **self.sageworks_meta()}\n\n def __repr__(self) -> str:\n \"\"\"String representation of this artifact\n\n Returns:\n str: String representation of this artifact\n \"\"\"\n\n # If the artifact does not exist, return a message\n if not self.exists():\n return f\"{self.__class__.__name__}: {self.uuid} does not exist\"\n\n summary_dict = self.summary()\n display_keys = [\n \"aws_arn\",\n \"health_tags\",\n \"size\",\n \"created\",\n \"modified\",\n \"input\",\n \"sageworks_status\",\n \"sageworks_tags\",\n ]\n summary_items = [f\" {repr(key)}: {repr(value)}\" for key, value in summary_dict.items() if key in display_keys]\n summary_str = f\"{self.__class__.__name__}: {self.uuid}\\n\" + \",\\n\".join(summary_items)\n return summary_str\n\n def delete_metadata(self, key_to_delete: str):\n \"\"\"Delete specific metadata from this artifact\n Args:\n key_to_delete (str): Metadata key to delete\n \"\"\"\n\n aws_arn = self.arn()\n self.log.important(f\"Deleting Metadata {key_to_delete} for Artifact: {aws_arn}...\")\n\n # First, fetch all the existing tags\n response = self.sm_session.list_tags(aws_arn)\n existing_tags = response.get(\"Tags\", [])\n\n # Convert existing AWS tags to a dictionary for easy manipulation\n existing_tags_dict = {item[\"Key\"]: item[\"Value\"] for item in existing_tags}\n\n # Identify tags to delete\n tag_list_to_delete = []\n for key in existing_tags_dict.keys():\n if key == key_to_delete or key.startswith(f\"{key_to_delete}_chunk_\"):\n tag_list_to_delete.append(key)\n\n # Delete the identified tags\n if tag_list_to_delete:\n self.sm_client.delete_tags(ResourceArn=aws_arn, TagKeys=tag_list_to_delete)\n else:\n self.log.info(f\"No Metadata found: {key_to_delete}...\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__init__","title":"__init__(uuid, use_cached_meta=False)
","text":"Initialize the Artifact Base Class
Parameters:
Name Type Description Defaultuuid
str
The UUID of this artifact
requireduse_cached_meta
bool
Should we use cached metadata? (default: False)
False
Source code in src/sageworks/core/artifacts/artifact.py
def __init__(self, uuid: str, use_cached_meta: bool = False):\n \"\"\"Initialize the Artifact Base Class\n\n Args:\n uuid (str): The UUID of this artifact\n use_cached_meta (bool): Should we use cached metadata? (default: False)\n \"\"\"\n self.uuid = uuid\n if use_cached_meta:\n self.log.info(f\"Using Cached Metadata for {self.uuid}\")\n self.meta = CachedMeta()\n else:\n self.meta = CloudMeta()\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__post_init__","title":"__post_init__()
","text":"Artifact Post Initialization
Source code insrc/sageworks/core/artifacts/artifact.py
def __post_init__(self):\n \"\"\"Artifact Post Initialization\"\"\"\n\n # Do I exist? (very metaphysical)\n if not self.exists():\n self.log.debug(f\"Artifact {self.uuid} does not exist\")\n return\n\n # Conduct a Health Check on this Artifact\n health_issues = self.health_check()\n if health_issues:\n if \"needs_onboard\" in health_issues:\n self.log.important(f\"Artifact {self.uuid} needs to be onboarded\")\n elif health_issues == [\"no_activity\"]:\n self.log.debug(f\"Artifact {self.uuid} has no activity, which is fine\")\n else:\n self.log.warning(f\"Health Check Failed {self.uuid}: {health_issues}\")\n for issue in health_issues:\n self.add_health_tag(issue)\n else:\n self.log.info(f\"Health Check Passed {self.uuid}\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.__repr__","title":"__repr__()
","text":"String representation of this artifact
Returns:
Name Type Descriptionstr
str
String representation of this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def __repr__(self) -> str:\n \"\"\"String representation of this artifact\n\n Returns:\n str: String representation of this artifact\n \"\"\"\n\n # If the artifact does not exist, return a message\n if not self.exists():\n return f\"{self.__class__.__name__}: {self.uuid} does not exist\"\n\n summary_dict = self.summary()\n display_keys = [\n \"aws_arn\",\n \"health_tags\",\n \"size\",\n \"created\",\n \"modified\",\n \"input\",\n \"sageworks_status\",\n \"sageworks_tags\",\n ]\n summary_items = [f\" {repr(key)}: {repr(value)}\" for key, value in summary_dict.items() if key in display_keys]\n summary_str = f\"{self.__class__.__name__}: {self.uuid}\\n\" + \",\\n\".join(summary_items)\n return summary_str\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.add_tag","title":"add_tag(tag, tag_type='user')
","text":"Add a tag for this artifact, ensuring no duplicates and maintaining order. Args: tag (str): Tag to add for this artifact tag_type (str): Type of tag to add (user or health)
Source code insrc/sageworks/core/artifacts/artifact.py
def add_tag(self, tag, tag_type=\"user\"):\n \"\"\"Add a tag for this artifact, ensuring no duplicates and maintaining order.\n Args:\n tag (str): Tag to add for this artifact\n tag_type (str): Type of tag to add (user or health)\n \"\"\"\n current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n if tag not in current_tags:\n current_tags.append(tag)\n combined_tags = self.tag_delimiter.join(current_tags)\n if tag_type == \"user\":\n self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n else:\n self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.arn","title":"arn()
abstractmethod
","text":"AWS ARN (Amazon Resource Name) for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef arn(self):\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.aws_meta","title":"aws_meta()
abstractmethod
","text":"Get the full AWS metadata for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef aws_meta(self) -> dict:\n \"\"\"Get the full AWS metadata for this artifact\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.aws_url","title":"aws_url()
abstractmethod
","text":"AWS console/web interface for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef aws_url(self):\n \"\"\"AWS console/web interface for this artifact\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.created","title":"created()
abstractmethod
","text":"Return the datetime when this artifact was created
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.delete","title":"delete()
abstractmethod
","text":"Delete this artifact including all related AWS objects
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef delete(self):\n \"\"\"Delete this artifact including all related AWS objects\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.delete_metadata","title":"delete_metadata(key_to_delete)
","text":"Delete specific metadata from this artifact Args: key_to_delete (str): Metadata key to delete
Source code insrc/sageworks/core/artifacts/artifact.py
def delete_metadata(self, key_to_delete: str):\n \"\"\"Delete specific metadata from this artifact\n Args:\n key_to_delete (str): Metadata key to delete\n \"\"\"\n\n aws_arn = self.arn()\n self.log.important(f\"Deleting Metadata {key_to_delete} for Artifact: {aws_arn}...\")\n\n # First, fetch all the existing tags\n response = self.sm_session.list_tags(aws_arn)\n existing_tags = response.get(\"Tags\", [])\n\n # Convert existing AWS tags to a dictionary for easy manipulation\n existing_tags_dict = {item[\"Key\"]: item[\"Value\"] for item in existing_tags}\n\n # Identify tags to delete\n tag_list_to_delete = []\n for key in existing_tags_dict.keys():\n if key == key_to_delete or key.startswith(f\"{key_to_delete}_chunk_\"):\n tag_list_to_delete.append(key)\n\n # Delete the identified tags\n if tag_list_to_delete:\n self.sm_client.delete_tags(ResourceArn=aws_arn, TagKeys=tag_list_to_delete)\n else:\n self.log.info(f\"No Metadata found: {key_to_delete}...\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.details","title":"details()
abstractmethod
","text":"Additional Details about this Artifact
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef details(self) -> dict:\n \"\"\"Additional Details about this Artifact\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.exists","title":"exists()
abstractmethod
","text":"Does the Artifact exist? Can we connect to it?
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef exists(self) -> bool:\n \"\"\"Does the Artifact exist? Can we connect to it?\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.expected_meta","title":"expected_meta()
","text":"Metadata we expect to see for this Artifact when it's ready Returns: list[str]: List of expected metadata keys
Source code insrc/sageworks/core/artifacts/artifact.py
def expected_meta(self) -> list[str]:\n \"\"\"Metadata we expect to see for this Artifact when it's ready\n Returns:\n list[str]: List of expected metadata keys\n \"\"\"\n\n # If an artifact has additional expected metadata override this method\n return [\"sageworks_status\"]\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.generate_valid_name","title":"generate_valid_name(name, delimiter='_', lower_case=True)
staticmethod
","text":"Only allow letters and the specified delimiter, also lowercase the string.
Parameters:
Name Type Description Defaultname
str
The name/id string to check.
requireddelimiter
str
The delimiter to use in the name/id string (default: \"_\")
'_'
lower_case
bool
Should the name be lowercased? (default: True)
True
Returns:
Name Type Descriptionstr
str
A generated valid name/id.
Source code insrc/sageworks/core/artifacts/artifact.py
@staticmethod\ndef generate_valid_name(name: str, delimiter: str = \"_\", lower_case: bool = True) -> str:\n \"\"\"Only allow letters and the specified delimiter, also lowercase the string.\n\n Args:\n name (str): The name/id string to check.\n delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n lower_case (bool): Should the name be lowercased? (default: True)\n\n Returns:\n str: A generated valid name/id.\n \"\"\"\n valid_name = \"\".join(c for c in name if c.isalnum() or c in [\"_\", \"-\"])\n if lower_case:\n valid_name = valid_name.lower()\n\n # Replace with the chosen delimiter\n return valid_name.replace(\"_\", delimiter).replace(\"-\", delimiter)\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_input","title":"get_input()
","text":"Get the input data for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def get_input(self) -> str:\n \"\"\"Get the input data for this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_input\", \"unknown\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_owner","title":"get_owner()
","text":"Get the owner of this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def get_owner(self) -> str:\n \"\"\"Get the owner of this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_owner\", \"unknown\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_status","title":"get_status()
","text":"Get the status for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def get_status(self) -> str:\n \"\"\"Get the status for this artifact\"\"\"\n return self.sageworks_meta().get(\"sageworks_status\", \"unknown\")\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.get_tags","title":"get_tags(tag_type='user')
","text":"Get the tags for this artifact Args: tag_type (str): Type of tags to return (user or health) Returns: list[str]: List of tags for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def get_tags(self, tag_type=\"user\") -> list:\n \"\"\"Get the tags for this artifact\n Args:\n tag_type (str): Type of tags to return (user or health)\n Returns:\n list[str]: List of tags for this artifact\n \"\"\"\n if tag_type == \"user\":\n user_tags = self.sageworks_meta().get(\"sageworks_tags\")\n return user_tags.split(self.tag_delimiter) if user_tags else []\n\n # Grab our health tags\n health_tags = self.sageworks_meta().get(\"sageworks_health_tags\")\n\n # If we don't have health tags, create the storage and return an empty list\n if health_tags is None:\n self.log.important(f\"{self.uuid} creating sageworks_health_tags storage...\")\n self.upsert_sageworks_meta({\"sageworks_health_tags\": \"\"})\n return []\n\n # Otherwise, return the health tags\n return health_tags.split(self.tag_delimiter) if health_tags else []\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.hash","title":"hash()
abstractmethod
","text":"Return the hash of this artifact, useful for content validation
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef hash(self) -> str:\n \"\"\"Return the hash of this artifact, useful for content validation\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.health_check","title":"health_check()
","text":"Perform a health check on this artifact Returns: list[str]: List of health issues
Source code insrc/sageworks/core/artifacts/artifact.py
def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this artifact\n Returns:\n list[str]: List of health issues\n \"\"\"\n health_issues = []\n if not self.ready():\n return [\"needs_onboard\"]\n # FIXME: Revisit AWS URL check\n # if \"unknown\" in self.aws_url():\n # health_issues.append(\"aws_url_unknown\")\n return health_issues\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.is_name_valid","title":"is_name_valid(name, delimiter='_', lower_case=True)
classmethod
","text":"Check if the name adheres to the naming conventions for this Artifact.
Parameters:
Name Type Description Defaultname
str
The name/id to check.
requireddelimiter
str
The delimiter to use in the name/id string (default: \"_\")
'_'
lower_case
bool
Should the name be lowercased? (default: True)
True
Returns:
Name Type Descriptionbool
bool
True if the name is valid, False otherwise.
Source code insrc/sageworks/core/artifacts/artifact.py
@classmethod\ndef is_name_valid(cls, name: str, delimiter: str = \"_\", lower_case: bool = True) -> bool:\n \"\"\"Check if the name adheres to the naming conventions for this Artifact.\n\n Args:\n name (str): The name/id to check.\n delimiter (str): The delimiter to use in the name/id string (default: \"_\")\n lower_case (bool): Should the name be lowercased? (default: True)\n\n Returns:\n bool: True if the name is valid, False otherwise.\n \"\"\"\n valid_name = cls.generate_valid_name(name, delimiter=delimiter, lower_case=lower_case)\n if name != valid_name:\n cls.log.warning(f\"Artifact name: '{name}' is not valid. Convert it to something like: '{valid_name}'\")\n return False\n return True\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.modified","title":"modified()
abstractmethod
","text":"Return the datetime when this artifact was last modified
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.onboard","title":"onboard()
abstractmethod
","text":"Onboard this Artifact into SageWorks Returns: bool: True if the Artifact was successfully onboarded, False otherwise
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef onboard(self) -> bool:\n \"\"\"Onboard this Artifact into SageWorks\n Returns:\n bool: True if the Artifact was successfully onboarded, False otherwise\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.ready","title":"ready()
","text":"Is the Artifact ready? Is initial setup complete and expected metadata populated?
Source code insrc/sageworks/core/artifacts/artifact.py
def ready(self) -> bool:\n \"\"\"Is the Artifact ready? Is initial setup complete and expected metadata populated?\"\"\"\n\n # If anything goes wrong, assume the artifact is not ready\n try:\n # Check for the expected metadata\n expected_meta = self.expected_meta()\n existing_meta = self.sageworks_meta()\n ready = set(existing_meta.keys()).issuperset(expected_meta)\n if ready:\n return True\n else:\n self.log.info(\"Artifact is not ready!\")\n return False\n except Exception as e:\n self.log.error(f\"Artifact malformed: {e}\")\n return False\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.refresh_meta","title":"refresh_meta()
abstractmethod
","text":"Refresh the Artifact's metadata
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.remove_sageworks_meta","title":"remove_sageworks_meta(key_to_remove)
","text":"Remove SageWorks specific metadata from this Artifact Args: key_to_remove (str): The metadata key to remove Note: This functionality will work for FeatureSets, Models, and Endpoints but not for DataSources. The DataSource class overrides this method.
Source code insrc/sageworks/core/artifacts/artifact.py
def remove_sageworks_meta(self, key_to_remove: str):\n \"\"\"Remove SageWorks specific metadata from this Artifact\n Args:\n key_to_remove (str): The metadata key to remove\n Note:\n This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources. The DataSource class overrides this method.\n \"\"\"\n aws_arn = self.arn()\n # Sanity check\n if aws_arn is None:\n self.log.error(f\"ARN is None for {self.uuid}!\")\n return\n self.log.info(f\"Removing SageWorks Metadata {key_to_remove} for Artifact: {aws_arn}...\")\n sagemaker_delete_tag(aws_arn, self.sm_session, key_to_remove)\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.remove_sageworks_tag","title":"remove_sageworks_tag(tag, tag_type='user')
","text":"Remove a tag from this artifact if it exists. Args: tag (str): Tag to remove from this artifact tag_type (str): Type of tag to remove (user or health)
Source code insrc/sageworks/core/artifacts/artifact.py
def remove_sageworks_tag(self, tag, tag_type=\"user\"):\n \"\"\"Remove a tag from this artifact if it exists.\n Args:\n tag (str): Tag to remove from this artifact\n tag_type (str): Type of tag to remove (user or health)\n \"\"\"\n current_tags = self.get_tags(tag_type) if tag_type == \"user\" else self.get_health_tags()\n if tag in current_tags:\n current_tags.remove(tag)\n combined_tags = self.tag_delimiter.join(current_tags)\n if tag_type == \"user\":\n self.upsert_sageworks_meta({\"sageworks_tags\": combined_tags})\n elif tag_type == \"health\":\n self.upsert_sageworks_meta({\"sageworks_health_tags\": combined_tags})\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.sageworks_meta","title":"sageworks_meta()
","text":"Get the SageWorks specific metadata for this Artifact
Returns:
Type DescriptionUnion[dict, None]
Union[dict, None]: Dictionary of SageWorks metadata for this Artifact
This functionality will work for FeatureSets, Models, and Endpointsbut not for DataSources and Graphs, those classes need to override this method.
Source code insrc/sageworks/core/artifacts/artifact.py
def sageworks_meta(self) -> Union[dict, None]:\n \"\"\"Get the SageWorks specific metadata for this Artifact\n\n Returns:\n Union[dict, None]: Dictionary of SageWorks metadata for this Artifact\n\n Note: This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources and Graphs, those classes need to override this method.\n \"\"\"\n return self.meta.get_aws_tags(self.arn())\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_input","title":"set_input(input_data)
","text":"Set the input data for this artifact
Parameters:
Name Type Description Defaultinput_data
str
Name of input data for this artifact
requiredNote: This breaks the official provenance of the artifact, so use with caution.
Source code insrc/sageworks/core/artifacts/artifact.py
def set_input(self, input_data: str):\n \"\"\"Set the input data for this artifact\n\n Args:\n input_data (str): Name of input data for this artifact\n Note:\n This breaks the official provenance of the artifact, so use with caution.\n \"\"\"\n self.log.important(f\"{self.uuid}: Setting input to {input_data}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input_data})\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_owner","title":"set_owner(owner)
","text":"Set the owner of this artifact
Parameters:
Name Type Description Defaultowner
str
Owner to set for this artifact
required Source code insrc/sageworks/core/artifacts/artifact.py
def set_owner(self, owner: str):\n \"\"\"Set the owner of this artifact\n\n Args:\n owner (str): Owner to set for this artifact\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_owner\": owner})\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.set_status","title":"set_status(status)
","text":"Set the status for this artifact Args: status (str): Status to set for this artifact
Source code insrc/sageworks/core/artifacts/artifact.py
def set_status(self, status: str):\n \"\"\"Set the status for this artifact\n Args:\n status (str): Status to set for this artifact\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_status\": status})\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.size","title":"size()
abstractmethod
","text":"Return the size of this artifact in MegaBytes
Source code insrc/sageworks/core/artifacts/artifact.py
@abstractmethod\ndef size(self) -> float:\n \"\"\"Return the size of this artifact in MegaBytes\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.summary","title":"summary()
","text":"This is generic summary information for all Artifacts. If you want to get more detailed information, call the details() method which is implemented by the specific Artifact class
Source code insrc/sageworks/core/artifacts/artifact.py
def summary(self) -> dict:\n \"\"\"This is generic summary information for all Artifacts. If you\n want to get more detailed information, call the details() method\n which is implemented by the specific Artifact class\"\"\"\n basic = {\n \"uuid\": self.uuid,\n \"health_tags\": self.get_health_tags(),\n \"aws_arn\": self.arn(),\n \"size\": self.size(),\n \"created\": self.created(),\n \"modified\": self.modified(),\n \"input\": self.get_input(),\n }\n # Combine the sageworks metadata with the basic metadata\n return {**basic, **self.sageworks_meta()}\n
"},{"location":"core_classes/artifacts/artifact/#sageworks.core.artifacts.artifact.Artifact.upsert_sageworks_meta","title":"upsert_sageworks_meta(new_meta)
","text":"Add SageWorks specific metadata to this Artifact Args: new_meta (dict): Dictionary of NEW metadata to add Note: This functionality will work for FeatureSets, Models, and Endpoints but not for DataSources. The DataSource class overrides this method.
Source code insrc/sageworks/core/artifacts/artifact.py
def upsert_sageworks_meta(self, new_meta: dict):\n \"\"\"Add SageWorks specific metadata to this Artifact\n Args:\n new_meta (dict): Dictionary of NEW metadata to add\n Note:\n This functionality will work for FeatureSets, Models, and Endpoints\n but not for DataSources. The DataSource class overrides this method.\n \"\"\"\n # Sanity check\n aws_arn = self.arn()\n if aws_arn is None:\n self.log.error(f\"ARN is None for {self.uuid}!\")\n return\n\n # Add the new metadata to the existing metadata\n self.log.info(f\"Adding Tags to {self.uuid}:{str(new_meta)[:50]}...\")\n aws_tags = dict_to_aws_tags(new_meta)\n try:\n self.sm_client.add_tags(ResourceArn=aws_arn, Tags=aws_tags)\n except Exception as e:\n self.log.error(f\"Error adding metadata to {aws_arn}: {e}\")\n
"},{"location":"core_classes/artifacts/athena_source/","title":"AthenaSource","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the DataSource API Class and voil\u00e0 it works the same.
AthenaSource: SageWorks Data Source accessible through Athena
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource","title":"AthenaSource
","text":" Bases: DataSourceAbstract
AthenaSource: SageWorks Data Source accessible through Athena
Common Usagemy_data = AthenaSource(data_uuid, database=\"sageworks\")\nmy_data.summary()\nmy_data.details()\ndf = my_data.query(f\"select * from {data_uuid} limit 5\")\n
Source code in src/sageworks/core/artifacts/athena_source.py
class AthenaSource(DataSourceAbstract):\n \"\"\"AthenaSource: SageWorks Data Source accessible through Athena\n\n Common Usage:\n ```python\n my_data = AthenaSource(data_uuid, database=\"sageworks\")\n my_data.summary()\n my_data.details()\n df = my_data.query(f\"select * from {data_uuid} limit 5\")\n ```\n \"\"\"\n\n def __init__(self, data_uuid, database=\"sageworks\", **kwargs):\n \"\"\"AthenaSource Initialization\n\n Args:\n data_uuid (str): Name of Athena Table\n database (str): Athena Database Name (default: sageworks)\n \"\"\"\n # Ensure the data_uuid is a valid name/id\n self.is_name_valid(data_uuid)\n\n # Call superclass init\n super().__init__(data_uuid, database, **kwargs)\n\n # Grab our metadata from the Meta class\n self.log.info(f\"Retrieving metadata for: {self.uuid}...\")\n self.data_source_meta = self.meta.data_source(data_uuid, database=database)\n if self.data_source_meta is None:\n self.log.error(f\"Unable to find {database}:{self.table} in Glue Catalogs...\")\n return\n\n # Call superclass post init\n super().__post_init__()\n\n # All done\n self.log.debug(f\"AthenaSource Initialized: {database}.{self.table}\")\n\n def refresh_meta(self):\n \"\"\"Refresh our internal AWS Broker catalog metadata\"\"\"\n self.data_source_meta = self.meta.data_source(self.uuid, database=self.database)\n\n def exists(self) -> bool:\n \"\"\"Validation Checks for this Data Source\"\"\"\n\n # Are we able to pull AWS Metadata for this table_name?\"\"\"\n # Do we have a valid data_source_meta?\n if getattr(self, \"data_source_meta\", None) is None:\n self.log.debug(f\"AthenaSource {self.table} not found in SageWorks Metadata...\")\n return False\n return True\n\n def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n # Grab our SageWorks Role Manager, get our AWS account id, and region for ARN creation\n account_id = self.aws_account_clamp.account_id\n region = self.aws_account_clamp.region\n arn = f\"arn:aws:glue:{region}:{account_id}:table/{self.database}/{self.table}\"\n return arn\n\n def sageworks_meta(self) -> dict:\n \"\"\"Get the SageWorks specific metadata for this Artifact\"\"\"\n\n # Sanity Check if we have invalid AWS Metadata\n if self.data_source_meta is None:\n if not self.exists():\n self.log.error(f\"DataSource {self.uuid} doesn't appear to exist...\")\n else:\n self.log.critical(f\"Unable to get AWS Metadata for {self.table}\")\n self.log.critical(\"Malformed Artifact! Delete this Artifact and recreate it!\")\n return {}\n\n # Get the SageWorks Metadata from the 'Parameters' section of the DataSource Metadata\n params = self.data_source_meta.get(\"Parameters\", {})\n return {key: decode_value(value) for key, value in params.items() if \"sageworks\" in key}\n\n def upsert_sageworks_meta(self, new_meta: dict):\n \"\"\"Add SageWorks specific metadata to this Artifact\n\n Args:\n new_meta (dict): Dictionary of new metadata to add\n \"\"\"\n self.log.important(f\"Upserting SageWorks Metadata {self.uuid}:{str(new_meta)[:50]}...\")\n\n # Give a warning message for keys that don't start with sageworks_\n for key in new_meta.keys():\n if not key.startswith(\"sageworks_\"):\n self.log.warning(\"Append 'sageworks_' to key names to avoid overwriting AWS meta data\")\n\n # Now convert any non-string values to JSON strings\n for key, value in new_meta.items():\n if not isinstance(value, str):\n new_meta[key] = json.dumps(value, cls=CustomEncoder)\n\n # Store our updated metadata\n try:\n wr.catalog.upsert_table_parameters(\n parameters=new_meta,\n database=self.database,\n table=self.table,\n boto3_session=self.boto3_session,\n )\n except botocore.exceptions.ClientError as e:\n error_code = e.response[\"Error\"][\"Code\"]\n if error_code == \"InvalidInputException\":\n self.log.error(f\"Unable to upsert metadata for {self.table}\")\n self.log.error(\"Probably because the metadata is too large\")\n self.log.error(new_meta)\n elif error_code == \"ConcurrentModificationException\":\n self.log.warning(\"ConcurrentModificationException... trying again...\")\n time.sleep(5)\n wr.catalog.upsert_table_parameters(\n parameters=new_meta,\n database=self.database,\n table=self.table,\n boto3_session=self.boto3_session,\n )\n else:\n self.log.critical(f\"Failed to upsert metadata: {e}\")\n self.log.critical(f\"{self.uuid} is Malformed! Delete this Artifact and recreate it!\")\n except Exception as e:\n self.log.critical(f\"Failed to upsert metadata: {e}\")\n self.log.critical(f\"{self.uuid} is Malformed! Delete this Artifact and recreate it!\")\n\n def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n size_in_bytes = sum(wr.s3.size_objects(self.s3_storage_location(), boto3_session=self.boto3_session).values())\n size_in_mb = size_in_bytes / 1_000_000\n return size_in_mb\n\n def aws_meta(self) -> dict:\n \"\"\"Get the FULL AWS metadata for this artifact\"\"\"\n return self.data_source_meta\n\n def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n sageworks_details = self.sageworks_meta().get(\"sageworks_details\", {})\n return sageworks_details.get(\"aws_url\", \"unknown\")\n\n def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.data_source_meta[\"CreateTime\"]\n\n def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n return self.data_source_meta[\"UpdateTime\"]\n\n def hash(self) -> str:\n \"\"\"Get the hash for the set of Parquet files used for this Artifact\"\"\"\n s3_uri = self.s3_storage_location()\n return compute_parquet_hash(s3_uri, self.boto3_session)\n\n def table_hash(self) -> str:\n \"\"\"Get the table hash for this AthenaSource\"\"\"\n s3_scratch = f\"s3://{self.sageworks_bucket}/temp/athena_output\"\n return compute_athena_table_hash(self.database, self.table, self.boto3_session, s3_scratch)\n\n def num_rows(self) -> int:\n \"\"\"Return the number of rows for this Data Source\"\"\"\n count_df = self.query(f'select count(*) AS sageworks_count from \"{self.database}\".\"{self.table}\"')\n return count_df[\"sageworks_count\"][0] if count_df is not None else 0\n\n def num_columns(self) -> int:\n \"\"\"Return the number of columns for this Data Source\"\"\"\n return len(self.columns)\n\n @property\n def columns(self) -> list[str]:\n \"\"\"Return the column names for this Athena Table\"\"\"\n return [item[\"Name\"] for item in self.data_source_meta[\"StorageDescriptor\"][\"Columns\"]]\n\n @property\n def column_types(self) -> list[str]:\n \"\"\"Return the column types of the internal AthenaSource\"\"\"\n return [item[\"Type\"] for item in self.data_source_meta[\"StorageDescriptor\"][\"Columns\"]]\n\n def query(self, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the AthenaSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n\n # Call internal class _query method\n return self.database_query(self.database, query)\n\n @classmethod\n def database_query(cls, database: str, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Specify the Database and Query the Athena Service\n\n Args:\n database (str): The Athena Database to query\n query (str): The query to run against the AthenaSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n cls.log.debug(f\"Executing Query: {query}...\")\n try:\n df = wr.athena.read_sql_query(\n sql=query,\n database=database,\n ctas_approach=False,\n boto3_session=cls.boto3_session,\n )\n scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n if scanned_bytes > 0:\n cls.log.debug(f\"Athena Query successful (scanned bytes: {scanned_bytes})\")\n return df\n except wr.exceptions.QueryFailed as e:\n cls.log.critical(f\"Failed to execute query: {e}\")\n return None\n\n def execute_statement(self, query: str, silence_errors: bool = False):\n \"\"\"Execute a non-returning SQL statement in Athena with retries.\n\n Args:\n query (str): The query to run against the AthenaSource\n silence_errors (bool): Silence errors (default: False)\n \"\"\"\n attempt = 0\n max_retries = 3\n retry_delay = 10\n while attempt < max_retries:\n try:\n # Start the query execution\n query_execution_id = wr.athena.start_query_execution(\n sql=query,\n database=self.database,\n boto3_session=self.boto3_session,\n )\n self.log.debug(f\"QueryExecutionId: {query_execution_id}\")\n\n # Wait for the query to complete\n wr.athena.wait_query(query_execution_id=query_execution_id, boto3_session=self.boto3_session)\n self.log.debug(f\"Statement executed successfully: {query_execution_id}\")\n break # If successful, exit the retry loop\n except wr.exceptions.QueryFailed as e:\n if \"AlreadyExistsException\" in str(e):\n self.log.warning(f\"Table already exists: {e} \\nIgnoring...\")\n break # No need to retry for this error\n elif \"ConcurrentModificationException\" in str(e):\n self.log.warning(f\"Concurrent modification detected: {e}\\nRetrying...\")\n attempt += 1\n if attempt < max_retries:\n time.sleep(retry_delay)\n else:\n if not silence_errors:\n self.log.critical(f\"Failed to execute statement after {max_retries} attempts: {e}\")\n raise\n else:\n if not silence_errors:\n self.log.critical(f\"Failed to execute statement: {e}\")\n raise\n\n def s3_storage_location(self) -> str:\n \"\"\"Get the S3 Storage Location for this Data Source\"\"\"\n return self.data_source_meta[\"StorageDescriptor\"][\"Location\"]\n\n def athena_test_query(self):\n \"\"\"Validate that Athena Queries are working\"\"\"\n query = f'select count(*) as sageworks_count from \"{self.table}\"'\n df = wr.athena.read_sql_query(\n sql=query,\n database=self.database,\n ctas_approach=False,\n boto3_session=self.boto3_session,\n )\n scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n self.log.info(f\"Athena TEST Query successful (scanned bytes: {scanned_bytes})\")\n\n def descriptive_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the descriptive stats (default: False)\n\n Returns:\n dict(dict): A dictionary of descriptive stats for each column in the form\n {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n 'col2': ...}\n \"\"\"\n\n # First check if we have already computed the descriptive stats\n stat_dict = self.sageworks_meta().get(\"sageworks_descriptive_stats\")\n if stat_dict and not recompute:\n return stat_dict\n\n # Call the SQL function to compute descriptive stats\n stat_dict = sql.descriptive_stats(self)\n\n # Push the descriptive stat data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_descriptive_stats\": stat_dict})\n\n # Return the descriptive stats\n return stat_dict\n\n @cache_dataframe(\"sample\")\n def sample(self) -> pd.DataFrame:\n \"\"\"Pull a sample of rows from the DataSource\n\n Returns:\n pd.DataFrame: A sample DataFrame for an Athena DataSource\n \"\"\"\n\n # Call the SQL function to pull a sample of the rows\n return sql.sample_rows(self)\n\n @cache_dataframe(\"outliers\")\n def outliers(self, scale: float = 1.5, use_stddev=False) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n use_stddev (bool): Use Standard Deviation instead of IQR (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n\n # Compute outliers using the SQL Outliers class\n sql_outliers = sql.outliers.Outliers()\n return sql_outliers.compute_outliers(self, scale=scale, use_stddev=use_stddev)\n\n @cache_dataframe(\"smart_sample\")\n def smart_sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a smart sample dataframe for this DataSource\n\n Args:\n recompute (bool): Recompute the smart sample (default: False)\n\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n\n # Compute/recompute the smart sample\n self.log.important(f\"Computing Smart Sample {self.uuid}...\")\n\n # Outliers DataFrame\n outlier_rows = self.outliers()\n\n # Sample DataFrame\n sample_rows = self.sample()\n sample_rows[\"outlier_group\"] = \"sample\"\n\n # Combine the sample rows with the outlier rows\n all_rows = pd.concat([outlier_rows, sample_rows]).reset_index(drop=True)\n\n # Drop duplicates\n all_except_outlier_group = [col for col in all_rows.columns if col != \"outlier_group\"]\n all_rows = all_rows.drop_duplicates(subset=all_except_outlier_group, ignore_index=True)\n\n # Return the smart_sample data\n return all_rows\n\n def correlations(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of correlations for each column in this format\n {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n \"\"\"\n\n # First check if we have already computed the correlations\n correlations_dict = self.sageworks_meta().get(\"sageworks_correlations\")\n if correlations_dict and not recompute:\n return correlations_dict\n\n # Call the SQL function to compute correlations\n correlations_dict = sql.correlations(self)\n\n # Push the correlation data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_correlations\": correlations_dict})\n\n # Return the correlation data\n return correlations_dict\n\n def column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros, descriptive_stats or correlation data\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100,\n 'descriptive_stats': {...}, 'correlations': {...}},\n ...}\n \"\"\"\n\n # First check if we have already computed the column stats\n columns_stats_dict = self.sageworks_meta().get(\"sageworks_column_stats\")\n if columns_stats_dict and not recompute:\n return columns_stats_dict\n\n # Call the SQL function to compute column stats\n column_stats_dict = sql.column_stats(self, recompute=recompute)\n\n # Push the column stats data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_column_stats\": column_stats_dict})\n\n # Return the column stats data\n return column_stats_dict\n\n def value_counts(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n\n Args:\n recompute (bool): Recompute the value counts (default: False)\n\n Returns:\n dict(dict): A dictionary of value counts for each column in the form\n {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...},\n 'col2': ...}\n \"\"\"\n\n # First check if we have already computed the value counts\n value_counts_dict = self.sageworks_meta().get(\"sageworks_value_counts\")\n if value_counts_dict and not recompute:\n return value_counts_dict\n\n # Call the SQL function to compute value_counts\n value_count_dict = sql.value_counts(self)\n\n # Push the value_count data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_value_counts\": value_count_dict})\n\n # Return the value_count data\n return value_count_dict\n\n def details(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Additional Details about this AthenaSource Artifact\n\n Args:\n recompute (bool): Recompute the details (default: False)\n\n Returns:\n dict(dict): A dictionary of details about this AthenaSource\n \"\"\"\n self.log.info(f\"Computing DataSource Details ({self.uuid})...\")\n\n # Get the details from the base class\n details = super().details()\n\n # Compute additional details\n details[\"s3_storage_location\"] = self.s3_storage_location()\n details[\"storage_type\"] = \"athena\"\n\n # Compute our AWS URL\n query = f'select * from \"{self.database}.{self.table}\" limit 10'\n query_exec_id = wr.athena.start_query_execution(\n sql=query, database=self.database, boto3_session=self.boto3_session\n )\n base_url = \"https://console.aws.amazon.com/athena/home\"\n details[\"aws_url\"] = f\"{base_url}?region={self.aws_region}#query/history/{query_exec_id}\"\n\n # Push the aws_url data into our DataSource Metadata\n # FIXME: We need to revisit this but doing an upsert just for aws_url is silly\n # self.upsert_sageworks_meta({\"sageworks_details\": {\"aws_url\": details[\"aws_url\"]}})\n\n # Convert any datetime fields to ISO-8601 strings\n details = convert_all_to_iso8601(details)\n\n # Add the column stats\n details[\"column_stats\"] = self.column_stats()\n\n # Return the details data\n return details\n\n def delete(self):\n \"\"\"Instance Method: Delete the AWS Data Catalog Table and S3 Storage Objects\"\"\"\n\n # Make sure the AthenaSource exists\n if not self.exists():\n self.log.warning(f\"Trying to delete an AthenaSource that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the AthenaSource\n AthenaSource.managed_delete(self.uuid, database=self.database)\n\n @classmethod\n def managed_delete(cls, data_source_name: str, database: str = \"sageworks\"):\n \"\"\"Class Method: Delete the AWS Data Catalog Table and S3 Storage Objects\n\n Args:\n data_source_name (str): Name of DataSource (AthenaSource)\n database (str): Athena Database Name (default: sageworks)\n \"\"\"\n table = data_source_name # The table name is the same as the data_source_name\n\n # Check if the Glue Catalog Table exists\n if not wr.catalog.does_table_exist(database, table, boto3_session=cls.boto3_session):\n cls.log.info(f\"DataSource {table} not found in database {database}.\")\n return\n\n # Delete any views associated with this AthenaSource\n cls.delete_views(table, database)\n\n # Delete S3 Storage Objects (if they exist)\n try:\n # Make an AWS Query to get the S3 storage location\n s3_path = wr.catalog.get_table_location(database, table, boto3_session=cls.boto3_session)\n\n # Delete Data Catalog Table\n cls.log.info(f\"Deleting DataCatalog Table: {database}.{table}...\")\n wr.catalog.delete_table_if_exists(database, table, boto3_session=cls.boto3_session)\n\n # Make sure we add the trailing slash\n s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n cls.log.info(f\"Deleting S3 Storage Objects: {s3_path}...\")\n wr.s3.delete_objects(s3_path, boto3_session=cls.boto3_session)\n except Exception as e:\n cls.log.error(f\"Failure when trying to delete {data_source_name}: {e}\")\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(data_source_name)\n\n @classmethod\n def delete_views(cls, table: str, database: str):\n \"\"\"Delete any views associated with this FeatureSet\n\n Args:\n table (str): Name of Athena Table\n database (str): Athena Database Name\n \"\"\"\n from sageworks.core.views.view_utils import delete_views_and_supplemental_data\n\n delete_views_and_supplemental_data(table, database, cls.boto3_session)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.column_types","title":"column_types: list[str]
property
","text":"Return the column types of the internal AthenaSource
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.columns","title":"columns: list[str]
property
","text":"Return the column names for this Athena Table
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.__init__","title":"__init__(data_uuid, database='sageworks', **kwargs)
","text":"AthenaSource Initialization
Parameters:
Name Type Description Defaultdata_uuid
str
Name of Athena Table
requireddatabase
str
Athena Database Name (default: sageworks)
'sageworks'
Source code in src/sageworks/core/artifacts/athena_source.py
def __init__(self, data_uuid, database=\"sageworks\", **kwargs):\n \"\"\"AthenaSource Initialization\n\n Args:\n data_uuid (str): Name of Athena Table\n database (str): Athena Database Name (default: sageworks)\n \"\"\"\n # Ensure the data_uuid is a valid name/id\n self.is_name_valid(data_uuid)\n\n # Call superclass init\n super().__init__(data_uuid, database, **kwargs)\n\n # Grab our metadata from the Meta class\n self.log.info(f\"Retrieving metadata for: {self.uuid}...\")\n self.data_source_meta = self.meta.data_source(data_uuid, database=database)\n if self.data_source_meta is None:\n self.log.error(f\"Unable to find {database}:{self.table} in Glue Catalogs...\")\n return\n\n # Call superclass post init\n super().__post_init__()\n\n # All done\n self.log.debug(f\"AthenaSource Initialized: {database}.{self.table}\")\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.arn","title":"arn()
","text":"AWS ARN (Amazon Resource Name) for this artifact
Source code insrc/sageworks/core/artifacts/athena_source.py
def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n # Grab our SageWorks Role Manager, get our AWS account id, and region for ARN creation\n account_id = self.aws_account_clamp.account_id\n region = self.aws_account_clamp.region\n arn = f\"arn:aws:glue:{region}:{account_id}:table/{self.database}/{self.table}\"\n return arn\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.athena_test_query","title":"athena_test_query()
","text":"Validate that Athena Queries are working
Source code insrc/sageworks/core/artifacts/athena_source.py
def athena_test_query(self):\n \"\"\"Validate that Athena Queries are working\"\"\"\n query = f'select count(*) as sageworks_count from \"{self.table}\"'\n df = wr.athena.read_sql_query(\n sql=query,\n database=self.database,\n ctas_approach=False,\n boto3_session=self.boto3_session,\n )\n scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n self.log.info(f\"Athena TEST Query successful (scanned bytes: {scanned_bytes})\")\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.aws_meta","title":"aws_meta()
","text":"Get the FULL AWS metadata for this artifact
Source code insrc/sageworks/core/artifacts/athena_source.py
def aws_meta(self) -> dict:\n \"\"\"Get the FULL AWS metadata for this artifact\"\"\"\n return self.data_source_meta\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.aws_url","title":"aws_url()
","text":"The AWS URL for looking at/querying this data source
Source code insrc/sageworks/core/artifacts/athena_source.py
def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n sageworks_details = self.sageworks_meta().get(\"sageworks_details\", {})\n return sageworks_details.get(\"aws_url\", \"unknown\")\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.column_stats","title":"column_stats(recompute=False)
","text":"Compute Column Stats for all the columns in a DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the column stats (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of stats for each column this format
NB
dict[dict]
String columns will NOT have num_zeros, descriptive_stats or correlation data {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12}, 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}, 'correlations': {...}}, ...}
Source code insrc/sageworks/core/artifacts/athena_source.py
def column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros, descriptive_stats or correlation data\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100,\n 'descriptive_stats': {...}, 'correlations': {...}},\n ...}\n \"\"\"\n\n # First check if we have already computed the column stats\n columns_stats_dict = self.sageworks_meta().get(\"sageworks_column_stats\")\n if columns_stats_dict and not recompute:\n return columns_stats_dict\n\n # Call the SQL function to compute column stats\n column_stats_dict = sql.column_stats(self, recompute=recompute)\n\n # Push the column stats data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_column_stats\": column_stats_dict})\n\n # Return the column stats data\n return column_stats_dict\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.correlations","title":"correlations(recompute=False)
","text":"Compute Correlations for all the numeric columns in a DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the column stats (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of correlations for each column in this format {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...}, 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}
Source code insrc/sageworks/core/artifacts/athena_source.py
def correlations(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of correlations for each column in this format\n {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n \"\"\"\n\n # First check if we have already computed the correlations\n correlations_dict = self.sageworks_meta().get(\"sageworks_correlations\")\n if correlations_dict and not recompute:\n return correlations_dict\n\n # Call the SQL function to compute correlations\n correlations_dict = sql.correlations(self)\n\n # Push the correlation data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_correlations\": correlations_dict})\n\n # Return the correlation data\n return correlations_dict\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.created","title":"created()
","text":"Return the datetime when this artifact was created
Source code insrc/sageworks/core/artifacts/athena_source.py
def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.data_source_meta[\"CreateTime\"]\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.database_query","title":"database_query(database, query)
classmethod
","text":"Specify the Database and Query the Athena Service
Parameters:
Name Type Description Defaultdatabase
str
The Athena Database to query
requiredquery
str
The query to run against the AthenaSource
requiredReturns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The results of the query
Source code insrc/sageworks/core/artifacts/athena_source.py
@classmethod\ndef database_query(cls, database: str, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Specify the Database and Query the Athena Service\n\n Args:\n database (str): The Athena Database to query\n query (str): The query to run against the AthenaSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n cls.log.debug(f\"Executing Query: {query}...\")\n try:\n df = wr.athena.read_sql_query(\n sql=query,\n database=database,\n ctas_approach=False,\n boto3_session=cls.boto3_session,\n )\n scanned_bytes = df.query_metadata[\"Statistics\"][\"DataScannedInBytes\"]\n if scanned_bytes > 0:\n cls.log.debug(f\"Athena Query successful (scanned bytes: {scanned_bytes})\")\n return df\n except wr.exceptions.QueryFailed as e:\n cls.log.critical(f\"Failed to execute query: {e}\")\n return None\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.delete","title":"delete()
","text":"Instance Method: Delete the AWS Data Catalog Table and S3 Storage Objects
Source code insrc/sageworks/core/artifacts/athena_source.py
def delete(self):\n \"\"\"Instance Method: Delete the AWS Data Catalog Table and S3 Storage Objects\"\"\"\n\n # Make sure the AthenaSource exists\n if not self.exists():\n self.log.warning(f\"Trying to delete an AthenaSource that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the AthenaSource\n AthenaSource.managed_delete(self.uuid, database=self.database)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.delete_views","title":"delete_views(table, database)
classmethod
","text":"Delete any views associated with this FeatureSet
Parameters:
Name Type Description Defaulttable
str
Name of Athena Table
requireddatabase
str
Athena Database Name
required Source code insrc/sageworks/core/artifacts/athena_source.py
@classmethod\ndef delete_views(cls, table: str, database: str):\n \"\"\"Delete any views associated with this FeatureSet\n\n Args:\n table (str): Name of Athena Table\n database (str): Athena Database Name\n \"\"\"\n from sageworks.core.views.view_utils import delete_views_and_supplemental_data\n\n delete_views_and_supplemental_data(table, database, cls.boto3_session)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.descriptive_stats","title":"descriptive_stats(recompute=False)
","text":"Compute Descriptive Stats for all the numeric columns in a DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the descriptive stats (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of descriptive stats for each column in the form {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4}, 'col2': ...}
Source code insrc/sageworks/core/artifacts/athena_source.py
def descriptive_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the descriptive stats (default: False)\n\n Returns:\n dict(dict): A dictionary of descriptive stats for each column in the form\n {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n 'col2': ...}\n \"\"\"\n\n # First check if we have already computed the descriptive stats\n stat_dict = self.sageworks_meta().get(\"sageworks_descriptive_stats\")\n if stat_dict and not recompute:\n return stat_dict\n\n # Call the SQL function to compute descriptive stats\n stat_dict = sql.descriptive_stats(self)\n\n # Push the descriptive stat data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_descriptive_stats\": stat_dict})\n\n # Return the descriptive stats\n return stat_dict\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.details","title":"details(recompute=False)
","text":"Additional Details about this AthenaSource Artifact
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the details (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of details about this AthenaSource
Source code insrc/sageworks/core/artifacts/athena_source.py
def details(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Additional Details about this AthenaSource Artifact\n\n Args:\n recompute (bool): Recompute the details (default: False)\n\n Returns:\n dict(dict): A dictionary of details about this AthenaSource\n \"\"\"\n self.log.info(f\"Computing DataSource Details ({self.uuid})...\")\n\n # Get the details from the base class\n details = super().details()\n\n # Compute additional details\n details[\"s3_storage_location\"] = self.s3_storage_location()\n details[\"storage_type\"] = \"athena\"\n\n # Compute our AWS URL\n query = f'select * from \"{self.database}.{self.table}\" limit 10'\n query_exec_id = wr.athena.start_query_execution(\n sql=query, database=self.database, boto3_session=self.boto3_session\n )\n base_url = \"https://console.aws.amazon.com/athena/home\"\n details[\"aws_url\"] = f\"{base_url}?region={self.aws_region}#query/history/{query_exec_id}\"\n\n # Push the aws_url data into our DataSource Metadata\n # FIXME: We need to revisit this but doing an upsert just for aws_url is silly\n # self.upsert_sageworks_meta({\"sageworks_details\": {\"aws_url\": details[\"aws_url\"]}})\n\n # Convert any datetime fields to ISO-8601 strings\n details = convert_all_to_iso8601(details)\n\n # Add the column stats\n details[\"column_stats\"] = self.column_stats()\n\n # Return the details data\n return details\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.execute_statement","title":"execute_statement(query, silence_errors=False)
","text":"Execute a non-returning SQL statement in Athena with retries.
Parameters:
Name Type Description Defaultquery
str
The query to run against the AthenaSource
requiredsilence_errors
bool
Silence errors (default: False)
False
Source code in src/sageworks/core/artifacts/athena_source.py
def execute_statement(self, query: str, silence_errors: bool = False):\n \"\"\"Execute a non-returning SQL statement in Athena with retries.\n\n Args:\n query (str): The query to run against the AthenaSource\n silence_errors (bool): Silence errors (default: False)\n \"\"\"\n attempt = 0\n max_retries = 3\n retry_delay = 10\n while attempt < max_retries:\n try:\n # Start the query execution\n query_execution_id = wr.athena.start_query_execution(\n sql=query,\n database=self.database,\n boto3_session=self.boto3_session,\n )\n self.log.debug(f\"QueryExecutionId: {query_execution_id}\")\n\n # Wait for the query to complete\n wr.athena.wait_query(query_execution_id=query_execution_id, boto3_session=self.boto3_session)\n self.log.debug(f\"Statement executed successfully: {query_execution_id}\")\n break # If successful, exit the retry loop\n except wr.exceptions.QueryFailed as e:\n if \"AlreadyExistsException\" in str(e):\n self.log.warning(f\"Table already exists: {e} \\nIgnoring...\")\n break # No need to retry for this error\n elif \"ConcurrentModificationException\" in str(e):\n self.log.warning(f\"Concurrent modification detected: {e}\\nRetrying...\")\n attempt += 1\n if attempt < max_retries:\n time.sleep(retry_delay)\n else:\n if not silence_errors:\n self.log.critical(f\"Failed to execute statement after {max_retries} attempts: {e}\")\n raise\n else:\n if not silence_errors:\n self.log.critical(f\"Failed to execute statement: {e}\")\n raise\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.exists","title":"exists()
","text":"Validation Checks for this Data Source
Source code insrc/sageworks/core/artifacts/athena_source.py
def exists(self) -> bool:\n \"\"\"Validation Checks for this Data Source\"\"\"\n\n # Are we able to pull AWS Metadata for this table_name?\"\"\"\n # Do we have a valid data_source_meta?\n if getattr(self, \"data_source_meta\", None) is None:\n self.log.debug(f\"AthenaSource {self.table} not found in SageWorks Metadata...\")\n return False\n return True\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.hash","title":"hash()
","text":"Get the hash for the set of Parquet files used for this Artifact
Source code insrc/sageworks/core/artifacts/athena_source.py
def hash(self) -> str:\n \"\"\"Get the hash for the set of Parquet files used for this Artifact\"\"\"\n s3_uri = self.s3_storage_location()\n return compute_parquet_hash(s3_uri, self.boto3_session)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.managed_delete","title":"managed_delete(data_source_name, database='sageworks')
classmethod
","text":"Class Method: Delete the AWS Data Catalog Table and S3 Storage Objects
Parameters:
Name Type Description Defaultdata_source_name
str
Name of DataSource (AthenaSource)
requireddatabase
str
Athena Database Name (default: sageworks)
'sageworks'
Source code in src/sageworks/core/artifacts/athena_source.py
@classmethod\ndef managed_delete(cls, data_source_name: str, database: str = \"sageworks\"):\n \"\"\"Class Method: Delete the AWS Data Catalog Table and S3 Storage Objects\n\n Args:\n data_source_name (str): Name of DataSource (AthenaSource)\n database (str): Athena Database Name (default: sageworks)\n \"\"\"\n table = data_source_name # The table name is the same as the data_source_name\n\n # Check if the Glue Catalog Table exists\n if not wr.catalog.does_table_exist(database, table, boto3_session=cls.boto3_session):\n cls.log.info(f\"DataSource {table} not found in database {database}.\")\n return\n\n # Delete any views associated with this AthenaSource\n cls.delete_views(table, database)\n\n # Delete S3 Storage Objects (if they exist)\n try:\n # Make an AWS Query to get the S3 storage location\n s3_path = wr.catalog.get_table_location(database, table, boto3_session=cls.boto3_session)\n\n # Delete Data Catalog Table\n cls.log.info(f\"Deleting DataCatalog Table: {database}.{table}...\")\n wr.catalog.delete_table_if_exists(database, table, boto3_session=cls.boto3_session)\n\n # Make sure we add the trailing slash\n s3_path = s3_path if s3_path.endswith(\"/\") else f\"{s3_path}/\"\n cls.log.info(f\"Deleting S3 Storage Objects: {s3_path}...\")\n wr.s3.delete_objects(s3_path, boto3_session=cls.boto3_session)\n except Exception as e:\n cls.log.error(f\"Failure when trying to delete {data_source_name}: {e}\")\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(data_source_name)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.modified","title":"modified()
","text":"Return the datetime when this artifact was last modified
Source code insrc/sageworks/core/artifacts/athena_source.py
def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n return self.data_source_meta[\"UpdateTime\"]\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.num_columns","title":"num_columns()
","text":"Return the number of columns for this Data Source
Source code insrc/sageworks/core/artifacts/athena_source.py
def num_columns(self) -> int:\n \"\"\"Return the number of columns for this Data Source\"\"\"\n return len(self.columns)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.num_rows","title":"num_rows()
","text":"Return the number of rows for this Data Source
Source code insrc/sageworks/core/artifacts/athena_source.py
def num_rows(self) -> int:\n \"\"\"Return the number of rows for this Data Source\"\"\"\n count_df = self.query(f'select count(*) AS sageworks_count from \"{self.database}\".\"{self.table}\"')\n return count_df[\"sageworks_count\"][0] if count_df is not None else 0\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.outliers","title":"outliers(scale=1.5, use_stddev=False)
","text":"Compute outliers for all the numeric columns in a DataSource
Parameters:
Name Type Description Defaultscale
float
The scale to use for the IQR (default: 1.5)
1.5
use_stddev
bool
Use Standard Deviation instead of IQR (default: False)
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame of outliers from this DataSource
NotesUses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma) The scale parameter can be adjusted to change the IQR multiplier
Source code insrc/sageworks/core/artifacts/athena_source.py
@cache_dataframe(\"outliers\")\ndef outliers(self, scale: float = 1.5, use_stddev=False) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n use_stddev (bool): Use Standard Deviation instead of IQR (default: False)\n\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n\n # Compute outliers using the SQL Outliers class\n sql_outliers = sql.outliers.Outliers()\n return sql_outliers.compute_outliers(self, scale=scale, use_stddev=use_stddev)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.query","title":"query(query)
","text":"Query the AthenaSource
Parameters:
Name Type Description Defaultquery
str
The query to run against the AthenaSource
requiredReturns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The results of the query
Source code insrc/sageworks/core/artifacts/athena_source.py
def query(self, query: str) -> Union[pd.DataFrame, None]:\n \"\"\"Query the AthenaSource\n\n Args:\n query (str): The query to run against the AthenaSource\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n\n # Call internal class _query method\n return self.database_query(self.database, query)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.refresh_meta","title":"refresh_meta()
","text":"Refresh our internal AWS Broker catalog metadata
Source code insrc/sageworks/core/artifacts/athena_source.py
def refresh_meta(self):\n \"\"\"Refresh our internal AWS Broker catalog metadata\"\"\"\n self.data_source_meta = self.meta.data_source(self.uuid, database=self.database)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.s3_storage_location","title":"s3_storage_location()
","text":"Get the S3 Storage Location for this Data Source
Source code insrc/sageworks/core/artifacts/athena_source.py
def s3_storage_location(self) -> str:\n \"\"\"Get the S3 Storage Location for this Data Source\"\"\"\n return self.data_source_meta[\"StorageDescriptor\"][\"Location\"]\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.sageworks_meta","title":"sageworks_meta()
","text":"Get the SageWorks specific metadata for this Artifact
Source code insrc/sageworks/core/artifacts/athena_source.py
def sageworks_meta(self) -> dict:\n \"\"\"Get the SageWorks specific metadata for this Artifact\"\"\"\n\n # Sanity Check if we have invalid AWS Metadata\n if self.data_source_meta is None:\n if not self.exists():\n self.log.error(f\"DataSource {self.uuid} doesn't appear to exist...\")\n else:\n self.log.critical(f\"Unable to get AWS Metadata for {self.table}\")\n self.log.critical(\"Malformed Artifact! Delete this Artifact and recreate it!\")\n return {}\n\n # Get the SageWorks Metadata from the 'Parameters' section of the DataSource Metadata\n params = self.data_source_meta.get(\"Parameters\", {})\n return {key: decode_value(value) for key, value in params.items() if \"sageworks\" in key}\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.sample","title":"sample()
","text":"Pull a sample of rows from the DataSource
Returns:
Type DescriptionDataFrame
pd.DataFrame: A sample DataFrame for an Athena DataSource
Source code insrc/sageworks/core/artifacts/athena_source.py
@cache_dataframe(\"sample\")\ndef sample(self) -> pd.DataFrame:\n \"\"\"Pull a sample of rows from the DataSource\n\n Returns:\n pd.DataFrame: A sample DataFrame for an Athena DataSource\n \"\"\"\n\n # Call the SQL function to pull a sample of the rows\n return sql.sample_rows(self)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.size","title":"size()
","text":"Return the size of this data in MegaBytes
Source code insrc/sageworks/core/artifacts/athena_source.py
def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n size_in_bytes = sum(wr.s3.size_objects(self.s3_storage_location(), boto3_session=self.boto3_session).values())\n size_in_mb = size_in_bytes / 1_000_000\n return size_in_mb\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.smart_sample","title":"smart_sample(recompute=False)
","text":"Get a smart sample dataframe for this DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the smart sample (default: False)
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A combined DataFrame of sample data + outliers
Source code insrc/sageworks/core/artifacts/athena_source.py
@cache_dataframe(\"smart_sample\")\ndef smart_sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a smart sample dataframe for this DataSource\n\n Args:\n recompute (bool): Recompute the smart sample (default: False)\n\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n\n # Compute/recompute the smart sample\n self.log.important(f\"Computing Smart Sample {self.uuid}...\")\n\n # Outliers DataFrame\n outlier_rows = self.outliers()\n\n # Sample DataFrame\n sample_rows = self.sample()\n sample_rows[\"outlier_group\"] = \"sample\"\n\n # Combine the sample rows with the outlier rows\n all_rows = pd.concat([outlier_rows, sample_rows]).reset_index(drop=True)\n\n # Drop duplicates\n all_except_outlier_group = [col for col in all_rows.columns if col != \"outlier_group\"]\n all_rows = all_rows.drop_duplicates(subset=all_except_outlier_group, ignore_index=True)\n\n # Return the smart_sample data\n return all_rows\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.table_hash","title":"table_hash()
","text":"Get the table hash for this AthenaSource
Source code insrc/sageworks/core/artifacts/athena_source.py
def table_hash(self) -> str:\n \"\"\"Get the table hash for this AthenaSource\"\"\"\n s3_scratch = f\"s3://{self.sageworks_bucket}/temp/athena_output\"\n return compute_athena_table_hash(self.database, self.table, self.boto3_session, s3_scratch)\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.upsert_sageworks_meta","title":"upsert_sageworks_meta(new_meta)
","text":"Add SageWorks specific metadata to this Artifact
Parameters:
Name Type Description Defaultnew_meta
dict
Dictionary of new metadata to add
required Source code insrc/sageworks/core/artifacts/athena_source.py
def upsert_sageworks_meta(self, new_meta: dict):\n \"\"\"Add SageWorks specific metadata to this Artifact\n\n Args:\n new_meta (dict): Dictionary of new metadata to add\n \"\"\"\n self.log.important(f\"Upserting SageWorks Metadata {self.uuid}:{str(new_meta)[:50]}...\")\n\n # Give a warning message for keys that don't start with sageworks_\n for key in new_meta.keys():\n if not key.startswith(\"sageworks_\"):\n self.log.warning(\"Append 'sageworks_' to key names to avoid overwriting AWS meta data\")\n\n # Now convert any non-string values to JSON strings\n for key, value in new_meta.items():\n if not isinstance(value, str):\n new_meta[key] = json.dumps(value, cls=CustomEncoder)\n\n # Store our updated metadata\n try:\n wr.catalog.upsert_table_parameters(\n parameters=new_meta,\n database=self.database,\n table=self.table,\n boto3_session=self.boto3_session,\n )\n except botocore.exceptions.ClientError as e:\n error_code = e.response[\"Error\"][\"Code\"]\n if error_code == \"InvalidInputException\":\n self.log.error(f\"Unable to upsert metadata for {self.table}\")\n self.log.error(\"Probably because the metadata is too large\")\n self.log.error(new_meta)\n elif error_code == \"ConcurrentModificationException\":\n self.log.warning(\"ConcurrentModificationException... trying again...\")\n time.sleep(5)\n wr.catalog.upsert_table_parameters(\n parameters=new_meta,\n database=self.database,\n table=self.table,\n boto3_session=self.boto3_session,\n )\n else:\n self.log.critical(f\"Failed to upsert metadata: {e}\")\n self.log.critical(f\"{self.uuid} is Malformed! Delete this Artifact and recreate it!\")\n except Exception as e:\n self.log.critical(f\"Failed to upsert metadata: {e}\")\n self.log.critical(f\"{self.uuid} is Malformed! Delete this Artifact and recreate it!\")\n
"},{"location":"core_classes/artifacts/athena_source/#sageworks.core.artifacts.athena_source.AthenaSource.value_counts","title":"value_counts(recompute=False)
","text":"Compute 'value_counts' for all the string columns in a DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the value counts (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of value counts for each column in the form {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...}, 'col2': ...}
Source code insrc/sageworks/core/artifacts/athena_source.py
def value_counts(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n\n Args:\n recompute (bool): Recompute the value counts (default: False)\n\n Returns:\n dict(dict): A dictionary of value counts for each column in the form\n {'col1': {'value_1': 42, 'value_2': 16, 'value_3': 9,...},\n 'col2': ...}\n \"\"\"\n\n # First check if we have already computed the value counts\n value_counts_dict = self.sageworks_meta().get(\"sageworks_value_counts\")\n if value_counts_dict and not recompute:\n return value_counts_dict\n\n # Call the SQL function to compute value_counts\n value_count_dict = sql.value_counts(self)\n\n # Push the value_count data into our DataSource Metadata\n self.upsert_sageworks_meta({\"sageworks_value_counts\": value_count_dict})\n\n # Return the value_count data\n return value_count_dict\n
"},{"location":"core_classes/artifacts/data_source_abstract/","title":"DataSource Abstract","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the DataSource API Class and voil\u00e0 it works the same.
The DataSource Abstract class is a base/abstract class that defines API implemented by all the child classes (currently just AthenaSource but later RDSSource, FutureThing ).
DataSourceAbstract: Abstract Base Class for all data sources (S3: CSV, JSONL, Parquet, RDS, etc)
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract","title":"DataSourceAbstract
","text":" Bases: Artifact
src/sageworks/core/artifacts/data_source_abstract.py
class DataSourceAbstract(Artifact):\n def __init__(self, data_uuid: str, database: str = \"sageworks\", **kwargs):\n \"\"\"DataSourceAbstract: Abstract Base Class for all data sources\n Args:\n data_uuid(str): The UUID for this Data Source\n database(str): The database to use for this Data Source (default: sageworks)\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, **kwargs)\n\n # Set up our instance attributes\n self._database = database\n self._table_name = data_uuid\n\n def __post_init__(self):\n # Call superclass post_init\n super().__post_init__()\n\n @deprecated(version=\"0.9\")\n def get_database(self) -> str:\n \"\"\"Get the database for this Data Source\"\"\"\n return self._database\n\n @property\n def database(self) -> str:\n \"\"\"Get the database for this Data Source\"\"\"\n return self._database\n\n @property\n def table(self) -> str:\n \"\"\"Get the base table name for this Data Source\"\"\"\n return self._table_name\n\n @abstractmethod\n def num_rows(self) -> int:\n \"\"\"Return the number of rows for this Data Source\"\"\"\n pass\n\n @abstractmethod\n def num_columns(self) -> int:\n \"\"\"Return the number of columns for this Data Source\"\"\"\n pass\n\n @property\n @abstractmethod\n def columns(self) -> list[str]:\n \"\"\"Return the column names for this Data Source\"\"\"\n pass\n\n @property\n @abstractmethod\n def column_types(self) -> list[str]:\n \"\"\"Return the column types for this Data Source\"\"\"\n pass\n\n def column_details(self) -> dict:\n \"\"\"Return the column details for this Data Source\n\n Returns:\n dict: The column details for this Data Source\n \"\"\"\n return dict(zip(self.columns, self.column_types))\n\n def views(self) -> list[str]:\n \"\"\"Return the views for this Data Source\"\"\"\n from sageworks.core.views.view_utils import list_views\n\n return list_views(self)\n\n def view(self, view_name: str) -> \"View\":\n \"\"\"Return a DataFrame for a specific view\n Args:\n view_name (str): The name of the view to return\n Returns:\n pd.DataFrame: A DataFrame for the specified view\n \"\"\"\n from sageworks.core.views import View\n\n return View(self, view_name)\n\n def set_display_columns(self, diplay_columns: list[str]):\n \"\"\"Set the display columns for this Data Source\n\n Args:\n diplay_columns (list[str]): The display columns for this Data Source\n \"\"\"\n # Check mismatch of display columns to computation columns\n c_view = self.view(\"computation\")\n computation_columns = c_view.columns\n mismatch_columns = [col for col in diplay_columns if col not in computation_columns]\n if mismatch_columns:\n self.log.monitor(f\"Display View/Computation mismatch: {mismatch_columns}\")\n\n self.log.important(f\"Setting Display Columns...{diplay_columns}\")\n from sageworks.core.views import DisplayView\n\n # Create a NEW display view\n DisplayView.create(self, source_table=c_view.table, column_list=diplay_columns)\n\n def set_computation_columns(self, computation_columns: list[str], recompute_stats: bool = True):\n \"\"\"Set the computation columns for this Data Source\n\n Args:\n computation_columns (list[str]): The computation columns for this Data Source\n recompute_stats (bool): Recomputes all the stats for this Data Source (default: True)\n \"\"\"\n self.log.important(f\"Setting Computation Columns...{computation_columns}\")\n from sageworks.core.views import ComputationView\n\n # Create a NEW computation view\n ComputationView.create(self, column_list=computation_columns)\n if recompute_stats:\n self.recompute_stats()\n\n def _create_display_view(self):\n \"\"\"Internal: Create the Display View for this DataSource\"\"\"\n from sageworks.core.views import View\n\n View(self, \"display\")\n\n @abstractmethod\n def query(self, query: str) -> pd.DataFrame:\n \"\"\"Query the DataSourceAbstract\n Args:\n query(str): The SQL query to execute\n \"\"\"\n pass\n\n @abstractmethod\n def execute_statement(self, query: str):\n \"\"\"Execute an SQL statement that doesn't return a result\n Args:\n query(str): The SQL statement to execute\n \"\"\"\n pass\n\n @abstractmethod\n def sample(self) -> pd.DataFrame:\n \"\"\"Return a sample DataFrame from this DataSourceAbstract\n\n Returns:\n pd.DataFrame: A sample DataFrame from this DataSource\n \"\"\"\n pass\n\n @abstractmethod\n def descriptive_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n Args:\n recompute (bool): Recompute the descriptive stats (default: False)\n Returns:\n dict(dict): A dictionary of descriptive stats for each column in the form\n {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n 'col2': ...}\n \"\"\"\n pass\n\n @abstractmethod\n def outliers(self, scale: float = 1.5) -> pd.DataFrame:\n \"\"\"Return a DataFrame of outliers from this DataSource\n\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n pass\n\n @abstractmethod\n def smart_sample(self) -> pd.DataFrame:\n \"\"\"Get a SMART sample dataframe from this DataSource\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n pass\n\n @abstractmethod\n def value_counts(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n Args:\n recompute (bool): Recompute the value counts (default: False)\n Returns:\n dict(dict): A dictionary of value counts for each column in the form\n {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...},\n 'col2': ...}\n \"\"\"\n pass\n\n @abstractmethod\n def column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in a DataSource\n Args:\n recompute (bool): Recompute the column stats (default: False)\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros and descriptive stats\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n ...}\n \"\"\"\n pass\n\n @abstractmethod\n def correlations(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of correlations for each column in this format\n {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n \"\"\"\n pass\n\n def details(self) -> dict:\n \"\"\"Additional Details about this DataSourceAbstract Artifact\"\"\"\n details = self.summary()\n details[\"num_rows\"] = self.num_rows()\n details[\"num_columns\"] = self.num_columns()\n details[\"column_details\"] = self.column_details()\n return details\n\n def expected_meta(self) -> list[str]:\n \"\"\"DataSources have quite a bit of expected Metadata for EDA displays\"\"\"\n\n # For DataSources, we expect to see the following metadata\n expected_meta = [\n # FIXME: Revisit this\n # \"sageworks_details\",\n \"sageworks_descriptive_stats\",\n \"sageworks_value_counts\",\n \"sageworks_correlations\",\n \"sageworks_column_stats\",\n ]\n return expected_meta\n\n def ready(self) -> bool:\n \"\"\"Is the DataSource ready?\"\"\"\n\n # Check if the Artifact is ready\n if not super().ready():\n return False\n\n # If we don't have a smart_sample we're probably not ready\n if not self.df_cache.check(f\"{self.uuid}/smart_sample\"):\n self.log.warning(f\"DataSource {self.uuid} not ready...\")\n return False\n\n # Okay so we have sample, outliers, and smart_sample so we are ready\n return True\n\n def onboard(self) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the data source (make it ready)\n\n Returns:\n bool: True if the DataSource was onboarded successfully\n \"\"\"\n self.log.important(f\"Onboarding {self.uuid}...\")\n self.set_status(\"onboarding\")\n self.remove_health_tag(\"needs_onboard\")\n\n # Make sure our display view actually exists\n self.view(\"display\").ensure_exists()\n\n # Recompute the stats\n self.recompute_stats()\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n self.set_status(\"ready\")\n return True\n\n def recompute_stats(self) -> bool:\n \"\"\"This is a BLOCKING method that will recompute the stats for the data source\n\n Returns:\n bool: True if the DataSource stats were recomputed successfully\n \"\"\"\n self.log.important(f\"Recomputing Stats {self.uuid}...\")\n\n # Make sure our computation view actually exists\n self.view(\"computation\").ensure_exists()\n\n # Compute the sample, column stats, outliers, and smart_sample\n self.df_cache.delete(f\"{self.uuid}/sample\")\n self.sample()\n self.column_stats(recompute=True)\n self.refresh_meta() # Refresh the meta since outliers needs descriptive_stats and value_counts\n self.df_cache.delete(f\"{self.uuid}/outliers\")\n self.outliers()\n self.df_cache.delete(f\"{self.uuid}/smart_sample\")\n self.smart_sample()\n return True\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_types","title":"column_types: list[str]
abstractmethod
property
","text":"Return the column types for this Data Source
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.columns","title":"columns: list[str]
abstractmethod
property
","text":"Return the column names for this Data Source
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.database","title":"database: str
property
","text":"Get the database for this Data Source
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.table","title":"table: str
property
","text":"Get the base table name for this Data Source
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.__init__","title":"__init__(data_uuid, database='sageworks', **kwargs)
","text":"DataSourceAbstract: Abstract Base Class for all data sources Args: data_uuid(str): The UUID for this Data Source database(str): The database to use for this Data Source (default: sageworks)
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def __init__(self, data_uuid: str, database: str = \"sageworks\", **kwargs):\n \"\"\"DataSourceAbstract: Abstract Base Class for all data sources\n Args:\n data_uuid(str): The UUID for this Data Source\n database(str): The database to use for this Data Source (default: sageworks)\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, **kwargs)\n\n # Set up our instance attributes\n self._database = database\n self._table_name = data_uuid\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_details","title":"column_details()
","text":"Return the column details for this Data Source
Returns:
Name Type Descriptiondict
dict
The column details for this Data Source
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def column_details(self) -> dict:\n \"\"\"Return the column details for this Data Source\n\n Returns:\n dict: The column details for this Data Source\n \"\"\"\n return dict(zip(self.columns, self.column_types))\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.column_stats","title":"column_stats(recompute=False)
abstractmethod
","text":"Compute Column Stats for all the columns in a DataSource Args: recompute (bool): Recompute the column stats (default: False) Returns: dict(dict): A dictionary of stats for each column this format NB: String columns will NOT have num_zeros and descriptive stats {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12}, 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}}, ...}
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in a DataSource\n Args:\n recompute (bool): Recompute the column stats (default: False)\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros and descriptive stats\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n ...}\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.correlations","title":"correlations(recompute=False)
abstractmethod
","text":"Compute Correlations for all the numeric columns in a DataSource
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the column stats (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of correlations for each column in this format {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...}, 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef correlations(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Correlations for all the numeric columns in a DataSource\n\n Args:\n recompute (bool): Recompute the column stats (default: False)\n\n Returns:\n dict(dict): A dictionary of correlations for each column in this format\n {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.descriptive_stats","title":"descriptive_stats(recompute=False)
abstractmethod
","text":"Compute Descriptive Stats for all the numeric columns in a DataSource Args: recompute (bool): Recompute the descriptive stats (default: False) Returns: dict(dict): A dictionary of descriptive stats for each column in the form {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4}, 'col2': ...}
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef descriptive_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n Args:\n recompute (bool): Recompute the descriptive stats (default: False)\n Returns:\n dict(dict): A dictionary of descriptive stats for each column in the form\n {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4},\n 'col2': ...}\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.details","title":"details()
","text":"Additional Details about this DataSourceAbstract Artifact
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def details(self) -> dict:\n \"\"\"Additional Details about this DataSourceAbstract Artifact\"\"\"\n details = self.summary()\n details[\"num_rows\"] = self.num_rows()\n details[\"num_columns\"] = self.num_columns()\n details[\"column_details\"] = self.column_details()\n return details\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.execute_statement","title":"execute_statement(query)
abstractmethod
","text":"Execute an SQL statement that doesn't return a result Args: query(str): The SQL statement to execute
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef execute_statement(self, query: str):\n \"\"\"Execute an SQL statement that doesn't return a result\n Args:\n query(str): The SQL statement to execute\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.expected_meta","title":"expected_meta()
","text":"DataSources have quite a bit of expected Metadata for EDA displays
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def expected_meta(self) -> list[str]:\n \"\"\"DataSources have quite a bit of expected Metadata for EDA displays\"\"\"\n\n # For DataSources, we expect to see the following metadata\n expected_meta = [\n # FIXME: Revisit this\n # \"sageworks_details\",\n \"sageworks_descriptive_stats\",\n \"sageworks_value_counts\",\n \"sageworks_correlations\",\n \"sageworks_column_stats\",\n ]\n return expected_meta\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.get_database","title":"get_database()
","text":"Get the database for this Data Source
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@deprecated(version=\"0.9\")\ndef get_database(self) -> str:\n \"\"\"Get the database for this Data Source\"\"\"\n return self._database\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.num_columns","title":"num_columns()
abstractmethod
","text":"Return the number of columns for this Data Source
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef num_columns(self) -> int:\n \"\"\"Return the number of columns for this Data Source\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.num_rows","title":"num_rows()
abstractmethod
","text":"Return the number of rows for this Data Source
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef num_rows(self) -> int:\n \"\"\"Return the number of rows for this Data Source\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.onboard","title":"onboard()
","text":"This is a BLOCKING method that will onboard the data source (make it ready)
Returns:
Name Type Descriptionbool
bool
True if the DataSource was onboarded successfully
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def onboard(self) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the data source (make it ready)\n\n Returns:\n bool: True if the DataSource was onboarded successfully\n \"\"\"\n self.log.important(f\"Onboarding {self.uuid}...\")\n self.set_status(\"onboarding\")\n self.remove_health_tag(\"needs_onboard\")\n\n # Make sure our display view actually exists\n self.view(\"display\").ensure_exists()\n\n # Recompute the stats\n self.recompute_stats()\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n self.set_status(\"ready\")\n return True\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.outliers","title":"outliers(scale=1.5)
abstractmethod
","text":"Return a DataFrame of outliers from this DataSource
Parameters:
Name Type Description Defaultscale
float
The scale to use for the IQR (default: 1.5)
1.5
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame of outliers from this DataSource
NotesUses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers The scale parameter can be adjusted to change the IQR multiplier
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef outliers(self, scale: float = 1.5) -> pd.DataFrame:\n \"\"\"Return a DataFrame of outliers from this DataSource\n\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.query","title":"query(query)
abstractmethod
","text":"Query the DataSourceAbstract Args: query(str): The SQL query to execute
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef query(self, query: str) -> pd.DataFrame:\n \"\"\"Query the DataSourceAbstract\n Args:\n query(str): The SQL query to execute\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.ready","title":"ready()
","text":"Is the DataSource ready?
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def ready(self) -> bool:\n \"\"\"Is the DataSource ready?\"\"\"\n\n # Check if the Artifact is ready\n if not super().ready():\n return False\n\n # If we don't have a smart_sample we're probably not ready\n if not self.df_cache.check(f\"{self.uuid}/smart_sample\"):\n self.log.warning(f\"DataSource {self.uuid} not ready...\")\n return False\n\n # Okay so we have sample, outliers, and smart_sample so we are ready\n return True\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.recompute_stats","title":"recompute_stats()
","text":"This is a BLOCKING method that will recompute the stats for the data source
Returns:
Name Type Descriptionbool
bool
True if the DataSource stats were recomputed successfully
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def recompute_stats(self) -> bool:\n \"\"\"This is a BLOCKING method that will recompute the stats for the data source\n\n Returns:\n bool: True if the DataSource stats were recomputed successfully\n \"\"\"\n self.log.important(f\"Recomputing Stats {self.uuid}...\")\n\n # Make sure our computation view actually exists\n self.view(\"computation\").ensure_exists()\n\n # Compute the sample, column stats, outliers, and smart_sample\n self.df_cache.delete(f\"{self.uuid}/sample\")\n self.sample()\n self.column_stats(recompute=True)\n self.refresh_meta() # Refresh the meta since outliers needs descriptive_stats and value_counts\n self.df_cache.delete(f\"{self.uuid}/outliers\")\n self.outliers()\n self.df_cache.delete(f\"{self.uuid}/smart_sample\")\n self.smart_sample()\n return True\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.sample","title":"sample()
abstractmethod
","text":"Return a sample DataFrame from this DataSourceAbstract
Returns:
Type DescriptionDataFrame
pd.DataFrame: A sample DataFrame from this DataSource
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef sample(self) -> pd.DataFrame:\n \"\"\"Return a sample DataFrame from this DataSourceAbstract\n\n Returns:\n pd.DataFrame: A sample DataFrame from this DataSource\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.set_computation_columns","title":"set_computation_columns(computation_columns, recompute_stats=True)
","text":"Set the computation columns for this Data Source
Parameters:
Name Type Description Defaultcomputation_columns
list[str]
The computation columns for this Data Source
requiredrecompute_stats
bool
Recomputes all the stats for this Data Source (default: True)
True
Source code in src/sageworks/core/artifacts/data_source_abstract.py
def set_computation_columns(self, computation_columns: list[str], recompute_stats: bool = True):\n \"\"\"Set the computation columns for this Data Source\n\n Args:\n computation_columns (list[str]): The computation columns for this Data Source\n recompute_stats (bool): Recomputes all the stats for this Data Source (default: True)\n \"\"\"\n self.log.important(f\"Setting Computation Columns...{computation_columns}\")\n from sageworks.core.views import ComputationView\n\n # Create a NEW computation view\n ComputationView.create(self, column_list=computation_columns)\n if recompute_stats:\n self.recompute_stats()\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.set_display_columns","title":"set_display_columns(diplay_columns)
","text":"Set the display columns for this Data Source
Parameters:
Name Type Description Defaultdiplay_columns
list[str]
The display columns for this Data Source
required Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def set_display_columns(self, diplay_columns: list[str]):\n \"\"\"Set the display columns for this Data Source\n\n Args:\n diplay_columns (list[str]): The display columns for this Data Source\n \"\"\"\n # Check mismatch of display columns to computation columns\n c_view = self.view(\"computation\")\n computation_columns = c_view.columns\n mismatch_columns = [col for col in diplay_columns if col not in computation_columns]\n if mismatch_columns:\n self.log.monitor(f\"Display View/Computation mismatch: {mismatch_columns}\")\n\n self.log.important(f\"Setting Display Columns...{diplay_columns}\")\n from sageworks.core.views import DisplayView\n\n # Create a NEW display view\n DisplayView.create(self, source_table=c_view.table, column_list=diplay_columns)\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.smart_sample","title":"smart_sample()
abstractmethod
","text":"Get a SMART sample dataframe from this DataSource Returns: pd.DataFrame: A combined DataFrame of sample data + outliers
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef smart_sample(self) -> pd.DataFrame:\n \"\"\"Get a SMART sample dataframe from this DataSource\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.value_counts","title":"value_counts(recompute=False)
abstractmethod
","text":"Compute 'value_counts' for all the string columns in a DataSource Args: recompute (bool): Recompute the value counts (default: False) Returns: dict(dict): A dictionary of value counts for each column in the form {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...}, 'col2': ...}
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
@abstractmethod\ndef value_counts(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute 'value_counts' for all the string columns in a DataSource\n Args:\n recompute (bool): Recompute the value counts (default: False)\n Returns:\n dict(dict): A dictionary of value counts for each column in the form\n {'col1': {'value_1': X, 'value_2': Y, 'value_3': Z,...},\n 'col2': ...}\n \"\"\"\n pass\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.view","title":"view(view_name)
","text":"Return a DataFrame for a specific view Args: view_name (str): The name of the view to return Returns: pd.DataFrame: A DataFrame for the specified view
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def view(self, view_name: str) -> \"View\":\n \"\"\"Return a DataFrame for a specific view\n Args:\n view_name (str): The name of the view to return\n Returns:\n pd.DataFrame: A DataFrame for the specified view\n \"\"\"\n from sageworks.core.views import View\n\n return View(self, view_name)\n
"},{"location":"core_classes/artifacts/data_source_abstract/#sageworks.core.artifacts.data_source_abstract.DataSourceAbstract.views","title":"views()
","text":"Return the views for this Data Source
Source code insrc/sageworks/core/artifacts/data_source_abstract.py
def views(self) -> list[str]:\n \"\"\"Return the views for this Data Source\"\"\"\n from sageworks.core.views.view_utils import list_views\n\n return list_views(self)\n
"},{"location":"core_classes/artifacts/endpoint_core/","title":"EndpointCore","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the Endpoint API Class and voil\u00e0 it works the same.
EndpointCore: SageWorks EndpointCore Class
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore","title":"EndpointCore
","text":" Bases: Artifact
EndpointCore: SageWorks EndpointCore Class
Common Usagemy_endpoint = EndpointCore(endpoint_uuid)\nprediction_df = my_endpoint.predict(test_df)\nmetrics = my_endpoint.regression_metrics(target_column, prediction_df)\nfor metric, value in metrics.items():\n print(f\"{metric}: {value:0.3f}\")\n
Source code in src/sageworks/core/artifacts/endpoint_core.py
class EndpointCore(Artifact):\n \"\"\"EndpointCore: SageWorks EndpointCore Class\n\n Common Usage:\n ```python\n my_endpoint = EndpointCore(endpoint_uuid)\n prediction_df = my_endpoint.predict(test_df)\n metrics = my_endpoint.regression_metrics(target_column, prediction_df)\n for metric, value in metrics.items():\n print(f\"{metric}: {value:0.3f}\")\n ```\n \"\"\"\n\n def __init__(self, endpoint_uuid, **kwargs):\n \"\"\"EndpointCore Initialization\n\n Args:\n endpoint_uuid (str): Name of Endpoint in SageWorks\n \"\"\"\n\n # Make sure the endpoint_uuid is a valid name\n self.is_name_valid(endpoint_uuid, delimiter=\"-\", lower_case=False)\n\n # Call SuperClass Initialization\n super().__init__(endpoint_uuid, **kwargs)\n\n # Grab an Cloud Metadata object and pull information for Endpoints\n self.endpoint_name = endpoint_uuid\n self.endpoint_meta = self.meta.endpoint(self.endpoint_name)\n\n # Sanity check that we found the endpoint\n if self.endpoint_meta is None:\n self.log.important(f\"Could not find endpoint {self.uuid} within current visibility scope\")\n return\n\n # Sanity check the Endpoint state\n if self.endpoint_meta[\"EndpointStatus\"] == \"Failed\":\n self.log.critical(f\"Endpoint {self.uuid} is in a failed state\")\n reason = self.endpoint_meta[\"FailureReason\"]\n self.log.critical(f\"Failure Reason: {reason}\")\n self.log.critical(\"Please delete this endpoint and re-deploy...\")\n\n # Set the Inference, Capture, and Monitoring S3 Paths\n self.endpoint_inference_path = self.endpoints_s3_path + \"/inference/\" + self.uuid\n self.endpoint_data_capture_path = self.endpoints_s3_path + \"/data_capture/\" + self.uuid\n self.endpoint_monitoring_path = self.endpoints_s3_path + \"/monitoring/\" + self.uuid\n\n # Set the Model Name\n self.model_name = self.get_input()\n\n # This is for endpoint error handling later\n self.endpoint_return_columns = None\n\n # We temporary cache the endpoint metrics\n self.temp_storage = Cache(prefix=\"temp_storage\", expire=300) # 5 minutes\n\n # Call SuperClass Post Initialization\n super().__post_init__()\n\n # All done\n self.log.info(f\"EndpointCore Initialized: {self.endpoint_name}\")\n\n def refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n self.endpoint_meta = self.meta.endpoint(self.endpoint_name)\n\n def exists(self) -> bool:\n \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n if self.endpoint_meta is None:\n self.log.debug(f\"Endpoint {self.endpoint_name} not found in AWS Metadata\")\n return False\n return True\n\n def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n\n Returns:\n list[str]: List of health issues\n \"\"\"\n if not self.ready():\n return [\"needs_onboard\"]\n\n # Call the base class health check\n health_issues = super().health_check()\n\n # Does this endpoint have a config?\n # Note: This is not an authoritative check, so improve later\n if self.endpoint_meta.get(\"ProductionVariants\") is None:\n health_issues.append(\"no_config\")\n\n # We're going to check for 5xx errors and no activity\n endpoint_metrics = self.endpoint_metrics()\n\n # Check if we have metrics\n if endpoint_metrics is None:\n health_issues.append(\"unknown_error\")\n return health_issues\n\n # Check for 5xx errors\n num_errors = endpoint_metrics[\"Invocation5XXErrors\"].sum()\n if num_errors > 5:\n health_issues.append(\"5xx_errors\")\n elif num_errors > 0:\n health_issues.append(\"5xx_errors_min\")\n else:\n self.remove_health_tag(\"5xx_errors\")\n self.remove_health_tag(\"5xx_errors_min\")\n\n # Check for Endpoint activity\n num_invocations = endpoint_metrics[\"Invocations\"].sum()\n if num_invocations == 0:\n health_issues.append(\"no_activity\")\n else:\n self.remove_health_tag(\"no_activity\")\n return health_issues\n\n def is_serverless(self) -> bool:\n \"\"\"Check if the current endpoint is serverless.\n\n Returns:\n bool: True if the endpoint is serverless, False otherwise.\n \"\"\"\n return \"Serverless\" in self.endpoint_meta[\"InstanceType\"]\n\n def add_data_capture(self):\n \"\"\"Add data capture to the endpoint\"\"\"\n self.get_monitor().add_data_capture()\n\n def get_monitor(self):\n \"\"\"Get the MonitorCore class for this endpoint\"\"\"\n from sageworks.core.artifacts.monitor_core import MonitorCore\n\n return MonitorCore(self.endpoint_name)\n\n def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n return 0.0\n\n def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.endpoint_meta\n\n def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n return self.endpoint_meta[\"EndpointArn\"]\n\n def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n\n def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.endpoint_meta[\"CreationTime\"]\n\n def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n return self.endpoint_meta[\"LastModifiedTime\"]\n\n def hash(self) -> Optional[str]:\n \"\"\"Return the hash for the internal model used by this endpoint\n\n Returns:\n Optional[str]: The hash for the internal model used by this endpoint\n \"\"\"\n from sageworks.utils.endpoint_utils import get_model_data_url # Avoid circular import\n\n model_url = get_model_data_url(self.endpoint_config_name(), self.boto3_session)\n return get_s3_etag(model_url, self.boto3_session)\n\n def endpoint_metrics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Return the metrics for this endpoint\n\n Returns:\n pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)\n \"\"\"\n\n # Do we have it cached?\n metrics_key = f\"endpoint:{self.uuid}:endpoint_metrics\"\n endpoint_metrics = self.temp_storage.get(metrics_key)\n if endpoint_metrics is not None:\n return endpoint_metrics\n\n # We don't have it cached so let's get it from CloudWatch\n if \"ProductionVariants\" not in self.endpoint_meta:\n return None\n self.log.important(\"Updating endpoint metrics...\")\n variant = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n endpoint_metrics = EndpointMetrics().get_metrics(self.uuid, variant=variant)\n self.temp_storage.set(metrics_key, endpoint_metrics)\n return endpoint_metrics\n\n def details(self, recompute: bool = False) -> dict:\n \"\"\"Additional Details about this Endpoint\n Args:\n recompute (bool): Recompute the details (default: False)\n Returns:\n dict(dict): A dictionary of details about this Endpoint\n \"\"\"\n\n # Fill in all the details about this Endpoint\n details = self.summary()\n\n # Get details from our AWS Metadata\n details[\"status\"] = self.endpoint_meta[\"EndpointStatus\"]\n details[\"instance\"] = self.endpoint_meta[\"InstanceType\"]\n try:\n details[\"instance_count\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"CurrentInstanceCount\"] or \"-\"\n except KeyError:\n details[\"instance_count\"] = \"-\"\n if \"ProductionVariants\" in self.endpoint_meta:\n details[\"variant\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n else:\n details[\"variant\"] = \"-\"\n\n # Add endpoint metrics from CloudWatch\n details[\"endpoint_metrics\"] = self.endpoint_metrics()\n\n # Return the details\n return details\n\n def onboard(self, interactive: bool = False) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the Endpoint (make it ready)\n Args:\n interactive (bool, optional): If True, will prompt the user for information. (default: False)\n Returns:\n bool: True if the Endpoint is successfully onboarded, False otherwise\n \"\"\"\n\n # Make sure our input is defined\n if self.get_input() == \"unknown\":\n if interactive:\n input_model = input(\"Input Model?: \")\n else:\n self.log.critical(\"Input Model is not defined!\")\n return False\n else:\n input_model = self.get_input()\n\n # Now that we have the details, let's onboard the Endpoint with args\n return self.onboard_with_args(input_model)\n\n def onboard_with_args(self, input_model: str) -> bool:\n \"\"\"Onboard the Endpoint with the given arguments\n\n Args:\n input_model (str): The input model for this endpoint\n Returns:\n bool: True if the Endpoint is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n self.upsert_sageworks_meta({\"sageworks_input\": input_model})\n self.model_name = input_model\n\n # Remove the needs_onboard tag\n self.remove_health_tag(\"needs_onboard\")\n self.set_status(\"ready\")\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n return True\n\n def auto_inference(self, capture: bool = False) -> pd.DataFrame:\n \"\"\"Run inference on the endpoint using FeatureSet data\n\n Args:\n capture (bool, optional): Capture the inference results and metrics (default=False)\n \"\"\"\n\n # Sanity Check that we have a model\n model = ModelCore(self.get_input())\n if not model.exists():\n self.log.error(\"No model found for this endpoint. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Now get the FeatureSet and make sure it exists\n fs = FeatureSetCore(model.get_input())\n if not fs.exists():\n self.log.error(\"No FeatureSet found for this endpoint. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Grab the evaluation data from the FeatureSet\n table = fs.view(\"training\").table\n eval_df = fs.query(f'SELECT * FROM \"{table}\" where training = FALSE')\n capture_uuid = \"auto_inference\" if capture else None\n return self.inference(eval_df, capture_uuid, id_column=fs.id_column)\n\n def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -> pd.DataFrame:\n \"\"\"Run inference and compute performance metrics with optional capture\n\n Args:\n eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n capture_uuid (str, optional): UUID of the inference capture (default=None)\n id_column (str, optional): Name of the ID column (default=None)\n\n Returns:\n pd.DataFrame: DataFrame with the inference results\n\n Note:\n If capture=True inference/performance metrics are written to S3 Endpoint Inference Folder\n \"\"\"\n\n # Run predictions on the evaluation data\n prediction_df = self._predict(eval_df)\n if prediction_df.empty:\n self.log.warning(\"No predictions were made. Returning empty DataFrame.\")\n return prediction_df\n\n # Get the target column\n model = ModelCore(self.model_name)\n target_column = model.target()\n\n # Sanity Check that the target column is present\n if target_column and (target_column not in prediction_df.columns):\n self.log.important(f\"Target Column {target_column} not found in prediction_df!\")\n self.log.important(\"In order to compute metrics, the target column must be present!\")\n return prediction_df\n\n # Compute the standard performance metrics for this model\n model_type = model.model_type\n if model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n prediction_df = self.residuals(target_column, prediction_df)\n metrics = self.regression_metrics(target_column, prediction_df)\n elif model_type == ModelType.CLASSIFIER:\n metrics = self.classification_metrics(target_column, prediction_df)\n else:\n # For other model types, we don't compute metrics\n self.log.important(f\"Model Type: {model_type} doesn't have metrics...\")\n metrics = pd.DataFrame()\n\n # Print out the metrics\n if not metrics.empty:\n print(f\"Performance Metrics for {self.model_name} on {self.uuid}\")\n print(metrics.head())\n\n # Capture the inference results and metrics\n if capture_uuid is not None:\n description = capture_uuid.replace(\"_\", \" \").title()\n self._capture_inference_results(\n capture_uuid, prediction_df, target_column, model_type, metrics, description, id_column\n )\n\n # Return the prediction DataFrame\n return prediction_df\n\n def fast_inference(self, eval_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n\n Note:\n There's no sanity checks or error handling... just FAST Inference!\n \"\"\"\n return fast_inference(self.uuid, eval_df, self.sm_session)\n\n def _predict(self, eval_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Internal: Run prediction on the given observations in the given DataFrame\n Args:\n eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n Returns:\n pd.DataFrame: Return the DataFrame with additional columns, prediction and any _proba columns\n \"\"\"\n\n # Sanity check: Does the DataFrame have 0 rows?\n if eval_df.empty:\n self.log.warning(\"Evaluation DataFrame has 0 rows. No predictions to run.\")\n return pd.DataFrame(columns=eval_df.columns) # Return empty DataFrame with same structure\n\n # Sanity check: Does the Model have Features?\n features = ModelCore(self.model_name).features()\n if not features:\n self.log.warning(\"Model does not have features defined, using all columns in the DataFrame\")\n else:\n # Sanity check: Does the DataFrame have the required features?\n df_columns_lower = set(col.lower() for col in eval_df.columns)\n features_lower = set(feature.lower() for feature in features)\n\n # Check if the features are a subset of the DataFrame columns (case-insensitive)\n if not features_lower.issubset(df_columns_lower):\n missing_features = features_lower - df_columns_lower\n raise ValueError(f\"DataFrame does not contain required features: {missing_features}\")\n\n # Create our Endpoint Predictor Class\n predictor = Predictor(\n self.endpoint_name,\n sagemaker_session=self.sm_session,\n serializer=CSVSerializer(),\n deserializer=CSVDeserializer(),\n )\n\n # Now split up the dataframe into 500 row chunks, send those chunks to our\n # endpoint (with error handling) and stitch all the chunks back together\n df_list = []\n for index in range(0, len(eval_df), 500):\n self.log.info(\"Processing...\")\n\n # Compute partial DataFrames, add them to a list, and concatenate at the end\n partial_df = self._endpoint_error_handling(predictor, eval_df[index : index + 500])\n df_list.append(partial_df)\n\n # Concatenate the dataframes\n combined_df = pd.concat(df_list, ignore_index=True)\n\n # Convert data to numeric\n # Note: Since we're using CSV serializers numeric columns often get changed to generic 'object' types\n\n # Hard Conversion\n # Note: We explicitly catch exceptions for columns that cannot be converted to numeric\n converted_df = combined_df.copy()\n for column in combined_df.columns:\n try:\n converted_df[column] = pd.to_numeric(combined_df[column])\n except ValueError:\n # If a ValueError is raised, the column cannot be converted to numeric, so we keep it as is\n pass\n\n # Soft Conversion\n # Convert columns to the best possible dtype that supports the pd.NA missing value.\n converted_df = converted_df.convert_dtypes()\n\n # Report on any rows that failed\n failed_rows = converted_df[converted_df.isna().any(axis=1)]\n if not failed_rows.empty:\n self.log.warning(f\"Rows that failed:\\n{failed_rows}\")\n\n # Convert pd.NA placeholders to pd.NA\n # Note: CSV serialization converts pd.NA to blank strings, so we have to put in placeholders\n converted_df.replace(\"__NA__\", pd.NA, inplace=True)\n\n # Return the Dataframe\n return converted_df\n\n def _endpoint_error_handling(self, predictor, feature_df):\n \"\"\"Internal: Handles errors, retries, and binary search for problematic rows.\"\"\"\n\n # Convert DataFrame into a CSV buffer\n csv_buffer = StringIO()\n feature_df.to_csv(csv_buffer, index=False)\n\n try:\n # Send CSV buffer to the predictor and process results\n results = predictor.predict(csv_buffer.getvalue())\n results_df = pd.DataFrame.from_records(results[1:], columns=results[0])\n self.endpoint_return_columns = results_df.columns.tolist()\n return results_df\n\n except botocore.exceptions.ClientError as err:\n error_code = err.response[\"Error\"][\"Code\"]\n\n if error_code == \"ModelNotReadyException\":\n self.log.error(f\"Error {error_code}: {err.response.get('Message', 'No message')}\")\n self.log.error(\"Model not ready. Sleeping and retrying...\")\n time.sleep(60)\n return self._endpoint_error_handling(predictor, feature_df)\n\n elif error_code == \"ModelError\":\n self.log.warning(\"Model error. Bisecting the DataFrame and retrying...\")\n\n # Base case: If there is only one row, we can't binary search further\n if len(feature_df) == 1:\n if not self.endpoint_return_columns:\n raise\n return self._error_df(feature_df, self.endpoint_return_columns)\n\n # Binary search to find the problematic row(s)\n mid_point = len(feature_df) // 2\n first_half = self._endpoint_error_handling(predictor, feature_df.iloc[:mid_point])\n second_half = self._endpoint_error_handling(predictor, feature_df.iloc[mid_point:])\n return pd.concat([first_half, second_half], ignore_index=True)\n\n else:\n # Unknown ClientError, raise the exception\n self.log.critical(f\"Unexpected ClientError: {err}\")\n raise\n\n except Exception as err:\n self.log.critical(f\"Unexpected general error: {err}\")\n raise\n\n def _error_df(self, df, all_columns):\n \"\"\"Internal: Method to construct an Error DataFrame (a Pandas DataFrame with one row of NaNs)\"\"\"\n # Create a new dataframe with all NaNs\n error_df = pd.DataFrame(dict(zip(all_columns, [[np.NaN]] * len(self.endpoint_return_columns))))\n # Now set the original values for the incoming dataframe\n for column in df.columns:\n error_df[column] = df[column].values\n return error_df\n\n def _capture_inference_results(\n self,\n capture_uuid: str,\n pred_results_df: pd.DataFrame,\n target_column: str,\n model_type: ModelType,\n metrics: pd.DataFrame,\n description: str,\n id_column: str = None,\n ):\n \"\"\"Internal: Capture the inference results and metrics to S3\n\n Args:\n capture_uuid (str): UUID of the inference capture\n pred_results_df (pd.DataFrame): DataFrame with the prediction results\n target_column (str): Name of the target column\n model_type (ModelType): Type of the model (e.g. REGRESSOR, CLASSIFIER)\n metrics (pd.DataFrame): DataFrame with the performance metrics\n description (str): Description of the inference results\n id_column (str, optional): Name of the ID column (default=None)\n \"\"\"\n\n # Compute a dataframe hash (just use the last 8)\n data_hash = joblib.hash(pred_results_df)[:8]\n\n # Metadata for the model inference\n inference_meta = {\n \"name\": capture_uuid,\n \"data_hash\": data_hash,\n \"num_rows\": len(pred_results_df),\n \"description\": description,\n }\n\n # Create the S3 Path for the Inference Capture\n inference_capture_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n # Write the metadata dictionary and metrics to our S3 Model Inference Folder\n wr.s3.to_json(\n pd.DataFrame([inference_meta]),\n f\"{inference_capture_path}/inference_meta.json\",\n index=False,\n )\n self.log.info(f\"Writing metrics to {inference_capture_path}/inference_metrics.csv\")\n wr.s3.to_csv(metrics, f\"{inference_capture_path}/inference_metrics.csv\", index=False)\n\n # Grab the target column, prediction column, any _proba columns, and the ID column (if present)\n prediction_col = \"prediction\" if \"prediction\" in pred_results_df.columns else \"predictions\"\n output_columns = [target_column, prediction_col]\n\n # Add any _proba columns to the output columns\n output_columns += [col for col in pred_results_df.columns if col.endswith(\"_proba\")]\n\n # Add any quantile columns to the output columns\n output_columns += [col for col in pred_results_df.columns if col.startswith(\"q_\") or col.startswith(\"qr_\")]\n\n # Add the ID column\n if id_column and id_column in pred_results_df.columns:\n output_columns.append(id_column)\n\n # Write the predictions to our S3 Model Inference Folder\n self.log.info(f\"Writing predictions to {inference_capture_path}/inference_predictions.csv\")\n subset_df = pred_results_df[output_columns]\n wr.s3.to_csv(subset_df, f\"{inference_capture_path}/inference_predictions.csv\", index=False)\n\n # CLASSIFIER: Write the confusion matrix to our S3 Model Inference Folder\n if model_type == ModelType.CLASSIFIER:\n conf_mtx = self.generate_confusion_matrix(target_column, pred_results_df)\n self.log.info(f\"Writing confusion matrix to {inference_capture_path}/inference_cm.csv\")\n # Note: Unlike other dataframes here, we want to write the index (labels) to the CSV\n wr.s3.to_csv(conf_mtx, f\"{inference_capture_path}/inference_cm.csv\", index=True)\n\n # Generate SHAP values for our Prediction Dataframe\n generate_shap_values(self.endpoint_name, model_type.value, pred_results_df, inference_capture_path)\n\n # Now recompute the details for our Model\n self.log.important(f\"Recomputing Details for {self.model_name} to show latest Inference Results...\")\n model = ModelCore(self.model_name)\n model._load_inference_metrics(capture_uuid)\n model.details(recompute=True)\n\n # Recompute the details so that inference model metrics are updated\n self.log.important(f\"Recomputing Details for {self.uuid} to show latest Inference Results...\")\n self.details(recompute=True)\n\n def regression_metrics(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the performance metrics for this Endpoint\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with the performance metrics\n \"\"\"\n\n # Sanity Check the prediction DataFrame\n if prediction_df.empty:\n self.log.warning(\"No predictions were made. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Compute the metrics\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n mae = mean_absolute_error(y_true, y_pred)\n rmse = np.sqrt(mean_squared_error(y_true, y_pred))\n r2 = r2_score(y_true, y_pred)\n # Mean Absolute Percentage Error\n mape = np.mean(np.where(y_true != 0, np.abs((y_true - y_pred) / y_true), np.abs(y_true - y_pred))) * 100\n # Median Absolute Error\n medae = median_absolute_error(y_true, y_pred)\n\n # Organize and return the metrics\n metrics = {\n \"MAE\": round(mae, 3),\n \"RMSE\": round(rmse, 3),\n \"R2\": round(r2, 3),\n \"MAPE\": round(mape, 3),\n \"MedAE\": round(medae, 3),\n \"NumRows\": len(prediction_df),\n }\n return pd.DataFrame.from_records([metrics])\n\n def residuals(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Add the residuals to the prediction DataFrame\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'\n \"\"\"\n\n # Compute the residuals\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n # Check for classification scenario\n if not pd.api.types.is_numeric_dtype(y_true) or not pd.api.types.is_numeric_dtype(y_pred):\n self.log.warning(\"Target and Prediction columns are not numeric. Computing 'diffs'...\")\n prediction_df[\"residuals\"] = (y_true != y_pred).astype(int)\n prediction_df[\"residuals_abs\"] = prediction_df[\"residuals\"]\n else:\n # Compute numeric residuals for regression\n prediction_df[\"residuals\"] = y_true - y_pred\n prediction_df[\"residuals_abs\"] = np.abs(prediction_df[\"residuals\"])\n\n return prediction_df\n\n @staticmethod\n def validate_proba_columns(prediction_df: pd.DataFrame, class_labels: list, guessing: bool = False):\n \"\"\"Ensure probability columns are correctly aligned with class labels\n\n Args:\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n class_labels (list): List of class labels\n guessing (bool, optional): Whether we're guessing the class labels. Defaults to False.\n \"\"\"\n proba_columns = [col.replace(\"_proba\", \"\") for col in prediction_df.columns if col.endswith(\"_proba\")]\n\n if sorted(class_labels) != sorted(proba_columns):\n if guessing:\n raise ValueError(f\"_proba columns {proba_columns} != GUESSED class_labels {class_labels}!\")\n else:\n raise ValueError(f\"_proba columns {proba_columns} != class_labels {class_labels}!\")\n\n def classification_metrics(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the performance metrics for this Endpoint\n\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n\n Returns:\n pd.DataFrame: DataFrame with the performance metrics\n \"\"\"\n # Get the class labels from the model\n class_labels = ModelCore(self.model_name).class_labels()\n if class_labels is None:\n self.log.warning(\n \"Class labels not found in the model. Guessing class labels from the prediction DataFrame.\"\n )\n class_labels = prediction_df[target_column].unique().tolist()\n self.validate_proba_columns(prediction_df, class_labels, guessing=True)\n else:\n self.validate_proba_columns(prediction_df, class_labels)\n\n # Calculate precision, recall, fscore, and support, handling zero division\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n scores = precision_recall_fscore_support(\n prediction_df[target_column],\n prediction_df[prediction_col],\n average=None,\n labels=class_labels,\n zero_division=0,\n )\n\n # Identify the probability columns and keep them as a Pandas DataFrame\n proba_columns = [f\"{label}_proba\" for label in class_labels]\n y_score = prediction_df[proba_columns]\n\n # One-hot encode the true labels using all class labels (fit with class_labels)\n encoder = OneHotEncoder(categories=[class_labels], sparse_output=False)\n y_true = encoder.fit_transform(prediction_df[[target_column]])\n\n # Calculate ROC AUC per label and handle exceptions for missing classes\n roc_auc_per_label = []\n for i, label in enumerate(class_labels):\n try:\n roc_auc = roc_auc_score(y_true[:, i], y_score.iloc[:, i])\n except ValueError as e:\n self.log.warning(f\"ROC AUC calculation failed for label {label}.\")\n self.log.warning(f\"{str(e)}\")\n roc_auc = 0.0\n roc_auc_per_label.append(roc_auc)\n\n # Put the scores into a DataFrame\n score_df = pd.DataFrame(\n {\n target_column: class_labels,\n \"precision\": scores[0],\n \"recall\": scores[1],\n \"fscore\": scores[2],\n \"roc_auc\": roc_auc_per_label,\n \"support\": scores[3],\n }\n )\n\n # Sort the target labels\n score_df = score_df.sort_values(by=[target_column], ascending=True)\n return score_df\n\n def generate_confusion_matrix(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the confusion matrix for this Endpoint\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with the confusion matrix\n \"\"\"\n\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n # Check if our model has class labels, if not we'll use the unique labels in the prediction\n class_labels = ModelCore(self.model_name).class_labels()\n if class_labels is None:\n class_labels = sorted(list(set(y_true) | set(y_pred)))\n\n # Compute the confusion matrix (sklearn confusion_matrix)\n conf_mtx = confusion_matrix(y_true, y_pred, labels=class_labels)\n\n # Create a DataFrame\n conf_mtx_df = pd.DataFrame(conf_mtx, index=class_labels, columns=class_labels)\n conf_mtx_df.index.name = \"labels\"\n\n # Check if our model has class labels. If so make the index and columns ordered\n model_class_labels = ModelCore(self.model_name).class_labels()\n if model_class_labels:\n self.log.important(\"Reordering the confusion matrix based on model class labels...\")\n conf_mtx_df.index = pd.Categorical(conf_mtx_df.index, categories=model_class_labels, ordered=True)\n conf_mtx_df.columns = pd.Categorical(conf_mtx_df.columns, categories=model_class_labels, ordered=True)\n conf_mtx_df = conf_mtx_df.sort_index().sort_index(axis=1)\n return conf_mtx_df\n\n def endpoint_config_name(self) -> str:\n # Grab the Endpoint Config Name from the AWS\n details = self.sm_client.describe_endpoint(EndpointName=self.endpoint_name)\n return details[\"EndpointConfigName\"]\n\n def set_input(self, input: str, force=False):\n \"\"\"Override: Set the input data for this artifact\n\n Args:\n input (str): Name of input for this artifact\n force (bool, optional): Force the input to be set. Defaults to False.\n Note:\n We're going to not allow this to be used for Models\n \"\"\"\n if not force:\n self.log.warning(f\"Endpoint {self.uuid}: Does not allow manual override of the input!\")\n return\n\n # Okay we're going to allow this to be set\n self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input})\n\n def delete(self):\n \"\"\" \"Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint\"\"\"\n if not self.exists():\n self.log.warning(f\"Trying to delete an Model that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the FeatureSet\n EndpointCore.managed_delete(endpoint_name=self.uuid)\n\n @classmethod\n def managed_delete(cls, endpoint_name: str):\n \"\"\"Delete the Endpoint and associated resources if it exists\"\"\"\n\n # Check if the endpoint exists\n try:\n endpoint_info = cls.sm_client.describe_endpoint(EndpointName=endpoint_name)\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] in [\"ValidationException\", \"ResourceNotFound\"]:\n cls.log.info(f\"Endpoint {endpoint_name} not found!\")\n return\n raise # Re-raise unexpected errors\n\n # Delete underlying models (Endpoints store/use models internally)\n cls.delete_endpoint_models(endpoint_name)\n\n # Get Endpoint Config Name and delete if exists\n endpoint_config_name = endpoint_info[\"EndpointConfigName\"]\n try:\n cls.log.info(f\"Deleting Endpoint Config {endpoint_config_name}...\")\n cls.sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n except ClientError:\n cls.log.info(f\"Endpoint Config {endpoint_config_name} not found...\")\n\n # Delete any monitoring schedules associated with the endpoint\n monitoring_schedules = cls.sm_client.list_monitoring_schedules(EndpointName=endpoint_name)[\n \"MonitoringScheduleSummaries\"\n ]\n for schedule in monitoring_schedules:\n cls.log.info(f\"Deleting Monitoring Schedule {schedule['MonitoringScheduleName']}...\")\n cls.sm_client.delete_monitoring_schedule(MonitoringScheduleName=schedule[\"MonitoringScheduleName\"])\n\n # Delete related S3 artifacts (inference, data capture, monitoring)\n endpoint_inference_path = cls.endpoints_s3_path + \"/inference/\" + endpoint_name\n endpoint_data_capture_path = cls.endpoints_s3_path + \"/data_capture/\" + endpoint_name\n endpoint_monitoring_path = cls.endpoints_s3_path + \"/monitoring/\" + endpoint_name\n for s3_path in [endpoint_inference_path, endpoint_data_capture_path, endpoint_monitoring_path]:\n s3_path = f\"{s3_path.rstrip('/')}/\"\n objects = wr.s3.list_objects(s3_path, boto3_session=cls.boto3_session)\n if objects:\n cls.log.info(f\"Deleting S3 Objects at {s3_path}...\")\n wr.s3.delete_objects(objects, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(endpoint_name)\n\n # Delete the endpoint\n time.sleep(2) # Allow AWS to catch up\n try:\n cls.log.info(f\"Deleting Endpoint {endpoint_name}...\")\n cls.sm_client.delete_endpoint(EndpointName=endpoint_name)\n except ClientError as e:\n cls.log.error(\"Error deleting endpoint.\")\n raise e\n\n time.sleep(5) # Final sleep for AWS to fully register deletions\n\n @classmethod\n def delete_endpoint_models(cls, endpoint_name: str):\n \"\"\"Delete the underlying Model for an Endpoint\n\n Args:\n endpoint_name (str): The name of the endpoint to delete\n \"\"\"\n\n # Grab the Endpoint Config Name from AWS\n endpoint_config_name = cls.sm_client.describe_endpoint(EndpointName=endpoint_name)[\"EndpointConfigName\"]\n\n # Retrieve the Model Names from the Endpoint Config\n try:\n endpoint_config = cls.sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n except botocore.exceptions.ClientError:\n cls.log.info(f\"Endpoint Config {endpoint_config_name} doesn't exist...\")\n return\n model_names = [variant[\"ModelName\"] for variant in endpoint_config[\"ProductionVariants\"]]\n for model_name in model_names:\n cls.log.info(f\"Deleting Internal Model {model_name}...\")\n try:\n cls.sm_client.delete_model(ModelName=model_name)\n except botocore.exceptions.ClientError as error:\n error_code = error.response[\"Error\"][\"Code\"]\n error_message = error.response[\"Error\"][\"Message\"]\n if error_code == \"ResourceInUse\":\n cls.log.warning(f\"Model {model_name} is still in use...\")\n else:\n cls.log.warning(f\"Error: {error_code} - {error_message}\")\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.__init__","title":"__init__(endpoint_uuid, **kwargs)
","text":"EndpointCore Initialization
Parameters:
Name Type Description Defaultendpoint_uuid
str
Name of Endpoint in SageWorks
required Source code insrc/sageworks/core/artifacts/endpoint_core.py
def __init__(self, endpoint_uuid, **kwargs):\n \"\"\"EndpointCore Initialization\n\n Args:\n endpoint_uuid (str): Name of Endpoint in SageWorks\n \"\"\"\n\n # Make sure the endpoint_uuid is a valid name\n self.is_name_valid(endpoint_uuid, delimiter=\"-\", lower_case=False)\n\n # Call SuperClass Initialization\n super().__init__(endpoint_uuid, **kwargs)\n\n # Grab an Cloud Metadata object and pull information for Endpoints\n self.endpoint_name = endpoint_uuid\n self.endpoint_meta = self.meta.endpoint(self.endpoint_name)\n\n # Sanity check that we found the endpoint\n if self.endpoint_meta is None:\n self.log.important(f\"Could not find endpoint {self.uuid} within current visibility scope\")\n return\n\n # Sanity check the Endpoint state\n if self.endpoint_meta[\"EndpointStatus\"] == \"Failed\":\n self.log.critical(f\"Endpoint {self.uuid} is in a failed state\")\n reason = self.endpoint_meta[\"FailureReason\"]\n self.log.critical(f\"Failure Reason: {reason}\")\n self.log.critical(\"Please delete this endpoint and re-deploy...\")\n\n # Set the Inference, Capture, and Monitoring S3 Paths\n self.endpoint_inference_path = self.endpoints_s3_path + \"/inference/\" + self.uuid\n self.endpoint_data_capture_path = self.endpoints_s3_path + \"/data_capture/\" + self.uuid\n self.endpoint_monitoring_path = self.endpoints_s3_path + \"/monitoring/\" + self.uuid\n\n # Set the Model Name\n self.model_name = self.get_input()\n\n # This is for endpoint error handling later\n self.endpoint_return_columns = None\n\n # We temporary cache the endpoint metrics\n self.temp_storage = Cache(prefix=\"temp_storage\", expire=300) # 5 minutes\n\n # Call SuperClass Post Initialization\n super().__post_init__()\n\n # All done\n self.log.info(f\"EndpointCore Initialized: {self.endpoint_name}\")\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.add_data_capture","title":"add_data_capture()
","text":"Add data capture to the endpoint
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def add_data_capture(self):\n \"\"\"Add data capture to the endpoint\"\"\"\n self.get_monitor().add_data_capture()\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.arn","title":"arn()
","text":"AWS ARN (Amazon Resource Name) for this artifact
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n return self.endpoint_meta[\"EndpointArn\"]\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.auto_inference","title":"auto_inference(capture=False)
","text":"Run inference on the endpoint using FeatureSet data
Parameters:
Name Type Description Defaultcapture
bool
Capture the inference results and metrics (default=False)
False
Source code in src/sageworks/core/artifacts/endpoint_core.py
def auto_inference(self, capture: bool = False) -> pd.DataFrame:\n \"\"\"Run inference on the endpoint using FeatureSet data\n\n Args:\n capture (bool, optional): Capture the inference results and metrics (default=False)\n \"\"\"\n\n # Sanity Check that we have a model\n model = ModelCore(self.get_input())\n if not model.exists():\n self.log.error(\"No model found for this endpoint. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Now get the FeatureSet and make sure it exists\n fs = FeatureSetCore(model.get_input())\n if not fs.exists():\n self.log.error(\"No FeatureSet found for this endpoint. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Grab the evaluation data from the FeatureSet\n table = fs.view(\"training\").table\n eval_df = fs.query(f'SELECT * FROM \"{table}\" where training = FALSE')\n capture_uuid = \"auto_inference\" if capture else None\n return self.inference(eval_df, capture_uuid, id_column=fs.id_column)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.aws_meta","title":"aws_meta()
","text":"Get ALL the AWS metadata for this artifact
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.endpoint_meta\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.aws_url","title":"aws_url()
","text":"The AWS URL for looking at/querying this data source
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this data source\"\"\"\n return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.classification_metrics","title":"classification_metrics(target_column, prediction_df)
","text":"Compute the performance metrics for this Endpoint
Parameters:
Name Type Description Defaulttarget_column
str
Name of the target column
requiredprediction_df
DataFrame
DataFrame with the prediction results
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: DataFrame with the performance metrics
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def classification_metrics(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the performance metrics for this Endpoint\n\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n\n Returns:\n pd.DataFrame: DataFrame with the performance metrics\n \"\"\"\n # Get the class labels from the model\n class_labels = ModelCore(self.model_name).class_labels()\n if class_labels is None:\n self.log.warning(\n \"Class labels not found in the model. Guessing class labels from the prediction DataFrame.\"\n )\n class_labels = prediction_df[target_column].unique().tolist()\n self.validate_proba_columns(prediction_df, class_labels, guessing=True)\n else:\n self.validate_proba_columns(prediction_df, class_labels)\n\n # Calculate precision, recall, fscore, and support, handling zero division\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n scores = precision_recall_fscore_support(\n prediction_df[target_column],\n prediction_df[prediction_col],\n average=None,\n labels=class_labels,\n zero_division=0,\n )\n\n # Identify the probability columns and keep them as a Pandas DataFrame\n proba_columns = [f\"{label}_proba\" for label in class_labels]\n y_score = prediction_df[proba_columns]\n\n # One-hot encode the true labels using all class labels (fit with class_labels)\n encoder = OneHotEncoder(categories=[class_labels], sparse_output=False)\n y_true = encoder.fit_transform(prediction_df[[target_column]])\n\n # Calculate ROC AUC per label and handle exceptions for missing classes\n roc_auc_per_label = []\n for i, label in enumerate(class_labels):\n try:\n roc_auc = roc_auc_score(y_true[:, i], y_score.iloc[:, i])\n except ValueError as e:\n self.log.warning(f\"ROC AUC calculation failed for label {label}.\")\n self.log.warning(f\"{str(e)}\")\n roc_auc = 0.0\n roc_auc_per_label.append(roc_auc)\n\n # Put the scores into a DataFrame\n score_df = pd.DataFrame(\n {\n target_column: class_labels,\n \"precision\": scores[0],\n \"recall\": scores[1],\n \"fscore\": scores[2],\n \"roc_auc\": roc_auc_per_label,\n \"support\": scores[3],\n }\n )\n\n # Sort the target labels\n score_df = score_df.sort_values(by=[target_column], ascending=True)\n return score_df\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.created","title":"created()
","text":"Return the datetime when this artifact was created
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.endpoint_meta[\"CreationTime\"]\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.delete","title":"delete()
","text":"\"Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def delete(self):\n \"\"\" \"Delete an existing Endpoint: Underlying Models, Configuration, and Endpoint\"\"\"\n if not self.exists():\n self.log.warning(f\"Trying to delete an Model that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the FeatureSet\n EndpointCore.managed_delete(endpoint_name=self.uuid)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.delete_endpoint_models","title":"delete_endpoint_models(endpoint_name)
classmethod
","text":"Delete the underlying Model for an Endpoint
Parameters:
Name Type Description Defaultendpoint_name
str
The name of the endpoint to delete
required Source code insrc/sageworks/core/artifacts/endpoint_core.py
@classmethod\ndef delete_endpoint_models(cls, endpoint_name: str):\n \"\"\"Delete the underlying Model for an Endpoint\n\n Args:\n endpoint_name (str): The name of the endpoint to delete\n \"\"\"\n\n # Grab the Endpoint Config Name from AWS\n endpoint_config_name = cls.sm_client.describe_endpoint(EndpointName=endpoint_name)[\"EndpointConfigName\"]\n\n # Retrieve the Model Names from the Endpoint Config\n try:\n endpoint_config = cls.sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n except botocore.exceptions.ClientError:\n cls.log.info(f\"Endpoint Config {endpoint_config_name} doesn't exist...\")\n return\n model_names = [variant[\"ModelName\"] for variant in endpoint_config[\"ProductionVariants\"]]\n for model_name in model_names:\n cls.log.info(f\"Deleting Internal Model {model_name}...\")\n try:\n cls.sm_client.delete_model(ModelName=model_name)\n except botocore.exceptions.ClientError as error:\n error_code = error.response[\"Error\"][\"Code\"]\n error_message = error.response[\"Error\"][\"Message\"]\n if error_code == \"ResourceInUse\":\n cls.log.warning(f\"Model {model_name} is still in use...\")\n else:\n cls.log.warning(f\"Error: {error_code} - {error_message}\")\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.details","title":"details(recompute=False)
","text":"Additional Details about this Endpoint Args: recompute (bool): Recompute the details (default: False) Returns: dict(dict): A dictionary of details about this Endpoint
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def details(self, recompute: bool = False) -> dict:\n \"\"\"Additional Details about this Endpoint\n Args:\n recompute (bool): Recompute the details (default: False)\n Returns:\n dict(dict): A dictionary of details about this Endpoint\n \"\"\"\n\n # Fill in all the details about this Endpoint\n details = self.summary()\n\n # Get details from our AWS Metadata\n details[\"status\"] = self.endpoint_meta[\"EndpointStatus\"]\n details[\"instance\"] = self.endpoint_meta[\"InstanceType\"]\n try:\n details[\"instance_count\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"CurrentInstanceCount\"] or \"-\"\n except KeyError:\n details[\"instance_count\"] = \"-\"\n if \"ProductionVariants\" in self.endpoint_meta:\n details[\"variant\"] = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n else:\n details[\"variant\"] = \"-\"\n\n # Add endpoint metrics from CloudWatch\n details[\"endpoint_metrics\"] = self.endpoint_metrics()\n\n # Return the details\n return details\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.endpoint_metrics","title":"endpoint_metrics()
","text":"Return the metrics for this endpoint
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def endpoint_metrics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Return the metrics for this endpoint\n\n Returns:\n pd.DataFrame: DataFrame with the metrics for this endpoint (or None if no metrics)\n \"\"\"\n\n # Do we have it cached?\n metrics_key = f\"endpoint:{self.uuid}:endpoint_metrics\"\n endpoint_metrics = self.temp_storage.get(metrics_key)\n if endpoint_metrics is not None:\n return endpoint_metrics\n\n # We don't have it cached so let's get it from CloudWatch\n if \"ProductionVariants\" not in self.endpoint_meta:\n return None\n self.log.important(\"Updating endpoint metrics...\")\n variant = self.endpoint_meta[\"ProductionVariants\"][0][\"VariantName\"]\n endpoint_metrics = EndpointMetrics().get_metrics(self.uuid, variant=variant)\n self.temp_storage.set(metrics_key, endpoint_metrics)\n return endpoint_metrics\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.exists","title":"exists()
","text":"Does the feature_set_name exist in the AWS Metadata?
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def exists(self) -> bool:\n \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n if self.endpoint_meta is None:\n self.log.debug(f\"Endpoint {self.endpoint_name} not found in AWS Metadata\")\n return False\n return True\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.fast_inference","title":"fast_inference(eval_df)
","text":"Run inference on the Endpoint using the provided DataFrame
Parameters:
Name Type Description Defaulteval_df
DataFrame
The DataFrame to run predictions on
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The DataFrame with predictions
NoteThere's no sanity checks or error handling... just FAST Inference!
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def fast_inference(self, eval_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Run inference on the Endpoint using the provided DataFrame\n\n Args:\n eval_df (pd.DataFrame): The DataFrame to run predictions on\n\n Returns:\n pd.DataFrame: The DataFrame with predictions\n\n Note:\n There's no sanity checks or error handling... just FAST Inference!\n \"\"\"\n return fast_inference(self.uuid, eval_df, self.sm_session)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.generate_confusion_matrix","title":"generate_confusion_matrix(target_column, prediction_df)
","text":"Compute the confusion matrix for this Endpoint Args: target_column (str): Name of the target column prediction_df (pd.DataFrame): DataFrame with the prediction results Returns: pd.DataFrame: DataFrame with the confusion matrix
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def generate_confusion_matrix(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the confusion matrix for this Endpoint\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with the confusion matrix\n \"\"\"\n\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n # Check if our model has class labels, if not we'll use the unique labels in the prediction\n class_labels = ModelCore(self.model_name).class_labels()\n if class_labels is None:\n class_labels = sorted(list(set(y_true) | set(y_pred)))\n\n # Compute the confusion matrix (sklearn confusion_matrix)\n conf_mtx = confusion_matrix(y_true, y_pred, labels=class_labels)\n\n # Create a DataFrame\n conf_mtx_df = pd.DataFrame(conf_mtx, index=class_labels, columns=class_labels)\n conf_mtx_df.index.name = \"labels\"\n\n # Check if our model has class labels. If so make the index and columns ordered\n model_class_labels = ModelCore(self.model_name).class_labels()\n if model_class_labels:\n self.log.important(\"Reordering the confusion matrix based on model class labels...\")\n conf_mtx_df.index = pd.Categorical(conf_mtx_df.index, categories=model_class_labels, ordered=True)\n conf_mtx_df.columns = pd.Categorical(conf_mtx_df.columns, categories=model_class_labels, ordered=True)\n conf_mtx_df = conf_mtx_df.sort_index().sort_index(axis=1)\n return conf_mtx_df\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.get_monitor","title":"get_monitor()
","text":"Get the MonitorCore class for this endpoint
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def get_monitor(self):\n \"\"\"Get the MonitorCore class for this endpoint\"\"\"\n from sageworks.core.artifacts.monitor_core import MonitorCore\n\n return MonitorCore(self.endpoint_name)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.hash","title":"hash()
","text":"Return the hash for the internal model used by this endpoint
Returns:
Type DescriptionOptional[str]
Optional[str]: The hash for the internal model used by this endpoint
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def hash(self) -> Optional[str]:\n \"\"\"Return the hash for the internal model used by this endpoint\n\n Returns:\n Optional[str]: The hash for the internal model used by this endpoint\n \"\"\"\n from sageworks.utils.endpoint_utils import get_model_data_url # Avoid circular import\n\n model_url = get_model_data_url(self.endpoint_config_name(), self.boto3_session)\n return get_s3_etag(model_url, self.boto3_session)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.health_check","title":"health_check()
","text":"Perform a health check on this model
Returns:
Type Descriptionlist[str]
list[str]: List of health issues
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n\n Returns:\n list[str]: List of health issues\n \"\"\"\n if not self.ready():\n return [\"needs_onboard\"]\n\n # Call the base class health check\n health_issues = super().health_check()\n\n # Does this endpoint have a config?\n # Note: This is not an authoritative check, so improve later\n if self.endpoint_meta.get(\"ProductionVariants\") is None:\n health_issues.append(\"no_config\")\n\n # We're going to check for 5xx errors and no activity\n endpoint_metrics = self.endpoint_metrics()\n\n # Check if we have metrics\n if endpoint_metrics is None:\n health_issues.append(\"unknown_error\")\n return health_issues\n\n # Check for 5xx errors\n num_errors = endpoint_metrics[\"Invocation5XXErrors\"].sum()\n if num_errors > 5:\n health_issues.append(\"5xx_errors\")\n elif num_errors > 0:\n health_issues.append(\"5xx_errors_min\")\n else:\n self.remove_health_tag(\"5xx_errors\")\n self.remove_health_tag(\"5xx_errors_min\")\n\n # Check for Endpoint activity\n num_invocations = endpoint_metrics[\"Invocations\"].sum()\n if num_invocations == 0:\n health_issues.append(\"no_activity\")\n else:\n self.remove_health_tag(\"no_activity\")\n return health_issues\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.inference","title":"inference(eval_df, capture_uuid=None, id_column=None)
","text":"Run inference and compute performance metrics with optional capture
Parameters:
Name Type Description Defaulteval_df
DataFrame
DataFrame to run predictions on (must have superset of features)
requiredcapture_uuid
str
UUID of the inference capture (default=None)
None
id_column
str
Name of the ID column (default=None)
None
Returns:
Type DescriptionDataFrame
pd.DataFrame: DataFrame with the inference results
NoteIf capture=True inference/performance metrics are written to S3 Endpoint Inference Folder
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def inference(self, eval_df: pd.DataFrame, capture_uuid: str = None, id_column: str = None) -> pd.DataFrame:\n \"\"\"Run inference and compute performance metrics with optional capture\n\n Args:\n eval_df (pd.DataFrame): DataFrame to run predictions on (must have superset of features)\n capture_uuid (str, optional): UUID of the inference capture (default=None)\n id_column (str, optional): Name of the ID column (default=None)\n\n Returns:\n pd.DataFrame: DataFrame with the inference results\n\n Note:\n If capture=True inference/performance metrics are written to S3 Endpoint Inference Folder\n \"\"\"\n\n # Run predictions on the evaluation data\n prediction_df = self._predict(eval_df)\n if prediction_df.empty:\n self.log.warning(\"No predictions were made. Returning empty DataFrame.\")\n return prediction_df\n\n # Get the target column\n model = ModelCore(self.model_name)\n target_column = model.target()\n\n # Sanity Check that the target column is present\n if target_column and (target_column not in prediction_df.columns):\n self.log.important(f\"Target Column {target_column} not found in prediction_df!\")\n self.log.important(\"In order to compute metrics, the target column must be present!\")\n return prediction_df\n\n # Compute the standard performance metrics for this model\n model_type = model.model_type\n if model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n prediction_df = self.residuals(target_column, prediction_df)\n metrics = self.regression_metrics(target_column, prediction_df)\n elif model_type == ModelType.CLASSIFIER:\n metrics = self.classification_metrics(target_column, prediction_df)\n else:\n # For other model types, we don't compute metrics\n self.log.important(f\"Model Type: {model_type} doesn't have metrics...\")\n metrics = pd.DataFrame()\n\n # Print out the metrics\n if not metrics.empty:\n print(f\"Performance Metrics for {self.model_name} on {self.uuid}\")\n print(metrics.head())\n\n # Capture the inference results and metrics\n if capture_uuid is not None:\n description = capture_uuid.replace(\"_\", \" \").title()\n self._capture_inference_results(\n capture_uuid, prediction_df, target_column, model_type, metrics, description, id_column\n )\n\n # Return the prediction DataFrame\n return prediction_df\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.is_serverless","title":"is_serverless()
","text":"Check if the current endpoint is serverless.
Returns:
Name Type Descriptionbool
bool
True if the endpoint is serverless, False otherwise.
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def is_serverless(self) -> bool:\n \"\"\"Check if the current endpoint is serverless.\n\n Returns:\n bool: True if the endpoint is serverless, False otherwise.\n \"\"\"\n return \"Serverless\" in self.endpoint_meta[\"InstanceType\"]\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.managed_delete","title":"managed_delete(endpoint_name)
classmethod
","text":"Delete the Endpoint and associated resources if it exists
Source code insrc/sageworks/core/artifacts/endpoint_core.py
@classmethod\ndef managed_delete(cls, endpoint_name: str):\n \"\"\"Delete the Endpoint and associated resources if it exists\"\"\"\n\n # Check if the endpoint exists\n try:\n endpoint_info = cls.sm_client.describe_endpoint(EndpointName=endpoint_name)\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] in [\"ValidationException\", \"ResourceNotFound\"]:\n cls.log.info(f\"Endpoint {endpoint_name} not found!\")\n return\n raise # Re-raise unexpected errors\n\n # Delete underlying models (Endpoints store/use models internally)\n cls.delete_endpoint_models(endpoint_name)\n\n # Get Endpoint Config Name and delete if exists\n endpoint_config_name = endpoint_info[\"EndpointConfigName\"]\n try:\n cls.log.info(f\"Deleting Endpoint Config {endpoint_config_name}...\")\n cls.sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n except ClientError:\n cls.log.info(f\"Endpoint Config {endpoint_config_name} not found...\")\n\n # Delete any monitoring schedules associated with the endpoint\n monitoring_schedules = cls.sm_client.list_monitoring_schedules(EndpointName=endpoint_name)[\n \"MonitoringScheduleSummaries\"\n ]\n for schedule in monitoring_schedules:\n cls.log.info(f\"Deleting Monitoring Schedule {schedule['MonitoringScheduleName']}...\")\n cls.sm_client.delete_monitoring_schedule(MonitoringScheduleName=schedule[\"MonitoringScheduleName\"])\n\n # Delete related S3 artifacts (inference, data capture, monitoring)\n endpoint_inference_path = cls.endpoints_s3_path + \"/inference/\" + endpoint_name\n endpoint_data_capture_path = cls.endpoints_s3_path + \"/data_capture/\" + endpoint_name\n endpoint_monitoring_path = cls.endpoints_s3_path + \"/monitoring/\" + endpoint_name\n for s3_path in [endpoint_inference_path, endpoint_data_capture_path, endpoint_monitoring_path]:\n s3_path = f\"{s3_path.rstrip('/')}/\"\n objects = wr.s3.list_objects(s3_path, boto3_session=cls.boto3_session)\n if objects:\n cls.log.info(f\"Deleting S3 Objects at {s3_path}...\")\n wr.s3.delete_objects(objects, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(endpoint_name)\n\n # Delete the endpoint\n time.sleep(2) # Allow AWS to catch up\n try:\n cls.log.info(f\"Deleting Endpoint {endpoint_name}...\")\n cls.sm_client.delete_endpoint(EndpointName=endpoint_name)\n except ClientError as e:\n cls.log.error(\"Error deleting endpoint.\")\n raise e\n\n time.sleep(5) # Final sleep for AWS to fully register deletions\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.modified","title":"modified()
","text":"Return the datetime when this artifact was last modified
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n return self.endpoint_meta[\"LastModifiedTime\"]\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.onboard","title":"onboard(interactive=False)
","text":"This is a BLOCKING method that will onboard the Endpoint (make it ready) Args: interactive (bool, optional): If True, will prompt the user for information. (default: False) Returns: bool: True if the Endpoint is successfully onboarded, False otherwise
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def onboard(self, interactive: bool = False) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the Endpoint (make it ready)\n Args:\n interactive (bool, optional): If True, will prompt the user for information. (default: False)\n Returns:\n bool: True if the Endpoint is successfully onboarded, False otherwise\n \"\"\"\n\n # Make sure our input is defined\n if self.get_input() == \"unknown\":\n if interactive:\n input_model = input(\"Input Model?: \")\n else:\n self.log.critical(\"Input Model is not defined!\")\n return False\n else:\n input_model = self.get_input()\n\n # Now that we have the details, let's onboard the Endpoint with args\n return self.onboard_with_args(input_model)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.onboard_with_args","title":"onboard_with_args(input_model)
","text":"Onboard the Endpoint with the given arguments
Parameters:
Name Type Description Defaultinput_model
str
The input model for this endpoint
requiredReturns: bool: True if the Endpoint is successfully onboarded, False otherwise
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def onboard_with_args(self, input_model: str) -> bool:\n \"\"\"Onboard the Endpoint with the given arguments\n\n Args:\n input_model (str): The input model for this endpoint\n Returns:\n bool: True if the Endpoint is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n self.upsert_sageworks_meta({\"sageworks_input\": input_model})\n self.model_name = input_model\n\n # Remove the needs_onboard tag\n self.remove_health_tag(\"needs_onboard\")\n self.set_status(\"ready\")\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n return True\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.refresh_meta","title":"refresh_meta()
","text":"Refresh the Artifact's metadata
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n self.endpoint_meta = self.meta.endpoint(self.endpoint_name)\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.regression_metrics","title":"regression_metrics(target_column, prediction_df)
","text":"Compute the performance metrics for this Endpoint Args: target_column (str): Name of the target column prediction_df (pd.DataFrame): DataFrame with the prediction results Returns: pd.DataFrame: DataFrame with the performance metrics
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def regression_metrics(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Compute the performance metrics for this Endpoint\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with the performance metrics\n \"\"\"\n\n # Sanity Check the prediction DataFrame\n if prediction_df.empty:\n self.log.warning(\"No predictions were made. Returning empty DataFrame.\")\n return pd.DataFrame()\n\n # Compute the metrics\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n mae = mean_absolute_error(y_true, y_pred)\n rmse = np.sqrt(mean_squared_error(y_true, y_pred))\n r2 = r2_score(y_true, y_pred)\n # Mean Absolute Percentage Error\n mape = np.mean(np.where(y_true != 0, np.abs((y_true - y_pred) / y_true), np.abs(y_true - y_pred))) * 100\n # Median Absolute Error\n medae = median_absolute_error(y_true, y_pred)\n\n # Organize and return the metrics\n metrics = {\n \"MAE\": round(mae, 3),\n \"RMSE\": round(rmse, 3),\n \"R2\": round(r2, 3),\n \"MAPE\": round(mape, 3),\n \"MedAE\": round(medae, 3),\n \"NumRows\": len(prediction_df),\n }\n return pd.DataFrame.from_records([metrics])\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.residuals","title":"residuals(target_column, prediction_df)
","text":"Add the residuals to the prediction DataFrame Args: target_column (str): Name of the target column prediction_df (pd.DataFrame): DataFrame with the prediction results Returns: pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def residuals(self, target_column: str, prediction_df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Add the residuals to the prediction DataFrame\n Args:\n target_column (str): Name of the target column\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n Returns:\n pd.DataFrame: DataFrame with two new columns called 'residuals' and 'residuals_abs'\n \"\"\"\n\n # Compute the residuals\n y_true = prediction_df[target_column]\n prediction_col = \"prediction\" if \"prediction\" in prediction_df.columns else \"predictions\"\n y_pred = prediction_df[prediction_col]\n\n # Check for classification scenario\n if not pd.api.types.is_numeric_dtype(y_true) or not pd.api.types.is_numeric_dtype(y_pred):\n self.log.warning(\"Target and Prediction columns are not numeric. Computing 'diffs'...\")\n prediction_df[\"residuals\"] = (y_true != y_pred).astype(int)\n prediction_df[\"residuals_abs\"] = prediction_df[\"residuals\"]\n else:\n # Compute numeric residuals for regression\n prediction_df[\"residuals\"] = y_true - y_pred\n prediction_df[\"residuals_abs\"] = np.abs(prediction_df[\"residuals\"])\n\n return prediction_df\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.set_input","title":"set_input(input, force=False)
","text":"Override: Set the input data for this artifact
Parameters:
Name Type Description Defaultinput
str
Name of input for this artifact
requiredforce
bool
Force the input to be set. Defaults to False.
False
Note: We're going to not allow this to be used for Models
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def set_input(self, input: str, force=False):\n \"\"\"Override: Set the input data for this artifact\n\n Args:\n input (str): Name of input for this artifact\n force (bool, optional): Force the input to be set. Defaults to False.\n Note:\n We're going to not allow this to be used for Models\n \"\"\"\n if not force:\n self.log.warning(f\"Endpoint {self.uuid}: Does not allow manual override of the input!\")\n return\n\n # Okay we're going to allow this to be set\n self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input})\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.size","title":"size()
","text":"Return the size of this data in MegaBytes
Source code insrc/sageworks/core/artifacts/endpoint_core.py
def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n return 0.0\n
"},{"location":"core_classes/artifacts/endpoint_core/#sageworks.core.artifacts.endpoint_core.EndpointCore.validate_proba_columns","title":"validate_proba_columns(prediction_df, class_labels, guessing=False)
staticmethod
","text":"Ensure probability columns are correctly aligned with class labels
Parameters:
Name Type Description Defaultprediction_df
DataFrame
DataFrame with the prediction results
requiredclass_labels
list
List of class labels
requiredguessing
bool
Whether we're guessing the class labels. Defaults to False.
False
Source code in src/sageworks/core/artifacts/endpoint_core.py
@staticmethod\ndef validate_proba_columns(prediction_df: pd.DataFrame, class_labels: list, guessing: bool = False):\n \"\"\"Ensure probability columns are correctly aligned with class labels\n\n Args:\n prediction_df (pd.DataFrame): DataFrame with the prediction results\n class_labels (list): List of class labels\n guessing (bool, optional): Whether we're guessing the class labels. Defaults to False.\n \"\"\"\n proba_columns = [col.replace(\"_proba\", \"\") for col in prediction_df.columns if col.endswith(\"_proba\")]\n\n if sorted(class_labels) != sorted(proba_columns):\n if guessing:\n raise ValueError(f\"_proba columns {proba_columns} != GUESSED class_labels {class_labels}!\")\n else:\n raise ValueError(f\"_proba columns {proba_columns} != class_labels {class_labels}!\")\n
"},{"location":"core_classes/artifacts/feature_set_core/","title":"FeatureSetCore","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the FeatureSet API Class and voil\u00e0 it works the same.
FeatureSet: SageWorks Feature Set accessible through Athena
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore","title":"FeatureSetCore
","text":" Bases: Artifact
FeatureSetCore: SageWorks FeatureSetCore Class
Common Usagemy_features = FeatureSetCore(feature_uuid)\nmy_features.summary()\nmy_features.details()\n
Source code in src/sageworks/core/artifacts/feature_set_core.py
class FeatureSetCore(Artifact):\n \"\"\"FeatureSetCore: SageWorks FeatureSetCore Class\n\n Common Usage:\n ```python\n my_features = FeatureSetCore(feature_uuid)\n my_features.summary()\n my_features.details()\n ```\n \"\"\"\n\n def __init__(self, feature_set_uuid: str, **kwargs):\n \"\"\"FeatureSetCore Initialization\n\n Args:\n feature_set_uuid (str): Name of Feature Set\n \"\"\"\n\n # Make sure the feature_set name is valid\n self.is_name_valid(feature_set_uuid)\n\n # Call superclass init\n super().__init__(feature_set_uuid, **kwargs)\n\n # Get our FeatureSet metadata\n self.feature_meta = self.meta.feature_set(self.uuid)\n\n # Sanity check and then set up our FeatureSet attributes\n if self.feature_meta is None:\n self.log.warning(f\"Could not find feature set {self.uuid} within current visibility scope\")\n self.data_source = None\n return\n else:\n self.id_column = self.feature_meta[\"RecordIdentifierFeatureName\"]\n self.event_time = self.feature_meta[\"EventTimeFeatureName\"]\n\n # Pull Athena and S3 Storage information from metadata\n self.athena_table = self.feature_meta[\"sageworks_meta\"][\"athena_table\"]\n self.athena_database = self.feature_meta[\"sageworks_meta\"][\"athena_database\"]\n self.s3_storage = self.feature_meta[\"sageworks_meta\"].get(\"s3_storage\")\n\n # Create our internal DataSource (hardcoded to Athena for now)\n self.data_source = AthenaSource(self.athena_table, self.athena_database)\n\n # Spin up our Feature Store\n self.feature_store = FeatureStore(self.sm_session)\n\n # Call superclass post_init\n super().__post_init__()\n\n # All done\n self.log.info(f\"FeatureSet Initialized: {self.uuid}...\")\n\n @property\n def table(self) -> str:\n \"\"\"Get the base table name for this FeatureSet\"\"\"\n return self.data_source.table\n\n def refresh_meta(self):\n \"\"\"Internal: Refresh our internal AWS Feature Store metadata\"\"\"\n self.log.info(\"Calling refresh_meta() on the underlying DataSource\")\n self.data_source.refresh_meta()\n\n def exists(self) -> bool:\n \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n if self.feature_meta is None:\n self.log.debug(f\"FeatureSet {self.uuid} not found in AWS Metadata!\")\n return False\n return True\n\n def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n\n Returns:\n list[str]: List of health issues\n \"\"\"\n # Call the base class health check\n health_issues = super().health_check()\n\n # If we have a 'needs_onboard' in the health check then just return\n if \"needs_onboard\" in health_issues:\n return health_issues\n\n # Check our DataSource\n if not self.data_source.exists():\n self.log.critical(f\"Data Source check failed for {self.uuid}\")\n self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n health_issues.append(\"data_source_missing\")\n return health_issues\n\n def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.feature_meta\n\n def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n return self.feature_meta[\"FeatureGroupArn\"]\n\n def size(self) -> float:\n \"\"\"Return the size of the internal DataSource in MegaBytes\"\"\"\n return self.data_source.size()\n\n @property\n def columns(self) -> list[str]:\n \"\"\"Return the column names of the Feature Set\"\"\"\n return list(self.column_details().keys())\n\n @property\n def column_types(self) -> list[str]:\n \"\"\"Return the column types of the Feature Set\"\"\"\n return list(self.column_details().values())\n\n def column_details(self) -> dict:\n \"\"\"Return the column details of the Feature Set\n\n Returns:\n dict: The column details of the Feature Set\n\n Notes:\n We can't call just call self.data_source.column_details() because FeatureSets have different\n types, so we need to overlay that type information on top of the DataSource type information\n \"\"\"\n fs_details = {item[\"FeatureName\"]: item[\"FeatureType\"] for item in self.feature_meta[\"FeatureDefinitions\"]}\n ds_details = self.data_source.column_details()\n\n # Overlay the FeatureSet type information on top of the DataSource type information\n for col, dtype in ds_details.items():\n ds_details[col] = fs_details.get(col, dtype)\n return ds_details\n\n def views(self) -> list[str]:\n \"\"\"Return the views for this Data Source\"\"\"\n from sageworks.core.views.view_utils import list_views\n\n return list_views(self.data_source)\n\n def view(self, view_name: str) -> \"View\":\n \"\"\"Return a DataFrame for a specific view\n Args:\n view_name (str): The name of the view to return\n Returns:\n pd.DataFrame: A DataFrame for the specified view\n \"\"\"\n from sageworks.core.views import View\n\n return View(self, view_name)\n\n def set_display_columns(self, diplay_columns: list[str]):\n \"\"\"Set the display columns for this Data Source\n\n Args:\n diplay_columns (list[str]): The display columns for this Data Source\n \"\"\"\n # Check mismatch of display columns to computation columns\n c_view = self.view(\"computation\")\n computation_columns = c_view.columns\n mismatch_columns = [col for col in diplay_columns if col not in computation_columns]\n if mismatch_columns:\n self.log.monitor(f\"Display View/Computation mismatch: {mismatch_columns}\")\n\n self.log.important(f\"Setting Display Columns...{diplay_columns}\")\n from sageworks.core.views import DisplayView\n\n # Create a NEW display view\n DisplayView.create(self, source_table=c_view.table, column_list=diplay_columns)\n\n def set_computation_columns(self, computation_columns: list[str], reset_display: bool = True):\n \"\"\"Set the computation columns for this FeatureSet\n\n Args:\n computation_columns (list[str]): The computation columns for this FeatureSet\n reset_display (bool): Also reset the display columns to match (default: True)\n \"\"\"\n self.log.important(f\"Setting Computation Columns...{computation_columns}\")\n from sageworks.core.views import ComputationView\n\n # Create a NEW computation view\n ComputationView.create(self, column_list=computation_columns)\n self.recompute_stats()\n\n # Reset the display columns to match the computation columns\n if reset_display:\n self.set_display_columns(computation_columns)\n\n def num_columns(self) -> int:\n \"\"\"Return the number of columns of the Feature Set\"\"\"\n return len(self.columns)\n\n def num_rows(self) -> int:\n \"\"\"Return the number of rows of the internal DataSource\"\"\"\n return self.data_source.num_rows()\n\n def query(self, query: str, overwrite: bool = True) -> pd.DataFrame:\n \"\"\"Query the internal DataSource\n\n Args:\n query (str): The query to run against the DataSource\n overwrite (bool): Overwrite the table name in the query (default: True)\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n if overwrite:\n query = query.replace(\" \" + self.uuid + \" \", \" \" + self.athena_table + \" \")\n return self.data_source.query(query)\n\n def aws_url(self):\n \"\"\"The AWS URL for looking at/querying the underlying data source\"\"\"\n sageworks_details = self.data_source.sageworks_meta().get(\"sageworks_details\", {})\n return sageworks_details.get(\"aws_url\", \"unknown\")\n\n def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.feature_meta[\"CreationTime\"]\n\n def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n # Note: We can't currently figure out how to this from AWS Metadata\n return self.feature_meta[\"CreationTime\"]\n\n def hash(self) -> str:\n \"\"\"Return the hash for the set of Parquet files for this artifact\"\"\"\n return self.data_source.hash()\n\n def table_hash(self) -> str:\n \"\"\"Return the hash for the Athena table\"\"\"\n return self.data_source.table_hash()\n\n def get_data_source(self) -> DataSourceFactory:\n \"\"\"Return the underlying DataSource object\"\"\"\n return self.data_source\n\n def get_feature_store(self) -> FeatureStore:\n \"\"\"Return the underlying AWS FeatureStore object. This can be useful for more advanced usage\n with create_dataset() such as Joins and time ranges and a host of other options\n See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n \"\"\"\n return self.feature_store\n\n def create_s3_training_data(self) -> str:\n \"\"\"Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want\n additional options/features use the get_feature_store() method and see AWS docs for all\n the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n Returns:\n str: The full path/file for the CSV file created by Feature Store create_dataset()\n \"\"\"\n\n # Set up the S3 Query results path\n date_time = datetime.now(timezone.utc).strftime(\"%Y-%m-%d_%H:%M:%S\")\n s3_output_path = self.feature_sets_s3_path + f\"/{self.uuid}/datasets/all_{date_time}\"\n\n # Make the query\n table = self.view(\"training\").table\n query = f'SELECT * FROM \"{table}\"'\n athena_query = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session).athena_query()\n athena_query.run(query, output_location=s3_output_path)\n athena_query.wait()\n query_execution = athena_query.get_query_execution()\n\n # Get the full path to the S3 files with the results\n full_s3_path = s3_output_path + f\"/{query_execution['QueryExecution']['QueryExecutionId']}.csv\"\n return full_s3_path\n\n def get_training_data(self) -> pd.DataFrame:\n \"\"\"Get the training data for this FeatureSet\n\n Returns:\n pd.DataFrame: The training data for this FeatureSet\n \"\"\"\n from sageworks.core.views.view import View\n\n return View(self, \"training\").pull_dataframe()\n\n def snapshot_query(self, table_name: str = None) -> str:\n \"\"\"An Athena query to get the latest snapshot of features\n\n Args:\n table_name (str): The name of the table to query (default: None)\n\n Returns:\n str: The Athena query to get the latest snapshot of features\n \"\"\"\n # Remove FeatureGroup metadata columns that might have gotten added\n columns = self.columns\n filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n columns = \", \".join(['\"' + x + '\"' for x in columns if x not in filter_columns])\n\n query = (\n f\"SELECT {columns} \"\n f\" FROM (SELECT *, row_number() OVER (PARTITION BY {self.id_column} \"\n f\" ORDER BY {self.event_time} desc, api_invocation_time DESC, write_time DESC) AS row_num \"\n f' FROM \"{table_name}\") '\n \" WHERE row_num = 1 and NOT is_deleted;\"\n )\n return query\n\n def details(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Additional Details about this FeatureSet Artifact\n\n Args:\n recompute (bool): Recompute the details (default: False)\n\n Returns:\n dict(dict): A dictionary of details about this FeatureSet\n \"\"\"\n\n self.log.info(f\"Computing FeatureSet Details ({self.uuid})...\")\n details = self.summary()\n details[\"aws_url\"] = self.aws_url()\n\n # Store the AWS URL in the SageWorks Metadata\n # FIXME: We need to revisit this but doing an upsert just for aws_url is silly\n # self.upsert_sageworks_meta({\"aws_url\": details[\"aws_url\"]})\n\n # Now get a summary of the underlying DataSource\n details[\"storage_summary\"] = self.data_source.summary()\n\n # Number of Columns\n details[\"num_columns\"] = self.num_columns()\n\n # Number of Rows\n details[\"num_rows\"] = self.num_rows()\n\n # Additional Details\n details[\"sageworks_status\"] = self.get_status()\n details[\"sageworks_input\"] = self.get_input()\n details[\"sageworks_tags\"] = self.tag_delimiter.join(self.get_tags())\n\n # Underlying Storage Details\n details[\"storage_type\"] = \"athena\" # TODO: Add RDS support\n details[\"storage_uuid\"] = self.data_source.uuid\n\n # Add the column details and column stats\n details[\"column_details\"] = self.column_details()\n details[\"column_stats\"] = self.column_stats()\n\n # Return the details data\n return details\n\n def delete(self):\n \"\"\"Instance Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\"\"\"\n # Make sure the AthenaSource exists\n if not self.exists():\n self.log.warning(f\"Trying to delete an FeatureSet that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the FeatureSet\n FeatureSetCore.managed_delete(self.uuid)\n\n @classmethod\n def managed_delete(cls, feature_set_name: str):\n \"\"\"Class Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\n\n Args:\n feature_set_name (str): The Name of the FeatureSet to delete\n \"\"\"\n\n # See if the FeatureSet exists\n try:\n response = cls.sm_client.describe_feature_group(FeatureGroupName=feature_set_name)\n except cls.sm_client.exceptions.ResourceNotFound:\n cls.log.info(f\"FeatureSet {feature_set_name} not found!\")\n return\n\n # Extract database and table information from the response\n offline_config = response.get(\"OfflineStoreConfig\", {})\n database = offline_config.get(\"DataCatalogConfig\", {}).get(\"Database\")\n offline_table = offline_config.get(\"DataCatalogConfig\", {}).get(\"TableName\")\n data_source_uuid = offline_table # Our offline storage IS a DataSource\n\n # Delete the Feature Group and ensure that it gets deleted\n cls.log.important(f\"Deleting FeatureSet {feature_set_name}...\")\n remove_fg = cls.aws_feature_group_delete(feature_set_name)\n cls.ensure_feature_group_deleted(remove_fg)\n\n # Delete our underlying DataSource (Data Catalog Table and S3 Storage Objects)\n AthenaSource.managed_delete(data_source_uuid, database=database)\n\n # Delete any views associated with this FeatureSet\n cls.delete_views(offline_table, database)\n\n # Feature Sets can often have a lot of cruft so delete the entire bucket/prefix\n s3_delete_path = cls.feature_sets_s3_path + f\"/{feature_set_name}/\"\n cls.log.info(f\"Deleting All FeatureSet S3 Storage Objects {s3_delete_path}\")\n wr.s3.delete_objects(s3_delete_path, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(feature_set_name)\n\n @classmethod\n @aws_throttle\n def aws_feature_group_delete(cls, feature_set_name):\n remove_fg = FeatureGroup(name=feature_set_name, sagemaker_session=cls.sm_session)\n remove_fg.delete()\n return remove_fg\n\n @classmethod\n def ensure_feature_group_deleted(cls, feature_group):\n status = \"Deleting\"\n while status == \"Deleting\":\n cls.log.debug(\"FeatureSet being Deleted...\")\n try:\n status = feature_group.describe().get(\"FeatureGroupStatus\")\n except botocore.exceptions.ClientError as error:\n # For ResourceNotFound/ValidationException, this is fine, otherwise raise all other exceptions\n if error.response[\"Error\"][\"Code\"] in [\"ResourceNotFound\", \"ValidationException\"]:\n break\n else:\n raise error\n time.sleep(1)\n cls.log.info(f\"FeatureSet {feature_group.name} successfully deleted\")\n\n def set_training_holdouts(self, id_column: str, holdout_ids: list[str]):\n \"\"\"Set the hold out ids for the training view for this FeatureSet\n\n Args:\n id_column (str): The name of the id column.\n holdout_ids (list[str]): The list of holdout ids.\n \"\"\"\n from sageworks.core.views import TrainingView\n\n # Create a NEW training view\n self.log.important(f\"Setting Training Holdouts: {len(holdout_ids)} ids...\")\n TrainingView.create(self, id_column=id_column, holdout_ids=holdout_ids)\n\n @classmethod\n def delete_views(cls, table: str, database: str):\n \"\"\"Delete any views associated with this FeatureSet\n\n Args:\n table (str): Name of Athena Table\n database (str): Athena Database Name\n \"\"\"\n from sageworks.core.views.view_utils import delete_views_and_supplemental_data\n\n delete_views_and_supplemental_data(table, database, cls.boto3_session)\n\n def descriptive_stats(self, recompute: bool = False) -> dict:\n \"\"\"Get the descriptive stats for the numeric columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the descriptive stats (default=False)\n Returns:\n dict: A dictionary of descriptive stats for the numeric columns\n \"\"\"\n return self.data_source.descriptive_stats(recompute)\n\n def sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a sample of the data from the underlying DataSource\n Args:\n recompute (bool): Recompute the sample (default=False)\n Returns:\n pd.DataFrame: A sample of the data from the underlying DataSource\n \"\"\"\n return self.data_source.sample(recompute)\n\n def outliers(self, scale: float = 1.5, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n recompute (bool): Recompute the outliers (default: False)\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n return self.data_source.outliers(scale=scale, recompute=recompute)\n\n def smart_sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a SMART sample dataframe from this FeatureSet\n\n Args:\n recompute (bool): Recompute the smart sample (default=False)\n\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n return self.data_source.smart_sample(recompute=recompute)\n\n def anomalies(self) -> pd.DataFrame:\n \"\"\"Get a set of anomalous data from the underlying DataSource\n Returns:\n pd.DataFrame: A dataframe of anomalies from the underlying DataSource\n \"\"\"\n\n # FIXME: Mock this for now\n anom_df = self.sample().copy()\n anom_df[\"anomaly_score\"] = np.random.rand(anom_df.shape[0])\n anom_df[\"cluster\"] = np.random.randint(0, 10, anom_df.shape[0])\n anom_df[\"x\"] = np.random.rand(anom_df.shape[0])\n anom_df[\"y\"] = np.random.rand(anom_df.shape[0])\n return anom_df\n\n def value_counts(self, recompute: bool = False) -> dict:\n \"\"\"Get the value counts for the string columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the value counts (default=False)\n Returns:\n dict: A dictionary of value counts for the string columns\n \"\"\"\n return self.data_source.value_counts(recompute)\n\n def correlations(self, recompute: bool = False) -> dict:\n \"\"\"Get the correlations for the numeric columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the value counts (default=False)\n Returns:\n dict: A dictionary of correlations for the numeric columns\n \"\"\"\n return self.data_source.correlations(recompute)\n\n def column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in the FeatureSets underlying DataSource\n Args:\n recompute (bool): Recompute the column stats (default: False)\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros and descriptive_stats\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n ...}\n \"\"\"\n\n # Grab the column stats from our DataSource\n ds_column_stats = self.data_source.column_stats(recompute)\n\n # Map the types from our DataSource to the FeatureSet types\n fs_type_mapper = self.column_details()\n for col, details in ds_column_stats.items():\n details[\"fs_dtype\"] = fs_type_mapper.get(col, \"unknown\")\n\n return ds_column_stats\n\n def ready(self) -> bool:\n \"\"\"Is the FeatureSet ready? Is initial setup complete and expected metadata populated?\n Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to\n check both to see if the FeatureSet is ready.\"\"\"\n\n # Check the expected metadata for the FeatureSet\n expected_meta = self.expected_meta()\n existing_meta = self.sageworks_meta()\n feature_set_ready = set(existing_meta.keys()).issuperset(expected_meta)\n if not feature_set_ready:\n self.log.info(f\"FeatureSet {self.uuid} is not ready!\")\n return False\n\n # Okay now call/return the DataSource ready() method\n return self.data_source.ready()\n\n def onboard(self) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the FeatureSet (make it ready)\"\"\"\n\n # Set our status to onboarding\n self.log.important(f\"Onboarding {self.uuid}...\")\n self.set_status(\"onboarding\")\n self.remove_health_tag(\"needs_onboard\")\n\n # Call our underlying DataSource onboard method\n self.data_source.refresh_meta()\n if not self.data_source.exists():\n self.log.critical(f\"Data Source check failed for {self.uuid}\")\n self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n return False\n if not self.data_source.ready():\n self.data_source.onboard()\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n self.set_status(\"ready\")\n return True\n\n def recompute_stats(self) -> bool:\n \"\"\"This is a BLOCKING method that will recompute the stats for the FeatureSet\"\"\"\n\n # Call our underlying DataSource recompute stats method\n self.log.important(f\"Recomputing Stats {self.uuid}...\")\n self.data_source.recompute_stats()\n self.details(recompute=True)\n return True\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_types","title":"column_types: list[str]
property
","text":"Return the column types of the Feature Set
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.columns","title":"columns: list[str]
property
","text":"Return the column names of the Feature Set
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.table","title":"table: str
property
","text":"Get the base table name for this FeatureSet
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.__init__","title":"__init__(feature_set_uuid, **kwargs)
","text":"FeatureSetCore Initialization
Parameters:
Name Type Description Defaultfeature_set_uuid
str
Name of Feature Set
required Source code insrc/sageworks/core/artifacts/feature_set_core.py
def __init__(self, feature_set_uuid: str, **kwargs):\n \"\"\"FeatureSetCore Initialization\n\n Args:\n feature_set_uuid (str): Name of Feature Set\n \"\"\"\n\n # Make sure the feature_set name is valid\n self.is_name_valid(feature_set_uuid)\n\n # Call superclass init\n super().__init__(feature_set_uuid, **kwargs)\n\n # Get our FeatureSet metadata\n self.feature_meta = self.meta.feature_set(self.uuid)\n\n # Sanity check and then set up our FeatureSet attributes\n if self.feature_meta is None:\n self.log.warning(f\"Could not find feature set {self.uuid} within current visibility scope\")\n self.data_source = None\n return\n else:\n self.id_column = self.feature_meta[\"RecordIdentifierFeatureName\"]\n self.event_time = self.feature_meta[\"EventTimeFeatureName\"]\n\n # Pull Athena and S3 Storage information from metadata\n self.athena_table = self.feature_meta[\"sageworks_meta\"][\"athena_table\"]\n self.athena_database = self.feature_meta[\"sageworks_meta\"][\"athena_database\"]\n self.s3_storage = self.feature_meta[\"sageworks_meta\"].get(\"s3_storage\")\n\n # Create our internal DataSource (hardcoded to Athena for now)\n self.data_source = AthenaSource(self.athena_table, self.athena_database)\n\n # Spin up our Feature Store\n self.feature_store = FeatureStore(self.sm_session)\n\n # Call superclass post_init\n super().__post_init__()\n\n # All done\n self.log.info(f\"FeatureSet Initialized: {self.uuid}...\")\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.anomalies","title":"anomalies()
","text":"Get a set of anomalous data from the underlying DataSource Returns: pd.DataFrame: A dataframe of anomalies from the underlying DataSource
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def anomalies(self) -> pd.DataFrame:\n \"\"\"Get a set of anomalous data from the underlying DataSource\n Returns:\n pd.DataFrame: A dataframe of anomalies from the underlying DataSource\n \"\"\"\n\n # FIXME: Mock this for now\n anom_df = self.sample().copy()\n anom_df[\"anomaly_score\"] = np.random.rand(anom_df.shape[0])\n anom_df[\"cluster\"] = np.random.randint(0, 10, anom_df.shape[0])\n anom_df[\"x\"] = np.random.rand(anom_df.shape[0])\n anom_df[\"y\"] = np.random.rand(anom_df.shape[0])\n return anom_df\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.arn","title":"arn()
","text":"AWS ARN (Amazon Resource Name) for this artifact
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for this artifact\"\"\"\n return self.feature_meta[\"FeatureGroupArn\"]\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.aws_meta","title":"aws_meta()
","text":"Get ALL the AWS metadata for this artifact
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.feature_meta\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.aws_url","title":"aws_url()
","text":"The AWS URL for looking at/querying the underlying data source
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def aws_url(self):\n \"\"\"The AWS URL for looking at/querying the underlying data source\"\"\"\n sageworks_details = self.data_source.sageworks_meta().get(\"sageworks_details\", {})\n return sageworks_details.get(\"aws_url\", \"unknown\")\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_details","title":"column_details()
","text":"Return the column details of the Feature Set
Returns:
Name Type Descriptiondict
dict
The column details of the Feature Set
NotesWe can't call just call self.data_source.column_details() because FeatureSets have different types, so we need to overlay that type information on top of the DataSource type information
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def column_details(self) -> dict:\n \"\"\"Return the column details of the Feature Set\n\n Returns:\n dict: The column details of the Feature Set\n\n Notes:\n We can't call just call self.data_source.column_details() because FeatureSets have different\n types, so we need to overlay that type information on top of the DataSource type information\n \"\"\"\n fs_details = {item[\"FeatureName\"]: item[\"FeatureType\"] for item in self.feature_meta[\"FeatureDefinitions\"]}\n ds_details = self.data_source.column_details()\n\n # Overlay the FeatureSet type information on top of the DataSource type information\n for col, dtype in ds_details.items():\n ds_details[col] = fs_details.get(col, dtype)\n return ds_details\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.column_stats","title":"column_stats(recompute=False)
","text":"Compute Column Stats for all the columns in the FeatureSets underlying DataSource Args: recompute (bool): Recompute the column stats (default: False) Returns: dict(dict): A dictionary of stats for each column this format NB: String columns will NOT have num_zeros and descriptive_stats {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12}, 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}}, ...}
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def column_stats(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Compute Column Stats for all the columns in the FeatureSets underlying DataSource\n Args:\n recompute (bool): Recompute the column stats (default: False)\n Returns:\n dict(dict): A dictionary of stats for each column this format\n NB: String columns will NOT have num_zeros and descriptive_stats\n {'col1': {'dtype': 'string', 'unique': 4321, 'nulls': 12},\n 'col2': {'dtype': 'int', 'unique': 4321, 'nulls': 12, 'num_zeros': 100, 'descriptive_stats': {...}},\n ...}\n \"\"\"\n\n # Grab the column stats from our DataSource\n ds_column_stats = self.data_source.column_stats(recompute)\n\n # Map the types from our DataSource to the FeatureSet types\n fs_type_mapper = self.column_details()\n for col, details in ds_column_stats.items():\n details[\"fs_dtype\"] = fs_type_mapper.get(col, \"unknown\")\n\n return ds_column_stats\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.correlations","title":"correlations(recompute=False)
","text":"Get the correlations for the numeric columns of the underlying DataSource Args: recompute (bool): Recompute the value counts (default=False) Returns: dict: A dictionary of correlations for the numeric columns
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def correlations(self, recompute: bool = False) -> dict:\n \"\"\"Get the correlations for the numeric columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the value counts (default=False)\n Returns:\n dict: A dictionary of correlations for the numeric columns\n \"\"\"\n return self.data_source.correlations(recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.create_s3_training_data","title":"create_s3_training_data()
","text":"Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want additional options/features use the get_feature_store() method and see AWS docs for all the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html Returns: str: The full path/file for the CSV file created by Feature Store create_dataset()
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def create_s3_training_data(self) -> str:\n \"\"\"Create some Training Data (S3 CSV) from a Feature Set using standard options. If you want\n additional options/features use the get_feature_store() method and see AWS docs for all\n the details: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n Returns:\n str: The full path/file for the CSV file created by Feature Store create_dataset()\n \"\"\"\n\n # Set up the S3 Query results path\n date_time = datetime.now(timezone.utc).strftime(\"%Y-%m-%d_%H:%M:%S\")\n s3_output_path = self.feature_sets_s3_path + f\"/{self.uuid}/datasets/all_{date_time}\"\n\n # Make the query\n table = self.view(\"training\").table\n query = f'SELECT * FROM \"{table}\"'\n athena_query = FeatureGroup(name=self.uuid, sagemaker_session=self.sm_session).athena_query()\n athena_query.run(query, output_location=s3_output_path)\n athena_query.wait()\n query_execution = athena_query.get_query_execution()\n\n # Get the full path to the S3 files with the results\n full_s3_path = s3_output_path + f\"/{query_execution['QueryExecution']['QueryExecutionId']}.csv\"\n return full_s3_path\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.created","title":"created()
","text":"Return the datetime when this artifact was created
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n return self.feature_meta[\"CreationTime\"]\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.delete","title":"delete()
","text":"Instance Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def delete(self):\n \"\"\"Instance Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\"\"\"\n # Make sure the AthenaSource exists\n if not self.exists():\n self.log.warning(f\"Trying to delete an FeatureSet that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the FeatureSet\n FeatureSetCore.managed_delete(self.uuid)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.delete_views","title":"delete_views(table, database)
classmethod
","text":"Delete any views associated with this FeatureSet
Parameters:
Name Type Description Defaulttable
str
Name of Athena Table
requireddatabase
str
Athena Database Name
required Source code insrc/sageworks/core/artifacts/feature_set_core.py
@classmethod\ndef delete_views(cls, table: str, database: str):\n \"\"\"Delete any views associated with this FeatureSet\n\n Args:\n table (str): Name of Athena Table\n database (str): Athena Database Name\n \"\"\"\n from sageworks.core.views.view_utils import delete_views_and_supplemental_data\n\n delete_views_and_supplemental_data(table, database, cls.boto3_session)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.descriptive_stats","title":"descriptive_stats(recompute=False)
","text":"Get the descriptive stats for the numeric columns of the underlying DataSource Args: recompute (bool): Recompute the descriptive stats (default=False) Returns: dict: A dictionary of descriptive stats for the numeric columns
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def descriptive_stats(self, recompute: bool = False) -> dict:\n \"\"\"Get the descriptive stats for the numeric columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the descriptive stats (default=False)\n Returns:\n dict: A dictionary of descriptive stats for the numeric columns\n \"\"\"\n return self.data_source.descriptive_stats(recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.details","title":"details(recompute=False)
","text":"Additional Details about this FeatureSet Artifact
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the details (default: False)
False
Returns:
Name Type Descriptiondict
dict
A dictionary of details about this FeatureSet
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def details(self, recompute: bool = False) -> dict[dict]:\n \"\"\"Additional Details about this FeatureSet Artifact\n\n Args:\n recompute (bool): Recompute the details (default: False)\n\n Returns:\n dict(dict): A dictionary of details about this FeatureSet\n \"\"\"\n\n self.log.info(f\"Computing FeatureSet Details ({self.uuid})...\")\n details = self.summary()\n details[\"aws_url\"] = self.aws_url()\n\n # Store the AWS URL in the SageWorks Metadata\n # FIXME: We need to revisit this but doing an upsert just for aws_url is silly\n # self.upsert_sageworks_meta({\"aws_url\": details[\"aws_url\"]})\n\n # Now get a summary of the underlying DataSource\n details[\"storage_summary\"] = self.data_source.summary()\n\n # Number of Columns\n details[\"num_columns\"] = self.num_columns()\n\n # Number of Rows\n details[\"num_rows\"] = self.num_rows()\n\n # Additional Details\n details[\"sageworks_status\"] = self.get_status()\n details[\"sageworks_input\"] = self.get_input()\n details[\"sageworks_tags\"] = self.tag_delimiter.join(self.get_tags())\n\n # Underlying Storage Details\n details[\"storage_type\"] = \"athena\" # TODO: Add RDS support\n details[\"storage_uuid\"] = self.data_source.uuid\n\n # Add the column details and column stats\n details[\"column_details\"] = self.column_details()\n details[\"column_stats\"] = self.column_stats()\n\n # Return the details data\n return details\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.exists","title":"exists()
","text":"Does the feature_set_name exist in the AWS Metadata?
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def exists(self) -> bool:\n \"\"\"Does the feature_set_name exist in the AWS Metadata?\"\"\"\n if self.feature_meta is None:\n self.log.debug(f\"FeatureSet {self.uuid} not found in AWS Metadata!\")\n return False\n return True\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_data_source","title":"get_data_source()
","text":"Return the underlying DataSource object
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def get_data_source(self) -> DataSourceFactory:\n \"\"\"Return the underlying DataSource object\"\"\"\n return self.data_source\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_feature_store","title":"get_feature_store()
","text":"Return the underlying AWS FeatureStore object. This can be useful for more advanced usage with create_dataset() such as Joins and time ranges and a host of other options See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def get_feature_store(self) -> FeatureStore:\n \"\"\"Return the underlying AWS FeatureStore object. This can be useful for more advanced usage\n with create_dataset() such as Joins and time ranges and a host of other options\n See: https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-a-dataset.html\n \"\"\"\n return self.feature_store\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.get_training_data","title":"get_training_data()
","text":"Get the training data for this FeatureSet
Returns:
Type DescriptionDataFrame
pd.DataFrame: The training data for this FeatureSet
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def get_training_data(self) -> pd.DataFrame:\n \"\"\"Get the training data for this FeatureSet\n\n Returns:\n pd.DataFrame: The training data for this FeatureSet\n \"\"\"\n from sageworks.core.views.view import View\n\n return View(self, \"training\").pull_dataframe()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.hash","title":"hash()
","text":"Return the hash for the set of Parquet files for this artifact
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def hash(self) -> str:\n \"\"\"Return the hash for the set of Parquet files for this artifact\"\"\"\n return self.data_source.hash()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.health_check","title":"health_check()
","text":"Perform a health check on this model
Returns:
Type Descriptionlist[str]
list[str]: List of health issues
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n\n Returns:\n list[str]: List of health issues\n \"\"\"\n # Call the base class health check\n health_issues = super().health_check()\n\n # If we have a 'needs_onboard' in the health check then just return\n if \"needs_onboard\" in health_issues:\n return health_issues\n\n # Check our DataSource\n if not self.data_source.exists():\n self.log.critical(f\"Data Source check failed for {self.uuid}\")\n self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n health_issues.append(\"data_source_missing\")\n return health_issues\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.managed_delete","title":"managed_delete(feature_set_name)
classmethod
","text":"Class Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects
Parameters:
Name Type Description Defaultfeature_set_name
str
The Name of the FeatureSet to delete
required Source code insrc/sageworks/core/artifacts/feature_set_core.py
@classmethod\ndef managed_delete(cls, feature_set_name: str):\n \"\"\"Class Method: Delete the Feature Set: Feature Group, Catalog Table, and S3 Storage Objects\n\n Args:\n feature_set_name (str): The Name of the FeatureSet to delete\n \"\"\"\n\n # See if the FeatureSet exists\n try:\n response = cls.sm_client.describe_feature_group(FeatureGroupName=feature_set_name)\n except cls.sm_client.exceptions.ResourceNotFound:\n cls.log.info(f\"FeatureSet {feature_set_name} not found!\")\n return\n\n # Extract database and table information from the response\n offline_config = response.get(\"OfflineStoreConfig\", {})\n database = offline_config.get(\"DataCatalogConfig\", {}).get(\"Database\")\n offline_table = offline_config.get(\"DataCatalogConfig\", {}).get(\"TableName\")\n data_source_uuid = offline_table # Our offline storage IS a DataSource\n\n # Delete the Feature Group and ensure that it gets deleted\n cls.log.important(f\"Deleting FeatureSet {feature_set_name}...\")\n remove_fg = cls.aws_feature_group_delete(feature_set_name)\n cls.ensure_feature_group_deleted(remove_fg)\n\n # Delete our underlying DataSource (Data Catalog Table and S3 Storage Objects)\n AthenaSource.managed_delete(data_source_uuid, database=database)\n\n # Delete any views associated with this FeatureSet\n cls.delete_views(offline_table, database)\n\n # Feature Sets can often have a lot of cruft so delete the entire bucket/prefix\n s3_delete_path = cls.feature_sets_s3_path + f\"/{feature_set_name}/\"\n cls.log.info(f\"Deleting All FeatureSet S3 Storage Objects {s3_delete_path}\")\n wr.s3.delete_objects(s3_delete_path, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(feature_set_name)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.modified","title":"modified()
","text":"Return the datetime when this artifact was last modified
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n # Note: We can't currently figure out how to this from AWS Metadata\n return self.feature_meta[\"CreationTime\"]\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.num_columns","title":"num_columns()
","text":"Return the number of columns of the Feature Set
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def num_columns(self) -> int:\n \"\"\"Return the number of columns of the Feature Set\"\"\"\n return len(self.columns)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.num_rows","title":"num_rows()
","text":"Return the number of rows of the internal DataSource
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def num_rows(self) -> int:\n \"\"\"Return the number of rows of the internal DataSource\"\"\"\n return self.data_source.num_rows()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.onboard","title":"onboard()
","text":"This is a BLOCKING method that will onboard the FeatureSet (make it ready)
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def onboard(self) -> bool:\n \"\"\"This is a BLOCKING method that will onboard the FeatureSet (make it ready)\"\"\"\n\n # Set our status to onboarding\n self.log.important(f\"Onboarding {self.uuid}...\")\n self.set_status(\"onboarding\")\n self.remove_health_tag(\"needs_onboard\")\n\n # Call our underlying DataSource onboard method\n self.data_source.refresh_meta()\n if not self.data_source.exists():\n self.log.critical(f\"Data Source check failed for {self.uuid}\")\n self.log.critical(\"Delete this Feature Set and recreate it to fix this issue\")\n return False\n if not self.data_source.ready():\n self.data_source.onboard()\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n self.set_status(\"ready\")\n return True\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.outliers","title":"outliers(scale=1.5, recompute=False)
","text":"Compute outliers for all the numeric columns in a DataSource Args: scale (float): The scale to use for the IQR (default: 1.5) recompute (bool): Recompute the outliers (default: False) Returns: pd.DataFrame: A DataFrame of outliers from this DataSource Notes: Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers The scale parameter can be adjusted to change the IQR multiplier
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def outliers(self, scale: float = 1.5, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n Args:\n scale (float): The scale to use for the IQR (default: 1.5)\n recompute (bool): Recompute the outliers (default: False)\n Returns:\n pd.DataFrame: A DataFrame of outliers from this DataSource\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) method to compute outliers\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n return self.data_source.outliers(scale=scale, recompute=recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.query","title":"query(query, overwrite=True)
","text":"Query the internal DataSource
Parameters:
Name Type Description Defaultquery
str
The query to run against the DataSource
requiredoverwrite
bool
Overwrite the table name in the query (default: True)
True
Returns:
Type DescriptionDataFrame
pd.DataFrame: The results of the query
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def query(self, query: str, overwrite: bool = True) -> pd.DataFrame:\n \"\"\"Query the internal DataSource\n\n Args:\n query (str): The query to run against the DataSource\n overwrite (bool): Overwrite the table name in the query (default: True)\n\n Returns:\n pd.DataFrame: The results of the query\n \"\"\"\n if overwrite:\n query = query.replace(\" \" + self.uuid + \" \", \" \" + self.athena_table + \" \")\n return self.data_source.query(query)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.ready","title":"ready()
","text":"Is the FeatureSet ready? Is initial setup complete and expected metadata populated? Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to check both to see if the FeatureSet is ready.
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def ready(self) -> bool:\n \"\"\"Is the FeatureSet ready? Is initial setup complete and expected metadata populated?\n Note: Since FeatureSet is a composite of DataSource and FeatureGroup, we need to\n check both to see if the FeatureSet is ready.\"\"\"\n\n # Check the expected metadata for the FeatureSet\n expected_meta = self.expected_meta()\n existing_meta = self.sageworks_meta()\n feature_set_ready = set(existing_meta.keys()).issuperset(expected_meta)\n if not feature_set_ready:\n self.log.info(f\"FeatureSet {self.uuid} is not ready!\")\n return False\n\n # Okay now call/return the DataSource ready() method\n return self.data_source.ready()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.recompute_stats","title":"recompute_stats()
","text":"This is a BLOCKING method that will recompute the stats for the FeatureSet
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def recompute_stats(self) -> bool:\n \"\"\"This is a BLOCKING method that will recompute the stats for the FeatureSet\"\"\"\n\n # Call our underlying DataSource recompute stats method\n self.log.important(f\"Recomputing Stats {self.uuid}...\")\n self.data_source.recompute_stats()\n self.details(recompute=True)\n return True\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.refresh_meta","title":"refresh_meta()
","text":"Internal: Refresh our internal AWS Feature Store metadata
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def refresh_meta(self):\n \"\"\"Internal: Refresh our internal AWS Feature Store metadata\"\"\"\n self.log.info(\"Calling refresh_meta() on the underlying DataSource\")\n self.data_source.refresh_meta()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.sample","title":"sample(recompute=False)
","text":"Get a sample of the data from the underlying DataSource Args: recompute (bool): Recompute the sample (default=False) Returns: pd.DataFrame: A sample of the data from the underlying DataSource
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a sample of the data from the underlying DataSource\n Args:\n recompute (bool): Recompute the sample (default=False)\n Returns:\n pd.DataFrame: A sample of the data from the underlying DataSource\n \"\"\"\n return self.data_source.sample(recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.set_computation_columns","title":"set_computation_columns(computation_columns, reset_display=True)
","text":"Set the computation columns for this FeatureSet
Parameters:
Name Type Description Defaultcomputation_columns
list[str]
The computation columns for this FeatureSet
requiredreset_display
bool
Also reset the display columns to match (default: True)
True
Source code in src/sageworks/core/artifacts/feature_set_core.py
def set_computation_columns(self, computation_columns: list[str], reset_display: bool = True):\n \"\"\"Set the computation columns for this FeatureSet\n\n Args:\n computation_columns (list[str]): The computation columns for this FeatureSet\n reset_display (bool): Also reset the display columns to match (default: True)\n \"\"\"\n self.log.important(f\"Setting Computation Columns...{computation_columns}\")\n from sageworks.core.views import ComputationView\n\n # Create a NEW computation view\n ComputationView.create(self, column_list=computation_columns)\n self.recompute_stats()\n\n # Reset the display columns to match the computation columns\n if reset_display:\n self.set_display_columns(computation_columns)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.set_display_columns","title":"set_display_columns(diplay_columns)
","text":"Set the display columns for this Data Source
Parameters:
Name Type Description Defaultdiplay_columns
list[str]
The display columns for this Data Source
required Source code insrc/sageworks/core/artifacts/feature_set_core.py
def set_display_columns(self, diplay_columns: list[str]):\n \"\"\"Set the display columns for this Data Source\n\n Args:\n diplay_columns (list[str]): The display columns for this Data Source\n \"\"\"\n # Check mismatch of display columns to computation columns\n c_view = self.view(\"computation\")\n computation_columns = c_view.columns\n mismatch_columns = [col for col in diplay_columns if col not in computation_columns]\n if mismatch_columns:\n self.log.monitor(f\"Display View/Computation mismatch: {mismatch_columns}\")\n\n self.log.important(f\"Setting Display Columns...{diplay_columns}\")\n from sageworks.core.views import DisplayView\n\n # Create a NEW display view\n DisplayView.create(self, source_table=c_view.table, column_list=diplay_columns)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.set_training_holdouts","title":"set_training_holdouts(id_column, holdout_ids)
","text":"Set the hold out ids for the training view for this FeatureSet
Parameters:
Name Type Description Defaultid_column
str
The name of the id column.
requiredholdout_ids
list[str]
The list of holdout ids.
required Source code insrc/sageworks/core/artifacts/feature_set_core.py
def set_training_holdouts(self, id_column: str, holdout_ids: list[str]):\n \"\"\"Set the hold out ids for the training view for this FeatureSet\n\n Args:\n id_column (str): The name of the id column.\n holdout_ids (list[str]): The list of holdout ids.\n \"\"\"\n from sageworks.core.views import TrainingView\n\n # Create a NEW training view\n self.log.important(f\"Setting Training Holdouts: {len(holdout_ids)} ids...\")\n TrainingView.create(self, id_column=id_column, holdout_ids=holdout_ids)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.size","title":"size()
","text":"Return the size of the internal DataSource in MegaBytes
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def size(self) -> float:\n \"\"\"Return the size of the internal DataSource in MegaBytes\"\"\"\n return self.data_source.size()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.smart_sample","title":"smart_sample(recompute=False)
","text":"Get a SMART sample dataframe from this FeatureSet
Parameters:
Name Type Description Defaultrecompute
bool
Recompute the smart sample (default=False)
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: A combined DataFrame of sample data + outliers
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def smart_sample(self, recompute: bool = False) -> pd.DataFrame:\n \"\"\"Get a SMART sample dataframe from this FeatureSet\n\n Args:\n recompute (bool): Recompute the smart sample (default=False)\n\n Returns:\n pd.DataFrame: A combined DataFrame of sample data + outliers\n \"\"\"\n return self.data_source.smart_sample(recompute=recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.snapshot_query","title":"snapshot_query(table_name=None)
","text":"An Athena query to get the latest snapshot of features
Parameters:
Name Type Description Defaulttable_name
str
The name of the table to query (default: None)
None
Returns:
Name Type Descriptionstr
str
The Athena query to get the latest snapshot of features
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def snapshot_query(self, table_name: str = None) -> str:\n \"\"\"An Athena query to get the latest snapshot of features\n\n Args:\n table_name (str): The name of the table to query (default: None)\n\n Returns:\n str: The Athena query to get the latest snapshot of features\n \"\"\"\n # Remove FeatureGroup metadata columns that might have gotten added\n columns = self.columns\n filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n columns = \", \".join(['\"' + x + '\"' for x in columns if x not in filter_columns])\n\n query = (\n f\"SELECT {columns} \"\n f\" FROM (SELECT *, row_number() OVER (PARTITION BY {self.id_column} \"\n f\" ORDER BY {self.event_time} desc, api_invocation_time DESC, write_time DESC) AS row_num \"\n f' FROM \"{table_name}\") '\n \" WHERE row_num = 1 and NOT is_deleted;\"\n )\n return query\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.table_hash","title":"table_hash()
","text":"Return the hash for the Athena table
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def table_hash(self) -> str:\n \"\"\"Return the hash for the Athena table\"\"\"\n return self.data_source.table_hash()\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.value_counts","title":"value_counts(recompute=False)
","text":"Get the value counts for the string columns of the underlying DataSource Args: recompute (bool): Recompute the value counts (default=False) Returns: dict: A dictionary of value counts for the string columns
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def value_counts(self, recompute: bool = False) -> dict:\n \"\"\"Get the value counts for the string columns of the underlying DataSource\n Args:\n recompute (bool): Recompute the value counts (default=False)\n Returns:\n dict: A dictionary of value counts for the string columns\n \"\"\"\n return self.data_source.value_counts(recompute)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.view","title":"view(view_name)
","text":"Return a DataFrame for a specific view Args: view_name (str): The name of the view to return Returns: pd.DataFrame: A DataFrame for the specified view
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def view(self, view_name: str) -> \"View\":\n \"\"\"Return a DataFrame for a specific view\n Args:\n view_name (str): The name of the view to return\n Returns:\n pd.DataFrame: A DataFrame for the specified view\n \"\"\"\n from sageworks.core.views import View\n\n return View(self, view_name)\n
"},{"location":"core_classes/artifacts/feature_set_core/#sageworks.core.artifacts.feature_set_core.FeatureSetCore.views","title":"views()
","text":"Return the views for this Data Source
Source code insrc/sageworks/core/artifacts/feature_set_core.py
def views(self) -> list[str]:\n \"\"\"Return the views for this Data Source\"\"\"\n from sageworks.core.views.view_utils import list_views\n\n return list_views(self.data_source)\n
"},{"location":"core_classes/artifacts/model_core/","title":"ModelCore","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the Model API Class and voil\u00e0 it works the same.
ModelCore: SageWorks ModelCore Class
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.InferenceImage","title":"InferenceImage
","text":"Class for retrieving locked Scikit-Learn inference images
Source code insrc/sageworks/core/artifacts/model_core.py
class InferenceImage:\n \"\"\"Class for retrieving locked Scikit-Learn inference images\"\"\"\n\n image_uris = {\n (\"us-east-1\", \"sklearn\", \"1.2.1\"): (\n \"683313688378.dkr.ecr.us-east-1.amazonaws.com/\"\n \"sagemaker-scikit-learn@sha256:ed242e33af079f334972acd2a7ddf74d13310d3c9a0ef3a0e9b0429ccc104dcd\"\n ),\n (\"us-east-2\", \"sklearn\", \"1.2.1\"): (\n \"257758044811.dkr.ecr.us-east-2.amazonaws.com/\"\n \"sagemaker-scikit-learn@sha256:ed242e33af079f334972acd2a7ddf74d13310d3c9a0ef3a0e9b0429ccc104dcd\"\n ),\n (\"us-west-1\", \"sklearn\", \"1.2.1\"): (\n \"746614075791.dkr.ecr.us-west-1.amazonaws.com/\"\n \"sagemaker-scikit-learn@sha256:ed242e33af079f334972acd2a7ddf74d13310d3c9a0ef3a0e9b0429ccc104dcd\"\n ),\n (\"us-west-2\", \"sklearn\", \"1.2.1\"): (\n \"246618743249.dkr.ecr.us-west-2.amazonaws.com/\"\n \"sagemaker-scikit-learn@sha256:ed242e33af079f334972acd2a7ddf74d13310d3c9a0ef3a0e9b0429ccc104dcd\"\n ),\n }\n\n @classmethod\n def get_image_uri(cls, region, framework, version):\n key = (region, framework, version)\n if key in cls.image_uris:\n return cls.image_uris[key]\n else:\n raise ValueError(\n f\"No matching image found for region: {region}, framework: {framework}, version: {version}\"\n )\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore","title":"ModelCore
","text":" Bases: Artifact
ModelCore: SageWorks ModelCore Class
Common Usagemy_model = ModelCore(model_uuid)\nmy_model.summary()\nmy_model.details()\n
Source code in src/sageworks/core/artifacts/model_core.py
class ModelCore(Artifact):\n \"\"\"ModelCore: SageWorks ModelCore Class\n\n Common Usage:\n ```python\n my_model = ModelCore(model_uuid)\n my_model.summary()\n my_model.details()\n ```\n \"\"\"\n\n def __init__(self, model_uuid: str, model_type: ModelType = None, **kwargs):\n \"\"\"ModelCore Initialization\n Args:\n model_uuid (str): Name of Model in SageWorks.\n model_type (ModelType, optional): Set this for newly created Models. Defaults to None.\n **kwargs: Additional keyword arguments\n \"\"\"\n\n # Make sure the model name is valid\n self.is_name_valid(model_uuid, delimiter=\"-\", lower_case=False)\n\n # Call SuperClass Initialization\n super().__init__(model_uuid, **kwargs)\n\n # Initialize our class attributes\n self.latest_model = None\n self.model_type = ModelType.UNKNOWN\n self.model_training_path = None\n self.endpoint_inference_path = None\n\n # Grab an Cloud Platform Meta object and pull information for this Model\n self.model_name = model_uuid\n self.model_meta = self.meta.model(self.model_name)\n if self.model_meta is None:\n self.log.warning(f\"Could not find model {self.model_name} within current visibility scope\")\n return\n else:\n # Is this a model package group without any models?\n if len(self.model_meta[\"ModelPackageList\"]) == 0:\n self.log.warning(f\"Model Group {self.model_name} has no Model Packages!\")\n self.latest_model = None\n self.add_health_tag(\"model_not_found\")\n return\n try:\n self.latest_model = self.model_meta[\"ModelPackageList\"][0]\n self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n self.training_job_name = self._extract_training_job_name()\n if model_type:\n self._set_model_type(model_type)\n else:\n self.model_type = self._get_model_type()\n except (IndexError, KeyError):\n self.log.critical(f\"Model {self.model_name} appears to be malformed. Delete and recreate it!\")\n return\n\n # Set the Model Training S3 Path\n self.model_training_path = self.models_s3_path + \"/training/\" + self.model_name\n\n # Get our Endpoint Inference Path (might be None)\n self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n # Call SuperClass Post Initialization\n super().__post_init__()\n\n # All done\n self.log.info(f\"Model Initialized: {self.model_name}\")\n\n def refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n self.model_meta = self.meta.model(self.model_name)\n self.latest_model = self.model_meta[\"ModelPackageList\"][0]\n self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n self.training_job_name = self._extract_training_job_name()\n\n def exists(self) -> bool:\n \"\"\"Does the model metadata exist in the AWS Metadata?\"\"\"\n if self.model_meta is None:\n self.log.info(f\"Model {self.model_name} not found in AWS Metadata!\")\n return False\n return True\n\n def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n Returns:\n list[str]: List of health issues\n \"\"\"\n # Call the base class health check\n health_issues = super().health_check()\n\n # Check if the model exists\n if self.latest_model is None:\n health_issues.append(\"model_not_found\")\n\n # Model Type\n if self._get_model_type() == ModelType.UNKNOWN:\n health_issues.append(\"model_type_unknown\")\n else:\n self.remove_health_tag(\"model_type_unknown\")\n\n # Model Performance Metrics\n needs_metrics = self.model_type in {ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR, ModelType.CLASSIFIER}\n if needs_metrics and self.get_inference_metrics() is None:\n health_issues.append(\"metrics_needed\")\n else:\n self.remove_health_tag(\"metrics_needed\")\n\n # Endpoint\n if not self.endpoints():\n health_issues.append(\"no_endpoint\")\n else:\n self.remove_health_tag(\"no_endpoint\")\n return health_issues\n\n def latest_model_object(self) -> SagemakerModel:\n \"\"\"Return the latest AWS Sagemaker Model object for this SageWorks Model\n\n Returns:\n sagemaker.model.Model: AWS Sagemaker Model object\n \"\"\"\n return SagemakerModel(\n model_data=self.model_package_arn(), sagemaker_session=self.sm_session, image_uri=self.container_image()\n )\n\n def list_inference_runs(self) -> list[str]:\n \"\"\"List the inference runs for this model\n\n Returns:\n list[str]: List of inference runs\n \"\"\"\n\n # Check if we have a model (if not return empty list)\n if self.latest_model is None:\n return []\n\n # Check if we have model training metrics in our metadata\n have_model_training = True if self.sageworks_meta().get(\"sageworks_training_metrics\") else False\n\n # Now grab the list of directories from our inference path\n inference_runs = []\n if self.endpoint_inference_path:\n directories = wr.s3.list_directories(path=self.endpoint_inference_path + \"/\")\n inference_runs = [urlparse(directory).path.split(\"/\")[-2] for directory in directories]\n\n # We're going to add the model training to the end of the list\n if have_model_training:\n inference_runs.append(\"model_training\")\n return inference_runs\n\n def delete_inference_run(self, inference_run_uuid: str):\n \"\"\"Delete the inference run for this model\n\n Args:\n inference_run_uuid (str): UUID of the inference run\n \"\"\"\n if inference_run_uuid == \"model_training\":\n self.log.warning(\"Cannot delete model training data!\")\n return\n\n if self.endpoint_inference_path:\n full_path = f\"{self.endpoint_inference_path}/{inference_run_uuid}\"\n # Check if there are any objects at the path\n if wr.s3.list_objects(full_path):\n wr.s3.delete_objects(path=full_path)\n self.log.important(f\"Deleted inference run {inference_run_uuid} for {self.model_name}\")\n else:\n self.log.warning(f\"Inference run {inference_run_uuid} not found for {self.model_name}!\")\n else:\n self.log.warning(f\"No inference data found for {self.model_name}!\")\n\n def get_inference_metrics(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the inference performance metrics for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n Returns:\n pd.DataFrame: DataFrame of the Model Metrics\n\n Note:\n If a capture_uuid isn't specified this will try to return something reasonable\n \"\"\"\n # Try to get the auto_capture 'training_holdout' or the training\n if capture_uuid == \"latest\":\n metrics_df = self.get_inference_metrics(\"auto_inference\")\n return metrics_df if metrics_df is not None else self.get_inference_metrics(\"model_training\")\n\n # Grab the metrics captured during model training (could return None)\n if capture_uuid == \"model_training\":\n # Sanity check the sageworks metadata\n if self.sageworks_meta() is None:\n error_msg = f\"Model {self.model_name} has no sageworks_meta(). Either onboard() or delete this model!\"\n self.log.critical(error_msg)\n raise ValueError(error_msg)\n\n metrics = self.sageworks_meta().get(\"sageworks_training_metrics\")\n return pd.DataFrame.from_dict(metrics) if metrics else None\n\n else: # Specific capture_uuid (could return None)\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n metrics = pull_s3_data(s3_path, embedded_index=True)\n if metrics is not None:\n return metrics\n else:\n self.log.warning(f\"Performance metrics {capture_uuid} not found for {self.model_name}!\")\n return None\n\n def confusion_matrix(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the confusion_matrix for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n Returns:\n pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n \"\"\"\n\n # Sanity check the sageworks metadata\n if self.sageworks_meta() is None:\n error_msg = f\"Model {self.model_name} has no sageworks_meta(). Either onboard() or delete this model!\"\n self.log.critical(error_msg)\n raise ValueError(error_msg)\n\n # Grab the metrics from the SageWorks Metadata (try inference first, then training)\n if capture_uuid == \"latest\":\n cm = self.confusion_matrix(\"auto_inference\")\n return cm if cm is not None else self.confusion_matrix(\"model_training\")\n\n # Grab the confusion matrix captured during model training (could return None)\n if capture_uuid == \"model_training\":\n cm = self.sageworks_meta().get(\"sageworks_training_cm\")\n return pd.DataFrame.from_dict(cm) if cm else None\n\n else: # Specific capture_uuid\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_cm.csv\"\n cm = pull_s3_data(s3_path, embedded_index=True)\n if cm is not None:\n return cm\n else:\n self.log.warning(f\"Confusion Matrix {capture_uuid} not found for {self.model_name}!\")\n return None\n\n def set_input(self, input: str, force: bool = False):\n \"\"\"Override: Set the input data for this artifact\n\n Args:\n input (str): Name of input for this artifact\n force (bool, optional): Force the input to be set (default: False)\n Note:\n We're going to not allow this to be used for Models\n \"\"\"\n if not force:\n self.log.warning(f\"Model {self.uuid}: Does not allow manual override of the input!\")\n return\n\n # Okay we're going to allow this to be set\n self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input})\n\n def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n return 0.0\n\n def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.model_meta\n\n def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n return self.group_arn()\n\n def group_arn(self) -> Union[str, None]:\n \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n return self.model_meta[\"ModelPackageGroupArn\"] if self.model_meta else None\n\n def model_package_arn(self) -> Union[str, None]:\n \"\"\"AWS ARN (Amazon Resource Name) for the Latest Model Package (within the Group)\"\"\"\n if self.latest_model is None:\n return None\n return self.latest_model[\"ModelPackageArn\"]\n\n def container_info(self) -> Union[dict, None]:\n \"\"\"Container Info for the Latest Model Package\"\"\"\n return self.latest_model[\"InferenceSpecification\"][\"Containers\"][0] if self.latest_model else None\n\n def container_image(self) -> str:\n \"\"\"Container Image for the Latest Model Package\"\"\"\n return self.container_info()[\"Image\"]\n\n def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this model\"\"\"\n return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n\n def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n if self.latest_model is None:\n return \"-\"\n return self.latest_model[\"CreationTime\"]\n\n def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n if self.latest_model is None:\n return \"-\"\n return self.latest_model[\"CreationTime\"]\n\n def hash(self) -> Optional[str]:\n \"\"\"Return the hash for this artifact\n\n Returns:\n Optional[str]: The hash for this artifact\n \"\"\"\n model_url = self.get_model_data_url()\n return get_s3_etag(model_url, self.boto3_session)\n\n def register_endpoint(self, endpoint_name: str):\n \"\"\"Add this endpoint to the set of registered endpoints for the model\n\n Args:\n endpoint_name (str): Name of the endpoint\n \"\"\"\n self.log.important(f\"Registering Endpoint {endpoint_name} with Model {self.uuid}...\")\n registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n registered_endpoints.add(endpoint_name)\n self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n # Remove any health tags\n self.remove_health_tag(\"no_endpoint\")\n\n # A new endpoint means we need to refresh our inference path\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n def remove_endpoint(self, endpoint_name: str):\n \"\"\"Remove this endpoint from the set of registered endpoints for the model\n\n Args:\n endpoint_name (str): Name of the endpoint\n \"\"\"\n self.log.important(f\"Removing Endpoint {endpoint_name} from Model {self.uuid}...\")\n registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n registered_endpoints.discard(endpoint_name)\n self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n # If we have NO endpionts, then set a health tags\n if not registered_endpoints:\n self.add_health_tag(\"no_endpoint\")\n self.details(recompute=True)\n\n # A new endpoint means we need to refresh our inference path\n time.sleep(2)\n\n def endpoints(self) -> list[str]:\n \"\"\"Get the list of registered endpoints for this Model\n\n Returns:\n list[str]: List of registered endpoints\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_registered_endpoints\", [])\n\n def get_endpoint_inference_path(self) -> Union[str, None]:\n \"\"\"Get the S3 Path for the Inference Data\n\n Returns:\n str: S3 Path for the Inference Data (or None if not found)\n \"\"\"\n\n # Look for any Registered Endpoints\n registered_endpoints = self.sageworks_meta().get(\"sageworks_registered_endpoints\")\n\n # Note: We may have 0 to N endpoints, so we find the one with the most recent artifacts\n if registered_endpoints:\n endpoint_inference_base = self.endpoints_s3_path + \"/inference/\"\n endpoint_inference_paths = [endpoint_inference_base + e for e in registered_endpoints]\n inference_path = newest_path(endpoint_inference_paths, self.sm_session)\n if inference_path is None:\n self.log.important(f\"No inference data found for {self.model_name}!\")\n self.log.important(f\"Returning default inference path for {registered_endpoints[0]}...\")\n self.log.important(f\"{endpoint_inference_paths[0]}\")\n return endpoint_inference_paths[0]\n else:\n return inference_path\n else:\n self.log.warning(f\"No registered endpoints found for {self.model_name}!\")\n return None\n\n def set_target(self, target_column: str):\n \"\"\"Set the target for this Model\n\n Args:\n target_column (str): Target column for this Model\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_model_target\": target_column})\n\n def set_features(self, feature_columns: list[str]):\n \"\"\"Set the features for this Model\n\n Args:\n feature_columns (list[str]): List of feature columns\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_model_features\": feature_columns})\n\n def target(self) -> Union[str, None]:\n \"\"\"Return the target for this Model (if supervised, else None)\n\n Returns:\n str: Target column for this Model (if supervised, else None)\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_model_target\") # Returns None if not found\n\n def features(self) -> Union[list[str], None]:\n \"\"\"Return a list of features used for this Model\n\n Returns:\n list[str]: List of features used for this Model\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_model_features\") # Returns None if not found\n\n def class_labels(self) -> Union[list[str], None]:\n \"\"\"Return the class labels for this Model (if it's a classifier)\n\n Returns:\n list[str]: List of class labels\n \"\"\"\n if self.model_type == ModelType.CLASSIFIER:\n return self.sageworks_meta().get(\"class_labels\") # Returns None if not found\n else:\n return None\n\n def set_class_labels(self, labels: list[str]):\n \"\"\"Return the class labels for this Model (if it's a classifier)\n\n Args:\n labels (list[str]): List of class labels\n \"\"\"\n if self.model_type == ModelType.CLASSIFIER:\n self.upsert_sageworks_meta({\"class_labels\": labels})\n else:\n self.log.error(f\"Model {self.model_name} is not a classifier!\")\n\n def details(self, recompute=False) -> dict:\n \"\"\"Additional Details about this Model\n Args:\n recompute (bool, optional): Recompute the details (default: False)\n Returns:\n dict: Dictionary of details about this Model\n \"\"\"\n self.log.info(\"Computing Model Details...\")\n details = self.summary()\n details[\"pipeline\"] = self.get_pipeline()\n details[\"model_type\"] = self.model_type.value\n details[\"model_package_group_arn\"] = self.group_arn()\n details[\"model_package_arn\"] = self.model_package_arn()\n\n # Sanity check is we have models in the group\n if self.latest_model is None:\n self.log.warning(f\"Model Package Group {self.model_name} has no models!\")\n return details\n\n # Grab the Model Details\n details[\"description\"] = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n details[\"version\"] = self.latest_model[\"ModelPackageVersion\"]\n details[\"status\"] = self.latest_model[\"ModelPackageStatus\"]\n details[\"approval_status\"] = self.latest_model.get(\"ModelApprovalStatus\", \"unknown\")\n details[\"image\"] = self.container_image().split(\"/\")[-1] # Shorten the image uri\n\n # Grab the inference and container info\n inference_spec = self.latest_model[\"InferenceSpecification\"]\n container_info = self.container_info()\n details[\"framework\"] = container_info.get(\"Framework\", \"unknown\")\n details[\"framework_version\"] = container_info.get(\"FrameworkVersion\", \"unknown\")\n details[\"inference_types\"] = inference_spec[\"SupportedRealtimeInferenceInstanceTypes\"]\n details[\"transform_types\"] = inference_spec[\"SupportedTransformInstanceTypes\"]\n details[\"content_types\"] = inference_spec[\"SupportedContentTypes\"]\n details[\"response_types\"] = inference_spec[\"SupportedResponseMIMETypes\"]\n details[\"model_metrics\"] = self.get_inference_metrics()\n if self.model_type == ModelType.CLASSIFIER:\n details[\"confusion_matrix\"] = self.confusion_matrix()\n details[\"predictions\"] = None\n elif self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n details[\"confusion_matrix\"] = None\n details[\"predictions\"] = self.get_inference_predictions()\n else:\n details[\"confusion_matrix\"] = None\n details[\"predictions\"] = None\n\n # Grab the inference metadata\n details[\"inference_meta\"] = self.get_inference_metadata()\n\n # Return the details\n return details\n\n # Pipeline for this model\n def get_pipeline(self) -> str:\n \"\"\"Get the pipeline for this model\"\"\"\n return self.sageworks_meta().get(\"sageworks_pipeline\")\n\n def set_pipeline(self, pipeline: str):\n \"\"\"Set the pipeline for this model\n\n Args:\n pipeline (str): Pipeline that was used to create this model\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_pipeline\": pipeline})\n\n def expected_meta(self) -> list[str]:\n \"\"\"Metadata we expect to see for this Model when it's ready\n Returns:\n list[str]: List of expected metadata keys\n \"\"\"\n # Our current list of expected metadata, we can add to this as needed\n return [\"sageworks_status\", \"sageworks_training_metrics\", \"sageworks_training_cm\"]\n\n def is_model_unknown(self) -> bool:\n \"\"\"Is the Model Type unknown?\"\"\"\n return self.model_type == ModelType.UNKNOWN\n\n def _determine_model_type(self):\n \"\"\"Internal: Determine the Model Type\"\"\"\n model_type = input(\"Model Type? (classifier, regressor, quantile_regressor, unsupervised, transformer): \")\n if model_type == \"classifier\":\n self._set_model_type(ModelType.CLASSIFIER)\n elif model_type == \"regressor\":\n self._set_model_type(ModelType.REGRESSOR)\n elif model_type == \"quantile_regressor\":\n self._set_model_type(ModelType.QUANTILE_REGRESSOR)\n elif model_type == \"unsupervised\":\n self._set_model_type(ModelType.UNSUPERVISED)\n elif model_type == \"transformer\":\n self._set_model_type(ModelType.TRANSFORMER)\n else:\n self.log.warning(f\"Unknown Model Type {model_type}!\")\n self._set_model_type(ModelType.UNKNOWN)\n\n def onboard(self, ask_everything=False) -> bool:\n \"\"\"This is an interactive method that will onboard the Model (make it ready)\n\n Args:\n ask_everything (bool, optional): Ask for all the details. Defaults to False.\n\n Returns:\n bool: True if the Model is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n # Determine the Model Type\n while self.is_model_unknown():\n self._determine_model_type()\n\n # Is our input data set?\n if self.get_input() in [\"\", \"unknown\"] or ask_everything:\n input_data = input(\"Input Data?: \")\n if input_data not in [\"None\", \"none\", \"\", \"unknown\"]:\n self.set_input(input_data)\n\n # Determine the Target Column (can be None)\n target_column = self.target()\n if target_column is None or ask_everything:\n target_column = input(\"Target Column? (for unsupervised/transformer just type None): \")\n if target_column in [\"None\", \"none\", \"\"]:\n target_column = None\n\n # Determine the Feature Columns\n feature_columns = self.features()\n if feature_columns is None or ask_everything:\n feature_columns = input(\"Feature Columns? (use commas): \")\n feature_columns = [e.strip() for e in feature_columns.split(\",\")]\n if feature_columns in [[\"None\"], [\"none\"], [\"\"]]:\n feature_columns = None\n\n # Registered Endpoints?\n endpoints = self.endpoints()\n if not endpoints or ask_everything:\n endpoints = input(\"Register Endpoints? (use commas for multiple): \")\n endpoints = [e.strip() for e in endpoints.split(\",\")]\n if endpoints in [[\"None\"], [\"none\"], [\"\"]]:\n endpoints = None\n\n # Model Owner?\n owner = self.get_owner()\n if owner in [None, \"unknown\"] or ask_everything:\n owner = input(\"Model Owner: \")\n if owner in [\"None\", \"none\", \"\"]:\n owner = \"unknown\"\n\n # Model Class Labels (if it's a classifier)\n if self.model_type == ModelType.CLASSIFIER:\n class_labels = self.class_labels()\n if class_labels is None or ask_everything:\n class_labels = input(\"Class Labels? (use commas): \")\n class_labels = [e.strip() for e in class_labels.split(\",\")]\n if class_labels in [[\"None\"], [\"none\"], [\"\"]]:\n class_labels = None\n self.set_class_labels(class_labels)\n\n # Now that we have all the details, let's onboard the Model with all the args\n return self.onboard_with_args(self.model_type, target_column, feature_columns, endpoints, owner)\n\n def onboard_with_args(\n self,\n model_type: ModelType,\n target_column: str = None,\n feature_list: list = None,\n endpoints: list = None,\n owner: str = None,\n ) -> bool:\n \"\"\"Onboard the Model with the given arguments\n\n Args:\n model_type (ModelType): Model Type\n target_column (str): Target Column\n feature_list (list): List of Feature Columns\n endpoints (list, optional): List of Endpoints. Defaults to None.\n owner (str, optional): Model Owner. Defaults to None.\n Returns:\n bool: True if the Model is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n # Set All the Details\n self._set_model_type(model_type)\n if target_column:\n self.set_target(target_column)\n if feature_list:\n self.set_features(feature_list)\n if endpoints:\n for endpoint in endpoints:\n self.register_endpoint(endpoint)\n if owner:\n self.set_owner(owner)\n\n # Load the training metrics and inference metrics\n self._load_training_metrics()\n self._load_inference_metrics()\n\n # Remove the needs_onboard tag\n self.remove_health_tag(\"needs_onboard\")\n self.set_status(\"ready\")\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n return True\n\n def get_model_data_url(self) -> Optional[str]:\n \"\"\"Retrieve the ModelDataUrl from the model's AWS metadata.\n\n Returns:\n Optional[str]: The ModelDataUrl if available, otherwise None.\n \"\"\"\n meta = self.aws_meta()\n try:\n return meta[\"ModelPackageList\"][0][\"InferenceSpecification\"][\"Containers\"][0][\"ModelDataUrl\"]\n except (KeyError, IndexError, TypeError):\n return None\n\n def delete(self):\n \"\"\"Delete the Model Packages and the Model Group\"\"\"\n if not self.exists():\n self.log.warning(f\"Trying to delete an Model that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the Model Group\n ModelCore.managed_delete(model_group_name=self.uuid)\n\n @classmethod\n def managed_delete(cls, model_group_name: str):\n \"\"\"Delete the Model Packages, Model Group, and S3 Storage Objects\n\n Args:\n model_group_name (str): The name of the Model Group to delete\n \"\"\"\n # Check if the model group exists in SageMaker\n try:\n cls.sm_client.describe_model_package_group(ModelPackageGroupName=model_group_name)\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] in [\"ValidationException\", \"ResourceNotFound\"]:\n cls.log.info(f\"Model Group {model_group_name} not found!\")\n return\n else:\n raise # Re-raise unexpected errors\n\n # Delete Model Packages within the Model Group\n try:\n paginator = cls.sm_client.get_paginator(\"list_model_packages\")\n for page in paginator.paginate(ModelPackageGroupName=model_group_name):\n for model_package in page[\"ModelPackageSummaryList\"]:\n package_arn = model_package[\"ModelPackageArn\"]\n cls.log.info(f\"Deleting Model Package {package_arn}...\")\n cls.sm_client.delete_model_package(ModelPackageName=package_arn)\n except ClientError as e:\n cls.log.error(f\"Error while deleting model packages: {e}\")\n raise\n\n # Delete the Model Package Group\n cls.log.info(f\"Deleting Model Group {model_group_name}...\")\n cls.sm_client.delete_model_package_group(ModelPackageGroupName=model_group_name)\n\n # Delete S3 training artifacts\n s3_delete_path = f\"{cls.models_s3_path}/training/{model_group_name}/\"\n cls.log.info(f\"Deleting S3 Objects at {s3_delete_path}...\")\n wr.s3.delete_objects(s3_delete_path, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(model_group_name)\n\n def _set_model_type(self, model_type: ModelType):\n \"\"\"Internal: Set the Model Type for this Model\"\"\"\n self.model_type = model_type\n self.upsert_sageworks_meta({\"sageworks_model_type\": self.model_type.value})\n self.remove_health_tag(\"model_type_unknown\")\n\n def _get_model_type(self) -> ModelType:\n \"\"\"Internal: Query the SageWorks Metadata to get the model type\n Returns:\n ModelType: The ModelType of this Model\n Notes:\n This is an internal method that should not be called directly\n Use the model_type attribute instead\n \"\"\"\n model_type = self.sageworks_meta().get(\"sageworks_model_type\")\n try:\n return ModelType(model_type)\n except ValueError:\n self.log.warning(f\"Could not determine model type for {self.model_name}!\")\n return ModelType.UNKNOWN\n\n def _load_training_metrics(self):\n \"\"\"Internal: Retrieve the training metrics and Confusion Matrix for this model\n and load the data into the SageWorks Metadata\n\n Notes:\n This may or may not exist based on whether we have access to TrainingJobAnalytics\n \"\"\"\n try:\n df = TrainingJobAnalytics(training_job_name=self.training_job_name).dataframe()\n if df.empty:\n self.log.important(f\"No training job metrics found for {self.training_job_name}\")\n self.upsert_sageworks_meta({\"sageworks_training_metrics\": None, \"sageworks_training_cm\": None})\n return\n if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n if \"timestamp\" in df.columns:\n df = df.drop(columns=[\"timestamp\"])\n\n # We're going to pivot the DataFrame to get the desired structure\n reg_metrics_df = df.set_index(\"metric_name\").T\n\n # Store and return the metrics in the SageWorks Metadata\n self.upsert_sageworks_meta(\n {\"sageworks_training_metrics\": reg_metrics_df.to_dict(), \"sageworks_training_cm\": None}\n )\n return\n\n except (KeyError, botocore.exceptions.ClientError):\n self.log.important(f\"No training job metrics found for {self.training_job_name}\")\n # Store and return the metrics in the SageWorks Metadata\n self.upsert_sageworks_meta({\"sageworks_training_metrics\": None, \"sageworks_training_cm\": None})\n return\n\n # We need additional processing for classification metrics\n if self.model_type == ModelType.CLASSIFIER:\n metrics_df, cm_df = self._process_classification_metrics(df)\n\n # Store and return the metrics in the SageWorks Metadata\n self.upsert_sageworks_meta(\n {\"sageworks_training_metrics\": metrics_df.to_dict(), \"sageworks_training_cm\": cm_df.to_dict()}\n )\n\n def _load_inference_metrics(self, capture_uuid: str = \"auto_inference\"):\n \"\"\"Internal: Retrieve the inference model metrics for this model\n and load the data into the SageWorks Metadata\n\n Args:\n capture_uuid (str, optional): A specific capture_uuid (default: \"auto_inference\")\n Notes:\n This may or may not exist based on whether an Endpoint ran Inference\n \"\"\"\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n inference_metrics = pull_s3_data(s3_path)\n\n # Store data into the SageWorks Metadata\n metrics_storage = None if inference_metrics is None else inference_metrics.to_dict(\"records\")\n self.upsert_sageworks_meta({\"sageworks_inference_metrics\": metrics_storage})\n\n def get_inference_metadata(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the inference metadata for this model\n\n Args:\n capture_uuid (str, optional): A specific capture_uuid (default: \"auto_inference\")\n\n Returns:\n dict: Dictionary of the inference metadata (might be None)\n Notes:\n Basically when Endpoint inference was run, name of the dataset, the MD5, etc\n \"\"\"\n # Sanity check the inference path (which may or may not exist)\n if self.endpoint_inference_path is None:\n return None\n\n # Check for model_training capture_uuid\n if capture_uuid == \"model_training\":\n # Create a DataFrame with the training metadata\n meta_df = pd.DataFrame(\n [\n {\n \"name\": \"AWS Training Capture\",\n \"data_hash\": \"N/A\",\n \"num_rows\": \"-\",\n \"description\": \"-\",\n }\n ]\n )\n return meta_df\n\n # Pull the inference metadata\n try:\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_meta.json\"\n return wr.s3.read_json(s3_path)\n except NoFilesFound:\n self.log.info(f\"Could not find model inference meta at {s3_path}...\")\n return None\n\n def get_inference_predictions(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n \"\"\"\n self.log.important(f\"Grabbing {capture_uuid} predictions for {self.model_name}...\")\n\n # Sanity check that the model should have predictions\n has_predictions = self.model_type in [ModelType.CLASSIFIER, ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]\n if not has_predictions:\n self.log.warning(f\"No Predictions for {self.model_name}...\")\n return None\n\n # Special case for model_training\n if capture_uuid == \"model_training\":\n return self._get_validation_predictions()\n\n # Construct the S3 path for the Inference Predictions\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_predictions.csv\"\n return pull_s3_data(s3_path)\n\n def _get_validation_predictions(self) -> Union[pd.DataFrame, None]:\n \"\"\"Internal: Retrieve the captured prediction results for this model\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Validation Predictions (might be None)\n \"\"\"\n # Sanity check the training path (which may or may not exist)\n if self.model_training_path is None:\n self.log.warning(f\"No Validation Predictions for {self.model_name}...\")\n return None\n self.log.important(f\"Grabbing Validation Predictions for {self.model_name}...\")\n s3_path = f\"{self.model_training_path}/validation_predictions.csv\"\n df = pull_s3_data(s3_path)\n return df\n\n def _extract_training_job_name(self) -> Union[str, None]:\n \"\"\"Internal: Extract the training job name from the ModelDataUrl\"\"\"\n try:\n model_data_url = self.container_info()[\"ModelDataUrl\"]\n parsed_url = urllib.parse.urlparse(model_data_url)\n training_job_name = parsed_url.path.lstrip(\"/\").split(\"/\")[0]\n return training_job_name\n except KeyError:\n self.log.warning(f\"Could not extract training job name from {model_data_url}\")\n return None\n\n @staticmethod\n def _process_classification_metrics(df: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"Internal: Process classification metrics into a more reasonable format\n Args:\n df (pd.DataFrame): DataFrame of training metrics\n Returns:\n (pd.DataFrame, pd.DataFrame): Tuple of DataFrames. Metrics and confusion matrix\n \"\"\"\n # Split into two DataFrames based on 'metric_name'\n metrics_df = df[df[\"metric_name\"].str.startswith(\"Metrics:\")].copy()\n cm_df = df[df[\"metric_name\"].str.startswith(\"ConfusionMatrix:\")].copy()\n\n # Split the 'metric_name' into different parts\n metrics_df[\"class\"] = metrics_df[\"metric_name\"].str.split(\":\").str[1]\n metrics_df[\"metric_type\"] = metrics_df[\"metric_name\"].str.split(\":\").str[2]\n\n # Pivot the DataFrame to get the desired structure\n metrics_df = metrics_df.pivot(index=\"class\", columns=\"metric_type\", values=\"value\").reset_index()\n metrics_df = metrics_df.rename_axis(None, axis=1)\n\n # Now process the confusion matrix\n cm_df[\"row_class\"] = cm_df[\"metric_name\"].str.split(\":\").str[1]\n cm_df[\"col_class\"] = cm_df[\"metric_name\"].str.split(\":\").str[2]\n\n # Pivot the DataFrame to create a form suitable for the heatmap\n cm_df = cm_df.pivot(index=\"row_class\", columns=\"col_class\", values=\"value\")\n\n # Convert the values in cm_df to integers\n cm_df = cm_df.astype(int)\n\n return metrics_df, cm_df\n\n def shapley_values(self, capture_uuid: str = \"auto_inference\") -> Union[list[pd.DataFrame], pd.DataFrame, None]:\n \"\"\"Retrieve the Shapely values for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: Dataframe(s) of the shapley values or None if not found\n\n Notes:\n This may or may not exist based on whether an Endpoint ran Shapley\n \"\"\"\n\n # Sanity check the inference path (which may or may not exist)\n if self.endpoint_inference_path is None:\n return None\n\n # Construct the S3 path for the Shapley values\n shapley_s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n # Multiple CSV if classifier\n if self.model_type == ModelType.CLASSIFIER:\n # CSVs for shap values are indexed by prediction class\n # Because we don't know how many classes there are, we need to search through\n # a list of S3 objects in the parent folder\n s3_paths = wr.s3.list_objects(shapley_s3_path)\n return [pull_s3_data(f) for f in s3_paths if \"inference_shap_values\" in f]\n\n # One CSV if regressor\n if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n s3_path = f\"{shapley_s3_path}/inference_shap_values.csv\"\n return pull_s3_data(s3_path)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.__init__","title":"__init__(model_uuid, model_type=None, **kwargs)
","text":"ModelCore Initialization Args: model_uuid (str): Name of Model in SageWorks. model_type (ModelType, optional): Set this for newly created Models. Defaults to None. **kwargs: Additional keyword arguments
Source code insrc/sageworks/core/artifacts/model_core.py
def __init__(self, model_uuid: str, model_type: ModelType = None, **kwargs):\n \"\"\"ModelCore Initialization\n Args:\n model_uuid (str): Name of Model in SageWorks.\n model_type (ModelType, optional): Set this for newly created Models. Defaults to None.\n **kwargs: Additional keyword arguments\n \"\"\"\n\n # Make sure the model name is valid\n self.is_name_valid(model_uuid, delimiter=\"-\", lower_case=False)\n\n # Call SuperClass Initialization\n super().__init__(model_uuid, **kwargs)\n\n # Initialize our class attributes\n self.latest_model = None\n self.model_type = ModelType.UNKNOWN\n self.model_training_path = None\n self.endpoint_inference_path = None\n\n # Grab an Cloud Platform Meta object and pull information for this Model\n self.model_name = model_uuid\n self.model_meta = self.meta.model(self.model_name)\n if self.model_meta is None:\n self.log.warning(f\"Could not find model {self.model_name} within current visibility scope\")\n return\n else:\n # Is this a model package group without any models?\n if len(self.model_meta[\"ModelPackageList\"]) == 0:\n self.log.warning(f\"Model Group {self.model_name} has no Model Packages!\")\n self.latest_model = None\n self.add_health_tag(\"model_not_found\")\n return\n try:\n self.latest_model = self.model_meta[\"ModelPackageList\"][0]\n self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n self.training_job_name = self._extract_training_job_name()\n if model_type:\n self._set_model_type(model_type)\n else:\n self.model_type = self._get_model_type()\n except (IndexError, KeyError):\n self.log.critical(f\"Model {self.model_name} appears to be malformed. Delete and recreate it!\")\n return\n\n # Set the Model Training S3 Path\n self.model_training_path = self.models_s3_path + \"/training/\" + self.model_name\n\n # Get our Endpoint Inference Path (might be None)\n self.endpoint_inference_path = self.get_endpoint_inference_path()\n\n # Call SuperClass Post Initialization\n super().__post_init__()\n\n # All done\n self.log.info(f\"Model Initialized: {self.model_name}\")\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.arn","title":"arn()
","text":"AWS ARN (Amazon Resource Name) for the Model Package Group
Source code insrc/sageworks/core/artifacts/model_core.py
def arn(self) -> str:\n \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n return self.group_arn()\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.aws_meta","title":"aws_meta()
","text":"Get ALL the AWS metadata for this artifact
Source code insrc/sageworks/core/artifacts/model_core.py
def aws_meta(self) -> dict:\n \"\"\"Get ALL the AWS metadata for this artifact\"\"\"\n return self.model_meta\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.aws_url","title":"aws_url()
","text":"The AWS URL for looking at/querying this model
Source code insrc/sageworks/core/artifacts/model_core.py
def aws_url(self):\n \"\"\"The AWS URL for looking at/querying this model\"\"\"\n return f\"https://{self.aws_region}.console.aws.amazon.com/athena/home\"\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.class_labels","title":"class_labels()
","text":"Return the class labels for this Model (if it's a classifier)
Returns:
Type DescriptionUnion[list[str], None]
list[str]: List of class labels
Source code insrc/sageworks/core/artifacts/model_core.py
def class_labels(self) -> Union[list[str], None]:\n \"\"\"Return the class labels for this Model (if it's a classifier)\n\n Returns:\n list[str]: List of class labels\n \"\"\"\n if self.model_type == ModelType.CLASSIFIER:\n return self.sageworks_meta().get(\"class_labels\") # Returns None if not found\n else:\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.confusion_matrix","title":"confusion_matrix(capture_uuid='latest')
","text":"Retrieve the confusion_matrix for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid or \"training\" (default: \"latest\")
'latest'
Returns: pd.DataFrame: DataFrame of the Confusion Matrix (might be None)
Source code insrc/sageworks/core/artifacts/model_core.py
def confusion_matrix(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the confusion_matrix for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n Returns:\n pd.DataFrame: DataFrame of the Confusion Matrix (might be None)\n \"\"\"\n\n # Sanity check the sageworks metadata\n if self.sageworks_meta() is None:\n error_msg = f\"Model {self.model_name} has no sageworks_meta(). Either onboard() or delete this model!\"\n self.log.critical(error_msg)\n raise ValueError(error_msg)\n\n # Grab the metrics from the SageWorks Metadata (try inference first, then training)\n if capture_uuid == \"latest\":\n cm = self.confusion_matrix(\"auto_inference\")\n return cm if cm is not None else self.confusion_matrix(\"model_training\")\n\n # Grab the confusion matrix captured during model training (could return None)\n if capture_uuid == \"model_training\":\n cm = self.sageworks_meta().get(\"sageworks_training_cm\")\n return pd.DataFrame.from_dict(cm) if cm else None\n\n else: # Specific capture_uuid\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_cm.csv\"\n cm = pull_s3_data(s3_path, embedded_index=True)\n if cm is not None:\n return cm\n else:\n self.log.warning(f\"Confusion Matrix {capture_uuid} not found for {self.model_name}!\")\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.container_image","title":"container_image()
","text":"Container Image for the Latest Model Package
Source code insrc/sageworks/core/artifacts/model_core.py
def container_image(self) -> str:\n \"\"\"Container Image for the Latest Model Package\"\"\"\n return self.container_info()[\"Image\"]\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.container_info","title":"container_info()
","text":"Container Info for the Latest Model Package
Source code insrc/sageworks/core/artifacts/model_core.py
def container_info(self) -> Union[dict, None]:\n \"\"\"Container Info for the Latest Model Package\"\"\"\n return self.latest_model[\"InferenceSpecification\"][\"Containers\"][0] if self.latest_model else None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.created","title":"created()
","text":"Return the datetime when this artifact was created
Source code insrc/sageworks/core/artifacts/model_core.py
def created(self) -> datetime:\n \"\"\"Return the datetime when this artifact was created\"\"\"\n if self.latest_model is None:\n return \"-\"\n return self.latest_model[\"CreationTime\"]\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.delete","title":"delete()
","text":"Delete the Model Packages and the Model Group
Source code insrc/sageworks/core/artifacts/model_core.py
def delete(self):\n \"\"\"Delete the Model Packages and the Model Group\"\"\"\n if not self.exists():\n self.log.warning(f\"Trying to delete an Model that doesn't exist: {self.uuid}\")\n\n # Call the Class Method to delete the Model Group\n ModelCore.managed_delete(model_group_name=self.uuid)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.delete_inference_run","title":"delete_inference_run(inference_run_uuid)
","text":"Delete the inference run for this model
Parameters:
Name Type Description Defaultinference_run_uuid
str
UUID of the inference run
required Source code insrc/sageworks/core/artifacts/model_core.py
def delete_inference_run(self, inference_run_uuid: str):\n \"\"\"Delete the inference run for this model\n\n Args:\n inference_run_uuid (str): UUID of the inference run\n \"\"\"\n if inference_run_uuid == \"model_training\":\n self.log.warning(\"Cannot delete model training data!\")\n return\n\n if self.endpoint_inference_path:\n full_path = f\"{self.endpoint_inference_path}/{inference_run_uuid}\"\n # Check if there are any objects at the path\n if wr.s3.list_objects(full_path):\n wr.s3.delete_objects(path=full_path)\n self.log.important(f\"Deleted inference run {inference_run_uuid} for {self.model_name}\")\n else:\n self.log.warning(f\"Inference run {inference_run_uuid} not found for {self.model_name}!\")\n else:\n self.log.warning(f\"No inference data found for {self.model_name}!\")\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.details","title":"details(recompute=False)
","text":"Additional Details about this Model Args: recompute (bool, optional): Recompute the details (default: False) Returns: dict: Dictionary of details about this Model
Source code insrc/sageworks/core/artifacts/model_core.py
def details(self, recompute=False) -> dict:\n \"\"\"Additional Details about this Model\n Args:\n recompute (bool, optional): Recompute the details (default: False)\n Returns:\n dict: Dictionary of details about this Model\n \"\"\"\n self.log.info(\"Computing Model Details...\")\n details = self.summary()\n details[\"pipeline\"] = self.get_pipeline()\n details[\"model_type\"] = self.model_type.value\n details[\"model_package_group_arn\"] = self.group_arn()\n details[\"model_package_arn\"] = self.model_package_arn()\n\n # Sanity check is we have models in the group\n if self.latest_model is None:\n self.log.warning(f\"Model Package Group {self.model_name} has no models!\")\n return details\n\n # Grab the Model Details\n details[\"description\"] = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n details[\"version\"] = self.latest_model[\"ModelPackageVersion\"]\n details[\"status\"] = self.latest_model[\"ModelPackageStatus\"]\n details[\"approval_status\"] = self.latest_model.get(\"ModelApprovalStatus\", \"unknown\")\n details[\"image\"] = self.container_image().split(\"/\")[-1] # Shorten the image uri\n\n # Grab the inference and container info\n inference_spec = self.latest_model[\"InferenceSpecification\"]\n container_info = self.container_info()\n details[\"framework\"] = container_info.get(\"Framework\", \"unknown\")\n details[\"framework_version\"] = container_info.get(\"FrameworkVersion\", \"unknown\")\n details[\"inference_types\"] = inference_spec[\"SupportedRealtimeInferenceInstanceTypes\"]\n details[\"transform_types\"] = inference_spec[\"SupportedTransformInstanceTypes\"]\n details[\"content_types\"] = inference_spec[\"SupportedContentTypes\"]\n details[\"response_types\"] = inference_spec[\"SupportedResponseMIMETypes\"]\n details[\"model_metrics\"] = self.get_inference_metrics()\n if self.model_type == ModelType.CLASSIFIER:\n details[\"confusion_matrix\"] = self.confusion_matrix()\n details[\"predictions\"] = None\n elif self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n details[\"confusion_matrix\"] = None\n details[\"predictions\"] = self.get_inference_predictions()\n else:\n details[\"confusion_matrix\"] = None\n details[\"predictions\"] = None\n\n # Grab the inference metadata\n details[\"inference_meta\"] = self.get_inference_metadata()\n\n # Return the details\n return details\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.endpoints","title":"endpoints()
","text":"Get the list of registered endpoints for this Model
Returns:
Type Descriptionlist[str]
list[str]: List of registered endpoints
Source code insrc/sageworks/core/artifacts/model_core.py
def endpoints(self) -> list[str]:\n \"\"\"Get the list of registered endpoints for this Model\n\n Returns:\n list[str]: List of registered endpoints\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_registered_endpoints\", [])\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.exists","title":"exists()
","text":"Does the model metadata exist in the AWS Metadata?
Source code insrc/sageworks/core/artifacts/model_core.py
def exists(self) -> bool:\n \"\"\"Does the model metadata exist in the AWS Metadata?\"\"\"\n if self.model_meta is None:\n self.log.info(f\"Model {self.model_name} not found in AWS Metadata!\")\n return False\n return True\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.expected_meta","title":"expected_meta()
","text":"Metadata we expect to see for this Model when it's ready Returns: list[str]: List of expected metadata keys
Source code insrc/sageworks/core/artifacts/model_core.py
def expected_meta(self) -> list[str]:\n \"\"\"Metadata we expect to see for this Model when it's ready\n Returns:\n list[str]: List of expected metadata keys\n \"\"\"\n # Our current list of expected metadata, we can add to this as needed\n return [\"sageworks_status\", \"sageworks_training_metrics\", \"sageworks_training_cm\"]\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.features","title":"features()
","text":"Return a list of features used for this Model
Returns:
Type DescriptionUnion[list[str], None]
list[str]: List of features used for this Model
Source code insrc/sageworks/core/artifacts/model_core.py
def features(self) -> Union[list[str], None]:\n \"\"\"Return a list of features used for this Model\n\n Returns:\n list[str]: List of features used for this Model\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_model_features\") # Returns None if not found\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_endpoint_inference_path","title":"get_endpoint_inference_path()
","text":"Get the S3 Path for the Inference Data
Returns:
Name Type Descriptionstr
Union[str, None]
S3 Path for the Inference Data (or None if not found)
Source code insrc/sageworks/core/artifacts/model_core.py
def get_endpoint_inference_path(self) -> Union[str, None]:\n \"\"\"Get the S3 Path for the Inference Data\n\n Returns:\n str: S3 Path for the Inference Data (or None if not found)\n \"\"\"\n\n # Look for any Registered Endpoints\n registered_endpoints = self.sageworks_meta().get(\"sageworks_registered_endpoints\")\n\n # Note: We may have 0 to N endpoints, so we find the one with the most recent artifacts\n if registered_endpoints:\n endpoint_inference_base = self.endpoints_s3_path + \"/inference/\"\n endpoint_inference_paths = [endpoint_inference_base + e for e in registered_endpoints]\n inference_path = newest_path(endpoint_inference_paths, self.sm_session)\n if inference_path is None:\n self.log.important(f\"No inference data found for {self.model_name}!\")\n self.log.important(f\"Returning default inference path for {registered_endpoints[0]}...\")\n self.log.important(f\"{endpoint_inference_paths[0]}\")\n return endpoint_inference_paths[0]\n else:\n return inference_path\n else:\n self.log.warning(f\"No registered endpoints found for {self.model_name}!\")\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_metadata","title":"get_inference_metadata(capture_uuid='auto_inference')
","text":"Retrieve the inference metadata for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
A specific capture_uuid (default: \"auto_inference\")
'auto_inference'
Returns:
Name Type Descriptiondict
Union[DataFrame, None]
Dictionary of the inference metadata (might be None)
Notes: Basically when Endpoint inference was run, name of the dataset, the MD5, etc
Source code insrc/sageworks/core/artifacts/model_core.py
def get_inference_metadata(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the inference metadata for this model\n\n Args:\n capture_uuid (str, optional): A specific capture_uuid (default: \"auto_inference\")\n\n Returns:\n dict: Dictionary of the inference metadata (might be None)\n Notes:\n Basically when Endpoint inference was run, name of the dataset, the MD5, etc\n \"\"\"\n # Sanity check the inference path (which may or may not exist)\n if self.endpoint_inference_path is None:\n return None\n\n # Check for model_training capture_uuid\n if capture_uuid == \"model_training\":\n # Create a DataFrame with the training metadata\n meta_df = pd.DataFrame(\n [\n {\n \"name\": \"AWS Training Capture\",\n \"data_hash\": \"N/A\",\n \"num_rows\": \"-\",\n \"description\": \"-\",\n }\n ]\n )\n return meta_df\n\n # Pull the inference metadata\n try:\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_meta.json\"\n return wr.s3.read_json(s3_path)\n except NoFilesFound:\n self.log.info(f\"Could not find model inference meta at {s3_path}...\")\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_metrics","title":"get_inference_metrics(capture_uuid='latest')
","text":"Retrieve the inference performance metrics for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid or \"training\" (default: \"latest\")
'latest'
Returns: pd.DataFrame: DataFrame of the Model Metrics
NoteIf a capture_uuid isn't specified this will try to return something reasonable
Source code insrc/sageworks/core/artifacts/model_core.py
def get_inference_metrics(self, capture_uuid: str = \"latest\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the inference performance metrics for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid or \"training\" (default: \"latest\")\n Returns:\n pd.DataFrame: DataFrame of the Model Metrics\n\n Note:\n If a capture_uuid isn't specified this will try to return something reasonable\n \"\"\"\n # Try to get the auto_capture 'training_holdout' or the training\n if capture_uuid == \"latest\":\n metrics_df = self.get_inference_metrics(\"auto_inference\")\n return metrics_df if metrics_df is not None else self.get_inference_metrics(\"model_training\")\n\n # Grab the metrics captured during model training (could return None)\n if capture_uuid == \"model_training\":\n # Sanity check the sageworks metadata\n if self.sageworks_meta() is None:\n error_msg = f\"Model {self.model_name} has no sageworks_meta(). Either onboard() or delete this model!\"\n self.log.critical(error_msg)\n raise ValueError(error_msg)\n\n metrics = self.sageworks_meta().get(\"sageworks_training_metrics\")\n return pd.DataFrame.from_dict(metrics) if metrics else None\n\n else: # Specific capture_uuid (could return None)\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_metrics.csv\"\n metrics = pull_s3_data(s3_path, embedded_index=True)\n if metrics is not None:\n return metrics\n else:\n self.log.warning(f\"Performance metrics {capture_uuid} not found for {self.model_name}!\")\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_inference_predictions","title":"get_inference_predictions(capture_uuid='auto_inference')
","text":"Retrieve the captured prediction results for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid (default: training_holdout)
'auto_inference'
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: DataFrame of the Captured Predictions (might be None)
Source code insrc/sageworks/core/artifacts/model_core.py
def get_inference_predictions(self, capture_uuid: str = \"auto_inference\") -> Union[pd.DataFrame, None]:\n \"\"\"Retrieve the captured prediction results for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: DataFrame of the Captured Predictions (might be None)\n \"\"\"\n self.log.important(f\"Grabbing {capture_uuid} predictions for {self.model_name}...\")\n\n # Sanity check that the model should have predictions\n has_predictions = self.model_type in [ModelType.CLASSIFIER, ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]\n if not has_predictions:\n self.log.warning(f\"No Predictions for {self.model_name}...\")\n return None\n\n # Special case for model_training\n if capture_uuid == \"model_training\":\n return self._get_validation_predictions()\n\n # Construct the S3 path for the Inference Predictions\n s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}/inference_predictions.csv\"\n return pull_s3_data(s3_path)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_model_data_url","title":"get_model_data_url()
","text":"Retrieve the ModelDataUrl from the model's AWS metadata.
Returns:
Type DescriptionOptional[str]
Optional[str]: The ModelDataUrl if available, otherwise None.
Source code insrc/sageworks/core/artifacts/model_core.py
def get_model_data_url(self) -> Optional[str]:\n \"\"\"Retrieve the ModelDataUrl from the model's AWS metadata.\n\n Returns:\n Optional[str]: The ModelDataUrl if available, otherwise None.\n \"\"\"\n meta = self.aws_meta()\n try:\n return meta[\"ModelPackageList\"][0][\"InferenceSpecification\"][\"Containers\"][0][\"ModelDataUrl\"]\n except (KeyError, IndexError, TypeError):\n return None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.get_pipeline","title":"get_pipeline()
","text":"Get the pipeline for this model
Source code insrc/sageworks/core/artifacts/model_core.py
def get_pipeline(self) -> str:\n \"\"\"Get the pipeline for this model\"\"\"\n return self.sageworks_meta().get(\"sageworks_pipeline\")\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.group_arn","title":"group_arn()
","text":"AWS ARN (Amazon Resource Name) for the Model Package Group
Source code insrc/sageworks/core/artifacts/model_core.py
def group_arn(self) -> Union[str, None]:\n \"\"\"AWS ARN (Amazon Resource Name) for the Model Package Group\"\"\"\n return self.model_meta[\"ModelPackageGroupArn\"] if self.model_meta else None\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.hash","title":"hash()
","text":"Return the hash for this artifact
Returns:
Type DescriptionOptional[str]
Optional[str]: The hash for this artifact
Source code insrc/sageworks/core/artifacts/model_core.py
def hash(self) -> Optional[str]:\n \"\"\"Return the hash for this artifact\n\n Returns:\n Optional[str]: The hash for this artifact\n \"\"\"\n model_url = self.get_model_data_url()\n return get_s3_etag(model_url, self.boto3_session)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.health_check","title":"health_check()
","text":"Perform a health check on this model Returns: list[str]: List of health issues
Source code insrc/sageworks/core/artifacts/model_core.py
def health_check(self) -> list[str]:\n \"\"\"Perform a health check on this model\n Returns:\n list[str]: List of health issues\n \"\"\"\n # Call the base class health check\n health_issues = super().health_check()\n\n # Check if the model exists\n if self.latest_model is None:\n health_issues.append(\"model_not_found\")\n\n # Model Type\n if self._get_model_type() == ModelType.UNKNOWN:\n health_issues.append(\"model_type_unknown\")\n else:\n self.remove_health_tag(\"model_type_unknown\")\n\n # Model Performance Metrics\n needs_metrics = self.model_type in {ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR, ModelType.CLASSIFIER}\n if needs_metrics and self.get_inference_metrics() is None:\n health_issues.append(\"metrics_needed\")\n else:\n self.remove_health_tag(\"metrics_needed\")\n\n # Endpoint\n if not self.endpoints():\n health_issues.append(\"no_endpoint\")\n else:\n self.remove_health_tag(\"no_endpoint\")\n return health_issues\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.is_model_unknown","title":"is_model_unknown()
","text":"Is the Model Type unknown?
Source code insrc/sageworks/core/artifacts/model_core.py
def is_model_unknown(self) -> bool:\n \"\"\"Is the Model Type unknown?\"\"\"\n return self.model_type == ModelType.UNKNOWN\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.latest_model_object","title":"latest_model_object()
","text":"Return the latest AWS Sagemaker Model object for this SageWorks Model
Returns:
Type DescriptionModel
sagemaker.model.Model: AWS Sagemaker Model object
Source code insrc/sageworks/core/artifacts/model_core.py
def latest_model_object(self) -> SagemakerModel:\n \"\"\"Return the latest AWS Sagemaker Model object for this SageWorks Model\n\n Returns:\n sagemaker.model.Model: AWS Sagemaker Model object\n \"\"\"\n return SagemakerModel(\n model_data=self.model_package_arn(), sagemaker_session=self.sm_session, image_uri=self.container_image()\n )\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.list_inference_runs","title":"list_inference_runs()
","text":"List the inference runs for this model
Returns:
Type Descriptionlist[str]
list[str]: List of inference runs
Source code insrc/sageworks/core/artifacts/model_core.py
def list_inference_runs(self) -> list[str]:\n \"\"\"List the inference runs for this model\n\n Returns:\n list[str]: List of inference runs\n \"\"\"\n\n # Check if we have a model (if not return empty list)\n if self.latest_model is None:\n return []\n\n # Check if we have model training metrics in our metadata\n have_model_training = True if self.sageworks_meta().get(\"sageworks_training_metrics\") else False\n\n # Now grab the list of directories from our inference path\n inference_runs = []\n if self.endpoint_inference_path:\n directories = wr.s3.list_directories(path=self.endpoint_inference_path + \"/\")\n inference_runs = [urlparse(directory).path.split(\"/\")[-2] for directory in directories]\n\n # We're going to add the model training to the end of the list\n if have_model_training:\n inference_runs.append(\"model_training\")\n return inference_runs\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.managed_delete","title":"managed_delete(model_group_name)
classmethod
","text":"Delete the Model Packages, Model Group, and S3 Storage Objects
Parameters:
Name Type Description Defaultmodel_group_name
str
The name of the Model Group to delete
required Source code insrc/sageworks/core/artifacts/model_core.py
@classmethod\ndef managed_delete(cls, model_group_name: str):\n \"\"\"Delete the Model Packages, Model Group, and S3 Storage Objects\n\n Args:\n model_group_name (str): The name of the Model Group to delete\n \"\"\"\n # Check if the model group exists in SageMaker\n try:\n cls.sm_client.describe_model_package_group(ModelPackageGroupName=model_group_name)\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] in [\"ValidationException\", \"ResourceNotFound\"]:\n cls.log.info(f\"Model Group {model_group_name} not found!\")\n return\n else:\n raise # Re-raise unexpected errors\n\n # Delete Model Packages within the Model Group\n try:\n paginator = cls.sm_client.get_paginator(\"list_model_packages\")\n for page in paginator.paginate(ModelPackageGroupName=model_group_name):\n for model_package in page[\"ModelPackageSummaryList\"]:\n package_arn = model_package[\"ModelPackageArn\"]\n cls.log.info(f\"Deleting Model Package {package_arn}...\")\n cls.sm_client.delete_model_package(ModelPackageName=package_arn)\n except ClientError as e:\n cls.log.error(f\"Error while deleting model packages: {e}\")\n raise\n\n # Delete the Model Package Group\n cls.log.info(f\"Deleting Model Group {model_group_name}...\")\n cls.sm_client.delete_model_package_group(ModelPackageGroupName=model_group_name)\n\n # Delete S3 training artifacts\n s3_delete_path = f\"{cls.models_s3_path}/training/{model_group_name}/\"\n cls.log.info(f\"Deleting S3 Objects at {s3_delete_path}...\")\n wr.s3.delete_objects(s3_delete_path, boto3_session=cls.boto3_session)\n\n # Delete any dataframes that were stored in the Dataframe Cache\n cls.log.info(\"Deleting Dataframe Cache...\")\n cls.df_cache.delete_recursive(model_group_name)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.model_package_arn","title":"model_package_arn()
","text":"AWS ARN (Amazon Resource Name) for the Latest Model Package (within the Group)
Source code insrc/sageworks/core/artifacts/model_core.py
def model_package_arn(self) -> Union[str, None]:\n \"\"\"AWS ARN (Amazon Resource Name) for the Latest Model Package (within the Group)\"\"\"\n if self.latest_model is None:\n return None\n return self.latest_model[\"ModelPackageArn\"]\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.modified","title":"modified()
","text":"Return the datetime when this artifact was last modified
Source code insrc/sageworks/core/artifacts/model_core.py
def modified(self) -> datetime:\n \"\"\"Return the datetime when this artifact was last modified\"\"\"\n if self.latest_model is None:\n return \"-\"\n return self.latest_model[\"CreationTime\"]\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.onboard","title":"onboard(ask_everything=False)
","text":"This is an interactive method that will onboard the Model (make it ready)
Parameters:
Name Type Description Defaultask_everything
bool
Ask for all the details. Defaults to False.
False
Returns:
Name Type Descriptionbool
bool
True if the Model is successfully onboarded, False otherwise
Source code insrc/sageworks/core/artifacts/model_core.py
def onboard(self, ask_everything=False) -> bool:\n \"\"\"This is an interactive method that will onboard the Model (make it ready)\n\n Args:\n ask_everything (bool, optional): Ask for all the details. Defaults to False.\n\n Returns:\n bool: True if the Model is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n # Determine the Model Type\n while self.is_model_unknown():\n self._determine_model_type()\n\n # Is our input data set?\n if self.get_input() in [\"\", \"unknown\"] or ask_everything:\n input_data = input(\"Input Data?: \")\n if input_data not in [\"None\", \"none\", \"\", \"unknown\"]:\n self.set_input(input_data)\n\n # Determine the Target Column (can be None)\n target_column = self.target()\n if target_column is None or ask_everything:\n target_column = input(\"Target Column? (for unsupervised/transformer just type None): \")\n if target_column in [\"None\", \"none\", \"\"]:\n target_column = None\n\n # Determine the Feature Columns\n feature_columns = self.features()\n if feature_columns is None or ask_everything:\n feature_columns = input(\"Feature Columns? (use commas): \")\n feature_columns = [e.strip() for e in feature_columns.split(\",\")]\n if feature_columns in [[\"None\"], [\"none\"], [\"\"]]:\n feature_columns = None\n\n # Registered Endpoints?\n endpoints = self.endpoints()\n if not endpoints or ask_everything:\n endpoints = input(\"Register Endpoints? (use commas for multiple): \")\n endpoints = [e.strip() for e in endpoints.split(\",\")]\n if endpoints in [[\"None\"], [\"none\"], [\"\"]]:\n endpoints = None\n\n # Model Owner?\n owner = self.get_owner()\n if owner in [None, \"unknown\"] or ask_everything:\n owner = input(\"Model Owner: \")\n if owner in [\"None\", \"none\", \"\"]:\n owner = \"unknown\"\n\n # Model Class Labels (if it's a classifier)\n if self.model_type == ModelType.CLASSIFIER:\n class_labels = self.class_labels()\n if class_labels is None or ask_everything:\n class_labels = input(\"Class Labels? (use commas): \")\n class_labels = [e.strip() for e in class_labels.split(\",\")]\n if class_labels in [[\"None\"], [\"none\"], [\"\"]]:\n class_labels = None\n self.set_class_labels(class_labels)\n\n # Now that we have all the details, let's onboard the Model with all the args\n return self.onboard_with_args(self.model_type, target_column, feature_columns, endpoints, owner)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.onboard_with_args","title":"onboard_with_args(model_type, target_column=None, feature_list=None, endpoints=None, owner=None)
","text":"Onboard the Model with the given arguments
Parameters:
Name Type Description Defaultmodel_type
ModelType
Model Type
requiredtarget_column
str
Target Column
None
feature_list
list
List of Feature Columns
None
endpoints
list
List of Endpoints. Defaults to None.
None
owner
str
Model Owner. Defaults to None.
None
Returns: bool: True if the Model is successfully onboarded, False otherwise
Source code insrc/sageworks/core/artifacts/model_core.py
def onboard_with_args(\n self,\n model_type: ModelType,\n target_column: str = None,\n feature_list: list = None,\n endpoints: list = None,\n owner: str = None,\n) -> bool:\n \"\"\"Onboard the Model with the given arguments\n\n Args:\n model_type (ModelType): Model Type\n target_column (str): Target Column\n feature_list (list): List of Feature Columns\n endpoints (list, optional): List of Endpoints. Defaults to None.\n owner (str, optional): Model Owner. Defaults to None.\n Returns:\n bool: True if the Model is successfully onboarded, False otherwise\n \"\"\"\n # Set the status to onboarding\n self.set_status(\"onboarding\")\n\n # Set All the Details\n self._set_model_type(model_type)\n if target_column:\n self.set_target(target_column)\n if feature_list:\n self.set_features(feature_list)\n if endpoints:\n for endpoint in endpoints:\n self.register_endpoint(endpoint)\n if owner:\n self.set_owner(owner)\n\n # Load the training metrics and inference metrics\n self._load_training_metrics()\n self._load_inference_metrics()\n\n # Remove the needs_onboard tag\n self.remove_health_tag(\"needs_onboard\")\n self.set_status(\"ready\")\n\n # Run a health check and refresh the meta\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.health_check()\n self.refresh_meta()\n self.details(recompute=True)\n return True\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.refresh_meta","title":"refresh_meta()
","text":"Refresh the Artifact's metadata
Source code insrc/sageworks/core/artifacts/model_core.py
def refresh_meta(self):\n \"\"\"Refresh the Artifact's metadata\"\"\"\n self.model_meta = self.meta.model(self.model_name)\n self.latest_model = self.model_meta[\"ModelPackageList\"][0]\n self.description = self.latest_model.get(\"ModelPackageDescription\", \"-\")\n self.training_job_name = self._extract_training_job_name()\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.register_endpoint","title":"register_endpoint(endpoint_name)
","text":"Add this endpoint to the set of registered endpoints for the model
Parameters:
Name Type Description Defaultendpoint_name
str
Name of the endpoint
required Source code insrc/sageworks/core/artifacts/model_core.py
def register_endpoint(self, endpoint_name: str):\n \"\"\"Add this endpoint to the set of registered endpoints for the model\n\n Args:\n endpoint_name (str): Name of the endpoint\n \"\"\"\n self.log.important(f\"Registering Endpoint {endpoint_name} with Model {self.uuid}...\")\n registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n registered_endpoints.add(endpoint_name)\n self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n # Remove any health tags\n self.remove_health_tag(\"no_endpoint\")\n\n # A new endpoint means we need to refresh our inference path\n time.sleep(2) # Give the AWS Metadata a chance to update\n self.endpoint_inference_path = self.get_endpoint_inference_path()\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.remove_endpoint","title":"remove_endpoint(endpoint_name)
","text":"Remove this endpoint from the set of registered endpoints for the model
Parameters:
Name Type Description Defaultendpoint_name
str
Name of the endpoint
required Source code insrc/sageworks/core/artifacts/model_core.py
def remove_endpoint(self, endpoint_name: str):\n \"\"\"Remove this endpoint from the set of registered endpoints for the model\n\n Args:\n endpoint_name (str): Name of the endpoint\n \"\"\"\n self.log.important(f\"Removing Endpoint {endpoint_name} from Model {self.uuid}...\")\n registered_endpoints = set(self.sageworks_meta().get(\"sageworks_registered_endpoints\", []))\n registered_endpoints.discard(endpoint_name)\n self.upsert_sageworks_meta({\"sageworks_registered_endpoints\": list(registered_endpoints)})\n\n # If we have NO endpionts, then set a health tags\n if not registered_endpoints:\n self.add_health_tag(\"no_endpoint\")\n self.details(recompute=True)\n\n # A new endpoint means we need to refresh our inference path\n time.sleep(2)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_class_labels","title":"set_class_labels(labels)
","text":"Return the class labels for this Model (if it's a classifier)
Parameters:
Name Type Description Defaultlabels
list[str]
List of class labels
required Source code insrc/sageworks/core/artifacts/model_core.py
def set_class_labels(self, labels: list[str]):\n \"\"\"Return the class labels for this Model (if it's a classifier)\n\n Args:\n labels (list[str]): List of class labels\n \"\"\"\n if self.model_type == ModelType.CLASSIFIER:\n self.upsert_sageworks_meta({\"class_labels\": labels})\n else:\n self.log.error(f\"Model {self.model_name} is not a classifier!\")\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_features","title":"set_features(feature_columns)
","text":"Set the features for this Model
Parameters:
Name Type Description Defaultfeature_columns
list[str]
List of feature columns
required Source code insrc/sageworks/core/artifacts/model_core.py
def set_features(self, feature_columns: list[str]):\n \"\"\"Set the features for this Model\n\n Args:\n feature_columns (list[str]): List of feature columns\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_model_features\": feature_columns})\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_input","title":"set_input(input, force=False)
","text":"Override: Set the input data for this artifact
Parameters:
Name Type Description Defaultinput
str
Name of input for this artifact
requiredforce
bool
Force the input to be set (default: False)
False
Note: We're going to not allow this to be used for Models
Source code insrc/sageworks/core/artifacts/model_core.py
def set_input(self, input: str, force: bool = False):\n \"\"\"Override: Set the input data for this artifact\n\n Args:\n input (str): Name of input for this artifact\n force (bool, optional): Force the input to be set (default: False)\n Note:\n We're going to not allow this to be used for Models\n \"\"\"\n if not force:\n self.log.warning(f\"Model {self.uuid}: Does not allow manual override of the input!\")\n return\n\n # Okay we're going to allow this to be set\n self.log.important(f\"{self.uuid}: Setting input to {input}...\")\n self.log.important(\"Be careful with this! It breaks automatic provenance of the artifact!\")\n self.upsert_sageworks_meta({\"sageworks_input\": input})\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_pipeline","title":"set_pipeline(pipeline)
","text":"Set the pipeline for this model
Parameters:
Name Type Description Defaultpipeline
str
Pipeline that was used to create this model
required Source code insrc/sageworks/core/artifacts/model_core.py
def set_pipeline(self, pipeline: str):\n \"\"\"Set the pipeline for this model\n\n Args:\n pipeline (str): Pipeline that was used to create this model\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_pipeline\": pipeline})\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.set_target","title":"set_target(target_column)
","text":"Set the target for this Model
Parameters:
Name Type Description Defaulttarget_column
str
Target column for this Model
required Source code insrc/sageworks/core/artifacts/model_core.py
def set_target(self, target_column: str):\n \"\"\"Set the target for this Model\n\n Args:\n target_column (str): Target column for this Model\n \"\"\"\n self.upsert_sageworks_meta({\"sageworks_model_target\": target_column})\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.shapley_values","title":"shapley_values(capture_uuid='auto_inference')
","text":"Retrieve the Shapely values for this model
Parameters:
Name Type Description Defaultcapture_uuid
str
Specific capture_uuid (default: training_holdout)
'auto_inference'
Returns:
Type DescriptionUnion[list[DataFrame], DataFrame, None]
pd.DataFrame: Dataframe(s) of the shapley values or None if not found
NotesThis may or may not exist based on whether an Endpoint ran Shapley
Source code insrc/sageworks/core/artifacts/model_core.py
def shapley_values(self, capture_uuid: str = \"auto_inference\") -> Union[list[pd.DataFrame], pd.DataFrame, None]:\n \"\"\"Retrieve the Shapely values for this model\n\n Args:\n capture_uuid (str, optional): Specific capture_uuid (default: training_holdout)\n\n Returns:\n pd.DataFrame: Dataframe(s) of the shapley values or None if not found\n\n Notes:\n This may or may not exist based on whether an Endpoint ran Shapley\n \"\"\"\n\n # Sanity check the inference path (which may or may not exist)\n if self.endpoint_inference_path is None:\n return None\n\n # Construct the S3 path for the Shapley values\n shapley_s3_path = f\"{self.endpoint_inference_path}/{capture_uuid}\"\n\n # Multiple CSV if classifier\n if self.model_type == ModelType.CLASSIFIER:\n # CSVs for shap values are indexed by prediction class\n # Because we don't know how many classes there are, we need to search through\n # a list of S3 objects in the parent folder\n s3_paths = wr.s3.list_objects(shapley_s3_path)\n return [pull_s3_data(f) for f in s3_paths if \"inference_shap_values\" in f]\n\n # One CSV if regressor\n if self.model_type in [ModelType.REGRESSOR, ModelType.QUANTILE_REGRESSOR]:\n s3_path = f\"{shapley_s3_path}/inference_shap_values.csv\"\n return pull_s3_data(s3_path)\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.size","title":"size()
","text":"Return the size of this data in MegaBytes
Source code insrc/sageworks/core/artifacts/model_core.py
def size(self) -> float:\n \"\"\"Return the size of this data in MegaBytes\"\"\"\n return 0.0\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelCore.target","title":"target()
","text":"Return the target for this Model (if supervised, else None)
Returns:
Name Type Descriptionstr
Union[str, None]
Target column for this Model (if supervised, else None)
Source code insrc/sageworks/core/artifacts/model_core.py
def target(self) -> Union[str, None]:\n \"\"\"Return the target for this Model (if supervised, else None)\n\n Returns:\n str: Target column for this Model (if supervised, else None)\n \"\"\"\n return self.sageworks_meta().get(\"sageworks_model_target\") # Returns None if not found\n
"},{"location":"core_classes/artifacts/model_core/#sageworks.core.artifacts.model_core.ModelType","title":"ModelType
","text":" Bases: Enum
Enumerated Types for SageWorks Model Types
Source code insrc/sageworks/core/artifacts/model_core.py
class ModelType(Enum):\n \"\"\"Enumerated Types for SageWorks Model Types\"\"\"\n\n CLASSIFIER = \"classifier\"\n REGRESSOR = \"regressor\"\n CLUSTERER = \"clusterer\"\n TRANSFORMER = \"transformer\"\n PROJECTION = \"projection\"\n UNSUPERVISED = \"unsupervised\"\n QUANTILE_REGRESSOR = \"quantile_regressor\"\n DETECTOR = \"detector\"\n UNKNOWN = \"unknown\"\n
"},{"location":"core_classes/artifacts/monitor_core/","title":"MonitorCore","text":"API Classes
Found a method here you want to use? The API Classes have method pass-through so just call the method on the Monitor API Class and voil\u00e0 it works the same.
MonitorCore class for monitoring SageMaker endpoints
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore","title":"MonitorCore
","text":"Source code in src/sageworks/core/artifacts/monitor_core.py
class MonitorCore:\n def __init__(self, endpoint_name, instance_type=\"ml.t3.large\"):\n \"\"\"ExtractModelArtifact Class\n Args:\n endpoint_name (str): Name of the endpoint to set up monitoring for\n instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\".\n Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.endpoint_name = endpoint_name\n self.endpoint = EndpointCore(self.endpoint_name)\n\n # Initialize Class Attributes\n self.sagemaker_session = self.endpoint.sm_session\n self.sagemaker_client = self.endpoint.sm_client\n self.data_capture_path = self.endpoint.endpoint_data_capture_path\n self.monitoring_path = self.endpoint.endpoint_monitoring_path\n self.instance_type = instance_type\n self.monitoring_schedule_name = f\"{self.endpoint_name}-monitoring-schedule\"\n self.monitoring_output_path = f\"{self.monitoring_path}/monitoring_reports\"\n self.baseline_dir = f\"{self.monitoring_path}/baseline\"\n self.baseline_csv_file = f\"{self.baseline_dir}/baseline.csv\"\n self.constraints_json_file = f\"{self.baseline_dir}/constraints.json\"\n self.statistics_json_file = f\"{self.baseline_dir}/statistics.json\"\n\n # Initialize the DefaultModelMonitor\n self.sageworks_role_arn = AWSAccountClamp().aws_session.get_sageworks_execution_role_arn()\n self.model_monitor = DefaultModelMonitor(role=self.sageworks_role_arn, instance_type=self.instance_type)\n\n def summary(self) -> dict:\n \"\"\"Return the summary of information about the endpoint monitor\n\n Returns:\n dict: Summary of information about the endpoint monitor\n \"\"\"\n if self.endpoint.is_serverless():\n return {\n \"endpoint_type\": \"serverless\",\n \"data_capture\": \"not supported\",\n \"baseline\": \"not supported\",\n \"monitoring_schedule\": \"not supported\",\n }\n else:\n summary = {\n \"endpoint_type\": \"realtime\",\n \"data_capture\": self.is_data_capture_configured(capture_percentage=100),\n \"baseline\": self.baseline_exists(),\n \"monitoring_schedule\": self.monitoring_schedule_exists(),\n }\n summary.update(self.last_run_details() or {})\n return summary\n\n def __repr__(self) -> str:\n \"\"\"String representation of this MonitorCore object\n\n Returns:\n str: String representation of this MonitorCore object\n \"\"\"\n summary_dict = self.summary()\n summary_items = [f\" {repr(key)}: {repr(value)}\" for key, value in summary_dict.items()]\n summary_str = f\"{self.__class__.__name__}: {self.endpoint_name}\\n\" + \",\\n\".join(summary_items)\n return summary_str\n\n def last_run_details(self) -> Union[dict, None]:\n \"\"\"Return the details of the last monitoring run for the endpoint\n\n Returns:\n dict: The details of the last monitoring run for the endpoint (None if no monitoring schedule)\n \"\"\"\n # Check if we have a monitoring schedule\n if not self.monitoring_schedule_exists():\n return None\n\n # Get the details of the last monitoring run\n schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n MonitoringScheduleName=self.monitoring_schedule_name\n )\n last_run_status = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"MonitoringExecutionStatus\")\n last_run_time = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"ScheduledTime\")\n failure_reason = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"FailureReason\")\n return {\n \"last_run_status\": last_run_status,\n \"last_run_time\": str(last_run_time),\n \"failure_reason\": failure_reason,\n }\n\n def details(self) -> dict:\n \"\"\"Return the details of the monitoring for the endpoint\n\n Returns:\n dict: The details of the monitoring for the endpoint\n \"\"\"\n # Check if we have data capture\n if self.is_data_capture_configured(capture_percentage=100):\n data_capture_path = self.data_capture_path\n else:\n data_capture_path = None\n\n # Check if we have a baseline\n if self.baseline_exists():\n baseline_csv_file = self.baseline_csv_file\n constraints_json_file = self.constraints_json_file\n statistics_json_file = self.statistics_json_file\n else:\n baseline_csv_file = None\n constraints_json_file = None\n statistics_json_file = None\n\n # Check if we have a monitoring schedule\n if self.monitoring_schedule_exists():\n schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n MonitoringScheduleName=self.monitoring_schedule_name\n )\n\n # General monitoring details\n schedule_name = schedule_details.get(\"MonitoringScheduleName\")\n schedule_status = schedule_details.get(\"MonitoringScheduleStatus\")\n output_path = self.monitoring_output_path\n last_run_details = self.last_run_details()\n else:\n schedule_name = None\n schedule_status = \"Not Scheduled\"\n schedule_details = None\n output_path = None\n last_run_details = None\n\n # General monitoring details\n general = {\n \"data_capture_path\": data_capture_path,\n \"baseline_csv_file\": baseline_csv_file,\n \"baseline_constraints_json_file\": constraints_json_file,\n \"baseline_statistics_json_file\": statistics_json_file,\n \"monitoring_schedule_name\": schedule_name,\n \"monitoring_output_path\": output_path,\n \"monitoring_schedule_status\": schedule_status,\n \"monitoring_schedule_details\": schedule_details,\n }\n if last_run_details:\n general.update(last_run_details)\n return general\n\n def add_data_capture(self, capture_percentage=100):\n \"\"\"\n Add data capture configuration for the SageMaker endpoint.\n\n Args:\n capture_percentage (int): Percentage of data to capture. Defaults to 100.\n \"\"\"\n\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\"Data capture is not currently supported for serverless endpoints.\")\n return\n\n # Check if the endpoint already has data capture configured\n if self.is_data_capture_configured(capture_percentage):\n self.log.important(f\"Data capture {capture_percentage} already configured for {self.endpoint_name}.\")\n return\n\n # Get the current endpoint configuration name\n current_endpoint_config_name = self.endpoint.endpoint_config_name()\n\n # Log the data capture path\n self.log.important(f\"Adding Data Capture to {self.endpoint_name} --> {self.data_capture_path}\")\n self.log.important(\"This normally redeploys the endpoint...\")\n\n # Setup data capture config\n data_capture_config = DataCaptureConfig(\n enable_capture=True,\n sampling_percentage=capture_percentage,\n destination_s3_uri=self.data_capture_path,\n capture_options=[\"Input\", \"Output\"],\n csv_content_types=[\"text/csv\"],\n )\n\n # Create a Predictor instance and update data capture configuration\n predictor = Predictor(self.endpoint_name, sagemaker_session=self.sagemaker_session)\n predictor.update_data_capture_config(data_capture_config=data_capture_config)\n\n # Delete the old endpoint configuration\n self.log.important(f\"Deleting old endpoint configuration: {current_endpoint_config_name}\")\n self.sagemaker_client.delete_endpoint_config(EndpointConfigName=current_endpoint_config_name)\n\n def is_data_capture_configured(self, capture_percentage):\n \"\"\"\n Check if data capture is already configured on the endpoint.\n Args:\n capture_percentage (int): Expected data capture percentage.\n Returns:\n bool: True if data capture is already configured, False otherwise.\n \"\"\"\n try:\n endpoint_config_name = self.endpoint.endpoint_config_name()\n endpoint_config = self.sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n data_capture_config = endpoint_config.get(\"DataCaptureConfig\", {})\n\n # Check if data capture is enabled and the percentage matches\n is_enabled = data_capture_config.get(\"EnableCapture\", False)\n current_percentage = data_capture_config.get(\"InitialSamplingPercentage\", 0)\n return is_enabled and current_percentage == capture_percentage\n except Exception as e:\n self.log.error(f\"Error checking data capture configuration: {e}\")\n return False\n\n def get_latest_data_capture(self) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Get the latest data capture from S3.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n # List files in the specified S3 path\n files = wr.s3.list_objects(self.data_capture_path)\n\n if files:\n print(f\"Found {len(files)} files in {self.data_capture_path}. Reading the most recent file.\")\n\n # Read the most recent file into a DataFrame\n df = wr.s3.read_json(path=files[-1], lines=True) # Reads the last file assuming it's the most recent one\n\n # Process the captured data and return the input and output DataFrames\n return self.process_captured_data(df)\n else:\n print(f\"No data capture files found in {self.data_capture_path}.\")\n return None, None\n\n @staticmethod\n def process_captured_data(df: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Process the captured data DataFrame to extract and flatten the nested data.\n\n Args:\n df (DataFrame): DataFrame with captured data.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n processed_records = []\n\n # Phase1: Process the AWS Data Capture format into a flatter DataFrame\n for _, row in df.iterrows():\n # Extract data from captureData dictionary\n capture_data = row[\"captureData\"]\n input_data = capture_data[\"endpointInput\"]\n output_data = capture_data[\"endpointOutput\"]\n\n # Process input and output, both meta and actual data\n record = {\n \"input_content_type\": input_data.get(\"observedContentType\"),\n \"input_encoding\": input_data.get(\"encoding\"),\n \"input\": input_data.get(\"data\"),\n \"output_content_type\": output_data.get(\"observedContentType\"),\n \"output_encoding\": output_data.get(\"encoding\"),\n \"output\": output_data.get(\"data\"),\n }\n processed_records.append(record)\n processed_df = pd.DataFrame(processed_records)\n\n # Phase2: Process the input and output 'data' columns into separate DataFrames\n input_df_list = []\n output_df_list = []\n for _, row in processed_df.iterrows():\n input_df = pd.read_csv(StringIO(row[\"input\"]))\n input_df_list.append(input_df)\n output_df = pd.read_csv(StringIO(row[\"output\"]))\n output_df_list.append(output_df)\n\n # Return the input and output DataFrames\n return pd.concat(input_df_list), pd.concat(output_df_list)\n\n def baseline_exists(self) -> bool:\n \"\"\"\n Check if baseline files exist in S3.\n\n Returns:\n bool: True if all files exist, False otherwise.\n \"\"\"\n\n files = [self.baseline_csv_file, self.constraints_json_file, self.statistics_json_file]\n return all(wr.s3.does_object_exist(file) for file in files)\n\n def create_baseline(self, recreate: bool = False):\n \"\"\"Code to create a baseline for monitoring\n Args:\n recreate (bool): If True, recreate the baseline even if it already exists\n Notes:\n This will create/write three files to the baseline_dir:\n - baseline.csv\n - constraints.json\n - statistics.json\n \"\"\"\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\n \"You can create a baseline but it can't be used/monitored for serverless endpoints, skipping...\"\n )\n return\n\n if not self.baseline_exists() or recreate:\n # Create a baseline for monitoring (training data from the FeatureSet)\n baseline_df = endpoint_utils.fs_training_data(self.endpoint)\n wr.s3.to_csv(baseline_df, self.baseline_csv_file, index=False)\n\n self.log.important(f\"Creating baseline files for {self.endpoint_name} --> {self.baseline_dir}\")\n self.model_monitor.suggest_baseline(\n baseline_dataset=self.baseline_csv_file,\n dataset_format=DatasetFormat.csv(header=True),\n output_s3_uri=self.baseline_dir,\n )\n else:\n self.log.important(f\"Baseline already exists for {self.endpoint_name}\")\n\n def get_baseline(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n Returns:\n pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n \"\"\"\n # Read the monitoring data from S3\n if not wr.s3.does_object_exist(path=self.baseline_csv_file):\n self.log.warning(\"baseline.csv data does not exist in S3.\")\n return None\n else:\n return wr.s3.read_csv(self.baseline_csv_file)\n\n def get_constraints(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the constraints from the baseline\n\n Returns:\n pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n \"\"\"\n return self._get_monitor_json_data(self.constraints_json_file)\n\n def get_statistics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the statistics from the baseline\n\n Returns:\n pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n \"\"\"\n return self._get_monitor_json_data(self.statistics_json_file)\n\n def _get_monitor_json_data(self, s3_path: str) -> Union[pd.DataFrame, None]:\n \"\"\"Internal: Convert the JSON monitoring data into a DataFrame\n Args:\n s3_path(str): The S3 path to the monitoring data\n Returns:\n pd.DataFrame: Monitoring data in DataFrame form (None if it doesn't exist)\n \"\"\"\n # Read the monitoring data from S3\n if not wr.s3.does_object_exist(path=s3_path):\n self.log.warning(\"Monitoring data does not exist in S3.\")\n return None\n else:\n raw_json = read_s3_file(s3_path=s3_path)\n monitoring_data = json.loads(raw_json)\n monitoring_df = pd.json_normalize(monitoring_data[\"features\"])\n return monitoring_df\n\n def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n \"\"\"\n Sets up the monitoring schedule for the model endpoint.\n Args:\n schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n \"\"\"\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\"Monitoring Schedule is not currently supported for serverless endpoints.\")\n return\n\n # Set up the monitoring schedule, name, and output path\n if schedule == \"daily\":\n schedule = CronExpressionGenerator.daily()\n else:\n schedule = CronExpressionGenerator.hourly()\n\n # Check if the baseline exists\n if not self.baseline_exists():\n self.log.warning(f\"Baseline does not exist for {self.endpoint_name}. Call create_baseline() first...\")\n return\n\n # Check if monitoring schedule already exists\n schedule_exists = self.monitoring_schedule_exists()\n\n # If the schedule exists, and we don't want to recreate it, return\n if schedule_exists and not recreate:\n return\n\n # If the schedule exists, delete it\n if schedule_exists:\n self.log.important(f\"Deleting existing monitoring schedule for {self.endpoint_name}...\")\n self.sagemaker_client.delete_monitoring_schedule(MonitoringScheduleName=self.monitoring_schedule_name)\n\n # Set up a NEW monitoring schedule\n self.model_monitor.create_monitoring_schedule(\n monitor_schedule_name=self.monitoring_schedule_name,\n endpoint_input=self.endpoint_name,\n output_s3_uri=self.monitoring_output_path,\n statistics=self.statistics_json_file,\n constraints=self.constraints_json_file,\n schedule_cron_expression=schedule,\n )\n self.log.important(f\"New Monitoring schedule created for {self.endpoint_name}.\")\n\n def setup_alerts(self):\n \"\"\"Code to set up alerts based on monitoring results\"\"\"\n pass\n\n def monitoring_schedule_exists(self):\n \"\"\"Code to figure out if a monitoring schedule already exists for this endpoint\"\"\"\n existing_schedules = self.sagemaker_client.list_monitoring_schedules(MaxResults=100).get(\n \"MonitoringScheduleSummaries\", []\n )\n if any(schedule[\"MonitoringScheduleName\"] == self.monitoring_schedule_name for schedule in existing_schedules):\n self.log.info(f\"Monitoring schedule already exists for {self.endpoint_name}.\")\n return True\n else:\n self.log.info(f\"Could not find a Monitoring schedule for {self.endpoint_name}.\")\n return False\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.__init__","title":"__init__(endpoint_name, instance_type='ml.t3.large')
","text":"ExtractModelArtifact Class Args: endpoint_name (str): Name of the endpoint to set up monitoring for instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\". Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...
Source code insrc/sageworks/core/artifacts/monitor_core.py
def __init__(self, endpoint_name, instance_type=\"ml.t3.large\"):\n \"\"\"ExtractModelArtifact Class\n Args:\n endpoint_name (str): Name of the endpoint to set up monitoring for\n instance_type (str): Instance type to use for monitoring. Defaults to \"ml.t3.large\".\n Other options: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ...\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.endpoint_name = endpoint_name\n self.endpoint = EndpointCore(self.endpoint_name)\n\n # Initialize Class Attributes\n self.sagemaker_session = self.endpoint.sm_session\n self.sagemaker_client = self.endpoint.sm_client\n self.data_capture_path = self.endpoint.endpoint_data_capture_path\n self.monitoring_path = self.endpoint.endpoint_monitoring_path\n self.instance_type = instance_type\n self.monitoring_schedule_name = f\"{self.endpoint_name}-monitoring-schedule\"\n self.monitoring_output_path = f\"{self.monitoring_path}/monitoring_reports\"\n self.baseline_dir = f\"{self.monitoring_path}/baseline\"\n self.baseline_csv_file = f\"{self.baseline_dir}/baseline.csv\"\n self.constraints_json_file = f\"{self.baseline_dir}/constraints.json\"\n self.statistics_json_file = f\"{self.baseline_dir}/statistics.json\"\n\n # Initialize the DefaultModelMonitor\n self.sageworks_role_arn = AWSAccountClamp().aws_session.get_sageworks_execution_role_arn()\n self.model_monitor = DefaultModelMonitor(role=self.sageworks_role_arn, instance_type=self.instance_type)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.__repr__","title":"__repr__()
","text":"String representation of this MonitorCore object
Returns:
Name Type Descriptionstr
str
String representation of this MonitorCore object
Source code insrc/sageworks/core/artifacts/monitor_core.py
def __repr__(self) -> str:\n \"\"\"String representation of this MonitorCore object\n\n Returns:\n str: String representation of this MonitorCore object\n \"\"\"\n summary_dict = self.summary()\n summary_items = [f\" {repr(key)}: {repr(value)}\" for key, value in summary_dict.items()]\n summary_str = f\"{self.__class__.__name__}: {self.endpoint_name}\\n\" + \",\\n\".join(summary_items)\n return summary_str\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.add_data_capture","title":"add_data_capture(capture_percentage=100)
","text":"Add data capture configuration for the SageMaker endpoint.
Parameters:
Name Type Description Defaultcapture_percentage
int
Percentage of data to capture. Defaults to 100.
100
Source code in src/sageworks/core/artifacts/monitor_core.py
def add_data_capture(self, capture_percentage=100):\n \"\"\"\n Add data capture configuration for the SageMaker endpoint.\n\n Args:\n capture_percentage (int): Percentage of data to capture. Defaults to 100.\n \"\"\"\n\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\"Data capture is not currently supported for serverless endpoints.\")\n return\n\n # Check if the endpoint already has data capture configured\n if self.is_data_capture_configured(capture_percentage):\n self.log.important(f\"Data capture {capture_percentage} already configured for {self.endpoint_name}.\")\n return\n\n # Get the current endpoint configuration name\n current_endpoint_config_name = self.endpoint.endpoint_config_name()\n\n # Log the data capture path\n self.log.important(f\"Adding Data Capture to {self.endpoint_name} --> {self.data_capture_path}\")\n self.log.important(\"This normally redeploys the endpoint...\")\n\n # Setup data capture config\n data_capture_config = DataCaptureConfig(\n enable_capture=True,\n sampling_percentage=capture_percentage,\n destination_s3_uri=self.data_capture_path,\n capture_options=[\"Input\", \"Output\"],\n csv_content_types=[\"text/csv\"],\n )\n\n # Create a Predictor instance and update data capture configuration\n predictor = Predictor(self.endpoint_name, sagemaker_session=self.sagemaker_session)\n predictor.update_data_capture_config(data_capture_config=data_capture_config)\n\n # Delete the old endpoint configuration\n self.log.important(f\"Deleting old endpoint configuration: {current_endpoint_config_name}\")\n self.sagemaker_client.delete_endpoint_config(EndpointConfigName=current_endpoint_config_name)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.baseline_exists","title":"baseline_exists()
","text":"Check if baseline files exist in S3.
Returns:
Name Type Descriptionbool
bool
True if all files exist, False otherwise.
Source code insrc/sageworks/core/artifacts/monitor_core.py
def baseline_exists(self) -> bool:\n \"\"\"\n Check if baseline files exist in S3.\n\n Returns:\n bool: True if all files exist, False otherwise.\n \"\"\"\n\n files = [self.baseline_csv_file, self.constraints_json_file, self.statistics_json_file]\n return all(wr.s3.does_object_exist(file) for file in files)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.create_baseline","title":"create_baseline(recreate=False)
","text":"Code to create a baseline for monitoring Args: recreate (bool): If True, recreate the baseline even if it already exists Notes: This will create/write three files to the baseline_dir: - baseline.csv - constraints.json - statistics.json
Source code insrc/sageworks/core/artifacts/monitor_core.py
def create_baseline(self, recreate: bool = False):\n \"\"\"Code to create a baseline for monitoring\n Args:\n recreate (bool): If True, recreate the baseline even if it already exists\n Notes:\n This will create/write three files to the baseline_dir:\n - baseline.csv\n - constraints.json\n - statistics.json\n \"\"\"\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\n \"You can create a baseline but it can't be used/monitored for serverless endpoints, skipping...\"\n )\n return\n\n if not self.baseline_exists() or recreate:\n # Create a baseline for monitoring (training data from the FeatureSet)\n baseline_df = endpoint_utils.fs_training_data(self.endpoint)\n wr.s3.to_csv(baseline_df, self.baseline_csv_file, index=False)\n\n self.log.important(f\"Creating baseline files for {self.endpoint_name} --> {self.baseline_dir}\")\n self.model_monitor.suggest_baseline(\n baseline_dataset=self.baseline_csv_file,\n dataset_format=DatasetFormat.csv(header=True),\n output_s3_uri=self.baseline_dir,\n )\n else:\n self.log.important(f\"Baseline already exists for {self.endpoint_name}\")\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.create_monitoring_schedule","title":"create_monitoring_schedule(schedule='hourly', recreate=False)
","text":"Sets up the monitoring schedule for the model endpoint. Args: schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly). recreate (bool): If True, recreate the monitoring schedule even if it already exists.
Source code insrc/sageworks/core/artifacts/monitor_core.py
def create_monitoring_schedule(self, schedule: str = \"hourly\", recreate: bool = False):\n \"\"\"\n Sets up the monitoring schedule for the model endpoint.\n Args:\n schedule (str): The schedule for the monitoring job (hourly or daily, defaults to hourly).\n recreate (bool): If True, recreate the monitoring schedule even if it already exists.\n \"\"\"\n # Check if this endpoint is a serverless endpoint\n if self.endpoint.is_serverless():\n self.log.warning(\"Monitoring Schedule is not currently supported for serverless endpoints.\")\n return\n\n # Set up the monitoring schedule, name, and output path\n if schedule == \"daily\":\n schedule = CronExpressionGenerator.daily()\n else:\n schedule = CronExpressionGenerator.hourly()\n\n # Check if the baseline exists\n if not self.baseline_exists():\n self.log.warning(f\"Baseline does not exist for {self.endpoint_name}. Call create_baseline() first...\")\n return\n\n # Check if monitoring schedule already exists\n schedule_exists = self.monitoring_schedule_exists()\n\n # If the schedule exists, and we don't want to recreate it, return\n if schedule_exists and not recreate:\n return\n\n # If the schedule exists, delete it\n if schedule_exists:\n self.log.important(f\"Deleting existing monitoring schedule for {self.endpoint_name}...\")\n self.sagemaker_client.delete_monitoring_schedule(MonitoringScheduleName=self.monitoring_schedule_name)\n\n # Set up a NEW monitoring schedule\n self.model_monitor.create_monitoring_schedule(\n monitor_schedule_name=self.monitoring_schedule_name,\n endpoint_input=self.endpoint_name,\n output_s3_uri=self.monitoring_output_path,\n statistics=self.statistics_json_file,\n constraints=self.constraints_json_file,\n schedule_cron_expression=schedule,\n )\n self.log.important(f\"New Monitoring schedule created for {self.endpoint_name}.\")\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.details","title":"details()
","text":"Return the details of the monitoring for the endpoint
Returns:
Name Type Descriptiondict
dict
The details of the monitoring for the endpoint
Source code insrc/sageworks/core/artifacts/monitor_core.py
def details(self) -> dict:\n \"\"\"Return the details of the monitoring for the endpoint\n\n Returns:\n dict: The details of the monitoring for the endpoint\n \"\"\"\n # Check if we have data capture\n if self.is_data_capture_configured(capture_percentage=100):\n data_capture_path = self.data_capture_path\n else:\n data_capture_path = None\n\n # Check if we have a baseline\n if self.baseline_exists():\n baseline_csv_file = self.baseline_csv_file\n constraints_json_file = self.constraints_json_file\n statistics_json_file = self.statistics_json_file\n else:\n baseline_csv_file = None\n constraints_json_file = None\n statistics_json_file = None\n\n # Check if we have a monitoring schedule\n if self.monitoring_schedule_exists():\n schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n MonitoringScheduleName=self.monitoring_schedule_name\n )\n\n # General monitoring details\n schedule_name = schedule_details.get(\"MonitoringScheduleName\")\n schedule_status = schedule_details.get(\"MonitoringScheduleStatus\")\n output_path = self.monitoring_output_path\n last_run_details = self.last_run_details()\n else:\n schedule_name = None\n schedule_status = \"Not Scheduled\"\n schedule_details = None\n output_path = None\n last_run_details = None\n\n # General monitoring details\n general = {\n \"data_capture_path\": data_capture_path,\n \"baseline_csv_file\": baseline_csv_file,\n \"baseline_constraints_json_file\": constraints_json_file,\n \"baseline_statistics_json_file\": statistics_json_file,\n \"monitoring_schedule_name\": schedule_name,\n \"monitoring_output_path\": output_path,\n \"monitoring_schedule_status\": schedule_status,\n \"monitoring_schedule_details\": schedule_details,\n }\n if last_run_details:\n general.update(last_run_details)\n return general\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_baseline","title":"get_baseline()
","text":"Code to get the baseline CSV from the S3 baseline directory
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)
Source code insrc/sageworks/core/artifacts/monitor_core.py
def get_baseline(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the baseline CSV from the S3 baseline directory\n\n Returns:\n pd.DataFrame: The baseline CSV as a DataFrame (None if it doesn't exist)\n \"\"\"\n # Read the monitoring data from S3\n if not wr.s3.does_object_exist(path=self.baseline_csv_file):\n self.log.warning(\"baseline.csv data does not exist in S3.\")\n return None\n else:\n return wr.s3.read_csv(self.baseline_csv_file)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_constraints","title":"get_constraints()
","text":"Code to get the constraints from the baseline
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)
Source code insrc/sageworks/core/artifacts/monitor_core.py
def get_constraints(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the constraints from the baseline\n\n Returns:\n pd.DataFrame: The constraints from the baseline (constraints.json) (None if it doesn't exist)\n \"\"\"\n return self._get_monitor_json_data(self.constraints_json_file)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_latest_data_capture","title":"get_latest_data_capture()
","text":"Get the latest data capture from S3.
Returns:
Name Type DescriptionDataFrame
input), DataFrame(output
Flattened and processed DataFrames for input and output data.
Source code insrc/sageworks/core/artifacts/monitor_core.py
def get_latest_data_capture(self) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Get the latest data capture from S3.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n # List files in the specified S3 path\n files = wr.s3.list_objects(self.data_capture_path)\n\n if files:\n print(f\"Found {len(files)} files in {self.data_capture_path}. Reading the most recent file.\")\n\n # Read the most recent file into a DataFrame\n df = wr.s3.read_json(path=files[-1], lines=True) # Reads the last file assuming it's the most recent one\n\n # Process the captured data and return the input and output DataFrames\n return self.process_captured_data(df)\n else:\n print(f\"No data capture files found in {self.data_capture_path}.\")\n return None, None\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.get_statistics","title":"get_statistics()
","text":"Code to get the statistics from the baseline
Returns:
Type DescriptionUnion[DataFrame, None]
pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)
Source code insrc/sageworks/core/artifacts/monitor_core.py
def get_statistics(self) -> Union[pd.DataFrame, None]:\n \"\"\"Code to get the statistics from the baseline\n\n Returns:\n pd.DataFrame: The statistics from the baseline (statistics.json) (None if it doesn't exist)\n \"\"\"\n return self._get_monitor_json_data(self.statistics_json_file)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.is_data_capture_configured","title":"is_data_capture_configured(capture_percentage)
","text":"Check if data capture is already configured on the endpoint. Args: capture_percentage (int): Expected data capture percentage. Returns: bool: True if data capture is already configured, False otherwise.
Source code insrc/sageworks/core/artifacts/monitor_core.py
def is_data_capture_configured(self, capture_percentage):\n \"\"\"\n Check if data capture is already configured on the endpoint.\n Args:\n capture_percentage (int): Expected data capture percentage.\n Returns:\n bool: True if data capture is already configured, False otherwise.\n \"\"\"\n try:\n endpoint_config_name = self.endpoint.endpoint_config_name()\n endpoint_config = self.sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)\n data_capture_config = endpoint_config.get(\"DataCaptureConfig\", {})\n\n # Check if data capture is enabled and the percentage matches\n is_enabled = data_capture_config.get(\"EnableCapture\", False)\n current_percentage = data_capture_config.get(\"InitialSamplingPercentage\", 0)\n return is_enabled and current_percentage == capture_percentage\n except Exception as e:\n self.log.error(f\"Error checking data capture configuration: {e}\")\n return False\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.last_run_details","title":"last_run_details()
","text":"Return the details of the last monitoring run for the endpoint
Returns:
Name Type Descriptiondict
Union[dict, None]
The details of the last monitoring run for the endpoint (None if no monitoring schedule)
Source code insrc/sageworks/core/artifacts/monitor_core.py
def last_run_details(self) -> Union[dict, None]:\n \"\"\"Return the details of the last monitoring run for the endpoint\n\n Returns:\n dict: The details of the last monitoring run for the endpoint (None if no monitoring schedule)\n \"\"\"\n # Check if we have a monitoring schedule\n if not self.monitoring_schedule_exists():\n return None\n\n # Get the details of the last monitoring run\n schedule_details = self.sagemaker_client.describe_monitoring_schedule(\n MonitoringScheduleName=self.monitoring_schedule_name\n )\n last_run_status = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"MonitoringExecutionStatus\")\n last_run_time = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"ScheduledTime\")\n failure_reason = schedule_details.get(\"LastMonitoringExecutionSummary\", {}).get(\"FailureReason\")\n return {\n \"last_run_status\": last_run_status,\n \"last_run_time\": str(last_run_time),\n \"failure_reason\": failure_reason,\n }\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.monitoring_schedule_exists","title":"monitoring_schedule_exists()
","text":"Code to figure out if a monitoring schedule already exists for this endpoint
Source code insrc/sageworks/core/artifacts/monitor_core.py
def monitoring_schedule_exists(self):\n \"\"\"Code to figure out if a monitoring schedule already exists for this endpoint\"\"\"\n existing_schedules = self.sagemaker_client.list_monitoring_schedules(MaxResults=100).get(\n \"MonitoringScheduleSummaries\", []\n )\n if any(schedule[\"MonitoringScheduleName\"] == self.monitoring_schedule_name for schedule in existing_schedules):\n self.log.info(f\"Monitoring schedule already exists for {self.endpoint_name}.\")\n return True\n else:\n self.log.info(f\"Could not find a Monitoring schedule for {self.endpoint_name}.\")\n return False\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.process_captured_data","title":"process_captured_data(df)
staticmethod
","text":"Process the captured data DataFrame to extract and flatten the nested data.
Parameters:
Name Type Description Defaultdf
DataFrame
DataFrame with captured data.
requiredReturns:
Name Type DescriptionDataFrame
input), DataFrame(output
Flattened and processed DataFrames for input and output data.
Source code insrc/sageworks/core/artifacts/monitor_core.py
@staticmethod\ndef process_captured_data(df: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame):\n \"\"\"\n Process the captured data DataFrame to extract and flatten the nested data.\n\n Args:\n df (DataFrame): DataFrame with captured data.\n\n Returns:\n DataFrame (input), DataFrame(output): Flattened and processed DataFrames for input and output data.\n \"\"\"\n processed_records = []\n\n # Phase1: Process the AWS Data Capture format into a flatter DataFrame\n for _, row in df.iterrows():\n # Extract data from captureData dictionary\n capture_data = row[\"captureData\"]\n input_data = capture_data[\"endpointInput\"]\n output_data = capture_data[\"endpointOutput\"]\n\n # Process input and output, both meta and actual data\n record = {\n \"input_content_type\": input_data.get(\"observedContentType\"),\n \"input_encoding\": input_data.get(\"encoding\"),\n \"input\": input_data.get(\"data\"),\n \"output_content_type\": output_data.get(\"observedContentType\"),\n \"output_encoding\": output_data.get(\"encoding\"),\n \"output\": output_data.get(\"data\"),\n }\n processed_records.append(record)\n processed_df = pd.DataFrame(processed_records)\n\n # Phase2: Process the input and output 'data' columns into separate DataFrames\n input_df_list = []\n output_df_list = []\n for _, row in processed_df.iterrows():\n input_df = pd.read_csv(StringIO(row[\"input\"]))\n input_df_list.append(input_df)\n output_df = pd.read_csv(StringIO(row[\"output\"]))\n output_df_list.append(output_df)\n\n # Return the input and output DataFrames\n return pd.concat(input_df_list), pd.concat(output_df_list)\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.setup_alerts","title":"setup_alerts()
","text":"Code to set up alerts based on monitoring results
Source code insrc/sageworks/core/artifacts/monitor_core.py
def setup_alerts(self):\n \"\"\"Code to set up alerts based on monitoring results\"\"\"\n pass\n
"},{"location":"core_classes/artifacts/monitor_core/#sageworks.core.artifacts.monitor_core.MonitorCore.summary","title":"summary()
","text":"Return the summary of information about the endpoint monitor
Returns:
Name Type Descriptiondict
dict
Summary of information about the endpoint monitor
Source code insrc/sageworks/core/artifacts/monitor_core.py
def summary(self) -> dict:\n \"\"\"Return the summary of information about the endpoint monitor\n\n Returns:\n dict: Summary of information about the endpoint monitor\n \"\"\"\n if self.endpoint.is_serverless():\n return {\n \"endpoint_type\": \"serverless\",\n \"data_capture\": \"not supported\",\n \"baseline\": \"not supported\",\n \"monitoring_schedule\": \"not supported\",\n }\n else:\n summary = {\n \"endpoint_type\": \"realtime\",\n \"data_capture\": self.is_data_capture_configured(capture_percentage=100),\n \"baseline\": self.baseline_exists(),\n \"monitoring_schedule\": self.monitoring_schedule_exists(),\n }\n summary.update(self.last_run_details() or {})\n return summary\n
"},{"location":"core_classes/artifacts/overview/","title":"SageWorks Artifacts","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
"},{"location":"core_classes/artifacts/overview/#welcome-to-the-sageworks-core-artifact-classes","title":"Welcome to the SageWorks Core Artifact Classes","text":"These classes provide low-level APIs for the SageWorks package, they interact more directly with AWS Services and are therefore more complex with a fairly large number of methods.
These DataLoader Classes are intended to load larger dataset into AWS. For large data we need to use AWS Glue Jobs/Batch Jobs and in general the process is a bit more complicated and has less features.
If you have smaller data please see DataLoaders Light
Welcome to the SageWorks DataLoaders Heavy Classes
These classes provide low-level APIs for loading larger data into AWS services
S3HeavyToDataSource
","text":"Source code in src/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py
class S3HeavyToDataSource:\n def __init__(self, glue_context: GlueContext, input_uuid: str, output_uuid: str):\n \"\"\"S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource\n\n Args:\n glue_context: GlueContext, AWS Glue Specific wrapper around SparkContext\n input_uuid (str): The S3 Path to the files to be loaded\n output_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n self.log = glue_context.get_logger()\n\n # FIXME: Pull these from Parameter Store or Config\n self.input_uuid = input_uuid\n self.output_uuid = output_uuid\n self.output_meta = {\"sageworks_input\": self.input_uuid}\n sageworks_bucket = \"s3://sandbox-sageworks-artifacts\"\n self.data_sources_s3_path = sageworks_bucket + \"/data-sources\"\n\n # Our Spark Context\n self.glue_context = glue_context\n\n @staticmethod\n def resolve_choice_fields(dyf):\n # Get schema fields\n schema_fields = dyf.schema().fields\n\n # Collect choice fields\n choice_fields = [(field.name, \"cast:long\") for field in schema_fields if field.dataType.typeName() == \"choice\"]\n print(f\"Choice Fields: {choice_fields}\")\n\n # If there are choice fields, resolve them\n if choice_fields:\n dyf = dyf.resolveChoice(specs=choice_fields)\n\n return dyf\n\n def timestamp_conversions(self, dyf: DynamicFrame, time_columns: list = []) -> DynamicFrame:\n \"\"\"Convert columns in the DynamicFrame to the correct data types\n Args:\n dyf (DynamicFrame): The DynamicFrame to convert\n time_columns (list): A list of column names to convert to timestamp\n Returns:\n DynamicFrame: The converted DynamicFrame\n \"\"\"\n\n # Convert the timestamp columns to timestamp types\n spark_df = dyf.toDF()\n for column in time_columns:\n spark_df = spark_df.withColumn(column, to_timestamp(col(column)))\n\n # Convert the Spark DataFrame back to a Glue DynamicFrame and return\n return DynamicFrame.fromDF(spark_df, self.glue_context, \"output_dyf\")\n\n @staticmethod\n def remove_periods_from_columns(dyf: DynamicFrame) -> DynamicFrame:\n \"\"\"Remove periods from column names in the DynamicFrame\n Args:\n dyf (DynamicFrame): The DynamicFrame to convert\n Returns:\n DynamicFrame: The converted DynamicFrame\n \"\"\"\n # Extract the column names from the schema\n old_column_names = [field.name for field in dyf.schema().fields]\n\n # Create a new list of renamed column names\n new_column_names = [name.replace(\".\", \"_\") for name in old_column_names]\n print(old_column_names)\n print(new_column_names)\n\n # Create a new DynamicFrame with renamed columns\n for c_old, c_new in zip(old_column_names, new_column_names):\n dyf = dyf.rename_field(f\"`{c_old}`\", c_new)\n return dyf\n\n def transform(\n self,\n input_type: str = \"json\",\n timestamp_columns: list = None,\n output_format: str = \"parquet\",\n ):\n \"\"\"Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n Args:\n input_type (str): The type of input files, either 'csv' or 'json'\n timestamp_columns (list): A list of column names to convert to timestamp\n output_format (str): The format of the output files, either 'parquet' or 'orc'\n \"\"\"\n\n # Add some tags here\n tags = [\"heavy\"]\n\n # Create the Output Parquet file S3 Storage Path\n s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n # Read JSONL files from S3 and infer schema dynamically\n self.log.info(f\"Reading JSONL files from {self.input_uuid}...\")\n input_dyf = self.glue_context.create_dynamic_frame.from_options(\n connection_type=\"s3\",\n connection_options={\n \"paths\": [self.input_uuid],\n \"recurse\": True,\n \"gzip\": True,\n },\n format=input_type,\n # format_options={'jsonPath': 'auto'}, Look into this later\n )\n self.log.info(\"Incoming DataFrame...\")\n input_dyf.show(5)\n input_dyf.printSchema()\n\n # Resolve Choice fields\n resolved_dyf = self.resolve_choice_fields(input_dyf)\n\n # The next couple of lines of code is for un-nesting any nested JSON\n # Create a Dynamic Frame Collection (dfc)\n dfc = Relationalize.apply(resolved_dyf, name=\"root\")\n\n # Aggregate the collection into a single dynamic frame\n output_dyf = dfc.select(\"root\")\n\n print(\"Before TimeStamp Conversions\")\n output_dyf.printSchema()\n\n # Convert any timestamp columns\n output_dyf = self.timestamp_conversions(output_dyf, timestamp_columns)\n\n # Relationalize will put periods in the column names. This will cause\n # problems later when we try to create a FeatureSet from this DataSource\n output_dyf = self.remove_periods_from_columns(output_dyf)\n\n print(\"After TimeStamp Conversions and Removing Periods from column names\")\n output_dyf.printSchema()\n\n # Write Parquet files to S3\n self.log.info(f\"Writing Parquet files to {s3_storage_path}...\")\n self.glue_context.purge_s3_path(s3_storage_path, {\"retentionPeriod\": 0})\n self.glue_context.write_dynamic_frame.from_options(\n frame=output_dyf,\n connection_type=\"s3\",\n connection_options={\n \"path\": s3_storage_path\n # \"partitionKeys\": [\"year\", \"month\", \"day\"],\n },\n format=output_format,\n )\n\n # Set up our SageWorks metadata (description, tags, etc)\n description = f\"SageWorks data source: {self.output_uuid}\"\n sageworks_meta = {\"sageworks_tags\": self.tag_delimiter.join(tags)}\n for key, value in self.output_meta.items():\n sageworks_meta[key] = value\n\n # Create a new table in the AWS Data Catalog\n self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n\n # Converting the Spark Types to Athena Types\n def to_athena_type(col):\n athena_type_map = {\"long\": \"bigint\"}\n spark_type = col.dataType.typeName()\n return athena_type_map.get(spark_type, spark_type)\n\n column_name_types = [{\"Name\": col.name, \"Type\": to_athena_type(col)} for col in output_dyf.schema().fields]\n\n # Our parameters for the Glue Data Catalog are different for Parquet and ORC\n if output_format == \"parquet\":\n glue_input_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat\"\n glue_output_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat\"\n serialization_library = \"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe\"\n else:\n glue_input_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n glue_output_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n serialization_library = \"org.apache.hadoop.hive.ql.io.orc.OrcSerde\"\n\n table_input = {\n \"Name\": self.output_uuid,\n \"Description\": description,\n \"Parameters\": sageworks_meta,\n \"TableType\": \"EXTERNAL_TABLE\",\n \"StorageDescriptor\": {\n \"Columns\": column_name_types,\n \"Location\": s3_storage_path,\n \"InputFormat\": glue_input_format,\n \"OutputFormat\": glue_output_format,\n \"Compressed\": True,\n \"SerdeInfo\": {\n \"SerializationLibrary\": serialization_library,\n },\n },\n }\n\n # Delete the Data Catalog Table if it already exists\n glue_client = boto3.client(\"glue\")\n try:\n glue_client.delete_table(DatabaseName=\"sageworks\", Name=self.output_uuid)\n self.log.info(f\"Deleting Data Catalog Table: {self.output_uuid}...\")\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] != \"EntityNotFoundException\":\n raise e\n\n self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n glue_client.create_table(DatabaseName=\"sageworks\", TableInput=table_input)\n\n # All done!\n self.log.info(f\"{self.input_uuid} --> {self.output_uuid} complete!\")\n
"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.__init__","title":"__init__(glue_context, input_uuid, output_uuid)
","text":"S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource
Parameters:
Name Type Description Defaultglue_context
GlueContext
GlueContext, AWS Glue Specific wrapper around SparkContext
requiredinput_uuid
str
The S3 Path to the files to be loaded
requiredoutput_uuid
str
The UUID of the SageWorks DataSource to be created
required Source code insrc/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py
def __init__(self, glue_context: GlueContext, input_uuid: str, output_uuid: str):\n \"\"\"S3HeavyToDataSource: Class to move HEAVY S3 Files into a SageWorks DataSource\n\n Args:\n glue_context: GlueContext, AWS Glue Specific wrapper around SparkContext\n input_uuid (str): The S3 Path to the files to be loaded\n output_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n self.log = glue_context.get_logger()\n\n # FIXME: Pull these from Parameter Store or Config\n self.input_uuid = input_uuid\n self.output_uuid = output_uuid\n self.output_meta = {\"sageworks_input\": self.input_uuid}\n sageworks_bucket = \"s3://sandbox-sageworks-artifacts\"\n self.data_sources_s3_path = sageworks_bucket + \"/data-sources\"\n\n # Our Spark Context\n self.glue_context = glue_context\n
"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.remove_periods_from_columns","title":"remove_periods_from_columns(dyf)
staticmethod
","text":"Remove periods from column names in the DynamicFrame Args: dyf (DynamicFrame): The DynamicFrame to convert Returns: DynamicFrame: The converted DynamicFrame
Source code insrc/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py
@staticmethod\ndef remove_periods_from_columns(dyf: DynamicFrame) -> DynamicFrame:\n \"\"\"Remove periods from column names in the DynamicFrame\n Args:\n dyf (DynamicFrame): The DynamicFrame to convert\n Returns:\n DynamicFrame: The converted DynamicFrame\n \"\"\"\n # Extract the column names from the schema\n old_column_names = [field.name for field in dyf.schema().fields]\n\n # Create a new list of renamed column names\n new_column_names = [name.replace(\".\", \"_\") for name in old_column_names]\n print(old_column_names)\n print(new_column_names)\n\n # Create a new DynamicFrame with renamed columns\n for c_old, c_new in zip(old_column_names, new_column_names):\n dyf = dyf.rename_field(f\"`{c_old}`\", c_new)\n return dyf\n
"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.timestamp_conversions","title":"timestamp_conversions(dyf, time_columns=[])
","text":"Convert columns in the DynamicFrame to the correct data types Args: dyf (DynamicFrame): The DynamicFrame to convert time_columns (list): A list of column names to convert to timestamp Returns: DynamicFrame: The converted DynamicFrame
Source code insrc/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py
def timestamp_conversions(self, dyf: DynamicFrame, time_columns: list = []) -> DynamicFrame:\n \"\"\"Convert columns in the DynamicFrame to the correct data types\n Args:\n dyf (DynamicFrame): The DynamicFrame to convert\n time_columns (list): A list of column names to convert to timestamp\n Returns:\n DynamicFrame: The converted DynamicFrame\n \"\"\"\n\n # Convert the timestamp columns to timestamp types\n spark_df = dyf.toDF()\n for column in time_columns:\n spark_df = spark_df.withColumn(column, to_timestamp(col(column)))\n\n # Convert the Spark DataFrame back to a Glue DynamicFrame and return\n return DynamicFrame.fromDF(spark_df, self.glue_context, \"output_dyf\")\n
"},{"location":"core_classes/transforms/data_loaders_heavy/#sageworks.core.transforms.data_loaders.heavy.S3HeavyToDataSource.transform","title":"transform(input_type='json', timestamp_columns=None, output_format='parquet')
","text":"Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and store the information about the data to the AWS Data Catalog sageworks database Args: input_type (str): The type of input files, either 'csv' or 'json' timestamp_columns (list): A list of column names to convert to timestamp output_format (str): The format of the output files, either 'parquet' or 'orc'
Source code insrc/sageworks/core/transforms/data_loaders/heavy/s3_heavy_to_data_source.py
def transform(\n self,\n input_type: str = \"json\",\n timestamp_columns: list = None,\n output_format: str = \"parquet\",\n):\n \"\"\"Convert the CSV or JSON data into Parquet Format in the SageWorks S3 Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n Args:\n input_type (str): The type of input files, either 'csv' or 'json'\n timestamp_columns (list): A list of column names to convert to timestamp\n output_format (str): The format of the output files, either 'parquet' or 'orc'\n \"\"\"\n\n # Add some tags here\n tags = [\"heavy\"]\n\n # Create the Output Parquet file S3 Storage Path\n s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n # Read JSONL files from S3 and infer schema dynamically\n self.log.info(f\"Reading JSONL files from {self.input_uuid}...\")\n input_dyf = self.glue_context.create_dynamic_frame.from_options(\n connection_type=\"s3\",\n connection_options={\n \"paths\": [self.input_uuid],\n \"recurse\": True,\n \"gzip\": True,\n },\n format=input_type,\n # format_options={'jsonPath': 'auto'}, Look into this later\n )\n self.log.info(\"Incoming DataFrame...\")\n input_dyf.show(5)\n input_dyf.printSchema()\n\n # Resolve Choice fields\n resolved_dyf = self.resolve_choice_fields(input_dyf)\n\n # The next couple of lines of code is for un-nesting any nested JSON\n # Create a Dynamic Frame Collection (dfc)\n dfc = Relationalize.apply(resolved_dyf, name=\"root\")\n\n # Aggregate the collection into a single dynamic frame\n output_dyf = dfc.select(\"root\")\n\n print(\"Before TimeStamp Conversions\")\n output_dyf.printSchema()\n\n # Convert any timestamp columns\n output_dyf = self.timestamp_conversions(output_dyf, timestamp_columns)\n\n # Relationalize will put periods in the column names. This will cause\n # problems later when we try to create a FeatureSet from this DataSource\n output_dyf = self.remove_periods_from_columns(output_dyf)\n\n print(\"After TimeStamp Conversions and Removing Periods from column names\")\n output_dyf.printSchema()\n\n # Write Parquet files to S3\n self.log.info(f\"Writing Parquet files to {s3_storage_path}...\")\n self.glue_context.purge_s3_path(s3_storage_path, {\"retentionPeriod\": 0})\n self.glue_context.write_dynamic_frame.from_options(\n frame=output_dyf,\n connection_type=\"s3\",\n connection_options={\n \"path\": s3_storage_path\n # \"partitionKeys\": [\"year\", \"month\", \"day\"],\n },\n format=output_format,\n )\n\n # Set up our SageWorks metadata (description, tags, etc)\n description = f\"SageWorks data source: {self.output_uuid}\"\n sageworks_meta = {\"sageworks_tags\": self.tag_delimiter.join(tags)}\n for key, value in self.output_meta.items():\n sageworks_meta[key] = value\n\n # Create a new table in the AWS Data Catalog\n self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n\n # Converting the Spark Types to Athena Types\n def to_athena_type(col):\n athena_type_map = {\"long\": \"bigint\"}\n spark_type = col.dataType.typeName()\n return athena_type_map.get(spark_type, spark_type)\n\n column_name_types = [{\"Name\": col.name, \"Type\": to_athena_type(col)} for col in output_dyf.schema().fields]\n\n # Our parameters for the Glue Data Catalog are different for Parquet and ORC\n if output_format == \"parquet\":\n glue_input_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat\"\n glue_output_format = \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat\"\n serialization_library = \"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe\"\n else:\n glue_input_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n glue_output_format = \"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\"\n serialization_library = \"org.apache.hadoop.hive.ql.io.orc.OrcSerde\"\n\n table_input = {\n \"Name\": self.output_uuid,\n \"Description\": description,\n \"Parameters\": sageworks_meta,\n \"TableType\": \"EXTERNAL_TABLE\",\n \"StorageDescriptor\": {\n \"Columns\": column_name_types,\n \"Location\": s3_storage_path,\n \"InputFormat\": glue_input_format,\n \"OutputFormat\": glue_output_format,\n \"Compressed\": True,\n \"SerdeInfo\": {\n \"SerializationLibrary\": serialization_library,\n },\n },\n }\n\n # Delete the Data Catalog Table if it already exists\n glue_client = boto3.client(\"glue\")\n try:\n glue_client.delete_table(DatabaseName=\"sageworks\", Name=self.output_uuid)\n self.log.info(f\"Deleting Data Catalog Table: {self.output_uuid}...\")\n except ClientError as e:\n if e.response[\"Error\"][\"Code\"] != \"EntityNotFoundException\":\n raise e\n\n self.log.info(f\"Creating Data Catalog Table: {self.output_uuid}...\")\n glue_client.create_table(DatabaseName=\"sageworks\", TableInput=table_input)\n\n # All done!\n self.log.info(f\"{self.input_uuid} --> {self.output_uuid} complete!\")\n
"},{"location":"core_classes/transforms/data_loaders_light/","title":"DataLoaders Light","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
These DataLoader Classes are intended to load smaller dataset into AWS. If you have large data please see DataLoaders Heavy
Welcome to the SageWorks DataLoaders Light Classes
These classes provide low-level APIs for loading smaller data into AWS services
CSVToDataSource
","text":" Bases: Transform
CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource
Common Usagecsv_to_data = CSVToDataSource(csv_file_path, data_uuid)\ncsv_to_data.set_output_tags([\"abalone\", \"csv\", \"whatever\"])\ncsv_to_data.transform()\n
Source code in src/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py
class CSVToDataSource(Transform):\n \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n Common Usage:\n ```python\n csv_to_data = CSVToDataSource(csv_file_path, data_uuid)\n csv_to_data.set_output_tags([\"abalone\", \"csv\", \"whatever\"])\n csv_to_data.transform()\n ```\n \"\"\"\n\n def __init__(self, csv_file_path: str, data_uuid: str):\n \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n Args:\n csv_file_path (str): The path to the CSV file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(csv_file_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.LOCAL_FILE\n self.output_type = TransformOutput.DATA_SOURCE\n\n def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Report the transformation initiation\n csv_file = os.path.basename(self.input_uuid)\n self.log.info(f\"Starting {csv_file} --> DataSource: {self.output_uuid}...\")\n\n # Read in the Local CSV as a Pandas DataFrame\n df = pd.read_csv(self.input_uuid, low_memory=False)\n df = convert_object_columns(df)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{csv_file} --> DataSource: {self.output_uuid} Complete!\")\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.__init__","title":"__init__(csv_file_path, data_uuid)
","text":"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource
Parameters:
Name Type Description Defaultcsv_file_path
str
The path to the CSV file to be transformed
requireddata_uuid
str
The UUID of the SageWorks DataSource to be created
required Source code insrc/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py
def __init__(self, csv_file_path: str, data_uuid: str):\n \"\"\"CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource\n\n Args:\n csv_file_path (str): The path to the CSV file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(csv_file_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.LOCAL_FILE\n self.output_type = TransformOutput.DATA_SOURCE\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform
Source code insrc/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.CSVToDataSource.transform_impl","title":"transform_impl(overwrite=True)
","text":"Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database
Source code insrc/sageworks/core/transforms/data_loaders/light/csv_to_data_source.py
def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the local CSV file into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Report the transformation initiation\n csv_file = os.path.basename(self.input_uuid)\n self.log.info(f\"Starting {csv_file} --> DataSource: {self.output_uuid}...\")\n\n # Read in the Local CSV as a Pandas DataFrame\n df = pd.read_csv(self.input_uuid, low_memory=False)\n df = convert_object_columns(df)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{csv_file} --> DataSource: {self.output_uuid} Complete!\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource","title":"JSONToDataSource
","text":" Bases: Transform
JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource
Common Usagejson_to_data = JSONToDataSource(json_file_path, data_uuid)\njson_to_data.set_output_tags([\"abalone\", \"json\", \"whatever\"])\njson_to_data.transform()\n
Source code in src/sageworks/core/transforms/data_loaders/light/json_to_data_source.py
class JSONToDataSource(Transform):\n \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n Common Usage:\n ```python\n json_to_data = JSONToDataSource(json_file_path, data_uuid)\n json_to_data.set_output_tags([\"abalone\", \"json\", \"whatever\"])\n json_to_data.transform()\n ```\n \"\"\"\n\n def __init__(self, json_file_path: str, data_uuid: str):\n \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n Args:\n json_file_path (str): The path to the JSON file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(json_file_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.LOCAL_FILE\n self.output_type = TransformOutput.DATA_SOURCE\n\n def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Report the transformation initiation\n json_file = os.path.basename(self.input_uuid)\n self.log.info(f\"Starting {json_file} --> DataSource: {self.output_uuid}...\")\n\n # Read in the Local JSON as a Pandas DataFrame\n df = pd.read_json(self.input_uuid, lines=True)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{json_file} --> DataSource: {self.output_uuid} Complete!\")\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.__init__","title":"__init__(json_file_path, data_uuid)
","text":"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource
Parameters:
Name Type Description Defaultjson_file_path
str
The path to the JSON file to be transformed
requireddata_uuid
str
The UUID of the SageWorks DataSource to be created
required Source code insrc/sageworks/core/transforms/data_loaders/light/json_to_data_source.py
def __init__(self, json_file_path: str, data_uuid: str):\n \"\"\"JSONToDataSource: Class to move local JSON Files into a SageWorks DataSource\n\n Args:\n json_file_path (str): The path to the JSON file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(json_file_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.LOCAL_FILE\n self.output_type = TransformOutput.DATA_SOURCE\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform
Source code insrc/sageworks/core/transforms/data_loaders/light/json_to_data_source.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.JSONToDataSource.transform_impl","title":"transform_impl(overwrite=True)
","text":"Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database
Source code insrc/sageworks/core/transforms/data_loaders/light/json_to_data_source.py
def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the local JSON file into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Report the transformation initiation\n json_file = os.path.basename(self.input_uuid)\n self.log.info(f\"Starting {json_file} --> DataSource: {self.output_uuid}...\")\n\n # Read in the Local JSON as a Pandas DataFrame\n df = pd.read_json(self.input_uuid, lines=True)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{json_file} --> DataSource: {self.output_uuid} Complete!\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight","title":"S3ToDataSourceLight
","text":" Bases: Transform
S3ToDataSourceLight: Class to move LIGHT S3 Files into a SageWorks DataSource
Common Usages3_to_data = S3ToDataSourceLight(s3_path, data_uuid, datatype=\"csv/json\")\ns3_to_data.set_output_tags([\"abalone\", \"whatever\"])\ns3_to_data.transform()\n
Source code in src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py
class S3ToDataSourceLight(Transform):\n \"\"\"S3ToDataSourceLight: Class to move LIGHT S3 Files into a SageWorks DataSource\n\n Common Usage:\n ```python\n s3_to_data = S3ToDataSourceLight(s3_path, data_uuid, datatype=\"csv/json\")\n s3_to_data.set_output_tags([\"abalone\", \"whatever\"])\n s3_to_data.transform()\n ```\n \"\"\"\n\n def __init__(self, s3_path: str, data_uuid: str, datatype: str = \"csv\"):\n \"\"\"S3ToDataSourceLight Initialization\n\n Args:\n s3_path (str): The S3 Path to the file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n datatype (str): The datatype of the file to be transformed (defaults to \"csv\")\n \"\"\"\n\n # Call superclass init\n super().__init__(s3_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.S3_OBJECT\n self.output_type = TransformOutput.DATA_SOURCE\n self.datatype = datatype\n\n def input_size_mb(self) -> int:\n \"\"\"Get the size of the input S3 object in MBytes\"\"\"\n size_in_bytes = wr.s3.size_objects(self.input_uuid, boto3_session=self.boto3_session)[self.input_uuid]\n size_in_mb = round(size_in_bytes / 1_000_000)\n return size_in_mb\n\n def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Sanity Check for S3 Object size\n object_megabytes = self.input_size_mb()\n if object_megabytes > 100:\n self.log.error(f\"S3 Object too big ({object_megabytes} MBytes): Use the S3ToDataSourceHeavy class!\")\n return\n\n # Read in the S3 CSV as a Pandas DataFrame\n if self.datatype == \"csv\":\n df = wr.s3.read_csv(self.input_uuid, low_memory=False, boto3_session=self.boto3_session)\n else:\n df = wr.s3.read_json(self.input_uuid, lines=True, boto3_session=self.boto3_session)\n\n # Temporary hack to limit the number of columns in the dataframe\n if len(df.columns) > 40:\n self.log.warning(f\"{self.input_uuid} Too Many Columns! Talk to SageWorks Support...\")\n\n # Convert object columns before sending to SageWorks Data Source\n df = convert_object_columns(df)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{self.input_uuid} --> DataSource: {self.output_uuid} Complete!\")\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.__init__","title":"__init__(s3_path, data_uuid, datatype='csv')
","text":"S3ToDataSourceLight Initialization
Parameters:
Name Type Description Defaults3_path
str
The S3 Path to the file to be transformed
requireddata_uuid
str
The UUID of the SageWorks DataSource to be created
requireddatatype
str
The datatype of the file to be transformed (defaults to \"csv\")
'csv'
Source code in src/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py
def __init__(self, s3_path: str, data_uuid: str, datatype: str = \"csv\"):\n \"\"\"S3ToDataSourceLight Initialization\n\n Args:\n s3_path (str): The S3 Path to the file to be transformed\n data_uuid (str): The UUID of the SageWorks DataSource to be created\n datatype (str): The datatype of the file to be transformed (defaults to \"csv\")\n \"\"\"\n\n # Call superclass init\n super().__init__(s3_path, data_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.S3_OBJECT\n self.output_type = TransformOutput.DATA_SOURCE\n self.datatype = datatype\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.input_size_mb","title":"input_size_mb()
","text":"Get the size of the input S3 object in MBytes
Source code insrc/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py
def input_size_mb(self) -> int:\n \"\"\"Get the size of the input S3 object in MBytes\"\"\"\n size_in_bytes = wr.s3.size_objects(self.input_uuid, boto3_session=self.boto3_session)[self.input_uuid]\n size_in_mb = round(size_in_bytes / 1_000_000)\n return size_in_mb\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform
Source code insrc/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform\"\"\"\n self.log.info(\"Post-Transform: S3 to DataSource...\")\n
"},{"location":"core_classes/transforms/data_loaders_light/#sageworks.core.transforms.data_loaders.light.S3ToDataSourceLight.transform_impl","title":"transform_impl(overwrite=True)
","text":"Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and store the information about the data to the AWS Data Catalog sageworks database
Source code insrc/sageworks/core/transforms/data_loaders/light/s3_to_data_source_light.py
def transform_impl(self, overwrite: bool = True):\n \"\"\"Convert the S3 CSV data into Parquet Format in the SageWorks Data Sources Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n \"\"\"\n\n # Sanity Check for S3 Object size\n object_megabytes = self.input_size_mb()\n if object_megabytes > 100:\n self.log.error(f\"S3 Object too big ({object_megabytes} MBytes): Use the S3ToDataSourceHeavy class!\")\n return\n\n # Read in the S3 CSV as a Pandas DataFrame\n if self.datatype == \"csv\":\n df = wr.s3.read_csv(self.input_uuid, low_memory=False, boto3_session=self.boto3_session)\n else:\n df = wr.s3.read_json(self.input_uuid, lines=True, boto3_session=self.boto3_session)\n\n # Temporary hack to limit the number of columns in the dataframe\n if len(df.columns) > 40:\n self.log.warning(f\"{self.input_uuid} Too Many Columns! Talk to SageWorks Support...\")\n\n # Convert object columns before sending to SageWorks Data Source\n df = convert_object_columns(df)\n\n # Use the SageWorks Pandas to Data Source class\n pandas_to_data = PandasToData(self.output_uuid)\n pandas_to_data.set_input(df)\n pandas_to_data.set_output_tags(self.output_tags)\n pandas_to_data.add_output_meta(self.output_meta)\n pandas_to_data.transform()\n\n # Report the transformation results\n self.log.info(f\"{self.input_uuid} --> DataSource: {self.output_uuid} Complete!\")\n
"},{"location":"core_classes/transforms/data_to_features/","title":"Data To Features","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas
MolecularDescriptors: Compute a Feature Set based on RDKit Descriptors
An alternative to using this class is to use thecompute_molecular_descriptors
function directly. df_features = compute_molecular_descriptors(df) to_features = PandasToFeatures(\"my_feature_set\") to_features.set_input(df_features, id_column=\"id\") to_features.set_output_tags([\"blah\", \"whatever\"]) to_features.transform()
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight","title":"DataToFeaturesLight
","text":" Bases: Transform
DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas
Common Usageto_features = DataToFeaturesLight(data_uuid, feature_uuid)\nto_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\nto_features.transform(id_column=\"id\"/None, event_time_column=\"date\"/None, query=str/None)\n
Source code in src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py
class DataToFeaturesLight(Transform):\n \"\"\"DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas\n\n Common Usage:\n ```python\n to_features = DataToFeaturesLight(data_uuid, feature_uuid)\n to_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n to_features.transform(id_column=\"id\"/None, event_time_column=\"date\"/None, query=str/None)\n ```\n \"\"\"\n\n def __init__(self, data_uuid: str, feature_uuid: str):\n \"\"\"DataToFeaturesLight Initialization\n\n Args:\n data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, feature_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.DATA_SOURCE\n self.output_type = TransformOutput.FEATURE_SET\n self.input_df = None\n self.output_df = None\n\n def pre_transform(self, query: str = None, **kwargs):\n \"\"\"Pull the input DataSource into our Input Pandas DataFrame\n Args:\n query(str): Optional query to filter the input DataFrame\n \"\"\"\n\n # Grab the Input (Data Source)\n data_to_pandas = DataToPandas(self.input_uuid)\n data_to_pandas.transform(query=query)\n self.input_df = data_to_pandas.get_output()\n\n # Check if there are any columns that are greater than 64 characters\n for col in self.input_df.columns:\n if len(col) > 64:\n raise ValueError(f\"Column name '{col}' > 64 characters. AWS FeatureGroup limits to 64 characters.\")\n\n def transform_impl(self, **kwargs):\n \"\"\"Transform the input DataFrame into a Feature Set\"\"\"\n\n # This is a reference implementation that should be overridden by the subclass\n self.output_df = self.input_df\n\n def post_transform(self, id_column, event_time_column=None, one_hot_columns=None, **kwargs):\n \"\"\"At this point the output DataFrame should be populated, so publish it as a Feature Set\n\n Args:\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n event_time_column (str, optional): The name of the event time column (default: None).\n one_hot_columns (list, optional): The list of columns to one-hot encode (default: None).\n \"\"\"\n # Now publish to the output location\n output_features = PandasToFeatures(self.output_uuid)\n output_features.set_input(\n self.output_df, id_column=id_column, event_time_column=event_time_column, one_hot_columns=one_hot_columns\n )\n output_features.set_output_tags(self.output_tags)\n output_features.add_output_meta(self.output_meta)\n output_features.transform()\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.__init__","title":"__init__(data_uuid, feature_uuid)
","text":"DataToFeaturesLight Initialization
Parameters:
Name Type Description Defaultdata_uuid
str
The UUID of the SageWorks DataSource to be transformed
requiredfeature_uuid
str
The UUID of the SageWorks FeatureSet to be created
required Source code insrc/sageworks/core/transforms/data_to_features/light/data_to_features_light.py
def __init__(self, data_uuid: str, feature_uuid: str):\n \"\"\"DataToFeaturesLight Initialization\n\n Args:\n data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, feature_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.DATA_SOURCE\n self.output_type = TransformOutput.FEATURE_SET\n self.input_df = None\n self.output_df = None\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.post_transform","title":"post_transform(id_column, event_time_column=None, one_hot_columns=None, **kwargs)
","text":"At this point the output DataFrame should be populated, so publish it as a Feature Set
Parameters:
Name Type Description Defaultid_column
str
The ID column (must be specified, use \"auto\" for auto-generated IDs).
requiredevent_time_column
str
The name of the event time column (default: None).
None
one_hot_columns
list
The list of columns to one-hot encode (default: None).
None
Source code in src/sageworks/core/transforms/data_to_features/light/data_to_features_light.py
def post_transform(self, id_column, event_time_column=None, one_hot_columns=None, **kwargs):\n \"\"\"At this point the output DataFrame should be populated, so publish it as a Feature Set\n\n Args:\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n event_time_column (str, optional): The name of the event time column (default: None).\n one_hot_columns (list, optional): The list of columns to one-hot encode (default: None).\n \"\"\"\n # Now publish to the output location\n output_features = PandasToFeatures(self.output_uuid)\n output_features.set_input(\n self.output_df, id_column=id_column, event_time_column=event_time_column, one_hot_columns=one_hot_columns\n )\n output_features.set_output_tags(self.output_tags)\n output_features.add_output_meta(self.output_meta)\n output_features.transform()\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.pre_transform","title":"pre_transform(query=None, **kwargs)
","text":"Pull the input DataSource into our Input Pandas DataFrame Args: query(str): Optional query to filter the input DataFrame
Source code insrc/sageworks/core/transforms/data_to_features/light/data_to_features_light.py
def pre_transform(self, query: str = None, **kwargs):\n \"\"\"Pull the input DataSource into our Input Pandas DataFrame\n Args:\n query(str): Optional query to filter the input DataFrame\n \"\"\"\n\n # Grab the Input (Data Source)\n data_to_pandas = DataToPandas(self.input_uuid)\n data_to_pandas.transform(query=query)\n self.input_df = data_to_pandas.get_output()\n\n # Check if there are any columns that are greater than 64 characters\n for col in self.input_df.columns:\n if len(col) > 64:\n raise ValueError(f\"Column name '{col}' > 64 characters. AWS FeatureGroup limits to 64 characters.\")\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.data_to_features_light.DataToFeaturesLight.transform_impl","title":"transform_impl(**kwargs)
","text":"Transform the input DataFrame into a Feature Set
Source code insrc/sageworks/core/transforms/data_to_features/light/data_to_features_light.py
def transform_impl(self, **kwargs):\n \"\"\"Transform the input DataFrame into a Feature Set\"\"\"\n\n # This is a reference implementation that should be overridden by the subclass\n self.output_df = self.input_df\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors","title":"MolecularDescriptors
","text":" Bases: DataToFeaturesLight
MolecularDescriptors: Create a FeatureSet (RDKit Descriptors) from a DataSource
Common Usageto_features = MolecularDescriptors(data_uuid, feature_uuid)\nto_features.set_output_tags([\"aqsol\", \"whatever\"])\nto_features.transform()\n
Source code in src/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py
class MolecularDescriptors(DataToFeaturesLight):\n \"\"\"MolecularDescriptors: Create a FeatureSet (RDKit Descriptors) from a DataSource\n\n Common Usage:\n ```python\n to_features = MolecularDescriptors(data_uuid, feature_uuid)\n to_features.set_output_tags([\"aqsol\", \"whatever\"])\n to_features.transform()\n ```\n \"\"\"\n\n def __init__(self, data_uuid: str, feature_uuid: str):\n \"\"\"MolecularDescriptors Initialization\n\n Args:\n data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, feature_uuid)\n\n def transform_impl(self, **kwargs):\n \"\"\"Compute a Feature Set based on RDKit Descriptors\"\"\"\n\n # Compute/add all the Molecular Descriptors\n self.output_df = compute_molecular_descriptors(self.input_df)\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors.__init__","title":"__init__(data_uuid, feature_uuid)
","text":"MolecularDescriptors Initialization
Parameters:
Name Type Description Defaultdata_uuid
str
The UUID of the SageWorks DataSource to be transformed
requiredfeature_uuid
str
The UUID of the SageWorks FeatureSet to be created
required Source code insrc/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py
def __init__(self, data_uuid: str, feature_uuid: str):\n \"\"\"MolecularDescriptors Initialization\n\n Args:\n data_uuid (str): The UUID of the SageWorks DataSource to be transformed\n feature_uuid (str): The UUID of the SageWorks FeatureSet to be created\n \"\"\"\n\n # Call superclass init\n super().__init__(data_uuid, feature_uuid)\n
"},{"location":"core_classes/transforms/data_to_features/#sageworks.core.transforms.data_to_features.light.molecular_descriptors.MolecularDescriptors.transform_impl","title":"transform_impl(**kwargs)
","text":"Compute a Feature Set based on RDKit Descriptors
Source code insrc/sageworks/core/transforms/data_to_features/light/molecular_descriptors.py
def transform_impl(self, **kwargs):\n \"\"\"Compute a Feature Set based on RDKit Descriptors\"\"\"\n\n # Compute/add all the Molecular Descriptors\n self.output_df = compute_molecular_descriptors(self.input_df)\n
"},{"location":"core_classes/transforms/features_to_model/","title":"Features To Model","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
FeaturesToModel: Train/Create a Model from a Feature Set
"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel","title":"FeaturesToModel
","text":" Bases: Transform
FeaturesToModel: Train/Create a Model from a FeatureSet
Common Usagefrom sageworks.core.transforms.features_to_model.features_to_model import FeaturesToModel\nto_model = FeaturesToModel(feature_uuid, model_uuid, model_type=ModelType)\nto_model.set_output_tags([\"abalone\", \"public\", \"whatever\"])\nto_model.transform(target_column=\"class_number_of_rings\",\n feature_list=[\"my\", \"best\", \"features\"])\n
Source code in src/sageworks/core/transforms/features_to_model/features_to_model.py
class FeaturesToModel(Transform):\n \"\"\"FeaturesToModel: Train/Create a Model from a FeatureSet\n\n Common Usage:\n ```python\n from sageworks.core.transforms.features_to_model.features_to_model import FeaturesToModel\n to_model = FeaturesToModel(feature_uuid, model_uuid, model_type=ModelType)\n to_model.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n to_model.transform(target_column=\"class_number_of_rings\",\n feature_list=[\"my\", \"best\", \"features\"])\n ```\n \"\"\"\n\n def __init__(\n self,\n feature_uuid: str,\n model_uuid: str,\n model_type: ModelType,\n model_class=None,\n model_import_str=None,\n custom_script=None,\n ):\n \"\"\"FeaturesToModel Initialization\n Args:\n feature_uuid (str): UUID of the FeatureSet to use as input\n model_uuid (str): UUID of the Model to create as output\n model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.\n model_class (str, optional): The class of the model (default None)\n model_import_str (str, optional): The import string for the model (default None)\n custom_script (str, optional): Custom script to use for the model (default None)\n \"\"\"\n\n # Make sure the model_uuid is a valid name\n Artifact.is_name_valid(model_uuid, delimiter=\"-\", lower_case=False)\n\n # Call superclass init\n super().__init__(feature_uuid, model_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.FEATURE_SET\n self.output_type = TransformOutput.MODEL\n self.model_type = model_type\n self.model_class = model_class\n self.model_import_str = model_import_str\n self.custom_script = custom_script\n self.estimator = None\n self.model_description = None\n self.model_training_root = self.models_s3_path + \"/training\"\n self.model_feature_list = None\n self.target_column = None\n self.class_labels = None\n\n def transform_impl(\n self, target_column: str, description: str = None, feature_list: list = None, train_all_data=False\n ):\n \"\"\"Generic Features to Model: Note you should create a new class and inherit from\n this one to include specific logic for your Feature Set/Model\n Args:\n target_column (str): Column name of the target variable\n description (str): Description of the model (optional)\n feature_list (list[str]): A list of columns for the features (default None, will try to guess)\n train_all_data (bool): Train on ALL (100%) of the data (default False)\n \"\"\"\n # Delete the existing model (if it exists)\n self.log.important(\"Trying to delete existing model...\")\n ModelCore.managed_delete(self.output_uuid)\n\n # Set our model description\n self.model_description = description if description is not None else f\"Model created from {self.input_uuid}\"\n\n # Get our Feature Set and create an S3 CSV Training dataset\n feature_set = FeatureSetCore(self.input_uuid)\n s3_training_path = feature_set.create_s3_training_data()\n self.log.info(f\"Created new training data {s3_training_path}...\")\n\n # Report the target column\n self.target_column = target_column\n self.log.info(f\"Target column: {self.target_column}\")\n\n # Did they specify a feature list?\n if feature_list:\n # AWS Feature Groups will also add these implicit columns, so remove them\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"training\"]\n feature_list = [c for c in feature_list if c not in aws_cols]\n\n # If they didn't specify a feature list, try to guess it\n else:\n # Try to figure out features with this logic\n # - Don't include id, event_time, __index_level_0__, or training columns\n # - Don't include AWS generated columns (e.g. write_time, api_invocation_time, is_deleted)\n # - Don't include the target columns\n # - Don't include any columns that are of type string or timestamp\n # - The rest of the columns are assumed to be features\n self.log.warning(\"Guessing at the feature list, HIGHLY SUGGESTED to specify an explicit feature list!\")\n all_columns = feature_set.columns\n filter_list = [\n \"id\",\n \"__index_level_0__\",\n \"write_time\",\n \"api_invocation_time\",\n \"is_deleted\",\n \"event_time\",\n \"training\",\n ] + [self.target_column]\n feature_list = [c for c in all_columns if c not in filter_list]\n\n # AWS Feature Store has 3 user column types (String, Integral, Fractional)\n # and two internal types (Timestamp and Boolean). A Feature List for\n # modeling can only contain Integral and Fractional types.\n remove_columns = []\n column_details = feature_set.column_details()\n for column_name in feature_list:\n if column_details[column_name] not in [\"Integral\", \"Fractional\"]:\n self.log.warning(\n f\"Removing {column_name} from feature list, improper type {column_details[column_name]}\"\n )\n remove_columns.append(column_name)\n\n # Remove the columns that are not Integral or Fractional\n feature_list = [c for c in feature_list if c not in remove_columns]\n\n # Set the final feature list\n self.model_feature_list = feature_list\n self.log.important(f\"Feature List for Modeling: {self.model_feature_list}\")\n\n # Custom Script\n if self.custom_script:\n script_path = self.custom_script\n self.log.info(\"Custom script path: {script_path}\")\n # Fixme: We'll need to circle back to this later\n copy_imports_to_script_dir(script_path, [\"sageworks.utils.chem_utils\"])\n\n # We're using one of the built-in model script templates\n else:\n # Set up our parameters for the model script\n template_params = {\n \"model_imports\": self.model_import_str,\n \"model_type\": self.model_type,\n \"model_class\": self.model_class,\n \"target_column\": self.target_column,\n \"feature_list\": self.model_feature_list,\n \"model_metrics_s3_path\": f\"{self.model_training_root}/{self.output_uuid}\",\n \"train_all_data\": train_all_data,\n }\n # Generate our model script\n script_path = generate_model_script(template_params)\n\n # Metric Definitions for Regression\n if self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.QUANTILE_REGRESSOR:\n metric_definitions = [\n {\"Name\": \"RMSE\", \"Regex\": \"RMSE: ([0-9.]+)\"},\n {\"Name\": \"MAE\", \"Regex\": \"MAE: ([0-9.]+)\"},\n {\"Name\": \"R2\", \"Regex\": \"R2: ([0-9.]+)\"},\n {\"Name\": \"NumRows\", \"Regex\": \"NumRows: ([0-9]+)\"},\n ]\n\n # Metric Definitions for Classification\n elif self.model_type == ModelType.CLASSIFIER:\n # We need to get creative with the Classification Metrics\n\n # Grab all the target column class values (class labels)\n table = feature_set.data_source.table\n self.class_labels = feature_set.query(f'select DISTINCT {self.target_column} FROM \"{table}\"')[\n self.target_column\n ].to_list()\n\n # Sanity check on the targets\n if len(self.class_labels) > 10:\n msg = f\"Too many target classes ({len(self.class_labels)}) for classification, aborting!\"\n self.log.critical(msg)\n raise ValueError(msg)\n\n # Dynamically create the metric definitions\n metrics = [\"precision\", \"recall\", \"fscore\"]\n metric_definitions = []\n for t in self.class_labels:\n for m in metrics:\n metric_definitions.append({\"Name\": f\"Metrics:{t}:{m}\", \"Regex\": f\"Metrics:{t}:{m} ([0-9.]+)\"})\n\n # Add the confusion matrix metrics\n for row in self.class_labels:\n for col in self.class_labels:\n metric_definitions.append(\n {\"Name\": f\"ConfusionMatrix:{row}:{col}\", \"Regex\": f\"ConfusionMatrix:{row}:{col} ([0-9.]+)\"}\n )\n\n # If the model type is UNKNOWN, our metric_definitions will be empty\n else:\n self.log.important(f\"ModelType is {self.model_type}, skipping metric_definitions...\")\n metric_definitions = []\n\n # Take the full script path and extract the entry point and source directory\n entry_point = str(Path(script_path).name)\n source_dir = str(Path(script_path).parent)\n\n # Create a Sagemaker Model with our script\n image = InferenceImage.get_image_uri(self.sm_session.boto_region_name, \"sklearn\", \"1.2.1\")\n self.estimator = SKLearn(\n entry_point=entry_point,\n source_dir=source_dir,\n role=self.sageworks_role_arn,\n instance_type=\"ml.m5.large\",\n sagemaker_session=self.sm_session,\n framework_version=\"1.2-1\",\n image_uri=image,\n metric_definitions=metric_definitions,\n )\n\n # Training Job Name based on the Model UUID and today's date\n training_date_time_utc = datetime.now(timezone.utc).strftime(\"%Y-%m-%d-%H-%M\")\n training_job_name = f\"{self.output_uuid}-{training_date_time_utc}\"\n\n # Train the estimator\n self.estimator.fit({\"train\": s3_training_path}, job_name=training_job_name)\n\n # Now delete the training data\n self.log.info(f\"Deleting training data {s3_training_path}...\")\n wr.s3.delete_objects(\n [s3_training_path, s3_training_path.replace(\".csv\", \".csv.metadata\")],\n boto3_session=self.boto3_session,\n )\n\n # Create Model and officially Register\n self.log.important(f\"Creating new model {self.output_uuid}...\")\n self.create_and_register_model()\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() on the Model\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() on the Model...\")\n\n # Store the model feature_list and target_column in the sageworks_meta\n output_model = ModelCore(self.output_uuid, model_type=self.model_type)\n output_model.upsert_sageworks_meta({\"sageworks_model_features\": self.model_feature_list})\n output_model.upsert_sageworks_meta({\"sageworks_model_target\": self.target_column})\n\n # Store the class labels (if they exist)\n if self.class_labels:\n output_model.set_class_labels(self.class_labels)\n\n # Call the Model onboard method\n output_model.onboard_with_args(self.model_type, self.target_column, self.model_feature_list)\n\n def create_and_register_model(self):\n \"\"\"Create and Register the Model\"\"\"\n\n # Get the metadata/tags to push into AWS\n aws_tags = self.get_aws_tags()\n\n # Create model group (if it doesn't already exist)\n self.sm_client.create_model_package_group(\n ModelPackageGroupName=self.output_uuid,\n ModelPackageGroupDescription=self.model_description,\n Tags=aws_tags,\n )\n\n # Register our model\n image = InferenceImage.get_image_uri(self.sm_session.boto_region_name, \"sklearn\", \"1.2.1\")\n self.log.important(f\"Registering model {self.output_uuid} with image {image}...\")\n model = self.estimator.create_model(role=self.sageworks_role_arn)\n model.register(\n model_package_group_name=self.output_uuid,\n framework_version=\"1.2.1\",\n image_uri=image,\n content_types=[\"text/csv\"],\n response_types=[\"text/csv\"],\n inference_instances=[\"ml.t2.medium\"],\n transform_instances=[\"ml.m5.large\"],\n approval_status=\"Approved\",\n description=self.model_description,\n )\n
"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.__init__","title":"__init__(feature_uuid, model_uuid, model_type, model_class=None, model_import_str=None, custom_script=None)
","text":"FeaturesToModel Initialization Args: feature_uuid (str): UUID of the FeatureSet to use as input model_uuid (str): UUID of the Model to create as output model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc. model_class (str, optional): The class of the model (default None) model_import_str (str, optional): The import string for the model (default None) custom_script (str, optional): Custom script to use for the model (default None)
Source code insrc/sageworks/core/transforms/features_to_model/features_to_model.py
def __init__(\n self,\n feature_uuid: str,\n model_uuid: str,\n model_type: ModelType,\n model_class=None,\n model_import_str=None,\n custom_script=None,\n):\n \"\"\"FeaturesToModel Initialization\n Args:\n feature_uuid (str): UUID of the FeatureSet to use as input\n model_uuid (str): UUID of the Model to create as output\n model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.\n model_class (str, optional): The class of the model (default None)\n model_import_str (str, optional): The import string for the model (default None)\n custom_script (str, optional): Custom script to use for the model (default None)\n \"\"\"\n\n # Make sure the model_uuid is a valid name\n Artifact.is_name_valid(model_uuid, delimiter=\"-\", lower_case=False)\n\n # Call superclass init\n super().__init__(feature_uuid, model_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.FEATURE_SET\n self.output_type = TransformOutput.MODEL\n self.model_type = model_type\n self.model_class = model_class\n self.model_import_str = model_import_str\n self.custom_script = custom_script\n self.estimator = None\n self.model_description = None\n self.model_training_root = self.models_s3_path + \"/training\"\n self.model_feature_list = None\n self.target_column = None\n self.class_labels = None\n
"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.create_and_register_model","title":"create_and_register_model()
","text":"Create and Register the Model
Source code insrc/sageworks/core/transforms/features_to_model/features_to_model.py
def create_and_register_model(self):\n \"\"\"Create and Register the Model\"\"\"\n\n # Get the metadata/tags to push into AWS\n aws_tags = self.get_aws_tags()\n\n # Create model group (if it doesn't already exist)\n self.sm_client.create_model_package_group(\n ModelPackageGroupName=self.output_uuid,\n ModelPackageGroupDescription=self.model_description,\n Tags=aws_tags,\n )\n\n # Register our model\n image = InferenceImage.get_image_uri(self.sm_session.boto_region_name, \"sklearn\", \"1.2.1\")\n self.log.important(f\"Registering model {self.output_uuid} with image {image}...\")\n model = self.estimator.create_model(role=self.sageworks_role_arn)\n model.register(\n model_package_group_name=self.output_uuid,\n framework_version=\"1.2.1\",\n image_uri=image,\n content_types=[\"text/csv\"],\n response_types=[\"text/csv\"],\n inference_instances=[\"ml.t2.medium\"],\n transform_instances=[\"ml.m5.large\"],\n approval_status=\"Approved\",\n description=self.model_description,\n )\n
"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Calling onboard() on the Model
Source code insrc/sageworks/core/transforms/features_to_model/features_to_model.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() on the Model\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() on the Model...\")\n\n # Store the model feature_list and target_column in the sageworks_meta\n output_model = ModelCore(self.output_uuid, model_type=self.model_type)\n output_model.upsert_sageworks_meta({\"sageworks_model_features\": self.model_feature_list})\n output_model.upsert_sageworks_meta({\"sageworks_model_target\": self.target_column})\n\n # Store the class labels (if they exist)\n if self.class_labels:\n output_model.set_class_labels(self.class_labels)\n\n # Call the Model onboard method\n output_model.onboard_with_args(self.model_type, self.target_column, self.model_feature_list)\n
"},{"location":"core_classes/transforms/features_to_model/#sageworks.core.transforms.features_to_model.features_to_model.FeaturesToModel.transform_impl","title":"transform_impl(target_column, description=None, feature_list=None, train_all_data=False)
","text":"Generic Features to Model: Note you should create a new class and inherit from this one to include specific logic for your Feature Set/Model Args: target_column (str): Column name of the target variable description (str): Description of the model (optional) feature_list (list[str]): A list of columns for the features (default None, will try to guess) train_all_data (bool): Train on ALL (100%) of the data (default False)
Source code insrc/sageworks/core/transforms/features_to_model/features_to_model.py
def transform_impl(\n self, target_column: str, description: str = None, feature_list: list = None, train_all_data=False\n):\n \"\"\"Generic Features to Model: Note you should create a new class and inherit from\n this one to include specific logic for your Feature Set/Model\n Args:\n target_column (str): Column name of the target variable\n description (str): Description of the model (optional)\n feature_list (list[str]): A list of columns for the features (default None, will try to guess)\n train_all_data (bool): Train on ALL (100%) of the data (default False)\n \"\"\"\n # Delete the existing model (if it exists)\n self.log.important(\"Trying to delete existing model...\")\n ModelCore.managed_delete(self.output_uuid)\n\n # Set our model description\n self.model_description = description if description is not None else f\"Model created from {self.input_uuid}\"\n\n # Get our Feature Set and create an S3 CSV Training dataset\n feature_set = FeatureSetCore(self.input_uuid)\n s3_training_path = feature_set.create_s3_training_data()\n self.log.info(f\"Created new training data {s3_training_path}...\")\n\n # Report the target column\n self.target_column = target_column\n self.log.info(f\"Target column: {self.target_column}\")\n\n # Did they specify a feature list?\n if feature_list:\n # AWS Feature Groups will also add these implicit columns, so remove them\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\", \"training\"]\n feature_list = [c for c in feature_list if c not in aws_cols]\n\n # If they didn't specify a feature list, try to guess it\n else:\n # Try to figure out features with this logic\n # - Don't include id, event_time, __index_level_0__, or training columns\n # - Don't include AWS generated columns (e.g. write_time, api_invocation_time, is_deleted)\n # - Don't include the target columns\n # - Don't include any columns that are of type string or timestamp\n # - The rest of the columns are assumed to be features\n self.log.warning(\"Guessing at the feature list, HIGHLY SUGGESTED to specify an explicit feature list!\")\n all_columns = feature_set.columns\n filter_list = [\n \"id\",\n \"__index_level_0__\",\n \"write_time\",\n \"api_invocation_time\",\n \"is_deleted\",\n \"event_time\",\n \"training\",\n ] + [self.target_column]\n feature_list = [c for c in all_columns if c not in filter_list]\n\n # AWS Feature Store has 3 user column types (String, Integral, Fractional)\n # and two internal types (Timestamp and Boolean). A Feature List for\n # modeling can only contain Integral and Fractional types.\n remove_columns = []\n column_details = feature_set.column_details()\n for column_name in feature_list:\n if column_details[column_name] not in [\"Integral\", \"Fractional\"]:\n self.log.warning(\n f\"Removing {column_name} from feature list, improper type {column_details[column_name]}\"\n )\n remove_columns.append(column_name)\n\n # Remove the columns that are not Integral or Fractional\n feature_list = [c for c in feature_list if c not in remove_columns]\n\n # Set the final feature list\n self.model_feature_list = feature_list\n self.log.important(f\"Feature List for Modeling: {self.model_feature_list}\")\n\n # Custom Script\n if self.custom_script:\n script_path = self.custom_script\n self.log.info(\"Custom script path: {script_path}\")\n # Fixme: We'll need to circle back to this later\n copy_imports_to_script_dir(script_path, [\"sageworks.utils.chem_utils\"])\n\n # We're using one of the built-in model script templates\n else:\n # Set up our parameters for the model script\n template_params = {\n \"model_imports\": self.model_import_str,\n \"model_type\": self.model_type,\n \"model_class\": self.model_class,\n \"target_column\": self.target_column,\n \"feature_list\": self.model_feature_list,\n \"model_metrics_s3_path\": f\"{self.model_training_root}/{self.output_uuid}\",\n \"train_all_data\": train_all_data,\n }\n # Generate our model script\n script_path = generate_model_script(template_params)\n\n # Metric Definitions for Regression\n if self.model_type == ModelType.REGRESSOR or self.model_type == ModelType.QUANTILE_REGRESSOR:\n metric_definitions = [\n {\"Name\": \"RMSE\", \"Regex\": \"RMSE: ([0-9.]+)\"},\n {\"Name\": \"MAE\", \"Regex\": \"MAE: ([0-9.]+)\"},\n {\"Name\": \"R2\", \"Regex\": \"R2: ([0-9.]+)\"},\n {\"Name\": \"NumRows\", \"Regex\": \"NumRows: ([0-9]+)\"},\n ]\n\n # Metric Definitions for Classification\n elif self.model_type == ModelType.CLASSIFIER:\n # We need to get creative with the Classification Metrics\n\n # Grab all the target column class values (class labels)\n table = feature_set.data_source.table\n self.class_labels = feature_set.query(f'select DISTINCT {self.target_column} FROM \"{table}\"')[\n self.target_column\n ].to_list()\n\n # Sanity check on the targets\n if len(self.class_labels) > 10:\n msg = f\"Too many target classes ({len(self.class_labels)}) for classification, aborting!\"\n self.log.critical(msg)\n raise ValueError(msg)\n\n # Dynamically create the metric definitions\n metrics = [\"precision\", \"recall\", \"fscore\"]\n metric_definitions = []\n for t in self.class_labels:\n for m in metrics:\n metric_definitions.append({\"Name\": f\"Metrics:{t}:{m}\", \"Regex\": f\"Metrics:{t}:{m} ([0-9.]+)\"})\n\n # Add the confusion matrix metrics\n for row in self.class_labels:\n for col in self.class_labels:\n metric_definitions.append(\n {\"Name\": f\"ConfusionMatrix:{row}:{col}\", \"Regex\": f\"ConfusionMatrix:{row}:{col} ([0-9.]+)\"}\n )\n\n # If the model type is UNKNOWN, our metric_definitions will be empty\n else:\n self.log.important(f\"ModelType is {self.model_type}, skipping metric_definitions...\")\n metric_definitions = []\n\n # Take the full script path and extract the entry point and source directory\n entry_point = str(Path(script_path).name)\n source_dir = str(Path(script_path).parent)\n\n # Create a Sagemaker Model with our script\n image = InferenceImage.get_image_uri(self.sm_session.boto_region_name, \"sklearn\", \"1.2.1\")\n self.estimator = SKLearn(\n entry_point=entry_point,\n source_dir=source_dir,\n role=self.sageworks_role_arn,\n instance_type=\"ml.m5.large\",\n sagemaker_session=self.sm_session,\n framework_version=\"1.2-1\",\n image_uri=image,\n metric_definitions=metric_definitions,\n )\n\n # Training Job Name based on the Model UUID and today's date\n training_date_time_utc = datetime.now(timezone.utc).strftime(\"%Y-%m-%d-%H-%M\")\n training_job_name = f\"{self.output_uuid}-{training_date_time_utc}\"\n\n # Train the estimator\n self.estimator.fit({\"train\": s3_training_path}, job_name=training_job_name)\n\n # Now delete the training data\n self.log.info(f\"Deleting training data {s3_training_path}...\")\n wr.s3.delete_objects(\n [s3_training_path, s3_training_path.replace(\".csv\", \".csv.metadata\")],\n boto3_session=self.boto3_session,\n )\n\n # Create Model and officially Register\n self.log.important(f\"Creating new model {self.output_uuid}...\")\n self.create_and_register_model()\n
"},{"location":"core_classes/transforms/features_to_model/#supported-models","title":"Supported Models","text":"Currently SageWorks supports XGBoost (classifier/regressor), and Scikit Learn models. Those models can be created by just specifying different parameters to the FeaturesToModel
class. The main issue with the supported models is they are vanilla versions with default parameters, any customization should be done with Custom Models
from sageworks.core.transforms.features_to_model.features_to_model import FeaturesToModel\n\n# XGBoost Regression Model\ninput_uuid = \"abalone_features\"\noutput_uuid = \"abalone-regression\"\nto_model = FeaturesToModel(input_uuid, output_uuid, model_type=ModelType.REGRESSOR)\nto_model.set_output_tags([\"abalone\", \"public\"])\nto_model.transform(target_column=\"class_number_of_rings\", description=\"Abalone Regression\")\n\n# XGBoost Classification Model\ninput_uuid = \"wine_features\"\noutput_uuid = \"wine-classification\"\nto_model = FeaturesToModel(input_uuid, output_uuid, ModelType.CLASSIFIER)\nto_model.set_output_tags([\"wine\", \"public\"])\nto_model.transform(target_column=\"wine_class\", description=\"Wine Classification\")\n\n# Quantile Regression Model (Abalone)\ninput_uuid = \"abalone_features\"\noutput_uuid = \"abalone-quantile-reg\"\nto_model = FeaturesToModel(input_uuid, output_uuid, ModelType.QUANTILE_REGRESSOR)\nto_model.set_output_tags([\"abalone\", \"quantiles\"])\nto_model.transform(target_column=\"class_number_of_rings\", description=\"Abalone Quantile Regression\")\n
"},{"location":"core_classes/transforms/features_to_model/#scikit-learn","title":"Scikit-Learn","text":"from sageworks.core.transforms.features_to_model.features_to_model import FeaturesToModel\n\n# Scikit-Learn Kmeans Clustering Model\ninput_uuid = \"wine_features\"\noutput_uuid = \"wine-clusters\"\nto_model = FeaturesToModel(\n input_uuid,\n output_uuid,\n model_class=\"KMeans\", # Clustering algorithm\n model_import_str=\"from sklearn.cluster import KMeans\", # Import statement for KMeans\n model_type=ModelType.CLUSTERER,\n)\nto_model.set_output_tags([\"wine\", \"clustering\"])\nto_model.transform(target_column=None, description=\"Wine Clustering\", train_all_data=True)\n\n# Scikit-Learn HDBSCAN Clustering Model\ninput_uuid = \"wine_features\"\noutput_uuid = \"wine-clusters-hdbscan\"\nto_model = FeaturesToModel(\n input_uuid,\n output_uuid,\n model_class=\"HDBSCAN\", # Density-based clustering algorithm\n model_import_str=\"from sklearn.cluster import HDBSCAN\",\n model_type=ModelType.CLUSTERER,\n)\nto_model.set_output_tags([\"wine\", \"density-based clustering\"])\nto_model.transform(target_column=None, description=\"Wine Clustering with HDBSCAN\", train_all_data=True)\n\n# Scikit-Learn 2D Projection Model using UMAP\ninput_uuid = \"wine_features\"\noutput_uuid = \"wine-2d-projection\"\nto_model = FeaturesToModel(\n input_uuid,\n output_uuid,\n model_class=\"UMAP\",\n model_import_str=\"from umap import UMAP\",\n model_type=ModelType.PROJECTION,\n)\nto_model.set_output_tags([\"wine\", \"2d-projection\"])\nto_model.transform(target_column=None, description=\"Wine 2D Projection\", train_all_data=True)\n
"},{"location":"core_classes/transforms/features_to_model/#custom-models","title":"Custom Models","text":"For custom models we recommend the following steps:
Experimental
The SageWorks Custom Models are currently in experimental mode so have fun but expect issues. Requires sageworks >= 0.8.60
. Feel free to submit issues to SageWorks Github
from sageworks.api import ModelType\nfrom sageworks.core.transforms.features_to_model.features_to_model import FeaturesToModel\n\n# Note this directory should also have a requirements.txt in it\nmy_custom_script = \"/full/path/to/my/directory/my_custom_script.py\"\ninput_uuid = \"wine_features\" # FeatureSet you want to use\noutput_uuid = \"my-custom-model\" # change to whatever\ntarget_column = \"wine-class\" # change to whatever\nto_model = FeaturesToModel(input_uuid, output_uuid,\n model_type=ModelType.CLASSIFIER, \n custom_script=my_custom_script)\nto_model.set_output_tags([\"your\", \"tags\"])\nto_model.transform(target_column=target_column, description=\"Custom Model\")\n
"},{"location":"core_classes/transforms/features_to_model/#custom-models-create-an-endpointrun-inference","title":"Custom Models: Create an Endpoint/Run Inference","text":"from sageworks.api import Model, Endpoint\n\nmodel = Model(\"my-custom-model\")\nend = model.to_endpoint() # Note: This takes a while\n\n# Now run inference on my custom model :)\nend.auto_inference(capture=True)\n\n# Run inference with my own dataframe\ndf = fs.pull_dataframe() # Or whatever dataframe\nend.inference(df)\n
"},{"location":"core_classes/transforms/features_to_model/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/transforms/model_to_endpoint/","title":"Model to Endpoint","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
ModelToEndpoint: Deploy an Endpoint for a Model
"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint","title":"ModelToEndpoint
","text":" Bases: Transform
ModelToEndpoint: Deploy an Endpoint for a Model
Common Usageto_endpoint = ModelToEndpoint(model_uuid, endpoint_uuid)\nto_endpoint.set_output_tags([\"aqsol\", \"public\", \"whatever\"])\nto_endpoint.transform()\n
Source code in src/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py
class ModelToEndpoint(Transform):\n \"\"\"ModelToEndpoint: Deploy an Endpoint for a Model\n\n Common Usage:\n ```python\n to_endpoint = ModelToEndpoint(model_uuid, endpoint_uuid)\n to_endpoint.set_output_tags([\"aqsol\", \"public\", \"whatever\"])\n to_endpoint.transform()\n ```\n \"\"\"\n\n def __init__(self, model_uuid: str, endpoint_uuid: str, serverless: bool = True):\n \"\"\"ModelToEndpoint Initialization\n Args:\n model_uuid(str): The UUID of the input Model\n endpoint_uuid(str): The UUID of the output Endpoint\n serverless(bool): Deploy the Endpoint in serverless mode (default: True)\n \"\"\"\n # Make sure the endpoint_uuid is a valid name\n Artifact.is_name_valid(endpoint_uuid, delimiter=\"-\", lower_case=False)\n\n # Call superclass init\n super().__init__(model_uuid, endpoint_uuid)\n\n # Set up all my instance attributes\n self.serverless = serverless\n self.instance_type = \"serverless\" if serverless else \"ml.t2.medium\"\n self.input_type = TransformInput.MODEL\n self.output_type = TransformOutput.ENDPOINT\n\n def transform_impl(self):\n \"\"\"Deploy an Endpoint for a Model\"\"\"\n\n # Delete endpoint (if it already exists)\n EndpointCore.managed_delete(self.output_uuid)\n\n # Get the Model Package ARN for our input model\n input_model = ModelCore(self.input_uuid)\n model_package_arn = input_model.model_package_arn()\n\n # Deploy the model\n self._deploy_model(model_package_arn)\n\n # Add this endpoint to the set of registered endpoints for the model\n input_model.register_endpoint(self.output_uuid)\n\n # This ensures that the endpoint is ready for use\n time.sleep(5) # We wait for AWS Lag\n end = EndpointCore(self.output_uuid)\n self.log.important(f\"Endpoint {end.uuid} is ready for use\")\n\n def _deploy_model(self, model_package_arn: str):\n \"\"\"Internal Method: Deploy the Model\n\n Args:\n model_package_arn(str): The Model Package ARN used to deploy the Endpoint\n \"\"\"\n # Grab the specified Model Package\n model_package = ModelPackage(\n role=self.sageworks_role_arn,\n model_package_arn=model_package_arn,\n sagemaker_session=self.sm_session,\n )\n\n # Get the metadata/tags to push into AWS\n aws_tags = self.get_aws_tags()\n\n # Is this a serverless deployment?\n serverless_config = None\n if self.serverless:\n serverless_config = ServerlessInferenceConfig(\n memory_size_in_mb=2048,\n max_concurrency=5,\n )\n\n # Deploy the Endpoint\n self.log.important(f\"Deploying the Endpoint {self.output_uuid}...\")\n model_package.deploy(\n initial_instance_count=1,\n instance_type=self.instance_type,\n serverless_inference_config=serverless_config,\n endpoint_name=self.output_uuid,\n serializer=CSVSerializer(),\n deserializer=CSVDeserializer(),\n tags=aws_tags,\n )\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() for the Endpoint\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() for the Endpoint...\")\n\n # Onboard the Endpoint\n output_endpoint = EndpointCore(self.output_uuid)\n output_endpoint.onboard_with_args(input_model=self.input_uuid)\n
"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.__init__","title":"__init__(model_uuid, endpoint_uuid, serverless=True)
","text":"ModelToEndpoint Initialization Args: model_uuid(str): The UUID of the input Model endpoint_uuid(str): The UUID of the output Endpoint serverless(bool): Deploy the Endpoint in serverless mode (default: True)
Source code insrc/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py
def __init__(self, model_uuid: str, endpoint_uuid: str, serverless: bool = True):\n \"\"\"ModelToEndpoint Initialization\n Args:\n model_uuid(str): The UUID of the input Model\n endpoint_uuid(str): The UUID of the output Endpoint\n serverless(bool): Deploy the Endpoint in serverless mode (default: True)\n \"\"\"\n # Make sure the endpoint_uuid is a valid name\n Artifact.is_name_valid(endpoint_uuid, delimiter=\"-\", lower_case=False)\n\n # Call superclass init\n super().__init__(model_uuid, endpoint_uuid)\n\n # Set up all my instance attributes\n self.serverless = serverless\n self.instance_type = \"serverless\" if serverless else \"ml.t2.medium\"\n self.input_type = TransformInput.MODEL\n self.output_type = TransformOutput.ENDPOINT\n
"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Calling onboard() for the Endpoint
Source code insrc/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() for the Endpoint\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() for the Endpoint...\")\n\n # Onboard the Endpoint\n output_endpoint = EndpointCore(self.output_uuid)\n output_endpoint.onboard_with_args(input_model=self.input_uuid)\n
"},{"location":"core_classes/transforms/model_to_endpoint/#sageworks.core.transforms.model_to_endpoint.model_to_endpoint.ModelToEndpoint.transform_impl","title":"transform_impl()
","text":"Deploy an Endpoint for a Model
Source code insrc/sageworks/core/transforms/model_to_endpoint/model_to_endpoint.py
def transform_impl(self):\n \"\"\"Deploy an Endpoint for a Model\"\"\"\n\n # Delete endpoint (if it already exists)\n EndpointCore.managed_delete(self.output_uuid)\n\n # Get the Model Package ARN for our input model\n input_model = ModelCore(self.input_uuid)\n model_package_arn = input_model.model_package_arn()\n\n # Deploy the model\n self._deploy_model(model_package_arn)\n\n # Add this endpoint to the set of registered endpoints for the model\n input_model.register_endpoint(self.output_uuid)\n\n # This ensures that the endpoint is ready for use\n time.sleep(5) # We wait for AWS Lag\n end = EndpointCore(self.output_uuid)\n self.log.important(f\"Endpoint {end.uuid} is ready for use\")\n
"},{"location":"core_classes/transforms/overview/","title":"Transforms","text":"API Classes
For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline
SageWorks currently has a large set of Transforms that go from one Artifact type to another (e.g. DataSource to FeatureSet). The Transforms will often have light and heavy versions depending on the scale of data that needs to be transformed.
"},{"location":"core_classes/transforms/overview/#transform-details","title":"Transform Details","text":"API Classes
The API Classes will often provide helpful methods that give you a DataFrame (data_source.query() for instance), so always check out the API Classes first.
These Transforms will give you the ultimate in customization and flexibility when creating AWS Machine Learning Pipelines. Grab a Pandas DataFrame from a DataSource or FeatureSet process in whatever way for your use case and simply create another Sageworks DataSource or FeatureSet from the resulting DataFrame.
Lots of Options:
Not for Large Data
Pandas Transforms can't handle large datasets (> 4 GigaBytes). For doing transforma on large data see our Heavy Transforms
Welcome to the SageWorks Pandas Transform Classes
These classes provide low-level APIs for using Pandas DataFrames
DataToPandas
","text":" Bases: Transform
DataToPandas: Class to transform a Data Source into a Pandas DataFrame
Common Usagedata_to_df = DataToPandas(data_source_uuid)\ndata_to_df.transform(query=<optional SQL query to filter/process data>)\ndata_to_df.transform(max_rows=<optional max rows to sample>)\nmy_df = data_to_df.get_output()\n\nNote: query is the best way to use this class, so use it :)\n
Source code in src/sageworks/core/transforms/pandas_transforms/data_to_pandas.py
class DataToPandas(Transform):\n \"\"\"DataToPandas: Class to transform a Data Source into a Pandas DataFrame\n\n Common Usage:\n ```python\n data_to_df = DataToPandas(data_source_uuid)\n data_to_df.transform(query=<optional SQL query to filter/process data>)\n data_to_df.transform(max_rows=<optional max rows to sample>)\n my_df = data_to_df.get_output()\n\n Note: query is the best way to use this class, so use it :)\n ```\n \"\"\"\n\n def __init__(self, input_uuid: str):\n \"\"\"DataToPandas Initialization\"\"\"\n\n # Call superclass init\n super().__init__(input_uuid, \"DataFrame\")\n\n # Set up all my instance attributes\n self.input_type = TransformInput.DATA_SOURCE\n self.output_type = TransformOutput.PANDAS_DF\n self.output_df = None\n\n def transform_impl(self, query: str = None, max_rows=100000):\n \"\"\"Convert the DataSource into a Pandas DataFrame\n Args:\n query(str): The query to run against the DataSource (default: None)\n max_rows(int): The maximum number of rows to return (default: 100000)\n \"\"\"\n\n # Grab the Input (Data Source)\n input_data = DataSourceFactory(self.input_uuid)\n if not input_data.exists():\n self.log.critical(f\"Data Check on {self.input_uuid} failed!\")\n return\n\n # If a query is provided, that overrides the queries below\n if query:\n self.log.info(f\"Querying {self.input_uuid} with {query}...\")\n self.output_df = input_data.query(query)\n return\n\n # If the data source has more rows than max_rows, do a sample query\n num_rows = input_data.num_rows()\n if num_rows > max_rows:\n percentage = round(max_rows * 100.0 / num_rows)\n self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n query = f\"SELECT * FROM {self.input_uuid} TABLESAMPLE BERNOULLI({percentage})\"\n else:\n query = f\"SELECT * FROM {self.input_uuid}\"\n\n # Mark the transform as complete and set the output DataFrame\n self.output_df = input_data.query(query)\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n\n def get_output(self) -> pd.DataFrame:\n \"\"\"Get the DataFrame Output from this Transform\"\"\"\n return self.output_df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.__init__","title":"__init__(input_uuid)
","text":"DataToPandas Initialization
Source code insrc/sageworks/core/transforms/pandas_transforms/data_to_pandas.py
def __init__(self, input_uuid: str):\n \"\"\"DataToPandas Initialization\"\"\"\n\n # Call superclass init\n super().__init__(input_uuid, \"DataFrame\")\n\n # Set up all my instance attributes\n self.input_type = TransformInput.DATA_SOURCE\n self.output_type = TransformOutput.PANDAS_DF\n self.output_df = None\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.get_output","title":"get_output()
","text":"Get the DataFrame Output from this Transform
Source code insrc/sageworks/core/transforms/pandas_transforms/data_to_pandas.py
def get_output(self) -> pd.DataFrame:\n \"\"\"Get the DataFrame Output from this Transform\"\"\"\n return self.output_df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Any checks on the Pandas DataFrame that need to be done
Source code insrc/sageworks/core/transforms/pandas_transforms/data_to_pandas.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.DataToPandas.transform_impl","title":"transform_impl(query=None, max_rows=100000)
","text":"Convert the DataSource into a Pandas DataFrame Args: query(str): The query to run against the DataSource (default: None) max_rows(int): The maximum number of rows to return (default: 100000)
Source code insrc/sageworks/core/transforms/pandas_transforms/data_to_pandas.py
def transform_impl(self, query: str = None, max_rows=100000):\n \"\"\"Convert the DataSource into a Pandas DataFrame\n Args:\n query(str): The query to run against the DataSource (default: None)\n max_rows(int): The maximum number of rows to return (default: 100000)\n \"\"\"\n\n # Grab the Input (Data Source)\n input_data = DataSourceFactory(self.input_uuid)\n if not input_data.exists():\n self.log.critical(f\"Data Check on {self.input_uuid} failed!\")\n return\n\n # If a query is provided, that overrides the queries below\n if query:\n self.log.info(f\"Querying {self.input_uuid} with {query}...\")\n self.output_df = input_data.query(query)\n return\n\n # If the data source has more rows than max_rows, do a sample query\n num_rows = input_data.num_rows()\n if num_rows > max_rows:\n percentage = round(max_rows * 100.0 / num_rows)\n self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n query = f\"SELECT * FROM {self.input_uuid} TABLESAMPLE BERNOULLI({percentage})\"\n else:\n query = f\"SELECT * FROM {self.input_uuid}\"\n\n # Mark the transform as complete and set the output DataFrame\n self.output_df = input_data.query(query)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas","title":"FeaturesToPandas
","text":" Bases: Transform
FeaturesToPandas: Class to transform a FeatureSet into a Pandas DataFrame
Common Usagefeature_to_df = FeaturesToPandas(feature_set_uuid)\nfeature_to_df.transform(max_rows=<optional max rows to sample>)\nmy_df = feature_to_df.get_output()\n
Source code in src/sageworks/core/transforms/pandas_transforms/features_to_pandas.py
class FeaturesToPandas(Transform):\n \"\"\"FeaturesToPandas: Class to transform a FeatureSet into a Pandas DataFrame\n\n Common Usage:\n ```python\n feature_to_df = FeaturesToPandas(feature_set_uuid)\n feature_to_df.transform(max_rows=<optional max rows to sample>)\n my_df = feature_to_df.get_output()\n ```\n \"\"\"\n\n def __init__(self, feature_set_name: str):\n \"\"\"FeaturesToPandas Initialization\"\"\"\n\n # Call superclass init\n super().__init__(input_uuid=feature_set_name, output_uuid=\"DataFrame\")\n\n # Set up all my instance attributes\n self.input_type = TransformInput.FEATURE_SET\n self.output_type = TransformOutput.PANDAS_DF\n self.output_df = None\n self.transform_run = False\n\n def transform_impl(self, max_rows=100000):\n \"\"\"Convert the FeatureSet into a Pandas DataFrame\"\"\"\n\n # Grab the Input (Feature Set)\n input_data = FeatureSetCore(self.input_uuid)\n if not input_data.exists():\n self.log.critical(f\"Feature Set Check on {self.input_uuid} failed!\")\n return\n\n # Grab the table for this Feature Set\n table = input_data.athena_table\n\n # Get the list of columns (and subtract metadata columns that might get added)\n columns = input_data.columns\n filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n columns = \", \".join([x for x in columns if x not in filter_columns])\n\n # Get the number of rows in the Feature Set\n num_rows = input_data.num_rows()\n\n # If the data source has more rows than max_rows, do a sample query\n if num_rows > max_rows:\n percentage = round(max_rows * 100.0 / num_rows)\n self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n query = f'SELECT {columns} FROM \"{table}\" TABLESAMPLE BERNOULLI({percentage})'\n else:\n query = f'SELECT {columns} FROM \"{table}\"'\n\n # Mark the transform as complete and set the output DataFrame\n self.transform_run = True\n self.output_df = input_data.query(query)\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n\n def get_output(self) -> pd.DataFrame:\n \"\"\"Get the DataFrame Output from this Transform\"\"\"\n if not self.transform_run:\n self.transform()\n return self.output_df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.__init__","title":"__init__(feature_set_name)
","text":"FeaturesToPandas Initialization
Source code insrc/sageworks/core/transforms/pandas_transforms/features_to_pandas.py
def __init__(self, feature_set_name: str):\n \"\"\"FeaturesToPandas Initialization\"\"\"\n\n # Call superclass init\n super().__init__(input_uuid=feature_set_name, output_uuid=\"DataFrame\")\n\n # Set up all my instance attributes\n self.input_type = TransformInput.FEATURE_SET\n self.output_type = TransformOutput.PANDAS_DF\n self.output_df = None\n self.transform_run = False\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.get_output","title":"get_output()
","text":"Get the DataFrame Output from this Transform
Source code insrc/sageworks/core/transforms/pandas_transforms/features_to_pandas.py
def get_output(self) -> pd.DataFrame:\n \"\"\"Get the DataFrame Output from this Transform\"\"\"\n if not self.transform_run:\n self.transform()\n return self.output_df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Any checks on the Pandas DataFrame that need to be done
Source code insrc/sageworks/core/transforms/pandas_transforms/features_to_pandas.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any checks on the Pandas DataFrame that need to be done\"\"\"\n self.log.info(\"Post-Transform: Checking Pandas DataFrame...\")\n self.log.info(f\"DataFrame Shape: {self.output_df.shape}\")\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.FeaturesToPandas.transform_impl","title":"transform_impl(max_rows=100000)
","text":"Convert the FeatureSet into a Pandas DataFrame
Source code insrc/sageworks/core/transforms/pandas_transforms/features_to_pandas.py
def transform_impl(self, max_rows=100000):\n \"\"\"Convert the FeatureSet into a Pandas DataFrame\"\"\"\n\n # Grab the Input (Feature Set)\n input_data = FeatureSetCore(self.input_uuid)\n if not input_data.exists():\n self.log.critical(f\"Feature Set Check on {self.input_uuid} failed!\")\n return\n\n # Grab the table for this Feature Set\n table = input_data.athena_table\n\n # Get the list of columns (and subtract metadata columns that might get added)\n columns = input_data.columns\n filter_columns = [\"write_time\", \"api_invocation_time\", \"is_deleted\"]\n columns = \", \".join([x for x in columns if x not in filter_columns])\n\n # Get the number of rows in the Feature Set\n num_rows = input_data.num_rows()\n\n # If the data source has more rows than max_rows, do a sample query\n if num_rows > max_rows:\n percentage = round(max_rows * 100.0 / num_rows)\n self.log.important(f\"DataSource has {num_rows} rows.. sampling down to {max_rows}...\")\n query = f'SELECT {columns} FROM \"{table}\" TABLESAMPLE BERNOULLI({percentage})'\n else:\n query = f'SELECT {columns} FROM \"{table}\"'\n\n # Mark the transform as complete and set the output DataFrame\n self.transform_run = True\n self.output_df = input_data.query(query)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData","title":"PandasToData
","text":" Bases: Transform
PandasToData: Class to publish a Pandas DataFrame as a DataSource
Common Usagedf_to_data = PandasToData(output_uuid)\ndf_to_data.set_output_tags([\"test\", \"small\"])\ndf_to_data.set_input(test_df)\ndf_to_data.transform()\n
Source code in src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
class PandasToData(Transform):\n \"\"\"PandasToData: Class to publish a Pandas DataFrame as a DataSource\n\n Common Usage:\n ```python\n df_to_data = PandasToData(output_uuid)\n df_to_data.set_output_tags([\"test\", \"small\"])\n df_to_data.set_input(test_df)\n df_to_data.transform()\n ```\n \"\"\"\n\n def __init__(self, output_uuid: str, output_format: str = \"parquet\"):\n \"\"\"PandasToData Initialization\n Args:\n output_uuid (str): The UUID of the DataSource to create\n output_format (str): The file format to store the S3 object data in (default: \"parquet\")\n \"\"\"\n\n # Make sure the output_uuid is a valid name/id\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.PANDAS_DF\n self.output_type = TransformOutput.DATA_SOURCE\n self.output_df = None\n\n # Give a message that Parquet is best in most cases\n if output_format != \"parquet\":\n self.log.warning(\"Parquet format works the best in most cases please consider using it\")\n self.output_format = output_format\n\n def set_input(self, input_df: pd.DataFrame):\n \"\"\"Set the DataFrame Input for this Transform\"\"\"\n self.output_df = input_df.copy()\n\n def delete_existing(self):\n # Delete the existing FeatureSet if it exists\n self.log.info(f\"Deleting the {self.output_uuid} DataSource...\")\n AthenaSource.managed_delete(self.output_uuid)\n time.sleep(1)\n\n def convert_object_to_string(self, df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Try to automatically convert object columns to string columns\"\"\"\n for c in df.columns[df.dtypes == \"object\"]: # Look at the object columns\n try:\n df[c] = df[c].astype(\"string\")\n df[c] = df[c].str.replace(\"'\", '\"') # This is for nested JSON\n except (ParserError, ValueError, TypeError):\n self.log.info(f\"Column {c} could not be converted to string...\")\n return df\n\n def convert_object_to_datetime(self, df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Try to automatically convert object columns to datetime or string columns\"\"\"\n for c in df.columns[df.dtypes == \"object\"]: # Look at the object columns\n try:\n df[c] = pd.to_datetime(df[c])\n except (ParserError, ValueError, TypeError):\n self.log.debug(f\"Column {c} could not be converted to datetime...\")\n return df\n\n @staticmethod\n def convert_datetime_columns(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Convert datetime columns to ISO-8601 string\"\"\"\n datetime_type = [\"datetime\", \"datetime64\", \"datetime64[ns]\", \"datetimetz\"]\n for c in df.select_dtypes(include=datetime_type).columns:\n df[c] = df[c].map(datetime_to_iso8601)\n df[c] = df[c].astype(pd.StringDtype())\n return df\n\n def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Delete the existing DataSource if it exists\"\"\"\n self.delete_existing()\n\n def transform_impl(self, overwrite: bool = True, **kwargs):\n \"\"\"Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n\n Args:\n overwrite (bool): Overwrite the existing data in the SageWorks S3 Bucket\n \"\"\"\n self.log.info(f\"DataFrame to SageWorks DataSource: {self.output_uuid}...\")\n\n # Set up our metadata storage\n sageworks_meta = {\"sageworks_tags\": self.output_tags}\n sageworks_meta.update(self.output_meta)\n\n # Create the Output Parquet file S3 Storage Path\n s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n # Convert columns names to lowercase, Athena will not work with uppercase column names\n if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n for c in self.output_df.columns:\n if c != c.lower():\n self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n self.output_df.columns = self.output_df.columns.str.lower()\n\n # Convert Object Columns to String\n self.output_df = self.convert_object_to_string(self.output_df)\n\n # Note: Both of these conversions may not be necessary, so we're leaving them commented out\n \"\"\"\n # Convert Object Columns to Datetime\n self.output_df = self.convert_object_to_datetime(self.output_df)\n\n # Now convert datetime columns to ISO-8601 string\n # self.output_df = self.convert_datetime_columns(self.output_df)\n \"\"\"\n\n # Write out the DataFrame to AWS Data Catalog in either Parquet or JSONL format\n description = f\"SageWorks data source: {self.output_uuid}\"\n glue_table_settings = {\"description\": description, \"parameters\": sageworks_meta}\n if self.output_format == \"parquet\":\n wr.s3.to_parquet(\n self.output_df,\n path=s3_storage_path,\n dataset=True,\n mode=\"overwrite\",\n database=self.data_catalog_db,\n table=self.output_uuid,\n filename_prefix=f\"{self.output_uuid}_\",\n boto3_session=self.boto3_session,\n partition_cols=None,\n glue_table_settings=glue_table_settings,\n sanitize_columns=False,\n ) # FIXME: Have some logic around partition columns\n\n # Note: In general Parquet works will for most uses cases. We recommend using Parquet\n # You can use JSON_EXTRACT on Parquet string field, and it works great.\n elif self.output_format == \"jsonl\":\n self.log.warning(\"We recommend using Parquet format for most use cases\")\n self.log.warning(\"If you have a use case that requires JSONL please contact SageWorks support\")\n self.log.warning(\"We'd like to understand what functionality JSONL is providing that isn't already\")\n self.log.warning(\"provided with Parquet and JSON_EXTRACT() for your Athena Queries\")\n wr.s3.to_json(\n self.output_df,\n path=s3_storage_path,\n orient=\"records\",\n lines=True,\n date_format=\"iso\",\n dataset=True,\n mode=\"overwrite\",\n database=self.data_catalog_db,\n table=self.output_uuid,\n filename_prefix=f\"{self.output_uuid}_\",\n boto3_session=self.boto3_session,\n partition_cols=None,\n glue_table_settings=glue_table_settings,\n )\n else:\n raise ValueError(f\"Unsupported file format: {self.output_format}\")\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() fnr the DataSource\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() for the DataSource...\")\n\n # Onboard the DataSource\n output_data_source = DataSourceFactory(self.output_uuid)\n output_data_source.onboard()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.__init__","title":"__init__(output_uuid, output_format='parquet')
","text":"PandasToData Initialization Args: output_uuid (str): The UUID of the DataSource to create output_format (str): The file format to store the S3 object data in (default: \"parquet\")
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def __init__(self, output_uuid: str, output_format: str = \"parquet\"):\n \"\"\"PandasToData Initialization\n Args:\n output_uuid (str): The UUID of the DataSource to create\n output_format (str): The file format to store the S3 object data in (default: \"parquet\")\n \"\"\"\n\n # Make sure the output_uuid is a valid name/id\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.PANDAS_DF\n self.output_type = TransformOutput.DATA_SOURCE\n self.output_df = None\n\n # Give a message that Parquet is best in most cases\n if output_format != \"parquet\":\n self.log.warning(\"Parquet format works the best in most cases please consider using it\")\n self.output_format = output_format\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_datetime_columns","title":"convert_datetime_columns(df)
staticmethod
","text":"Convert datetime columns to ISO-8601 string
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
@staticmethod\ndef convert_datetime_columns(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Convert datetime columns to ISO-8601 string\"\"\"\n datetime_type = [\"datetime\", \"datetime64\", \"datetime64[ns]\", \"datetimetz\"]\n for c in df.select_dtypes(include=datetime_type).columns:\n df[c] = df[c].map(datetime_to_iso8601)\n df[c] = df[c].astype(pd.StringDtype())\n return df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_object_to_datetime","title":"convert_object_to_datetime(df)
","text":"Try to automatically convert object columns to datetime or string columns
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def convert_object_to_datetime(self, df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Try to automatically convert object columns to datetime or string columns\"\"\"\n for c in df.columns[df.dtypes == \"object\"]: # Look at the object columns\n try:\n df[c] = pd.to_datetime(df[c])\n except (ParserError, ValueError, TypeError):\n self.log.debug(f\"Column {c} could not be converted to datetime...\")\n return df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.convert_object_to_string","title":"convert_object_to_string(df)
","text":"Try to automatically convert object columns to string columns
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def convert_object_to_string(self, df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Try to automatically convert object columns to string columns\"\"\"\n for c in df.columns[df.dtypes == \"object\"]: # Look at the object columns\n try:\n df[c] = df[c].astype(\"string\")\n df[c] = df[c].str.replace(\"'\", '\"') # This is for nested JSON\n except (ParserError, ValueError, TypeError):\n self.log.info(f\"Column {c} could not be converted to string...\")\n return df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Calling onboard() fnr the DataSource
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Calling onboard() fnr the DataSource\"\"\"\n self.log.info(\"Post-Transform: Calling onboard() for the DataSource...\")\n\n # Onboard the DataSource\n output_data_source = DataSourceFactory(self.output_uuid)\n output_data_source.onboard()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.pre_transform","title":"pre_transform(**kwargs)
","text":"Pre-Transform: Delete the existing DataSource if it exists
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Delete the existing DataSource if it exists\"\"\"\n self.delete_existing()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.set_input","title":"set_input(input_df)
","text":"Set the DataFrame Input for this Transform
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def set_input(self, input_df: pd.DataFrame):\n \"\"\"Set the DataFrame Input for this Transform\"\"\"\n self.output_df = input_df.copy()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToData.transform_impl","title":"transform_impl(overwrite=True, **kwargs)
","text":"Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and store the information about the data to the AWS Data Catalog sageworks database
Parameters:
Name Type Description Defaultoverwrite
bool
Overwrite the existing data in the SageWorks S3 Bucket
True
Source code in src/sageworks/core/transforms/pandas_transforms/pandas_to_data.py
def transform_impl(self, overwrite: bool = True, **kwargs):\n \"\"\"Convert the Pandas DataFrame into Parquet Format in the SageWorks S3 Bucket, and\n store the information about the data to the AWS Data Catalog sageworks database\n\n Args:\n overwrite (bool): Overwrite the existing data in the SageWorks S3 Bucket\n \"\"\"\n self.log.info(f\"DataFrame to SageWorks DataSource: {self.output_uuid}...\")\n\n # Set up our metadata storage\n sageworks_meta = {\"sageworks_tags\": self.output_tags}\n sageworks_meta.update(self.output_meta)\n\n # Create the Output Parquet file S3 Storage Path\n s3_storage_path = f\"{self.data_sources_s3_path}/{self.output_uuid}\"\n\n # Convert columns names to lowercase, Athena will not work with uppercase column names\n if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n for c in self.output_df.columns:\n if c != c.lower():\n self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n self.output_df.columns = self.output_df.columns.str.lower()\n\n # Convert Object Columns to String\n self.output_df = self.convert_object_to_string(self.output_df)\n\n # Note: Both of these conversions may not be necessary, so we're leaving them commented out\n \"\"\"\n # Convert Object Columns to Datetime\n self.output_df = self.convert_object_to_datetime(self.output_df)\n\n # Now convert datetime columns to ISO-8601 string\n # self.output_df = self.convert_datetime_columns(self.output_df)\n \"\"\"\n\n # Write out the DataFrame to AWS Data Catalog in either Parquet or JSONL format\n description = f\"SageWorks data source: {self.output_uuid}\"\n glue_table_settings = {\"description\": description, \"parameters\": sageworks_meta}\n if self.output_format == \"parquet\":\n wr.s3.to_parquet(\n self.output_df,\n path=s3_storage_path,\n dataset=True,\n mode=\"overwrite\",\n database=self.data_catalog_db,\n table=self.output_uuid,\n filename_prefix=f\"{self.output_uuid}_\",\n boto3_session=self.boto3_session,\n partition_cols=None,\n glue_table_settings=glue_table_settings,\n sanitize_columns=False,\n ) # FIXME: Have some logic around partition columns\n\n # Note: In general Parquet works will for most uses cases. We recommend using Parquet\n # You can use JSON_EXTRACT on Parquet string field, and it works great.\n elif self.output_format == \"jsonl\":\n self.log.warning(\"We recommend using Parquet format for most use cases\")\n self.log.warning(\"If you have a use case that requires JSONL please contact SageWorks support\")\n self.log.warning(\"We'd like to understand what functionality JSONL is providing that isn't already\")\n self.log.warning(\"provided with Parquet and JSON_EXTRACT() for your Athena Queries\")\n wr.s3.to_json(\n self.output_df,\n path=s3_storage_path,\n orient=\"records\",\n lines=True,\n date_format=\"iso\",\n dataset=True,\n mode=\"overwrite\",\n database=self.data_catalog_db,\n table=self.output_uuid,\n filename_prefix=f\"{self.output_uuid}_\",\n boto3_session=self.boto3_session,\n partition_cols=None,\n glue_table_settings=glue_table_settings,\n )\n else:\n raise ValueError(f\"Unsupported file format: {self.output_format}\")\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures","title":"PandasToFeatures
","text":" Bases: Transform
PandasToFeatures: Class to publish a Pandas DataFrame into a FeatureSet
Common Usageto_features = PandasToFeatures(output_uuid)\nto_features.set_output_tags([\"my\", \"awesome\", \"data\"])\nto_features.set_input(df, id_column=\"my_id\")\nto_features.transform()\n
Source code in src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
class PandasToFeatures(Transform):\n \"\"\"PandasToFeatures: Class to publish a Pandas DataFrame into a FeatureSet\n\n Common Usage:\n ```python\n to_features = PandasToFeatures(output_uuid)\n to_features.set_output_tags([\"my\", \"awesome\", \"data\"])\n to_features.set_input(df, id_column=\"my_id\")\n to_features.transform()\n ```\n \"\"\"\n\n def __init__(self, output_uuid: str):\n \"\"\"PandasToFeatures Initialization\n\n Args:\n output_uuid (str): The UUID of the FeatureSet to create\n \"\"\"\n\n # Make sure the output_uuid is a valid name\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.PANDAS_DF\n self.output_type = TransformOutput.FEATURE_SET\n self.id_column = None\n self.event_time_column = None\n self.one_hot_columns = []\n self.categorical_dtypes = {} # Used for streaming/chunking\n self.output_df = None\n self.table_format = TableFormatEnum.ICEBERG\n self.incoming_hold_out_ids = None\n\n # These will be set in the transform method\n self.output_feature_group = None\n self.output_feature_set = None\n self.expected_rows = 0\n\n def set_input(self, input_df: pd.DataFrame, id_column, event_time_column=None, one_hot_columns=None):\n \"\"\"Set the Input DataFrame for this Transform\n\n Args:\n input_df (pd.DataFrame): The input DataFrame.\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n event_time_column (str, optional): The name of the event time column (default: None).\n one_hot_columns (list, optional): The list of columns to one-hot encode (default: None).\n \"\"\"\n self.id_column = id_column\n self.event_time_column = event_time_column\n self.output_df = input_df.copy()\n self.one_hot_columns = one_hot_columns or []\n\n # Now Prepare the DataFrame for its journey into an AWS FeatureGroup\n self.prep_dataframe()\n\n def delete_existing(self):\n # Delete the existing FeatureSet if it exists\n self.log.info(f\"Deleting the {self.output_uuid} FeatureSet...\")\n FeatureSetCore.managed_delete(self.output_uuid)\n time.sleep(1)\n\n def _ensure_id_column(self):\n \"\"\"Internal: AWS Feature Store requires an Id field\"\"\"\n if self.id_column in [\"auto\", \"index\"]:\n self.log.info(\"Generating an 'auto_id' column from the dataframe index..\")\n self.output_df[\"auto_id\"] = self.output_df.index\n return\n if self.id_column not in self.output_df.columns:\n error_msg = f\"Id column {self.id_column} not found in the DataFrame\"\n self.log.critical(error_msg)\n raise ValueError(error_msg)\n\n def _ensure_event_time(self):\n \"\"\"Internal: AWS Feature Store requires an event_time field for all data stored\"\"\"\n if self.event_time_column is None or self.event_time_column not in self.output_df.columns:\n self.log.info(\"Generating an event_time column before FeatureSet Creation...\")\n self.event_time_column = \"event_time\"\n self.output_df[self.event_time_column] = pd.Timestamp(\"now\", tz=\"UTC\")\n\n # The event_time_column is defined, so we need to make sure it's in ISO-8601 string format\n # Note: AWS Feature Store only a particular ISO-8601 format not ALL ISO-8601 formats\n time_column = self.output_df[self.event_time_column]\n\n # Check if the event_time_column is of type object or string convert it to DateTime\n if time_column.dtypes == \"object\" or time_column.dtypes.name == \"string\":\n self.log.info(f\"Converting {self.event_time_column} to DateTime...\")\n time_column = pd.to_datetime(time_column)\n\n # Let's make sure it the right type for Feature Store\n if pd.api.types.is_datetime64_any_dtype(time_column):\n self.log.info(f\"Converting {self.event_time_column} to ISOFormat Date String before FeatureSet Creation...\")\n\n # Convert the datetime DType to ISO-8601 string\n # TableFormat=ICEBERG does not support alternate formats for event_time field, it only supports String type.\n time_column = time_column.map(datetime_to_iso8601)\n self.output_df[self.event_time_column] = time_column.astype(\"string\")\n\n def _convert_objs_to_string(self):\n \"\"\"Internal: AWS Feature Store doesn't know how to store object dtypes, so convert to String\"\"\"\n for col in self.output_df:\n if pd.api.types.is_object_dtype(self.output_df[col].dtype):\n self.output_df[col] = self.output_df[col].astype(pd.StringDtype())\n\n def process_column_name(self, column: str, shorten: bool = False) -> str:\n \"\"\"Call various methods to make sure the column is ready for Feature Store\n Args:\n column (str): The column name to process\n shorten (bool): Should we shorten the column name? (default: False)\n \"\"\"\n self.log.debug(f\"Processing column {column}...\")\n\n # Make sure the column name is valid\n column = self.sanitize_column_name(column)\n\n # Make sure the column name isn't too long\n if shorten:\n column = self.shorten_column_name(column)\n\n return column\n\n def shorten_column_name(self, name, max_length=20):\n if len(name) <= max_length:\n return name\n\n # Start building the new name from the end\n parts = name.split(\"_\")[::-1]\n new_name = \"\"\n for part in parts:\n if len(new_name) + len(part) + 1 <= max_length: # +1 for the underscore\n new_name = f\"{part}_{new_name}\" if new_name else part\n else:\n break\n\n # If new_name is empty, just use the last part of the original name\n if not new_name:\n new_name = parts[0]\n\n self.log.info(f\"Shortening {name} to {new_name}\")\n return new_name\n\n def sanitize_column_name(self, name):\n # Remove all invalid characters\n sanitized = re.sub(\"[^a-zA-Z0-9-_]\", \"_\", name)\n sanitized = re.sub(\"_+\", \"_\", sanitized)\n sanitized = sanitized.strip(\"_\")\n\n # Log the change if the name was altered\n if sanitized != name:\n self.log.info(f\"Sanitizing {name} to {sanitized}\")\n\n return sanitized\n\n def one_hot_encode(self, df, one_hot_columns: list) -> pd.DataFrame:\n \"\"\"One Hot Encoding for Categorical Columns with additional column name management\n\n Args:\n df (pd.DataFrame): The DataFrame to process\n one_hot_columns (list): The list of columns to one-hot encode\n\n Returns:\n pd.DataFrame: The DataFrame with one-hot encoded columns\n \"\"\"\n\n # Grab the current list of columns\n current_columns = list(df.columns)\n\n # Now convert the list of columns into Categorical and then One-Hot Encode\n self.convert_columns_to_categorical(one_hot_columns)\n self.log.important(f\"One-Hot encoding columns: {one_hot_columns}\")\n df = pd.get_dummies(df, columns=one_hot_columns)\n\n # Compute the new columns generated by get_dummies\n new_columns = list(set(df.columns) - set(current_columns))\n self.log.important(f\"New columns generated: {new_columns}\")\n\n # Convert new columns to int32\n df[new_columns] = df[new_columns].astype(\"int32\")\n\n # For the new columns we're going to shorten the names\n renamed_columns = {col: self.process_column_name(col) for col in new_columns}\n\n # Rename the columns in the DataFrame\n df.rename(columns=renamed_columns, inplace=True)\n\n return df\n\n # Helper Methods\n def convert_columns_to_categorical(self, columns: list):\n \"\"\"Convert column to Categorical type\"\"\"\n for feature in columns:\n if feature not in [self.event_time_column, self.id_column]:\n unique_values = self.output_df[feature].nunique()\n if 1 < unique_values < 10:\n self.log.important(f\"Converting column {feature} to categorical (unique {unique_values})\")\n self.output_df[feature] = self.output_df[feature].astype(\"category\")\n else:\n self.log.warning(f\"Column {feature} too many unique values {unique_values} skipping...\")\n\n def manual_categorical_converter(self):\n \"\"\"Used for Streaming: Convert object and string types to Categorical\n\n Note:\n This method is used for streaming/chunking. You can set the\n categorical_dtypes attribute to a dictionary of column names and\n their respective categorical types.\n \"\"\"\n for column, cat_d_type in self.categorical_dtypes.items():\n self.output_df[column] = self.output_df[column].astype(cat_d_type)\n\n @staticmethod\n def convert_column_types(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Convert the types of the DataFrame to the correct types for the Feature Store\"\"\"\n for column in list(df.select_dtypes(include=\"bool\").columns):\n df[column] = df[column].astype(\"int32\")\n for column in list(df.select_dtypes(include=\"category\").columns):\n df[column] = df[column].astype(\"str\")\n\n # Select all columns that are of datetime dtype and convert them to ISO-8601 strings\n for column in [col for col in df.columns if pd.api.types.is_datetime64_any_dtype(df[col])]:\n df[column] = df[column].map(datetime_to_iso8601).astype(\"string\")\n\n \"\"\"FIXME Not sure we need these conversions\n for column in list(df.select_dtypes(include=\"object\").columns):\n df[column] = df[column].astype(\"string\")\n for column in list(df.select_dtypes(include=[pd.Int64Dtype]).columns):\n df[column] = df[column].astype(\"int64\")\n for column in list(df.select_dtypes(include=[pd.Float64Dtype]).columns):\n df[column] = df[column].astype(\"float64\")\n \"\"\"\n return df\n\n def prep_dataframe(self):\n \"\"\"Prep the DataFrame for Feature Store Creation\"\"\"\n self.log.info(\"Prep the output_df (cat_convert, convert types, and lowercase columns)...\")\n\n # Remove any columns generated from AWS\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n self.output_df = self.output_df.drop(columns=aws_cols, errors=\"ignore\")\n\n # If one-hot columns are provided then one-hot encode them\n if self.one_hot_columns:\n self.output_df = self.one_hot_encode(self.output_df, self.one_hot_columns)\n\n # Convert columns names to lowercase, Athena will not work with uppercase column names\n if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n for c in self.output_df.columns:\n if c != c.lower():\n self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n self.output_df.columns = self.output_df.columns.str.lower()\n\n # Make sure we have the required id and event_time columns\n self._ensure_id_column()\n self._ensure_event_time()\n\n # Check for a training column (SageWorks uses dynamic training columns)\n if \"training\" in self.output_df.columns:\n self.log.important(\n \"\"\"Training column detected: Since FeatureSets are read-only, SageWorks creates a training view\n that can be dynamically changed. We'll use this training column to create a training view.\"\"\"\n )\n self.incoming_hold_out_ids = self.output_df[~self.output_df[\"training\"]][self.id_column].tolist()\n self.output_df = self.output_df.drop(columns=[\"training\"])\n\n # We need to convert some of our column types to the correct types\n # Feature Store only supports these data types:\n # - Integral\n # - Fractional\n # - String (timestamp/datetime types need to be converted to string)\n self.output_df = self.convert_column_types(self.output_df)\n\n def create_feature_group(self):\n \"\"\"Create a Feature Group, load our Feature Definitions, and wait for it to be ready\"\"\"\n\n # Create a Feature Group and load our Feature Definitions\n my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)\n my_feature_group.load_feature_definitions(data_frame=self.output_df)\n\n # Create the Output S3 Storage Path for this Feature Set\n s3_storage_path = f\"{self.feature_sets_s3_path}/{self.output_uuid}\"\n\n # Get the metadata/tags to push into AWS\n aws_tags = self.get_aws_tags()\n\n # Create the Feature Group\n my_feature_group.create(\n s3_uri=s3_storage_path,\n record_identifier_name=self.id_column,\n event_time_feature_name=self.event_time_column,\n role_arn=self.sageworks_role_arn,\n enable_online_store=True,\n table_format=self.table_format,\n tags=aws_tags,\n )\n\n # Ensure/wait for the feature group to be created\n self.ensure_feature_group_created(my_feature_group)\n return my_feature_group\n\n def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Delete any existing FeatureSet and Create the Feature Group\"\"\"\n self.delete_existing()\n self.output_feature_group = self.create_feature_group()\n\n def transform_impl(self):\n \"\"\"Transform Implementation: Ingest the data into the Feature Group\"\"\"\n\n # Now we actually push the data into the Feature Group (called ingestion)\n self.log.important(f\"Ingesting rows into Feature Group {self.output_uuid}...\")\n ingest_manager = self.output_feature_group.ingest(self.output_df, max_workers=8, max_processes=2, wait=False)\n try:\n ingest_manager.wait()\n except IngestionError as exc:\n self.log.warning(f\"Some rows had an ingesting error: {exc}\")\n\n # Report on any rows that failed to ingest\n if ingest_manager.failed_rows:\n self.log.warning(f\"Number of Failed Rows: {len(ingest_manager.failed_rows)}\")\n\n # FIXME: This may or may not give us the correct rows\n # If any index is greater then the number of rows, then the index needs\n # to be converted to a relative index in our current output_df\n df_rows = len(self.output_df)\n relative_indexes = [idx - df_rows if idx >= df_rows else idx for idx in ingest_manager.failed_rows]\n failed_data = self.output_df.iloc[relative_indexes]\n for idx, row in failed_data.iterrows():\n self.log.warning(f\"Failed Row {idx}: {row.to_dict()}\")\n\n # Keep track of the number of rows we expect to be ingested\n self.expected_rows += len(self.output_df) - len(ingest_manager.failed_rows)\n self.log.info(f\"Added rows: {len(self.output_df)}\")\n self.log.info(f\"Failed rows: {len(ingest_manager.failed_rows)}\")\n self.log.info(f\"Total rows ingested: {self.expected_rows}\")\n\n # We often need to wait a bit for AWS to fully register the new Feature Group\n self.log.important(f\"Waiting for AWS to register the new Feature Group {self.output_uuid}...\")\n time.sleep(30)\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Populating Offline Storage and onboard()\"\"\"\n self.log.info(\"Post-Transform: Populating Offline Storage and onboard()...\")\n\n # Feature Group Ingestion takes a while, so we need to wait for it to finish\n self.output_feature_set = FeatureSetCore(self.output_uuid)\n self.log.important(\"Waiting for AWS Feature Group Offline storage to be ready...\")\n self.log.important(\"This will often take 10-20 minutes...go have coffee or lunch :)\")\n self.output_feature_set.set_status(\"initializing\")\n self.wait_for_rows(self.expected_rows)\n\n # Call the FeatureSet onboard method to compute a bunch of EDA stuff\n self.output_feature_set.onboard()\n\n # Set Hold Out Ids (if we got them during creation)\n if self.incoming_hold_out_ids:\n self.output_feature_set.set_training_holdouts(self.id_column, self.incoming_hold_out_ids)\n\n def ensure_feature_group_created(self, feature_group):\n status = feature_group.describe().get(\"FeatureGroupStatus\")\n while status == \"Creating\":\n self.log.debug(\"FeatureSet being Created...\")\n time.sleep(5)\n status = feature_group.describe().get(\"FeatureGroupStatus\")\n if status == \"Created\":\n self.log.info(f\"FeatureSet {feature_group.name} successfully created\")\n else:\n self.log.critical(f\"FeatureSet {feature_group.name} creation failed with status: {status}\")\n\n def wait_for_rows(self, expected_rows: int):\n \"\"\"Wait for AWS Feature Group to fully populate the Offline Storage\"\"\"\n rows = self.output_feature_set.num_rows()\n\n # Wait for the rows to be populated\n self.log.info(f\"Waiting for AWS Feature Group {self.output_uuid} Offline Storage...\")\n max_retry = 20\n num_retry = 0\n sleep_time = 30\n while rows < expected_rows and num_retry < max_retry:\n num_retry += 1\n time.sleep(sleep_time)\n rows = self.output_feature_set.num_rows()\n self.log.info(f\"Offline Storage {self.output_uuid}: {rows} rows out of {expected_rows}\")\n if rows == expected_rows:\n self.log.important(f\"Success: Reached Expected Rows ({rows} rows)...\")\n else:\n msg = f\"Did not reach expected rows ({rows}/{expected_rows})...(probably AWS lag)\"\n self.log.warning(msg)\n self.log.monitor(msg)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.__init__","title":"__init__(output_uuid)
","text":"PandasToFeatures Initialization
Parameters:
Name Type Description Defaultoutput_uuid
str
The UUID of the FeatureSet to create
required Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def __init__(self, output_uuid: str):\n \"\"\"PandasToFeatures Initialization\n\n Args:\n output_uuid (str): The UUID of the FeatureSet to create\n \"\"\"\n\n # Make sure the output_uuid is a valid name\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.input_type = TransformInput.PANDAS_DF\n self.output_type = TransformOutput.FEATURE_SET\n self.id_column = None\n self.event_time_column = None\n self.one_hot_columns = []\n self.categorical_dtypes = {} # Used for streaming/chunking\n self.output_df = None\n self.table_format = TableFormatEnum.ICEBERG\n self.incoming_hold_out_ids = None\n\n # These will be set in the transform method\n self.output_feature_group = None\n self.output_feature_set = None\n self.expected_rows = 0\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.convert_column_types","title":"convert_column_types(df)
staticmethod
","text":"Convert the types of the DataFrame to the correct types for the Feature Store
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
@staticmethod\ndef convert_column_types(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Convert the types of the DataFrame to the correct types for the Feature Store\"\"\"\n for column in list(df.select_dtypes(include=\"bool\").columns):\n df[column] = df[column].astype(\"int32\")\n for column in list(df.select_dtypes(include=\"category\").columns):\n df[column] = df[column].astype(\"str\")\n\n # Select all columns that are of datetime dtype and convert them to ISO-8601 strings\n for column in [col for col in df.columns if pd.api.types.is_datetime64_any_dtype(df[col])]:\n df[column] = df[column].map(datetime_to_iso8601).astype(\"string\")\n\n \"\"\"FIXME Not sure we need these conversions\n for column in list(df.select_dtypes(include=\"object\").columns):\n df[column] = df[column].astype(\"string\")\n for column in list(df.select_dtypes(include=[pd.Int64Dtype]).columns):\n df[column] = df[column].astype(\"int64\")\n for column in list(df.select_dtypes(include=[pd.Float64Dtype]).columns):\n df[column] = df[column].astype(\"float64\")\n \"\"\"\n return df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.convert_columns_to_categorical","title":"convert_columns_to_categorical(columns)
","text":"Convert column to Categorical type
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def convert_columns_to_categorical(self, columns: list):\n \"\"\"Convert column to Categorical type\"\"\"\n for feature in columns:\n if feature not in [self.event_time_column, self.id_column]:\n unique_values = self.output_df[feature].nunique()\n if 1 < unique_values < 10:\n self.log.important(f\"Converting column {feature} to categorical (unique {unique_values})\")\n self.output_df[feature] = self.output_df[feature].astype(\"category\")\n else:\n self.log.warning(f\"Column {feature} too many unique values {unique_values} skipping...\")\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.create_feature_group","title":"create_feature_group()
","text":"Create a Feature Group, load our Feature Definitions, and wait for it to be ready
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def create_feature_group(self):\n \"\"\"Create a Feature Group, load our Feature Definitions, and wait for it to be ready\"\"\"\n\n # Create a Feature Group and load our Feature Definitions\n my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)\n my_feature_group.load_feature_definitions(data_frame=self.output_df)\n\n # Create the Output S3 Storage Path for this Feature Set\n s3_storage_path = f\"{self.feature_sets_s3_path}/{self.output_uuid}\"\n\n # Get the metadata/tags to push into AWS\n aws_tags = self.get_aws_tags()\n\n # Create the Feature Group\n my_feature_group.create(\n s3_uri=s3_storage_path,\n record_identifier_name=self.id_column,\n event_time_feature_name=self.event_time_column,\n role_arn=self.sageworks_role_arn,\n enable_online_store=True,\n table_format=self.table_format,\n tags=aws_tags,\n )\n\n # Ensure/wait for the feature group to be created\n self.ensure_feature_group_created(my_feature_group)\n return my_feature_group\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.manual_categorical_converter","title":"manual_categorical_converter()
","text":"Used for Streaming: Convert object and string types to Categorical
NoteThis method is used for streaming/chunking. You can set the categorical_dtypes attribute to a dictionary of column names and their respective categorical types.
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def manual_categorical_converter(self):\n \"\"\"Used for Streaming: Convert object and string types to Categorical\n\n Note:\n This method is used for streaming/chunking. You can set the\n categorical_dtypes attribute to a dictionary of column names and\n their respective categorical types.\n \"\"\"\n for column, cat_d_type in self.categorical_dtypes.items():\n self.output_df[column] = self.output_df[column].astype(cat_d_type)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.one_hot_encode","title":"one_hot_encode(df, one_hot_columns)
","text":"One Hot Encoding for Categorical Columns with additional column name management
Parameters:
Name Type Description Defaultdf
DataFrame
The DataFrame to process
requiredone_hot_columns
list
The list of columns to one-hot encode
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The DataFrame with one-hot encoded columns
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def one_hot_encode(self, df, one_hot_columns: list) -> pd.DataFrame:\n \"\"\"One Hot Encoding for Categorical Columns with additional column name management\n\n Args:\n df (pd.DataFrame): The DataFrame to process\n one_hot_columns (list): The list of columns to one-hot encode\n\n Returns:\n pd.DataFrame: The DataFrame with one-hot encoded columns\n \"\"\"\n\n # Grab the current list of columns\n current_columns = list(df.columns)\n\n # Now convert the list of columns into Categorical and then One-Hot Encode\n self.convert_columns_to_categorical(one_hot_columns)\n self.log.important(f\"One-Hot encoding columns: {one_hot_columns}\")\n df = pd.get_dummies(df, columns=one_hot_columns)\n\n # Compute the new columns generated by get_dummies\n new_columns = list(set(df.columns) - set(current_columns))\n self.log.important(f\"New columns generated: {new_columns}\")\n\n # Convert new columns to int32\n df[new_columns] = df[new_columns].astype(\"int32\")\n\n # For the new columns we're going to shorten the names\n renamed_columns = {col: self.process_column_name(col) for col in new_columns}\n\n # Rename the columns in the DataFrame\n df.rename(columns=renamed_columns, inplace=True)\n\n return df\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Populating Offline Storage and onboard()
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Populating Offline Storage and onboard()\"\"\"\n self.log.info(\"Post-Transform: Populating Offline Storage and onboard()...\")\n\n # Feature Group Ingestion takes a while, so we need to wait for it to finish\n self.output_feature_set = FeatureSetCore(self.output_uuid)\n self.log.important(\"Waiting for AWS Feature Group Offline storage to be ready...\")\n self.log.important(\"This will often take 10-20 minutes...go have coffee or lunch :)\")\n self.output_feature_set.set_status(\"initializing\")\n self.wait_for_rows(self.expected_rows)\n\n # Call the FeatureSet onboard method to compute a bunch of EDA stuff\n self.output_feature_set.onboard()\n\n # Set Hold Out Ids (if we got them during creation)\n if self.incoming_hold_out_ids:\n self.output_feature_set.set_training_holdouts(self.id_column, self.incoming_hold_out_ids)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.pre_transform","title":"pre_transform(**kwargs)
","text":"Pre-Transform: Delete any existing FeatureSet and Create the Feature Group
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Delete any existing FeatureSet and Create the Feature Group\"\"\"\n self.delete_existing()\n self.output_feature_group = self.create_feature_group()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.prep_dataframe","title":"prep_dataframe()
","text":"Prep the DataFrame for Feature Store Creation
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def prep_dataframe(self):\n \"\"\"Prep the DataFrame for Feature Store Creation\"\"\"\n self.log.info(\"Prep the output_df (cat_convert, convert types, and lowercase columns)...\")\n\n # Remove any columns generated from AWS\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n self.output_df = self.output_df.drop(columns=aws_cols, errors=\"ignore\")\n\n # If one-hot columns are provided then one-hot encode them\n if self.one_hot_columns:\n self.output_df = self.one_hot_encode(self.output_df, self.one_hot_columns)\n\n # Convert columns names to lowercase, Athena will not work with uppercase column names\n if str(self.output_df.columns) != str(self.output_df.columns.str.lower()):\n for c in self.output_df.columns:\n if c != c.lower():\n self.log.important(f\"Column name {c} converted to lowercase: {c.lower()}\")\n self.output_df.columns = self.output_df.columns.str.lower()\n\n # Make sure we have the required id and event_time columns\n self._ensure_id_column()\n self._ensure_event_time()\n\n # Check for a training column (SageWorks uses dynamic training columns)\n if \"training\" in self.output_df.columns:\n self.log.important(\n \"\"\"Training column detected: Since FeatureSets are read-only, SageWorks creates a training view\n that can be dynamically changed. We'll use this training column to create a training view.\"\"\"\n )\n self.incoming_hold_out_ids = self.output_df[~self.output_df[\"training\"]][self.id_column].tolist()\n self.output_df = self.output_df.drop(columns=[\"training\"])\n\n # We need to convert some of our column types to the correct types\n # Feature Store only supports these data types:\n # - Integral\n # - Fractional\n # - String (timestamp/datetime types need to be converted to string)\n self.output_df = self.convert_column_types(self.output_df)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.process_column_name","title":"process_column_name(column, shorten=False)
","text":"Call various methods to make sure the column is ready for Feature Store Args: column (str): The column name to process shorten (bool): Should we shorten the column name? (default: False)
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def process_column_name(self, column: str, shorten: bool = False) -> str:\n \"\"\"Call various methods to make sure the column is ready for Feature Store\n Args:\n column (str): The column name to process\n shorten (bool): Should we shorten the column name? (default: False)\n \"\"\"\n self.log.debug(f\"Processing column {column}...\")\n\n # Make sure the column name is valid\n column = self.sanitize_column_name(column)\n\n # Make sure the column name isn't too long\n if shorten:\n column = self.shorten_column_name(column)\n\n return column\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.set_input","title":"set_input(input_df, id_column, event_time_column=None, one_hot_columns=None)
","text":"Set the Input DataFrame for this Transform
Parameters:
Name Type Description Defaultinput_df
DataFrame
The input DataFrame.
requiredid_column
str
The ID column (must be specified, use \"auto\" for auto-generated IDs).
requiredevent_time_column
str
The name of the event time column (default: None).
None
one_hot_columns
list
The list of columns to one-hot encode (default: None).
None
Source code in src/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def set_input(self, input_df: pd.DataFrame, id_column, event_time_column=None, one_hot_columns=None):\n \"\"\"Set the Input DataFrame for this Transform\n\n Args:\n input_df (pd.DataFrame): The input DataFrame.\n id_column (str): The ID column (must be specified, use \"auto\" for auto-generated IDs).\n event_time_column (str, optional): The name of the event time column (default: None).\n one_hot_columns (list, optional): The list of columns to one-hot encode (default: None).\n \"\"\"\n self.id_column = id_column\n self.event_time_column = event_time_column\n self.output_df = input_df.copy()\n self.one_hot_columns = one_hot_columns or []\n\n # Now Prepare the DataFrame for its journey into an AWS FeatureGroup\n self.prep_dataframe()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.transform_impl","title":"transform_impl()
","text":"Transform Implementation: Ingest the data into the Feature Group
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def transform_impl(self):\n \"\"\"Transform Implementation: Ingest the data into the Feature Group\"\"\"\n\n # Now we actually push the data into the Feature Group (called ingestion)\n self.log.important(f\"Ingesting rows into Feature Group {self.output_uuid}...\")\n ingest_manager = self.output_feature_group.ingest(self.output_df, max_workers=8, max_processes=2, wait=False)\n try:\n ingest_manager.wait()\n except IngestionError as exc:\n self.log.warning(f\"Some rows had an ingesting error: {exc}\")\n\n # Report on any rows that failed to ingest\n if ingest_manager.failed_rows:\n self.log.warning(f\"Number of Failed Rows: {len(ingest_manager.failed_rows)}\")\n\n # FIXME: This may or may not give us the correct rows\n # If any index is greater then the number of rows, then the index needs\n # to be converted to a relative index in our current output_df\n df_rows = len(self.output_df)\n relative_indexes = [idx - df_rows if idx >= df_rows else idx for idx in ingest_manager.failed_rows]\n failed_data = self.output_df.iloc[relative_indexes]\n for idx, row in failed_data.iterrows():\n self.log.warning(f\"Failed Row {idx}: {row.to_dict()}\")\n\n # Keep track of the number of rows we expect to be ingested\n self.expected_rows += len(self.output_df) - len(ingest_manager.failed_rows)\n self.log.info(f\"Added rows: {len(self.output_df)}\")\n self.log.info(f\"Failed rows: {len(ingest_manager.failed_rows)}\")\n self.log.info(f\"Total rows ingested: {self.expected_rows}\")\n\n # We often need to wait a bit for AWS to fully register the new Feature Group\n self.log.important(f\"Waiting for AWS to register the new Feature Group {self.output_uuid}...\")\n time.sleep(30)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeatures.wait_for_rows","title":"wait_for_rows(expected_rows)
","text":"Wait for AWS Feature Group to fully populate the Offline Storage
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features.py
def wait_for_rows(self, expected_rows: int):\n \"\"\"Wait for AWS Feature Group to fully populate the Offline Storage\"\"\"\n rows = self.output_feature_set.num_rows()\n\n # Wait for the rows to be populated\n self.log.info(f\"Waiting for AWS Feature Group {self.output_uuid} Offline Storage...\")\n max_retry = 20\n num_retry = 0\n sleep_time = 30\n while rows < expected_rows and num_retry < max_retry:\n num_retry += 1\n time.sleep(sleep_time)\n rows = self.output_feature_set.num_rows()\n self.log.info(f\"Offline Storage {self.output_uuid}: {rows} rows out of {expected_rows}\")\n if rows == expected_rows:\n self.log.important(f\"Success: Reached Expected Rows ({rows} rows)...\")\n else:\n msg = f\"Did not reach expected rows ({rows}/{expected_rows})...(probably AWS lag)\"\n self.log.warning(msg)\n self.log.monitor(msg)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked","title":"PandasToFeaturesChunked
","text":" Bases: Transform
PandasToFeaturesChunked: Class to manage a bunch of chunked Pandas DataFrames into a FeatureSet
Common Usageto_features = PandasToFeaturesChunked(output_uuid, id_column=\"id\"/None, event_time_column=\"date\"/None)\nto_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\ncat_column_info = {\"sex\": [\"M\", \"F\", \"I\"]}\nto_features.set_categorical_info(cat_column_info)\nto_features.add_chunk(df)\nto_features.add_chunk(df)\n...\nto_features.finalize()\n
Source code in src/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
class PandasToFeaturesChunked(Transform):\n \"\"\"PandasToFeaturesChunked: Class to manage a bunch of chunked Pandas DataFrames into a FeatureSet\n\n Common Usage:\n ```python\n to_features = PandasToFeaturesChunked(output_uuid, id_column=\"id\"/None, event_time_column=\"date\"/None)\n to_features.set_output_tags([\"abalone\", \"public\", \"whatever\"])\n cat_column_info = {\"sex\": [\"M\", \"F\", \"I\"]}\n to_features.set_categorical_info(cat_column_info)\n to_features.add_chunk(df)\n to_features.add_chunk(df)\n ...\n to_features.finalize()\n ```\n \"\"\"\n\n def __init__(self, output_uuid: str, id_column=None, event_time_column=None):\n \"\"\"PandasToFeaturesChunked Initialization\"\"\"\n\n # Make sure the output_uuid is a valid name\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.id_column = id_column\n self.event_time_column = event_time_column\n self.first_chunk = None\n self.pandas_to_features = PandasToFeatures(output_uuid)\n\n def set_categorical_info(self, cat_column_info: dict[list[str]]):\n \"\"\"Set the Categorical Columns\n Args:\n cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values\n \"\"\"\n\n # Create the CategoricalDtypes\n cat_d_types = {}\n for col, vals in cat_column_info.items():\n cat_d_types[col] = CategoricalDtype(categories=vals)\n\n # Now set the CategoricalDtypes on our underlying PandasToFeatures\n self.pandas_to_features.categorical_dtypes = cat_d_types\n\n def add_chunk(self, chunk_df: pd.DataFrame):\n \"\"\"Add a Chunk of Data to the FeatureSet\"\"\"\n\n # Is this the first chunk? If so we need to run the pre_transform\n if self.first_chunk is None:\n self.log.info(f\"Adding first chunk {chunk_df.shape}...\")\n self.first_chunk = chunk_df\n self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n self.pandas_to_features.pre_transform()\n self.pandas_to_features.transform_impl()\n else:\n self.log.info(f\"Adding chunk {chunk_df.shape}...\")\n self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n self.pandas_to_features.transform_impl()\n\n def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Create the Feature Group with Chunked Data\"\"\"\n\n # Loading data into a Feature Group takes a while, so set status to loading\n FeatureSetCore(self.output_uuid).set_status(\"loading\")\n\n def transform_impl(self):\n \"\"\"Required implementation of the Transform interface\"\"\"\n self.log.warning(\"PandasToFeaturesChunked.transform_impl() called. This is a no-op.\")\n\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any Post Transform Steps\"\"\"\n self.pandas_to_features.post_transform()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.__init__","title":"__init__(output_uuid, id_column=None, event_time_column=None)
","text":"PandasToFeaturesChunked Initialization
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def __init__(self, output_uuid: str, id_column=None, event_time_column=None):\n \"\"\"PandasToFeaturesChunked Initialization\"\"\"\n\n # Make sure the output_uuid is a valid name\n Artifact.is_name_valid(output_uuid)\n\n # Call superclass init\n super().__init__(\"DataFrame\", output_uuid)\n\n # Set up all my instance attributes\n self.id_column = id_column\n self.event_time_column = event_time_column\n self.first_chunk = None\n self.pandas_to_features = PandasToFeatures(output_uuid)\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.add_chunk","title":"add_chunk(chunk_df)
","text":"Add a Chunk of Data to the FeatureSet
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def add_chunk(self, chunk_df: pd.DataFrame):\n \"\"\"Add a Chunk of Data to the FeatureSet\"\"\"\n\n # Is this the first chunk? If so we need to run the pre_transform\n if self.first_chunk is None:\n self.log.info(f\"Adding first chunk {chunk_df.shape}...\")\n self.first_chunk = chunk_df\n self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n self.pandas_to_features.pre_transform()\n self.pandas_to_features.transform_impl()\n else:\n self.log.info(f\"Adding chunk {chunk_df.shape}...\")\n self.pandas_to_features.set_input(chunk_df, self.id_column, self.event_time_column)\n self.pandas_to_features.transform_impl()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.post_transform","title":"post_transform(**kwargs)
","text":"Post-Transform: Any Post Transform Steps
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def post_transform(self, **kwargs):\n \"\"\"Post-Transform: Any Post Transform Steps\"\"\"\n self.pandas_to_features.post_transform()\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.pre_transform","title":"pre_transform(**kwargs)
","text":"Pre-Transform: Create the Feature Group with Chunked Data
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def pre_transform(self, **kwargs):\n \"\"\"Pre-Transform: Create the Feature Group with Chunked Data\"\"\"\n\n # Loading data into a Feature Group takes a while, so set status to loading\n FeatureSetCore(self.output_uuid).set_status(\"loading\")\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.set_categorical_info","title":"set_categorical_info(cat_column_info)
","text":"Set the Categorical Columns Args: cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def set_categorical_info(self, cat_column_info: dict[list[str]]):\n \"\"\"Set the Categorical Columns\n Args:\n cat_column_info (dict[list[str]]): Dictionary of categorical columns and their possible values\n \"\"\"\n\n # Create the CategoricalDtypes\n cat_d_types = {}\n for col, vals in cat_column_info.items():\n cat_d_types[col] = CategoricalDtype(categories=vals)\n\n # Now set the CategoricalDtypes on our underlying PandasToFeatures\n self.pandas_to_features.categorical_dtypes = cat_d_types\n
"},{"location":"core_classes/transforms/pandas_transforms/#sageworks.core.transforms.pandas_transforms.PandasToFeaturesChunked.transform_impl","title":"transform_impl()
","text":"Required implementation of the Transform interface
Source code insrc/sageworks/core/transforms/pandas_transforms/pandas_to_features_chunked.py
def transform_impl(self):\n \"\"\"Required implementation of the Transform interface\"\"\"\n self.log.warning(\"PandasToFeaturesChunked.transform_impl() called. This is a no-op.\")\n
"},{"location":"core_classes/transforms/transform/","title":"Transform","text":"API Classes
The API Classes will use Transforms internally. So model.to_endpoint() uses the ModelToEndpoint() transform. If you need more control over the Transform you can use the Core Classes directly.
The SageWorks Transform class is a base/abstract class that defines API implemented by all the child classes (DataLoaders, DataSourceToFeatureSet, ModelToEndpoint, etc).
Transform: Base Class for all transforms within SageWorks Inherited Classes must implement the abstract transform_impl() method
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform","title":"Transform
","text":" Bases: ABC
Transform: Abstract Base Class for all transforms within SageWorks. Inherited Classes must implement the abstract transform_impl() method
Source code insrc/sageworks/core/transforms/transform.py
class Transform(ABC):\n \"\"\"Transform: Abstract Base Class for all transforms within SageWorks. Inherited Classes\n must implement the abstract transform_impl() method\"\"\"\n\n def __init__(self, input_uuid: str, output_uuid: str):\n \"\"\"Transform Initialization\"\"\"\n\n self.log = logging.getLogger(\"sageworks\")\n self.input_type = None\n self.output_type = None\n self.output_tags = \"\"\n self.input_uuid = str(input_uuid) # Occasionally we get a pathlib.Path object\n self.output_uuid = str(output_uuid) # Occasionally we get a pathlib.Path object\n self.output_meta = {\"sageworks_input\": self.input_uuid}\n self.data_catalog_db = \"sageworks\"\n\n # Grab our SageWorks Bucket\n cm = ConfigManager()\n if not cm.config_okay():\n self.log.error(\"SageWorks Configuration Incomplete...\")\n self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n raise FatalConfigError()\n self.sageworks_bucket = cm.get_config(\"SAGEWORKS_BUCKET\")\n self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n self.endpoints_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n # Grab a SageWorks Role ARN, Boto3, SageMaker Session, and SageMaker Client\n self.aws_account_clamp = AWSAccountClamp()\n self.sageworks_role_arn = self.aws_account_clamp.aws_session.get_sageworks_execution_role_arn()\n self.boto3_session = self.aws_account_clamp.boto3_session\n self.sm_session = self.aws_account_clamp.sagemaker_session()\n self.sm_client = self.aws_account_clamp.sagemaker_client()\n\n # Delimiter for storing lists in AWS Tags\n self.tag_delimiter = \"::\"\n\n @abstractmethod\n def transform_impl(self, **kwargs):\n \"\"\"Abstract Method: Implement the Transformation from Input to Output\"\"\"\n pass\n\n def pre_transform(self, **kwargs):\n \"\"\"Perform any Pre-Transform operations\"\"\"\n self.log.debug(\"Pre-Transform...\")\n\n @abstractmethod\n def post_transform(self, **kwargs):\n \"\"\"Post-Transform ensures that the output Artifact is ready for use\"\"\"\n pass\n\n def set_output_tags(self, tags: Union[list, str]):\n \"\"\"Set the tags that will be associated with the output object\n Args:\n tags (Union[list, str]): The list of tags or a '::' separated string of tags\"\"\"\n if isinstance(tags, list):\n self.output_tags = self.tag_delimiter.join(tags)\n else:\n self.output_tags = tags\n\n def add_output_meta(self, meta: dict):\n \"\"\"Add additional metadata that will be associated with the output artifact\n Args:\n meta (dict): A dictionary of metadata\"\"\"\n self.output_meta = self.output_meta | meta\n\n @staticmethod\n def convert_to_aws_tags(metadata: dict):\n \"\"\"Convert a dictionary to the AWS tag format (list of dicts)\n [ {Key: key_name, Value: value}, {..}, ...]\"\"\"\n return [{\"Key\": key, \"Value\": value} for key, value in metadata.items()]\n\n def get_aws_tags(self):\n \"\"\"Get the metadata/tags and convert them into AWS Tag Format\"\"\"\n # Set up our metadata storage\n sageworks_meta = {\"sageworks_tags\": self.output_tags}\n for key, value in self.output_meta.items():\n sageworks_meta[key] = value\n aws_tags = self.convert_to_aws_tags(sageworks_meta)\n return aws_tags\n\n @final\n def transform(self, **kwargs):\n \"\"\"Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations\"\"\"\n self.pre_transform(**kwargs)\n self.transform_impl(**kwargs)\n self.post_transform(**kwargs)\n\n def input_type(self) -> TransformInput:\n \"\"\"What Input Type does this Transform Consume\"\"\"\n return self.input_type\n\n def output_type(self) -> TransformOutput:\n \"\"\"What Output Type does this Transform Produce\"\"\"\n return self.output_type\n\n def set_input_uuid(self, input_uuid: str):\n \"\"\"Set the Input UUID (Name) for this Transform\"\"\"\n self.input_uuid = input_uuid\n\n def set_output_uuid(self, output_uuid: str):\n \"\"\"Set the Output UUID (Name) for this Transform\"\"\"\n self.output_uuid = output_uuid\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.__init__","title":"__init__(input_uuid, output_uuid)
","text":"Transform Initialization
Source code insrc/sageworks/core/transforms/transform.py
def __init__(self, input_uuid: str, output_uuid: str):\n \"\"\"Transform Initialization\"\"\"\n\n self.log = logging.getLogger(\"sageworks\")\n self.input_type = None\n self.output_type = None\n self.output_tags = \"\"\n self.input_uuid = str(input_uuid) # Occasionally we get a pathlib.Path object\n self.output_uuid = str(output_uuid) # Occasionally we get a pathlib.Path object\n self.output_meta = {\"sageworks_input\": self.input_uuid}\n self.data_catalog_db = \"sageworks\"\n\n # Grab our SageWorks Bucket\n cm = ConfigManager()\n if not cm.config_okay():\n self.log.error(\"SageWorks Configuration Incomplete...\")\n self.log.error(\"Run the 'sageworks' command and follow the prompts...\")\n raise FatalConfigError()\n self.sageworks_bucket = cm.get_config(\"SAGEWORKS_BUCKET\")\n self.data_sources_s3_path = \"s3://\" + self.sageworks_bucket + \"/data-sources\"\n self.feature_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/feature-sets\"\n self.models_s3_path = \"s3://\" + self.sageworks_bucket + \"/models\"\n self.endpoints_sets_s3_path = \"s3://\" + self.sageworks_bucket + \"/endpoints\"\n\n # Grab a SageWorks Role ARN, Boto3, SageMaker Session, and SageMaker Client\n self.aws_account_clamp = AWSAccountClamp()\n self.sageworks_role_arn = self.aws_account_clamp.aws_session.get_sageworks_execution_role_arn()\n self.boto3_session = self.aws_account_clamp.boto3_session\n self.sm_session = self.aws_account_clamp.sagemaker_session()\n self.sm_client = self.aws_account_clamp.sagemaker_client()\n\n # Delimiter for storing lists in AWS Tags\n self.tag_delimiter = \"::\"\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.add_output_meta","title":"add_output_meta(meta)
","text":"Add additional metadata that will be associated with the output artifact Args: meta (dict): A dictionary of metadata
Source code insrc/sageworks/core/transforms/transform.py
def add_output_meta(self, meta: dict):\n \"\"\"Add additional metadata that will be associated with the output artifact\n Args:\n meta (dict): A dictionary of metadata\"\"\"\n self.output_meta = self.output_meta | meta\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.convert_to_aws_tags","title":"convert_to_aws_tags(metadata)
staticmethod
","text":"Convert a dictionary to the AWS tag format (list of dicts) [ {Key: key_name, Value: value}, {..}, ...]
Source code insrc/sageworks/core/transforms/transform.py
@staticmethod\ndef convert_to_aws_tags(metadata: dict):\n \"\"\"Convert a dictionary to the AWS tag format (list of dicts)\n [ {Key: key_name, Value: value}, {..}, ...]\"\"\"\n return [{\"Key\": key, \"Value\": value} for key, value in metadata.items()]\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.get_aws_tags","title":"get_aws_tags()
","text":"Get the metadata/tags and convert them into AWS Tag Format
Source code insrc/sageworks/core/transforms/transform.py
def get_aws_tags(self):\n \"\"\"Get the metadata/tags and convert them into AWS Tag Format\"\"\"\n # Set up our metadata storage\n sageworks_meta = {\"sageworks_tags\": self.output_tags}\n for key, value in self.output_meta.items():\n sageworks_meta[key] = value\n aws_tags = self.convert_to_aws_tags(sageworks_meta)\n return aws_tags\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.input_type","title":"input_type()
","text":"What Input Type does this Transform Consume
Source code insrc/sageworks/core/transforms/transform.py
def input_type(self) -> TransformInput:\n \"\"\"What Input Type does this Transform Consume\"\"\"\n return self.input_type\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.output_type","title":"output_type()
","text":"What Output Type does this Transform Produce
Source code insrc/sageworks/core/transforms/transform.py
def output_type(self) -> TransformOutput:\n \"\"\"What Output Type does this Transform Produce\"\"\"\n return self.output_type\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.post_transform","title":"post_transform(**kwargs)
abstractmethod
","text":"Post-Transform ensures that the output Artifact is ready for use
Source code insrc/sageworks/core/transforms/transform.py
@abstractmethod\ndef post_transform(self, **kwargs):\n \"\"\"Post-Transform ensures that the output Artifact is ready for use\"\"\"\n pass\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.pre_transform","title":"pre_transform(**kwargs)
","text":"Perform any Pre-Transform operations
Source code insrc/sageworks/core/transforms/transform.py
def pre_transform(self, **kwargs):\n \"\"\"Perform any Pre-Transform operations\"\"\"\n self.log.debug(\"Pre-Transform...\")\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_input_uuid","title":"set_input_uuid(input_uuid)
","text":"Set the Input UUID (Name) for this Transform
Source code insrc/sageworks/core/transforms/transform.py
def set_input_uuid(self, input_uuid: str):\n \"\"\"Set the Input UUID (Name) for this Transform\"\"\"\n self.input_uuid = input_uuid\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_output_tags","title":"set_output_tags(tags)
","text":"Set the tags that will be associated with the output object Args: tags (Union[list, str]): The list of tags or a '::' separated string of tags
Source code insrc/sageworks/core/transforms/transform.py
def set_output_tags(self, tags: Union[list, str]):\n \"\"\"Set the tags that will be associated with the output object\n Args:\n tags (Union[list, str]): The list of tags or a '::' separated string of tags\"\"\"\n if isinstance(tags, list):\n self.output_tags = self.tag_delimiter.join(tags)\n else:\n self.output_tags = tags\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.set_output_uuid","title":"set_output_uuid(output_uuid)
","text":"Set the Output UUID (Name) for this Transform
Source code insrc/sageworks/core/transforms/transform.py
def set_output_uuid(self, output_uuid: str):\n \"\"\"Set the Output UUID (Name) for this Transform\"\"\"\n self.output_uuid = output_uuid\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.transform","title":"transform(**kwargs)
","text":"Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations
Source code insrc/sageworks/core/transforms/transform.py
@final\ndef transform(self, **kwargs):\n \"\"\"Perform the Transformation from Input to Output with pre_transform() and post_transform() invocations\"\"\"\n self.pre_transform(**kwargs)\n self.transform_impl(**kwargs)\n self.post_transform(**kwargs)\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.Transform.transform_impl","title":"transform_impl(**kwargs)
abstractmethod
","text":"Abstract Method: Implement the Transformation from Input to Output
Source code insrc/sageworks/core/transforms/transform.py
@abstractmethod\ndef transform_impl(self, **kwargs):\n \"\"\"Abstract Method: Implement the Transformation from Input to Output\"\"\"\n pass\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.TransformInput","title":"TransformInput
","text":" Bases: Enum
Enumerated Types for SageWorks Transform Inputs
Source code insrc/sageworks/core/transforms/transform.py
class TransformInput(Enum):\n \"\"\"Enumerated Types for SageWorks Transform Inputs\"\"\"\n\n LOCAL_FILE = auto()\n PANDAS_DF = auto()\n SPARK_DF = auto()\n S3_OBJECT = auto()\n DATA_SOURCE = auto()\n FEATURE_SET = auto()\n MODEL = auto()\n
"},{"location":"core_classes/transforms/transform/#sageworks.core.transforms.transform.TransformOutput","title":"TransformOutput
","text":" Bases: Enum
Enumerated Types for SageWorks Transform Outputs
Source code insrc/sageworks/core/transforms/transform.py
class TransformOutput(Enum):\n \"\"\"Enumerated Types for SageWorks Transform Outputs\"\"\"\n\n PANDAS_DF = auto()\n SPARK_DF = auto()\n S3_OBJECT = auto()\n DATA_SOURCE = auto()\n FEATURE_SET = auto()\n MODEL = auto()\n ENDPOINT = auto()\n
"},{"location":"core_classes/views/computation_view/","title":"Computation View","text":"Experimental
The SageWorks View classes are currently in experimental mode so have fun but expect issues and API changes going forward.
Note: This class can be automatically invoked from DataSource/FeatureSet set_computation_columns()
DataSource or FeatureSet. If you need more control then you can use this class directly.
ComputationView Class: Create a View with a subset of columns for display purposes
"},{"location":"core_classes/views/computation_view/#sageworks.core.views.computation_view.ComputationView","title":"ComputationView
","text":" Bases: ColumnSubsetView
ComputationView Class: Create a View with a subset of columns for computation purposes
Common Usage# Create a default ComputationView\nfs = FeatureSet(\"test_features\")\ncomp_view = ComputationView.create(fs)\ndf = comp_view.pull_dataframe()\n\n# Create a ComputationView with a specific set of columns\ncomp_view = ComputationView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n# Query the view\ndf = comp_view.query(f\"SELECT * FROM {comp_view.table} where prediction > 0.5\")\n
Source code in src/sageworks/core/views/computation_view.py
class ComputationView(ColumnSubsetView):\n \"\"\"ComputationView Class: Create a View with a subset of columns for computation purposes\n\n Common Usage:\n ```python\n # Create a default ComputationView\n fs = FeatureSet(\"test_features\")\n comp_view = ComputationView.create(fs)\n df = comp_view.pull_dataframe()\n\n # Create a ComputationView with a specific set of columns\n comp_view = ComputationView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n # Query the view\n df = comp_view.query(f\"SELECT * FROM {comp_view.table} where prediction > 0.5\")\n ```\n \"\"\"\n\n @classmethod\n def create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n ) -> Union[View, None]:\n \"\"\"Factory method to create and return a ComputationView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"computation\" view name\n return ColumnSubsetView.create(\"computation\", artifact, source_table, column_list, column_limit)\n
"},{"location":"core_classes/views/computation_view/#sageworks.core.views.computation_view.ComputationView.create","title":"create(artifact, source_table=None, column_list=None, column_limit=30)
classmethod
","text":"Factory method to create and return a ComputationView instance.
Parameters:
Name Type Description Defaultartifact
Union[DataSource, FeatureSet]
The DataSource or FeatureSet object
requiredsource_table
str
The table/view to create the view from. Defaults to None
None
column_list
Union[list[str], None]
A list of columns to include. Defaults to None.
None
column_limit
int
The max number of columns to include. Defaults to 30.
30
Returns:
Type DescriptionUnion[View, None]
Union[View, None]: The created View object (or None if failed to create the view)
Source code insrc/sageworks/core/views/computation_view.py
@classmethod\ndef create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n) -> Union[View, None]:\n \"\"\"Factory method to create and return a ComputationView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"computation\" view name\n return ColumnSubsetView.create(\"computation\", artifact, source_table, column_list, column_limit)\n
"},{"location":"core_classes/views/computation_view/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/views/display_view/","title":"Display View","text":"Experimental
The SageWorks View classes are currently in experimental mode so have fun but expect issues and API changes going forward.
Note: This class will be used in the future to fine tune what columns get displayed. For now just use the DataSource/FeatureSet set_computation_columns()
DataSource or FeatureSet
DisplayView Class: Create a View with a subset of columns for display purposes
"},{"location":"core_classes/views/display_view/#sageworks.core.views.display_view.DisplayView","title":"DisplayView
","text":" Bases: ColumnSubsetView
DisplayView Class: Create a View with a subset of columns for display purposes
Common Usage# Create a default DisplayView\nfs = FeatureSet(\"test_features\")\ndisplay_view = DisplayView.create(fs)\ndf = display_view.pull_dataframe()\n\n# Create a DisplayView with a specific set of columns\ndisplay_view = DisplayView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n# Query the view\ndf = display_view.query(f\"SELECT * FROM {display_view.table} where awesome = 'yes'\")\n
Source code in src/sageworks/core/views/display_view.py
class DisplayView(ColumnSubsetView):\n \"\"\"DisplayView Class: Create a View with a subset of columns for display purposes\n\n Common Usage:\n ```python\n # Create a default DisplayView\n fs = FeatureSet(\"test_features\")\n display_view = DisplayView.create(fs)\n df = display_view.pull_dataframe()\n\n # Create a DisplayView with a specific set of columns\n display_view = DisplayView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n # Query the view\n df = display_view.query(f\"SELECT * FROM {display_view.table} where awesome = 'yes'\")\n ```\n \"\"\"\n\n @classmethod\n def create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n ) -> Union[View, None]:\n \"\"\"Factory method to create and return a DisplayView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"display\" view name\n return ColumnSubsetView.create(\"display\", artifact, source_table, column_list, column_limit)\n
"},{"location":"core_classes/views/display_view/#sageworks.core.views.display_view.DisplayView.create","title":"create(artifact, source_table=None, column_list=None, column_limit=30)
classmethod
","text":"Factory method to create and return a DisplayView instance.
Parameters:
Name Type Description Defaultartifact
Union[DataSource, FeatureSet]
The DataSource or FeatureSet object
requiredsource_table
str
The table/view to create the view from. Defaults to None
None
column_list
Union[list[str], None]
A list of columns to include. Defaults to None.
None
column_limit
int
The max number of columns to include. Defaults to 30.
30
Returns:
Type DescriptionUnion[View, None]
Union[View, None]: The created View object (or None if failed to create the view)
Source code insrc/sageworks/core/views/display_view.py
@classmethod\ndef create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n) -> Union[View, None]:\n \"\"\"Factory method to create and return a DisplayView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"display\" view name\n return ColumnSubsetView.create(\"display\", artifact, source_table, column_list, column_limit)\n
"},{"location":"core_classes/views/display_view/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/views/mdq_view/","title":"ModelDataQuality View","text":"Experimental
The SageWorks View classes are currently in experimental mode so have fun but expect issues and API changes going forward.
MDQView Class: A View that computes various endpoint data quality metrics
"},{"location":"core_classes/views/mdq_view/#sageworks.core.views.mdq_view.MDQView","title":"MDQView
","text":"MDQView Class: A View that computes various endpoint data quality metrics
Common Usage# Grab a FeatureSet and an Endpoint\nfs = FeatureSet(\"abalone_features\")\nendpoint = Endpoint(\"abalone-regression-end\")\n\n# Create a ModelDataQuality View\nmdq_view = MDQView.create(fs, endpoint=endpoint, id_column=\"id\")\nmy_df = mdq_view.pull_dataframe(head=True)\n\n# Query the view\ndf = mdq_view.query(f\"SELECT * FROM {mdq_view.table} where residuals > 0.5\")\n
Source code in src/sageworks/core/views/mdq_view.py
class MDQView:\n \"\"\"MDQView Class: A View that computes various endpoint data quality metrics\n\n Common Usage:\n ```python\n # Grab a FeatureSet and an Endpoint\n fs = FeatureSet(\"abalone_features\")\n endpoint = Endpoint(\"abalone-regression-end\")\n\n # Create a ModelDataQuality View\n mdq_view = MDQView.create(fs, endpoint=endpoint, id_column=\"id\")\n my_df = mdq_view.pull_dataframe(head=True)\n\n # Query the view\n df = mdq_view.query(f\"SELECT * FROM {mdq_view.table} where residuals > 0.5\")\n ```\n \"\"\"\n\n @classmethod\n def create(\n cls,\n fs: FeatureSet,\n endpoint: Endpoint,\n id_column: str,\n use_reference_model: bool = False,\n ) -> Union[View, None]:\n \"\"\"Create a Model Data Quality View with metrics\n\n Args:\n fs (FeatureSet): The FeatureSet object\n endpoint (Endpoint): The Endpoint object to use for the target and features\n id_column (str): The name of the id column (must be defined for join logic)\n use_reference_model (bool): Use the reference model for inference (default: False)\n\n Returns:\n Union[View, None]: The created View object (or None if failed)\n \"\"\"\n # Log view creation\n fs.log.important(\"Creating Model Data Quality View...\")\n\n # Get the target and feature columns from the endpoints model input\n model_input = Model(endpoint.get_input())\n target = model_input.target()\n features = model_input.features()\n\n # Pull in data from the source table\n df = fs.data_source.query(f\"SELECT * FROM {fs.data_source.uuid}\")\n\n # Check if the target and features are available in the data source\n missing_columns = [col for col in [target] + features if col not in df.columns]\n if missing_columns:\n fs.log.error(f\"Missing columns in data source: {missing_columns}\")\n return None\n\n # Check if the target is categorical\n categorical_target = not pd.api.types.is_numeric_dtype(df[target])\n\n # Compute row tags with RowTagger\n row_tagger = RowTagger(\n df,\n features=features,\n id_column=id_column,\n target_column=target,\n within_dist=0.25,\n min_target_diff=1.0,\n outlier_df=fs.data_source.outliers(),\n categorical_target=categorical_target,\n )\n mdq_df = row_tagger.tag_rows()\n\n # Rename and compute data quality scores based on tags\n mdq_df.rename(columns={\"tags\": \"data_quality_tags\"}, inplace=True)\n\n # We're going to compute a data_quality score based on the tags.\n mdq_df[\"data_quality\"] = mdq_df[\"data_quality_tags\"].apply(cls.calculate_data_quality)\n\n # Compute residuals using ResidualsCalculator\n if use_reference_model:\n residuals_calculator = ResidualsCalculator()\n else:\n residuals_calculator = ResidualsCalculator(endpoint=endpoint)\n residuals_df = residuals_calculator.fit_transform(df[features], df[target])\n\n # Add id_column to the residuals dataframe and merge with mdq_df\n residuals_df[id_column] = df[id_column]\n\n # Drop overlapping columns in mdq_df (except for the id_column) to avoid _x and _y suffixes\n overlap_columns = [col for col in residuals_df.columns if col in mdq_df.columns and col != id_column]\n mdq_df = mdq_df.drop(columns=overlap_columns)\n\n # Merge the DataFrames, with the id_column as the join key\n mdq_df = mdq_df.merge(residuals_df, on=id_column, how=\"left\")\n\n # Delegate view creation to PandasToView\n view_name = \"mdq_ref\" if use_reference_model else \"mdq\"\n return PandasToView.create(view_name, fs, df=mdq_df, id_column=id_column)\n\n @staticmethod\n def calculate_data_quality(tags):\n score = 1.0 # Start with the default score\n if \"coincident\" in tags:\n score -= 1.0\n if \"htg\" in tags:\n score -= 0.5\n if \"outlier\" in tags:\n score -= 0.25\n score = max(0.0, score)\n return score\n
"},{"location":"core_classes/views/mdq_view/#sageworks.core.views.mdq_view.MDQView.create","title":"create(fs, endpoint, id_column, use_reference_model=False)
classmethod
","text":"Create a Model Data Quality View with metrics
Parameters:
Name Type Description Defaultfs
FeatureSet
The FeatureSet object
requiredendpoint
Endpoint
The Endpoint object to use for the target and features
requiredid_column
str
The name of the id column (must be defined for join logic)
requireduse_reference_model
bool
Use the reference model for inference (default: False)
False
Returns:
Type DescriptionUnion[View, None]
Union[View, None]: The created View object (or None if failed)
Source code insrc/sageworks/core/views/mdq_view.py
@classmethod\ndef create(\n cls,\n fs: FeatureSet,\n endpoint: Endpoint,\n id_column: str,\n use_reference_model: bool = False,\n) -> Union[View, None]:\n \"\"\"Create a Model Data Quality View with metrics\n\n Args:\n fs (FeatureSet): The FeatureSet object\n endpoint (Endpoint): The Endpoint object to use for the target and features\n id_column (str): The name of the id column (must be defined for join logic)\n use_reference_model (bool): Use the reference model for inference (default: False)\n\n Returns:\n Union[View, None]: The created View object (or None if failed)\n \"\"\"\n # Log view creation\n fs.log.important(\"Creating Model Data Quality View...\")\n\n # Get the target and feature columns from the endpoints model input\n model_input = Model(endpoint.get_input())\n target = model_input.target()\n features = model_input.features()\n\n # Pull in data from the source table\n df = fs.data_source.query(f\"SELECT * FROM {fs.data_source.uuid}\")\n\n # Check if the target and features are available in the data source\n missing_columns = [col for col in [target] + features if col not in df.columns]\n if missing_columns:\n fs.log.error(f\"Missing columns in data source: {missing_columns}\")\n return None\n\n # Check if the target is categorical\n categorical_target = not pd.api.types.is_numeric_dtype(df[target])\n\n # Compute row tags with RowTagger\n row_tagger = RowTagger(\n df,\n features=features,\n id_column=id_column,\n target_column=target,\n within_dist=0.25,\n min_target_diff=1.0,\n outlier_df=fs.data_source.outliers(),\n categorical_target=categorical_target,\n )\n mdq_df = row_tagger.tag_rows()\n\n # Rename and compute data quality scores based on tags\n mdq_df.rename(columns={\"tags\": \"data_quality_tags\"}, inplace=True)\n\n # We're going to compute a data_quality score based on the tags.\n mdq_df[\"data_quality\"] = mdq_df[\"data_quality_tags\"].apply(cls.calculate_data_quality)\n\n # Compute residuals using ResidualsCalculator\n if use_reference_model:\n residuals_calculator = ResidualsCalculator()\n else:\n residuals_calculator = ResidualsCalculator(endpoint=endpoint)\n residuals_df = residuals_calculator.fit_transform(df[features], df[target])\n\n # Add id_column to the residuals dataframe and merge with mdq_df\n residuals_df[id_column] = df[id_column]\n\n # Drop overlapping columns in mdq_df (except for the id_column) to avoid _x and _y suffixes\n overlap_columns = [col for col in residuals_df.columns if col in mdq_df.columns and col != id_column]\n mdq_df = mdq_df.drop(columns=overlap_columns)\n\n # Merge the DataFrames, with the id_column as the join key\n mdq_df = mdq_df.merge(residuals_df, on=id_column, how=\"left\")\n\n # Delegate view creation to PandasToView\n view_name = \"mdq_ref\" if use_reference_model else \"mdq\"\n return PandasToView.create(view_name, fs, df=mdq_df, id_column=id_column)\n
"},{"location":"core_classes/views/mdq_view/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/views/overview/","title":"Views","text":"View Examples
Examples of using the Views classes to extend the functionality of SageWorks Artifacts are in the Examples section at the bottom of this page.
Views are a powerful way to filter and agument your DataSources and FeatureSets. With Views you can subset columns, rows, and even add data to existing SageWorks Artifacts. If you want to compute outliers, runs some statistics or engineer some new features, Views are an easy way to change, modify, and add to DataSources and FeatureSets.
If you're looking to read and pull data from a view please see the Views documentation.
"},{"location":"core_classes/views/overview/#view-constructor-classes","title":"View Constructor Classes","text":"These classes provide APIs for creating Views for DataSources and FeatureSets.
All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Listing Views
views.pyfrom sageworks.api.data_source import DataSource\n\n# Convert the Data Source to a Feature Set\ntest_data = DataSource('test_data')\ntest_data.views()\n[\"display\", \"training\", \"computation\"]\n
Getting a Particular View
views.pyfrom sageworks.api.feature_set import FeatureSet\n\nfs = FeatureSet('test_features')\n\n# Grab the columns for the display view\ndisplay_view = fs.view(\"display\")\ndisplay_view.columns\n['id', 'name', 'height', 'weight', 'salary', ...]\n\n# Pull the dataframe for this view\ndf = display_view.pull_dataframe()\n id name height weight salary ...\n0 58 Person 58 71.781227 275.088196 162053.140625 \n
View Queries
All SageWorks Views are stored in AWS Athena, so any query that you can make with Athena is accessible through the View Query API.
view_query.pyfrom sageworks.api.feature_set import FeatureSet\n\n# Grab a FeatureSet View\nfs = FeatureSet(\"abalone_features\")\nt_view = fs.view(\"training\")\n\n# Make some queries using the Athena backend\ndf = t_view(f\"select * from {t_view.table} where height > .3\")\nprint(df.head())\n\ndf = t_view.query(\"select * from abalone_features where class_number_of_rings < 3\")\nprint(df.head())\n
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10\n1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8\n\n sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings\n0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1\n1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2\n
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"core_classes/views/training_view/","title":"Training View","text":"Experimental
The SageWorks View classes are currently in experimental mode so have fun but expect issues and API changes going forward.
TrainingView Class: A View with an additional training column that marks holdout ids
"},{"location":"core_classes/views/training_view/#sageworks.core.views.training_view.TrainingView","title":"TrainingView
","text":" Bases: CreateView
TrainingView Class: A View with an additional training column that marks holdout ids
Common Usage# Create a default TrainingView\nfs = FeatureSet(\"test_features\")\ntraining_view = TrainingView.create(fs)\ndf = training_view.pull_dataframe()\n\n# Create a TrainingView with a specific set of columns\ntraining_view = TrainingView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n# Query the view\ndf = training_view.query(f\"SELECT * FROM {training_view.table} where training = TRUE\")\n
Source code in src/sageworks/core/views/training_view.py
class TrainingView(CreateView):\n \"\"\"TrainingView Class: A View with an additional training column that marks holdout ids\n\n Common Usage:\n ```python\n # Create a default TrainingView\n fs = FeatureSet(\"test_features\")\n training_view = TrainingView.create(fs)\n df = training_view.pull_dataframe()\n\n # Create a TrainingView with a specific set of columns\n training_view = TrainingView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n # Query the view\n df = training_view.query(f\"SELECT * FROM {training_view.table} where training = TRUE\")\n ```\n \"\"\"\n\n @classmethod\n def create(\n cls,\n feature_set: FeatureSet,\n source_table: str = None,\n id_column: str = None,\n holdout_ids: Union[list[str], list[int], None] = None,\n ) -> Union[View, None]:\n \"\"\"Factory method to create and return a TrainingView instance.\n\n Args:\n feature_set (FeatureSet): A FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None.\n id_column (str, optional): The name of the id column. Defaults to None.\n holdout_ids (Union[list[str], list[int], None], optional): A list of holdout ids. Defaults to None.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Instantiate the TrainingView with \"training\" as the view name\n instance = cls(\"training\", feature_set, source_table)\n\n # Drop any columns generated from AWS\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n source_table_columns = get_column_list(instance.data_source, instance.source_table)\n column_list = [col for col in source_table_columns if col not in aws_cols]\n\n # Sanity check on the id column\n if not id_column:\n instance.log.important(\"No id column specified, we'll try the auto_id_column ..\")\n if not instance.auto_id_column:\n instance.log.error(\"No id column specified and no auto_id_column found, aborting ..\")\n return None\n else:\n if instance.auto_id_column not in column_list:\n instance.log.error(\n f\"Auto id column {instance.auto_id_column} not found in column list, aborting ..\"\n )\n return None\n else:\n id_column = instance.auto_id_column\n\n # If we don't have holdout ids, create a default training view\n if not holdout_ids:\n instance._default_training_view(instance.data_source, id_column)\n return View(instance.data_source, instance.view_name, auto_create_view=False)\n\n # Format the list of holdout ids for SQL IN clause\n if holdout_ids and all(isinstance(id, str) for id in holdout_ids):\n formatted_holdout_ids = \", \".join(f\"'{id}'\" for id in holdout_ids)\n else:\n formatted_holdout_ids = \", \".join(map(str, holdout_ids))\n\n # Enclose each column name in double quotes\n sql_columns = \", \".join([f'\"{column}\"' for column in column_list])\n\n # Construct the CREATE VIEW query\n create_view_query = f\"\"\"\n CREATE OR REPLACE VIEW {instance.table} AS\n SELECT {sql_columns}, CASE\n WHEN {id_column} IN ({formatted_holdout_ids}) THEN False\n ELSE True\n END AS training\n FROM {instance.source_table}\n \"\"\"\n\n # Execute the CREATE VIEW query\n instance.data_source.execute_statement(create_view_query)\n\n # Return the View\n return View(instance.data_source, instance.view_name, auto_create_view=False)\n\n # This is an internal method that's used to create a default training view\n def _default_training_view(self, data_source: DataSource, id_column: str):\n \"\"\"Create a default view in Athena that assigns roughly 80% of the data to training\n\n Args:\n data_source (DataSource): The SageWorks DataSource object\n id_column (str): The name of the id column\n \"\"\"\n self.log.important(f\"Creating default Training View {self.table}...\")\n\n # Drop any columns generated from AWS\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n column_list = [col for col in data_source.columns if col not in aws_cols]\n\n # Enclose each column name in double quotes\n sql_columns = \", \".join([f'\"{column}\"' for column in column_list])\n\n # Construct the CREATE VIEW query with a simple modulo operation for the 80/20 split\n create_view_query = f\"\"\"\n CREATE OR REPLACE VIEW \"{self.table}\" AS\n SELECT {sql_columns}, CASE\n WHEN MOD(ROW_NUMBER() OVER (ORDER BY {id_column}), 10) < 8 THEN True -- Assign 80% to training\n ELSE False -- Assign roughly 20% to validation/test\n END AS training\n FROM {self.base_table_name}\n \"\"\"\n\n # Execute the CREATE VIEW query\n data_source.execute_statement(create_view_query)\n
"},{"location":"core_classes/views/training_view/#sageworks.core.views.training_view.TrainingView.create","title":"create(feature_set, source_table=None, id_column=None, holdout_ids=None)
classmethod
","text":"Factory method to create and return a TrainingView instance.
Parameters:
Name Type Description Defaultfeature_set
FeatureSet
A FeatureSet object
requiredsource_table
str
The table/view to create the view from. Defaults to None.
None
id_column
str
The name of the id column. Defaults to None.
None
holdout_ids
Union[list[str], list[int], None]
A list of holdout ids. Defaults to None.
None
Returns:
Type DescriptionUnion[View, None]
Union[View, None]: The created View object (or None if failed to create the view)
Source code insrc/sageworks/core/views/training_view.py
@classmethod\ndef create(\n cls,\n feature_set: FeatureSet,\n source_table: str = None,\n id_column: str = None,\n holdout_ids: Union[list[str], list[int], None] = None,\n) -> Union[View, None]:\n \"\"\"Factory method to create and return a TrainingView instance.\n\n Args:\n feature_set (FeatureSet): A FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None.\n id_column (str, optional): The name of the id column. Defaults to None.\n holdout_ids (Union[list[str], list[int], None], optional): A list of holdout ids. Defaults to None.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Instantiate the TrainingView with \"training\" as the view name\n instance = cls(\"training\", feature_set, source_table)\n\n # Drop any columns generated from AWS\n aws_cols = [\"write_time\", \"api_invocation_time\", \"is_deleted\", \"event_time\"]\n source_table_columns = get_column_list(instance.data_source, instance.source_table)\n column_list = [col for col in source_table_columns if col not in aws_cols]\n\n # Sanity check on the id column\n if not id_column:\n instance.log.important(\"No id column specified, we'll try the auto_id_column ..\")\n if not instance.auto_id_column:\n instance.log.error(\"No id column specified and no auto_id_column found, aborting ..\")\n return None\n else:\n if instance.auto_id_column not in column_list:\n instance.log.error(\n f\"Auto id column {instance.auto_id_column} not found in column list, aborting ..\"\n )\n return None\n else:\n id_column = instance.auto_id_column\n\n # If we don't have holdout ids, create a default training view\n if not holdout_ids:\n instance._default_training_view(instance.data_source, id_column)\n return View(instance.data_source, instance.view_name, auto_create_view=False)\n\n # Format the list of holdout ids for SQL IN clause\n if holdout_ids and all(isinstance(id, str) for id in holdout_ids):\n formatted_holdout_ids = \", \".join(f\"'{id}'\" for id in holdout_ids)\n else:\n formatted_holdout_ids = \", \".join(map(str, holdout_ids))\n\n # Enclose each column name in double quotes\n sql_columns = \", \".join([f'\"{column}\"' for column in column_list])\n\n # Construct the CREATE VIEW query\n create_view_query = f\"\"\"\n CREATE OR REPLACE VIEW {instance.table} AS\n SELECT {sql_columns}, CASE\n WHEN {id_column} IN ({formatted_holdout_ids}) THEN False\n ELSE True\n END AS training\n FROM {instance.source_table}\n \"\"\"\n\n # Execute the CREATE VIEW query\n instance.data_source.execute_statement(create_view_query)\n\n # Return the View\n return View(instance.data_source, instance.view_name, auto_create_view=False)\n
"},{"location":"core_classes/views/training_view/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"data_algorithms/overview/","title":"Data Algorithms","text":"Data Algorithms
WIP: These classes are currently actively being developed and are subject to change in both API and functionality over time. They provide a set of data algorithms for various types of data storage. We currently have subdirectorys for:
SQL: SQL queries that provide a wide range of functionality:
Welcome to the SageWorks Data Algorithms
Docs TBD
"},{"location":"data_algorithms/overview/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"data_algorithms/dataframes/overview/","title":"Pandas Dataframe Algorithms","text":"Pandas Dataframes
Pandas dataframes are obviously not going to scale as well as our Spark and SQL Algorithms, but for 'moderate' sized data these algorithms provide some nice functionality.
Pandas Dataframe Algorithms
SageWorks has a growing set of algorithms and data processing tools for Pandas Dataframes. In general these algorithm will take a dataframe as input and give you back a dataframe with additional columns.
FeatureSpaceProximity: A class for neighbor lookups using KNN with optional target information.
DimensionalityReduction: Perform Dimensionality Reduction on a DataFrame
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity","title":"FeatureSpaceProximity
","text":"Source code in src/sageworks/algorithms/dataframe/feature_space_proximity.py
class FeatureSpaceProximity:\n def __init__(self, df: pd.DataFrame, features: list, id_column: str, target: str = None, neighbors: int = 10):\n \"\"\"FeatureSpaceProximity: A class for neighbor lookups using KNN with optional target information.\n\n Args:\n df: Pandas DataFrame\n features: List of feature column names\n id_column: Name of the ID column\n target: Optional name of the target column to include target-based functionality (default: None)\n neighbors: Number of neighbors to use in the KNN model (default: 10)\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.df = df\n self.features = features\n self.id_column = id_column\n self.target = target\n self.knn_neighbors = neighbors\n\n # Standardize the feature values and build the KNN model\n self.log.info(\"Building KNN model for FeatureSpaceProximity...\")\n self.scaler = StandardScaler().fit(df[features])\n scaled_features = self.scaler.transform(df[features])\n self.knn_model = NearestNeighbors(n_neighbors=neighbors, algorithm=\"auto\").fit(scaled_features)\n\n # Compute Z-Scores or Consistency Scores for the target values\n if self.target and is_numeric_dtype(self.df[self.target]):\n self.log.info(\"Computing Z-Scores for target values...\")\n self.target_z_scores()\n else:\n self.log.info(\"Computing target consistency scores...\")\n self.target_consistency()\n\n # Now compute the outlier scores\n self.log.info(\"Computing outlier scores...\")\n self.outliers()\n\n @classmethod\n def from_model(cls, model) -> \"FeatureSpaceProximity\":\n \"\"\"Create a FeatureSpaceProximity instance from a SageWorks model object.\n\n Args:\n model (Model): A SageWorks model object.\n\n Returns:\n FeatureSpaceProximity: A new instance of the FeatureSpaceProximity class.\n \"\"\"\n from sageworks.api import FeatureSet\n\n # Extract necessary attributes from the SageWorks model\n fs = FeatureSet(model.get_input())\n features = model.features()\n target = model.target()\n\n # Retrieve the training DataFrame from the feature set\n df = fs.view(\"training\").pull_dataframe()\n\n # Create and return a new instance of FeatureSpaceProximity\n return cls(df=df, features=features, id_column=fs.id_column, target=target)\n\n def neighbors(self, query_id: Union[str, int], radius: float = None, include_self: bool = True) -> pd.DataFrame:\n \"\"\"Return neighbors of the given query ID, either by fixed neighbors or within a radius.\n\n Args:\n query_id (Union[str, int]): The ID of the query point.\n radius (float): Optional radius within which neighbors are to be searched, else use fixed neighbors.\n include_self (bool): Whether to include the query ID itself in the neighbor results.\n\n Returns:\n pd.DataFrame: Filtered DataFrame that includes the query ID, its neighbors, and optionally target values.\n \"\"\"\n if query_id not in self.df[self.id_column].values:\n self.log.warning(f\"Query ID '{query_id}' not found in the DataFrame. Returning an empty DataFrame.\")\n return pd.DataFrame()\n\n # Get a single-row DataFrame for the query ID\n query_df = self.df[self.df[self.id_column] == query_id]\n\n # Use the neighbors_bulk method with the appropriate radius\n neighbors_info_df = self.neighbors_bulk(query_df, radius=radius, include_self=include_self)\n\n # Extract the neighbor IDs and distances from the results\n neighbor_ids = neighbors_info_df[\"neighbor_ids\"].iloc[0]\n neighbor_distances = neighbors_info_df[\"neighbor_distances\"].iloc[0]\n\n # Sort neighbors by distance (ascending order)\n sorted_neighbors = sorted(zip(neighbor_ids, neighbor_distances), key=lambda x: x[1])\n sorted_ids, sorted_distances = zip(*sorted_neighbors)\n\n # Filter the internal DataFrame to include only the sorted neighbors\n neighbors_df = self.df[self.df[self.id_column].isin(sorted_ids)]\n neighbors_df = neighbors_df.set_index(self.id_column).reindex(sorted_ids).reset_index()\n neighbors_df[\"knn_distance\"] = sorted_distances\n return neighbors_df\n\n def neighbors_bulk(self, query_df: pd.DataFrame, radius: float = None, include_self: bool = False) -> pd.DataFrame:\n \"\"\"Return neighbors for each row in the given query dataframe, either by fixed neighbors or within a radius.\n\n Args:\n query_df: Pandas DataFrame with the same features as the training data.\n radius: Optional radius within which neighbors are to be searched, else use fixed neighbors.\n include_self: Boolean indicating whether to include the query ID in the neighbor results.\n\n Returns:\n pd.DataFrame: DataFrame with query ID, neighbor IDs, neighbor targets, and neighbor distances.\n \"\"\"\n # Scale the query data using the same scaler as the training data\n query_scaled = self.scaler.transform(query_df[self.features])\n\n # Retrieve neighbors based on radius or standard neighbors\n if radius is not None:\n distances, indices = self.knn_model.radius_neighbors(query_scaled, radius=radius)\n else:\n distances, indices = self.knn_model.kneighbors(query_scaled)\n\n # Collect neighbor information (IDs, target values, and distances)\n query_ids = query_df[self.id_column].values\n neighbor_ids = [[self.df.iloc[idx][self.id_column] for idx in index_list] for index_list in indices]\n neighbor_targets = (\n [\n [self.df.loc[self.df[self.id_column] == neighbor, self.target].values[0] for neighbor in index_list]\n for index_list in neighbor_ids\n ]\n if self.target\n else None\n )\n neighbor_distances = [list(dist_list) for dist_list in distances]\n\n # Automatically remove the query ID itself from the neighbor results if include_self is False\n for i, query_id in enumerate(query_ids):\n if query_id in neighbor_ids[i] and not include_self:\n idx_to_remove = neighbor_ids[i].index(query_id)\n neighbor_ids[i].pop(idx_to_remove)\n neighbor_distances[i].pop(idx_to_remove)\n if neighbor_targets:\n neighbor_targets[i].pop(idx_to_remove)\n\n # Sort neighbors by distance (ascending order)\n sorted_neighbors = sorted(zip(neighbor_ids[i], neighbor_distances[i]), key=lambda x: x[1])\n neighbor_ids[i], neighbor_distances[i] = list(zip(*sorted_neighbors)) if sorted_neighbors else ([], [])\n if neighbor_targets:\n neighbor_targets[i] = [\n self.df.loc[self.df[self.id_column] == neighbor, self.target].values[0]\n for neighbor in neighbor_ids[i]\n ]\n\n # Create and return a results DataFrame with the updated neighbor information\n result_df = pd.DataFrame(\n {\n \"query_id\": query_ids,\n \"neighbor_ids\": neighbor_ids,\n \"neighbor_distances\": neighbor_distances,\n }\n )\n\n if neighbor_targets:\n result_df[\"neighbor_targets\"] = neighbor_targets\n\n return result_df\n\n def outliers(self) -> None:\n \"\"\"Compute a unified 'outlier' score based on either 'target_z' or 'target_consistency'.\"\"\"\n if \"target_z\" in self.df.columns:\n # Normalize Z-Scores to a 0-1 range\n self.df[\"outlier\"] = (self.df[\"target_z\"].abs() / (self.df[\"target_z\"].abs().max() + 1e-6)).clip(0, 1)\n\n elif \"target_consistency\" in self.df.columns:\n # Calculate outlier score as 1 - consistency\n self.df[\"outlier\"] = 1 - self.df[\"target_consistency\"]\n\n else:\n self.log.warning(\"No 'target_z' or 'target_consistency' column found to compute outlier scores.\")\n\n def target_z_scores(self) -> None:\n \"\"\"Compute Z-Scores for NUMERIC target values.\"\"\"\n if not self.target:\n self.log.warning(\"No target column defined for Z-Score computation.\")\n return\n\n # Get the neighbors and distances for each internal observation\n distances, indices = self.knn_model.kneighbors()\n\n # Retrieve all neighbor target values in a single operation\n neighbor_targets = self.df[self.target].values[indices] # Shape will be (n_samples, n_neighbors)\n\n # Compute the mean and std along the neighbors axis (axis=1)\n neighbor_means = neighbor_targets.mean(axis=1)\n neighbor_stds = neighbor_targets.std(axis=1, ddof=0)\n\n # Vectorized Z-score calculation\n current_targets = self.df[self.target].values\n z_scores = np.where(neighbor_stds == 0, 0.0, (current_targets - neighbor_means) / neighbor_stds)\n\n # Assign the computed Z-Scores back to the DataFrame\n self.df[\"target_z\"] = z_scores\n\n def target_consistency(self) -> None:\n \"\"\"Compute a Neighborhood Consistency Score for CATEGORICAL targets.\"\"\"\n if not self.target:\n self.log.warning(\"No target column defined for neighborhood consistency computation.\")\n return\n\n # Get the neighbors and distances for each internal observation (already excludes the query)\n distances, indices = self.knn_model.kneighbors()\n\n # Calculate the Neighborhood Consistency Score for each observation\n consistency_scores = []\n for idx, idx_list in enumerate(indices):\n query_target = self.df.iloc[idx][self.target] # Get current observation's target value\n\n # Get the neighbors' target values\n neighbor_targets = self.df.iloc[idx_list][self.target]\n\n # Calculate the proportion of neighbors that have the same category as the query observation\n consistency_score = (neighbor_targets == query_target).mean()\n consistency_scores.append(consistency_score)\n\n # Add the 'target_consistency' column to the internal dataframe\n self.df[\"target_consistency\"] = consistency_scores\n\n def get_neighbor_indices_and_distances(self):\n \"\"\"Retrieve neighbor indices and distances for all points in the dataset.\"\"\"\n distances, indices = self.knn_model.kneighbors()\n return indices, distances\n\n def target_summary(self, query_id: Union[str, int]) -> pd.DataFrame:\n \"\"\"WIP: Provide a summary of target values in the neighborhood of the given query ID\"\"\"\n neighbors_df = self.neighbors(query_id, include_self=False)\n if self.target and not neighbors_df.empty:\n summary_stats = neighbors_df[self.target].describe()\n return pd.DataFrame(summary_stats).transpose()\n else:\n self.log.warning(f\"No target values found for neighbors of Query ID '{query_id}'.\")\n return pd.DataFrame()\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.__init__","title":"__init__(df, features, id_column, target=None, neighbors=10)
","text":"FeatureSpaceProximity: A class for neighbor lookups using KNN with optional target information.
Parameters:
Name Type Description Defaultdf
DataFrame
Pandas DataFrame
requiredfeatures
list
List of feature column names
requiredid_column
str
Name of the ID column
requiredtarget
str
Optional name of the target column to include target-based functionality (default: None)
None
neighbors
int
Number of neighbors to use in the KNN model (default: 10)
10
Source code in src/sageworks/algorithms/dataframe/feature_space_proximity.py
def __init__(self, df: pd.DataFrame, features: list, id_column: str, target: str = None, neighbors: int = 10):\n \"\"\"FeatureSpaceProximity: A class for neighbor lookups using KNN with optional target information.\n\n Args:\n df: Pandas DataFrame\n features: List of feature column names\n id_column: Name of the ID column\n target: Optional name of the target column to include target-based functionality (default: None)\n neighbors: Number of neighbors to use in the KNN model (default: 10)\n \"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.df = df\n self.features = features\n self.id_column = id_column\n self.target = target\n self.knn_neighbors = neighbors\n\n # Standardize the feature values and build the KNN model\n self.log.info(\"Building KNN model for FeatureSpaceProximity...\")\n self.scaler = StandardScaler().fit(df[features])\n scaled_features = self.scaler.transform(df[features])\n self.knn_model = NearestNeighbors(n_neighbors=neighbors, algorithm=\"auto\").fit(scaled_features)\n\n # Compute Z-Scores or Consistency Scores for the target values\n if self.target and is_numeric_dtype(self.df[self.target]):\n self.log.info(\"Computing Z-Scores for target values...\")\n self.target_z_scores()\n else:\n self.log.info(\"Computing target consistency scores...\")\n self.target_consistency()\n\n # Now compute the outlier scores\n self.log.info(\"Computing outlier scores...\")\n self.outliers()\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.from_model","title":"from_model(model)
classmethod
","text":"Create a FeatureSpaceProximity instance from a SageWorks model object.
Parameters:
Name Type Description Defaultmodel
Model
A SageWorks model object.
requiredReturns:
Name Type DescriptionFeatureSpaceProximity
FeatureSpaceProximity
A new instance of the FeatureSpaceProximity class.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
@classmethod\ndef from_model(cls, model) -> \"FeatureSpaceProximity\":\n \"\"\"Create a FeatureSpaceProximity instance from a SageWorks model object.\n\n Args:\n model (Model): A SageWorks model object.\n\n Returns:\n FeatureSpaceProximity: A new instance of the FeatureSpaceProximity class.\n \"\"\"\n from sageworks.api import FeatureSet\n\n # Extract necessary attributes from the SageWorks model\n fs = FeatureSet(model.get_input())\n features = model.features()\n target = model.target()\n\n # Retrieve the training DataFrame from the feature set\n df = fs.view(\"training\").pull_dataframe()\n\n # Create and return a new instance of FeatureSpaceProximity\n return cls(df=df, features=features, id_column=fs.id_column, target=target)\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.get_neighbor_indices_and_distances","title":"get_neighbor_indices_and_distances()
","text":"Retrieve neighbor indices and distances for all points in the dataset.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def get_neighbor_indices_and_distances(self):\n \"\"\"Retrieve neighbor indices and distances for all points in the dataset.\"\"\"\n distances, indices = self.knn_model.kneighbors()\n return indices, distances\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.neighbors","title":"neighbors(query_id, radius=None, include_self=True)
","text":"Return neighbors of the given query ID, either by fixed neighbors or within a radius.
Parameters:
Name Type Description Defaultquery_id
Union[str, int]
The ID of the query point.
requiredradius
float
Optional radius within which neighbors are to be searched, else use fixed neighbors.
None
include_self
bool
Whether to include the query ID itself in the neighbor results.
True
Returns:
Type DescriptionDataFrame
pd.DataFrame: Filtered DataFrame that includes the query ID, its neighbors, and optionally target values.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def neighbors(self, query_id: Union[str, int], radius: float = None, include_self: bool = True) -> pd.DataFrame:\n \"\"\"Return neighbors of the given query ID, either by fixed neighbors or within a radius.\n\n Args:\n query_id (Union[str, int]): The ID of the query point.\n radius (float): Optional radius within which neighbors are to be searched, else use fixed neighbors.\n include_self (bool): Whether to include the query ID itself in the neighbor results.\n\n Returns:\n pd.DataFrame: Filtered DataFrame that includes the query ID, its neighbors, and optionally target values.\n \"\"\"\n if query_id not in self.df[self.id_column].values:\n self.log.warning(f\"Query ID '{query_id}' not found in the DataFrame. Returning an empty DataFrame.\")\n return pd.DataFrame()\n\n # Get a single-row DataFrame for the query ID\n query_df = self.df[self.df[self.id_column] == query_id]\n\n # Use the neighbors_bulk method with the appropriate radius\n neighbors_info_df = self.neighbors_bulk(query_df, radius=radius, include_self=include_self)\n\n # Extract the neighbor IDs and distances from the results\n neighbor_ids = neighbors_info_df[\"neighbor_ids\"].iloc[0]\n neighbor_distances = neighbors_info_df[\"neighbor_distances\"].iloc[0]\n\n # Sort neighbors by distance (ascending order)\n sorted_neighbors = sorted(zip(neighbor_ids, neighbor_distances), key=lambda x: x[1])\n sorted_ids, sorted_distances = zip(*sorted_neighbors)\n\n # Filter the internal DataFrame to include only the sorted neighbors\n neighbors_df = self.df[self.df[self.id_column].isin(sorted_ids)]\n neighbors_df = neighbors_df.set_index(self.id_column).reindex(sorted_ids).reset_index()\n neighbors_df[\"knn_distance\"] = sorted_distances\n return neighbors_df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.neighbors_bulk","title":"neighbors_bulk(query_df, radius=None, include_self=False)
","text":"Return neighbors for each row in the given query dataframe, either by fixed neighbors or within a radius.
Parameters:
Name Type Description Defaultquery_df
DataFrame
Pandas DataFrame with the same features as the training data.
requiredradius
float
Optional radius within which neighbors are to be searched, else use fixed neighbors.
None
include_self
bool
Boolean indicating whether to include the query ID in the neighbor results.
False
Returns:
Type DescriptionDataFrame
pd.DataFrame: DataFrame with query ID, neighbor IDs, neighbor targets, and neighbor distances.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def neighbors_bulk(self, query_df: pd.DataFrame, radius: float = None, include_self: bool = False) -> pd.DataFrame:\n \"\"\"Return neighbors for each row in the given query dataframe, either by fixed neighbors or within a radius.\n\n Args:\n query_df: Pandas DataFrame with the same features as the training data.\n radius: Optional radius within which neighbors are to be searched, else use fixed neighbors.\n include_self: Boolean indicating whether to include the query ID in the neighbor results.\n\n Returns:\n pd.DataFrame: DataFrame with query ID, neighbor IDs, neighbor targets, and neighbor distances.\n \"\"\"\n # Scale the query data using the same scaler as the training data\n query_scaled = self.scaler.transform(query_df[self.features])\n\n # Retrieve neighbors based on radius or standard neighbors\n if radius is not None:\n distances, indices = self.knn_model.radius_neighbors(query_scaled, radius=radius)\n else:\n distances, indices = self.knn_model.kneighbors(query_scaled)\n\n # Collect neighbor information (IDs, target values, and distances)\n query_ids = query_df[self.id_column].values\n neighbor_ids = [[self.df.iloc[idx][self.id_column] for idx in index_list] for index_list in indices]\n neighbor_targets = (\n [\n [self.df.loc[self.df[self.id_column] == neighbor, self.target].values[0] for neighbor in index_list]\n for index_list in neighbor_ids\n ]\n if self.target\n else None\n )\n neighbor_distances = [list(dist_list) for dist_list in distances]\n\n # Automatically remove the query ID itself from the neighbor results if include_self is False\n for i, query_id in enumerate(query_ids):\n if query_id in neighbor_ids[i] and not include_self:\n idx_to_remove = neighbor_ids[i].index(query_id)\n neighbor_ids[i].pop(idx_to_remove)\n neighbor_distances[i].pop(idx_to_remove)\n if neighbor_targets:\n neighbor_targets[i].pop(idx_to_remove)\n\n # Sort neighbors by distance (ascending order)\n sorted_neighbors = sorted(zip(neighbor_ids[i], neighbor_distances[i]), key=lambda x: x[1])\n neighbor_ids[i], neighbor_distances[i] = list(zip(*sorted_neighbors)) if sorted_neighbors else ([], [])\n if neighbor_targets:\n neighbor_targets[i] = [\n self.df.loc[self.df[self.id_column] == neighbor, self.target].values[0]\n for neighbor in neighbor_ids[i]\n ]\n\n # Create and return a results DataFrame with the updated neighbor information\n result_df = pd.DataFrame(\n {\n \"query_id\": query_ids,\n \"neighbor_ids\": neighbor_ids,\n \"neighbor_distances\": neighbor_distances,\n }\n )\n\n if neighbor_targets:\n result_df[\"neighbor_targets\"] = neighbor_targets\n\n return result_df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.outliers","title":"outliers()
","text":"Compute a unified 'outlier' score based on either 'target_z' or 'target_consistency'.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def outliers(self) -> None:\n \"\"\"Compute a unified 'outlier' score based on either 'target_z' or 'target_consistency'.\"\"\"\n if \"target_z\" in self.df.columns:\n # Normalize Z-Scores to a 0-1 range\n self.df[\"outlier\"] = (self.df[\"target_z\"].abs() / (self.df[\"target_z\"].abs().max() + 1e-6)).clip(0, 1)\n\n elif \"target_consistency\" in self.df.columns:\n # Calculate outlier score as 1 - consistency\n self.df[\"outlier\"] = 1 - self.df[\"target_consistency\"]\n\n else:\n self.log.warning(\"No 'target_z' or 'target_consistency' column found to compute outlier scores.\")\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.target_consistency","title":"target_consistency()
","text":"Compute a Neighborhood Consistency Score for CATEGORICAL targets.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def target_consistency(self) -> None:\n \"\"\"Compute a Neighborhood Consistency Score for CATEGORICAL targets.\"\"\"\n if not self.target:\n self.log.warning(\"No target column defined for neighborhood consistency computation.\")\n return\n\n # Get the neighbors and distances for each internal observation (already excludes the query)\n distances, indices = self.knn_model.kneighbors()\n\n # Calculate the Neighborhood Consistency Score for each observation\n consistency_scores = []\n for idx, idx_list in enumerate(indices):\n query_target = self.df.iloc[idx][self.target] # Get current observation's target value\n\n # Get the neighbors' target values\n neighbor_targets = self.df.iloc[idx_list][self.target]\n\n # Calculate the proportion of neighbors that have the same category as the query observation\n consistency_score = (neighbor_targets == query_target).mean()\n consistency_scores.append(consistency_score)\n\n # Add the 'target_consistency' column to the internal dataframe\n self.df[\"target_consistency\"] = consistency_scores\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.target_summary","title":"target_summary(query_id)
","text":"WIP: Provide a summary of target values in the neighborhood of the given query ID
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def target_summary(self, query_id: Union[str, int]) -> pd.DataFrame:\n \"\"\"WIP: Provide a summary of target values in the neighborhood of the given query ID\"\"\"\n neighbors_df = self.neighbors(query_id, include_self=False)\n if self.target and not neighbors_df.empty:\n summary_stats = neighbors_df[self.target].describe()\n return pd.DataFrame(summary_stats).transpose()\n else:\n self.log.warning(f\"No target values found for neighbors of Query ID '{query_id}'.\")\n return pd.DataFrame()\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.feature_space_proximity.FeatureSpaceProximity.target_z_scores","title":"target_z_scores()
","text":"Compute Z-Scores for NUMERIC target values.
Source code insrc/sageworks/algorithms/dataframe/feature_space_proximity.py
def target_z_scores(self) -> None:\n \"\"\"Compute Z-Scores for NUMERIC target values.\"\"\"\n if not self.target:\n self.log.warning(\"No target column defined for Z-Score computation.\")\n return\n\n # Get the neighbors and distances for each internal observation\n distances, indices = self.knn_model.kneighbors()\n\n # Retrieve all neighbor target values in a single operation\n neighbor_targets = self.df[self.target].values[indices] # Shape will be (n_samples, n_neighbors)\n\n # Compute the mean and std along the neighbors axis (axis=1)\n neighbor_means = neighbor_targets.mean(axis=1)\n neighbor_stds = neighbor_targets.std(axis=1, ddof=0)\n\n # Vectorized Z-score calculation\n current_targets = self.df[self.target].values\n z_scores = np.where(neighbor_stds == 0, 0.0, (current_targets - neighbor_means) / neighbor_stds)\n\n # Assign the computed Z-Scores back to the DataFrame\n self.df[\"target_z\"] = z_scores\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.residuals_calculator.ResidualsCalculator","title":"ResidualsCalculator
","text":" Bases: BaseEstimator
, TransformerMixin
A custom transformer for calculating residuals using cross-validation or an endpoint.
This transformer performs K-Fold cross-validation (if no endpoint is provided), or it uses the endpoint to generate predictions and compute residuals. It adds 'prediction', 'residuals', 'residuals_abs', 'prediction_100', 'residuals_100', and 'residuals_100_abs' columns to the input DataFrame.
Attributes:
Name Type Descriptionmodel_class
Union[RegressorMixin, XGBRegressor]
The machine learning model class used for predictions.
n_splits
int
Number of splits for cross-validation.
random_state
int
Random state for reproducibility.
endpoint
Optional
The SageWorks endpoint object for running inference, if provided.
Source code insrc/sageworks/algorithms/dataframe/residuals_calculator.py
class ResidualsCalculator(BaseEstimator, TransformerMixin):\n \"\"\"\n A custom transformer for calculating residuals using cross-validation or an endpoint.\n\n This transformer performs K-Fold cross-validation (if no endpoint is provided), or it uses the endpoint\n to generate predictions and compute residuals. It adds 'prediction', 'residuals', 'residuals_abs',\n 'prediction_100', 'residuals_100', and 'residuals_100_abs' columns to the input DataFrame.\n\n Attributes:\n model_class (Union[RegressorMixin, XGBRegressor]): The machine learning model class used for predictions.\n n_splits (int): Number of splits for cross-validation.\n random_state (int): Random state for reproducibility.\n endpoint (Optional): The SageWorks endpoint object for running inference, if provided.\n \"\"\"\n\n def __init__(\n self,\n endpoint: Optional[object] = None,\n reference_model_class: Union[RegressorMixin, XGBRegressor] = XGBRegressor,\n ):\n \"\"\"\n Initializes the ResidualsCalculator with the specified parameters.\n\n Args:\n endpoint (Optional): A SageWorks endpoint object to run inference, if available.\n reference_model_class (Union[RegressorMixin, XGBRegressor]): The reference model class for predictions.\n \"\"\"\n self.n_splits = 5\n self.random_state = 42\n self.reference_model_class = reference_model_class # Store the class, instantiate the model later\n self.reference_model = None # Lazy model initialization\n self.endpoint = endpoint # Use this endpoint for inference if provided\n self.X = None\n self.y = None\n\n def fit(self, X: pd.DataFrame, y: pd.Series) -> BaseEstimator:\n \"\"\"\n Fits the model. If no endpoint is provided, fitting involves storing the input data\n and initializing a reference model.\n\n Args:\n X (pd.DataFrame): The input features.\n y (pd.Series): The target variable.\n\n Returns:\n self: Returns an instance of self.\n \"\"\"\n self.X = X\n self.y = y\n\n if self.endpoint is None:\n # Only initialize the reference model if no endpoint is provided\n self.reference_model = self.reference_model_class()\n return self\n\n def transform(self, X: pd.DataFrame) -> pd.DataFrame:\n \"\"\"\n Transforms the input DataFrame by adding 'prediction', 'residuals', 'residuals_abs',\n 'prediction_100', 'residuals_100', and 'residuals_100_abs' columns.\n\n Args:\n X (pd.DataFrame): The input features.\n\n Returns:\n pd.DataFrame: The transformed DataFrame with additional columns.\n \"\"\"\n check_is_fitted(self, [\"X\", \"y\"]) # Ensure fit has been called\n\n if self.endpoint:\n # If an endpoint is provided, run inference on the full data\n result_df = self._run_inference_via_endpoint(X)\n else:\n # If no endpoint, perform cross-validation and full model fitting\n result_df = self._run_cross_validation(X)\n\n return result_df\n\n def _run_cross_validation(self, X: pd.DataFrame) -> pd.DataFrame:\n \"\"\"\n Handles the cross-validation process when no endpoint is provided.\n\n Args:\n X (pd.DataFrame): The input features.\n\n Returns:\n pd.DataFrame: DataFrame with predictions and residuals from cross-validation and full model fit.\n \"\"\"\n kf = KFold(n_splits=self.n_splits, shuffle=True, random_state=self.random_state)\n\n # Initialize pandas Series to store predictions and residuals, aligned by index\n predictions = pd.Series(index=self.y.index, dtype=np.float64)\n residuals = pd.Series(index=self.y.index, dtype=np.float64)\n residuals_abs = pd.Series(index=self.y.index, dtype=np.float64)\n\n # Perform cross-validation and collect predictions and residuals\n for train_index, test_index in kf.split(self.X):\n X_train, X_test = self.X.iloc[train_index], self.X.iloc[test_index]\n y_train, y_test = self.y.iloc[train_index], self.y.iloc[test_index]\n\n # Fit the model on the training data\n self.reference_model.fit(X_train, y_train)\n\n # Predict on the test data\n y_pred = self.reference_model.predict(X_test)\n\n # Compute residuals and absolute residuals\n residuals_fold = y_test - y_pred\n residuals_abs_fold = np.abs(residuals_fold)\n\n # Place the predictions and residuals in the correct positions based on index\n predictions.iloc[test_index] = y_pred\n residuals.iloc[test_index] = residuals_fold\n residuals_abs.iloc[test_index] = residuals_abs_fold\n\n # Train on all data and compute residuals for 100% training\n self.reference_model.fit(self.X, self.y)\n y_pred_100 = self.reference_model.predict(self.X)\n residuals_100 = self.y - y_pred_100\n residuals_100_abs = np.abs(residuals_100)\n\n # Create a copy of the provided DataFrame and add the new columns\n result_df = X.copy()\n result_df[\"prediction\"] = predictions\n result_df[\"residuals\"] = residuals\n result_df[\"residuals_abs\"] = residuals_abs\n result_df[\"prediction_100\"] = y_pred_100\n result_df[\"residuals_100\"] = residuals_100\n result_df[\"residuals_100_abs\"] = residuals_100_abs\n result_df[self.y.name] = self.y # Add the target column back\n\n return result_df\n\n def _run_inference_via_endpoint(self, X: pd.DataFrame) -> pd.DataFrame:\n \"\"\"\n Handles the inference process when an endpoint is provided.\n\n Args:\n X (pd.DataFrame): The input features.\n\n Returns:\n pd.DataFrame: DataFrame with predictions and residuals from the endpoint.\n \"\"\"\n # Run inference on all data using the endpoint (include the target column)\n X = X.copy()\n X.loc[:, self.y.name] = self.y\n results_df = self.endpoint.inference(X)\n predictions = results_df[\"prediction\"]\n\n # Compute residuals and residuals_abs based on the endpoint's predictions\n residuals = self.y - predictions\n residuals_abs = np.abs(residuals)\n\n # To maintain consistency, populate both 'prediction' and 'prediction_100' with the same values\n result_df = X.copy()\n result_df[\"prediction\"] = predictions\n result_df[\"residuals\"] = residuals\n result_df[\"residuals_abs\"] = residuals_abs\n result_df[\"prediction_100\"] = predictions\n result_df[\"residuals_100\"] = residuals\n result_df[\"residuals_100_abs\"] = residuals_abs\n\n return result_df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.residuals_calculator.ResidualsCalculator.__init__","title":"__init__(endpoint=None, reference_model_class=XGBRegressor)
","text":"Initializes the ResidualsCalculator with the specified parameters.
Parameters:
Name Type Description Defaultendpoint
Optional
A SageWorks endpoint object to run inference, if available.
None
reference_model_class
Union[RegressorMixin, XGBRegressor]
The reference model class for predictions.
XGBRegressor
Source code in src/sageworks/algorithms/dataframe/residuals_calculator.py
def __init__(\n self,\n endpoint: Optional[object] = None,\n reference_model_class: Union[RegressorMixin, XGBRegressor] = XGBRegressor,\n):\n \"\"\"\n Initializes the ResidualsCalculator with the specified parameters.\n\n Args:\n endpoint (Optional): A SageWorks endpoint object to run inference, if available.\n reference_model_class (Union[RegressorMixin, XGBRegressor]): The reference model class for predictions.\n \"\"\"\n self.n_splits = 5\n self.random_state = 42\n self.reference_model_class = reference_model_class # Store the class, instantiate the model later\n self.reference_model = None # Lazy model initialization\n self.endpoint = endpoint # Use this endpoint for inference if provided\n self.X = None\n self.y = None\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.residuals_calculator.ResidualsCalculator.fit","title":"fit(X, y)
","text":"Fits the model. If no endpoint is provided, fitting involves storing the input data and initializing a reference model.
Parameters:
Name Type Description DefaultX
DataFrame
The input features.
requiredy
Series
The target variable.
requiredReturns:
Name Type Descriptionself
BaseEstimator
Returns an instance of self.
Source code insrc/sageworks/algorithms/dataframe/residuals_calculator.py
def fit(self, X: pd.DataFrame, y: pd.Series) -> BaseEstimator:\n \"\"\"\n Fits the model. If no endpoint is provided, fitting involves storing the input data\n and initializing a reference model.\n\n Args:\n X (pd.DataFrame): The input features.\n y (pd.Series): The target variable.\n\n Returns:\n self: Returns an instance of self.\n \"\"\"\n self.X = X\n self.y = y\n\n if self.endpoint is None:\n # Only initialize the reference model if no endpoint is provided\n self.reference_model = self.reference_model_class()\n return self\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.residuals_calculator.ResidualsCalculator.transform","title":"transform(X)
","text":"Transforms the input DataFrame by adding 'prediction', 'residuals', 'residuals_abs', 'prediction_100', 'residuals_100', and 'residuals_100_abs' columns.
Parameters:
Name Type Description DefaultX
DataFrame
The input features.
requiredReturns:
Type DescriptionDataFrame
pd.DataFrame: The transformed DataFrame with additional columns.
Source code insrc/sageworks/algorithms/dataframe/residuals_calculator.py
def transform(self, X: pd.DataFrame) -> pd.DataFrame:\n \"\"\"\n Transforms the input DataFrame by adding 'prediction', 'residuals', 'residuals_abs',\n 'prediction_100', 'residuals_100', and 'residuals_100_abs' columns.\n\n Args:\n X (pd.DataFrame): The input features.\n\n Returns:\n pd.DataFrame: The transformed DataFrame with additional columns.\n \"\"\"\n check_is_fitted(self, [\"X\", \"y\"]) # Ensure fit has been called\n\n if self.endpoint:\n # If an endpoint is provided, run inference on the full data\n result_df = self._run_inference_via_endpoint(X)\n else:\n # If no endpoint, perform cross-validation and full model fitting\n result_df = self._run_cross_validation(X)\n\n return result_df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.dimensionality_reduction.DimensionalityReduction","title":"DimensionalityReduction
","text":"Source code in src/sageworks/algorithms/dataframe/dimensionality_reduction.py
class DimensionalityReduction:\n def __init__(self):\n \"\"\"DimensionalityReduction: Perform Dimensionality Reduction on a DataFrame\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.projection_model = None\n self.features = None\n\n def fit_transform(self, df: pd.DataFrame, features: list = None, projection: str = \"TSNE\") -> pd.DataFrame:\n \"\"\"Fit and Transform the DataFrame\n Args:\n df: Pandas DataFrame\n features: List of feature column names (default: None)\n projection: The projection model to use (TSNE, MDS or PCA, default: PCA)\n Returns:\n Pandas DataFrame with new columns x and y\n \"\"\"\n\n # If no features are given, indentify all numeric columns\n if features is None:\n features = [x for x in df.select_dtypes(include=\"number\").columns.tolist() if not x.endswith(\"id\")]\n # Also drop group_count if it exists\n features = [x for x in features if x != \"group_count\"]\n self.log.info(\"No features given, auto identifying numeric columns...\")\n self.log.info(f\"{features}\")\n self.features = features\n\n # Sanity checks\n if not all(column in df.columns for column in self.features):\n self.log.critical(\"Some features are missing in the DataFrame\")\n return df\n if len(self.features) < 2:\n self.log.critical(\"At least two features are required\")\n return df\n if df.empty:\n self.log.critical(\"DataFrame is empty\")\n return df\n\n # Most projection models will fail if there are any NaNs in the data\n # So we'll fill NaNs with the mean value for that column\n for col in df[self.features].columns:\n df[col].fillna(df[col].mean(), inplace=True)\n\n # Normalize the features\n scaler = StandardScaler()\n normalized_data = scaler.fit_transform(df[self.features])\n df[self.features] = normalized_data\n\n # Project the multidimensional features onto an x,y plane\n self.log.info(\"Projecting features onto an x,y plane...\")\n\n # Perform the projection\n if projection == \"TSNE\":\n # Perplexity is a hyperparameter that controls the number of neighbors used to compute the manifold\n # The number of neighbors should be less than the number of samples\n perplexity = min(40, len(df) - 1)\n self.log.info(f\"Perplexity: {perplexity}\")\n self.projection_model = TSNE(perplexity=perplexity)\n elif projection == \"MDS\":\n self.projection_model = MDS(n_components=2, random_state=0)\n elif projection == \"PCA\":\n self.projection_model = PCA(n_components=2)\n\n # Fit the projection model\n # Hack PCA + TSNE to work together\n projection = self.projection_model.fit_transform(df[self.features])\n\n # Put the projection results back into the given DataFrame\n df[\"x\"] = projection[:, 0] # Projection X Column\n df[\"y\"] = projection[:, 1] # Projection Y Column\n\n # Jitter the data to resolve coincident points\n # df = self.resolve_coincident_points(df)\n\n # Return the DataFrame with the new columns\n return df\n\n @staticmethod\n def resolve_coincident_points(df: pd.DataFrame):\n \"\"\"Resolve coincident points in a DataFrame\n Args:\n df(pd.DataFrame): The DataFrame to resolve coincident points in\n Returns:\n pd.DataFrame: The DataFrame with resolved coincident points\n \"\"\"\n # Adding Jitter to the projection\n x_scale = (df[\"x\"].max() - df[\"x\"].min()) * 0.1\n y_scale = (df[\"y\"].max() - df[\"y\"].min()) * 0.1\n df[\"x\"] += np.random.normal(-x_scale, +x_scale, len(df))\n df[\"y\"] += np.random.normal(-y_scale, +y_scale, len(df))\n return df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.dimensionality_reduction.DimensionalityReduction.__init__","title":"__init__()
","text":"DimensionalityReduction: Perform Dimensionality Reduction on a DataFrame
Source code insrc/sageworks/algorithms/dataframe/dimensionality_reduction.py
def __init__(self):\n \"\"\"DimensionalityReduction: Perform Dimensionality Reduction on a DataFrame\"\"\"\n self.log = logging.getLogger(\"sageworks\")\n self.projection_model = None\n self.features = None\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.dimensionality_reduction.DimensionalityReduction.fit_transform","title":"fit_transform(df, features=None, projection='TSNE')
","text":"Fit and Transform the DataFrame Args: df: Pandas DataFrame features: List of feature column names (default: None) projection: The projection model to use (TSNE, MDS or PCA, default: PCA) Returns: Pandas DataFrame with new columns x and y
Source code insrc/sageworks/algorithms/dataframe/dimensionality_reduction.py
def fit_transform(self, df: pd.DataFrame, features: list = None, projection: str = \"TSNE\") -> pd.DataFrame:\n \"\"\"Fit and Transform the DataFrame\n Args:\n df: Pandas DataFrame\n features: List of feature column names (default: None)\n projection: The projection model to use (TSNE, MDS or PCA, default: PCA)\n Returns:\n Pandas DataFrame with new columns x and y\n \"\"\"\n\n # If no features are given, indentify all numeric columns\n if features is None:\n features = [x for x in df.select_dtypes(include=\"number\").columns.tolist() if not x.endswith(\"id\")]\n # Also drop group_count if it exists\n features = [x for x in features if x != \"group_count\"]\n self.log.info(\"No features given, auto identifying numeric columns...\")\n self.log.info(f\"{features}\")\n self.features = features\n\n # Sanity checks\n if not all(column in df.columns for column in self.features):\n self.log.critical(\"Some features are missing in the DataFrame\")\n return df\n if len(self.features) < 2:\n self.log.critical(\"At least two features are required\")\n return df\n if df.empty:\n self.log.critical(\"DataFrame is empty\")\n return df\n\n # Most projection models will fail if there are any NaNs in the data\n # So we'll fill NaNs with the mean value for that column\n for col in df[self.features].columns:\n df[col].fillna(df[col].mean(), inplace=True)\n\n # Normalize the features\n scaler = StandardScaler()\n normalized_data = scaler.fit_transform(df[self.features])\n df[self.features] = normalized_data\n\n # Project the multidimensional features onto an x,y plane\n self.log.info(\"Projecting features onto an x,y plane...\")\n\n # Perform the projection\n if projection == \"TSNE\":\n # Perplexity is a hyperparameter that controls the number of neighbors used to compute the manifold\n # The number of neighbors should be less than the number of samples\n perplexity = min(40, len(df) - 1)\n self.log.info(f\"Perplexity: {perplexity}\")\n self.projection_model = TSNE(perplexity=perplexity)\n elif projection == \"MDS\":\n self.projection_model = MDS(n_components=2, random_state=0)\n elif projection == \"PCA\":\n self.projection_model = PCA(n_components=2)\n\n # Fit the projection model\n # Hack PCA + TSNE to work together\n projection = self.projection_model.fit_transform(df[self.features])\n\n # Put the projection results back into the given DataFrame\n df[\"x\"] = projection[:, 0] # Projection X Column\n df[\"y\"] = projection[:, 1] # Projection Y Column\n\n # Jitter the data to resolve coincident points\n # df = self.resolve_coincident_points(df)\n\n # Return the DataFrame with the new columns\n return df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.dimensionality_reduction.DimensionalityReduction.resolve_coincident_points","title":"resolve_coincident_points(df)
staticmethod
","text":"Resolve coincident points in a DataFrame Args: df(pd.DataFrame): The DataFrame to resolve coincident points in Returns: pd.DataFrame: The DataFrame with resolved coincident points
Source code insrc/sageworks/algorithms/dataframe/dimensionality_reduction.py
@staticmethod\ndef resolve_coincident_points(df: pd.DataFrame):\n \"\"\"Resolve coincident points in a DataFrame\n Args:\n df(pd.DataFrame): The DataFrame to resolve coincident points in\n Returns:\n pd.DataFrame: The DataFrame with resolved coincident points\n \"\"\"\n # Adding Jitter to the projection\n x_scale = (df[\"x\"].max() - df[\"x\"].min()) * 0.1\n y_scale = (df[\"y\"].max() - df[\"y\"].min()) * 0.1\n df[\"x\"] += np.random.normal(-x_scale, +x_scale, len(df))\n df[\"y\"] += np.random.normal(-y_scale, +y_scale, len(df))\n return df\n
"},{"location":"data_algorithms/dataframes/overview/#sageworks.algorithms.dataframe.dimensionality_reduction.test","title":"test()
","text":"Test for the Dimensionality Reduction Class
Source code insrc/sageworks/algorithms/dataframe/dimensionality_reduction.py
def test():\n \"\"\"Test for the Dimensionality Reduction Class\"\"\"\n # Set some pandas options\n pd.set_option(\"display.max_columns\", None)\n pd.set_option(\"display.width\", 1000)\n\n # Make some fake data\n data = {\n \"ID\": [\n \"id_0\",\n \"id_0\",\n \"id_2\",\n \"id_3\",\n \"id_4\",\n \"id_5\",\n \"id_6\",\n \"id_7\",\n \"id_8\",\n \"id_9\",\n ],\n \"feat1\": [1.0, 1.0, 1.1, 3.0, 4.0, 1.0, 1.0, 1.1, 3.0, 4.0],\n \"feat2\": [1.0, 1.0, 1.1, 3.0, 4.0, 1.0, 1.0, 1.1, 3.0, 4.0],\n \"feat3\": [0.1, 0.1, 0.2, 1.6, 2.5, 0.1, 0.1, 0.2, 1.6, 2.5],\n \"price\": [31, 60, 62, 40, 20, 31, 61, 60, 40, 20],\n }\n data_df = pd.DataFrame(data)\n features = [\"feat1\", \"feat2\", \"feat3\"]\n\n # Create the class and run the dimensionality reduction\n projection = DimensionalityReduction()\n new_df = projection.fit_transform(data_df, features=features, projection=\"TSNE\")\n\n # Check that the x and y columns were added\n assert \"x\" in new_df.columns\n assert \"y\" in new_df.columns\n\n # Output the DataFrame\n print(new_df)\n
"},{"location":"data_algorithms/dataframes/overview/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"data_algorithms/graphs/overview/","title":"Graph Algorithms","text":"Graph Algorithms
WIP: These classes are currently actively being developed and are subject to change in both API and functionality over time.
Graph Algorithms
Docs TBD
"},{"location":"data_algorithms/graphs/overview/#sageworks.algorithms.graph.light.proximity_graph.ProximityGraph","title":"ProximityGraph
","text":"Build a proximity graph of the nearest neighbors based on feature space.
Source code insrc/sageworks/algorithms/graph/light/proximity_graph.py
class ProximityGraph:\n \"\"\"\n Build a proximity graph of the nearest neighbors based on feature space.\n \"\"\"\n\n def __init__(self, n_neighbors: int = 5):\n \"\"\"Initialize the ProximityGraph with the specified parameters.\n\n Args:\n n_neighbors (int): Number of neighbors to consider (default: 5)\n \"\"\"\n self.n_neighbors = n_neighbors\n self.nx_graph = nx.Graph()\n\n def build_graph(\n self,\n df: pd.DataFrame,\n features: list,\n id_column: str,\n target: str,\n store_features=True,\n ) -> nx.Graph:\n \"\"\"\n Processes the input DataFrame and builds a proximity graph.\n\n Args:\n df (pd.DataFrame): The input DataFrame containing feature columns.\n features (list): List of feature column names to be used for building the proximity graph.\n id_column (str): Name of the ID column in the DataFrame.\n target (str): Name of the target column in the DataFrame.\n store_features (bool): Whether to store the features as node attributes (default: True).\n\n Returns:\n nx.Graph: The proximity graph as a NetworkX graph.\n \"\"\"\n # Drop NaNs from the DataFrame using the provided utility\n df = drop_nans(df)\n\n # Initialize FeatureSpaceProximity with the input DataFrame and the specified features\n knn_spider = FeatureSpaceProximity(\n df,\n features=features,\n id_column=id_column,\n target=target,\n neighbors=self.n_neighbors,\n )\n\n # Use FeatureSpaceProximity to get all neighbor indices and distances\n indices, distances = knn_spider.get_neighbor_indices_and_distances()\n\n # Compute max distance for scaling (to [0, 1])\n max_distance = distances.max()\n\n # Initialize an empty graph\n self.nx_graph = nx.Graph()\n\n # Use the ID column for node IDs instead of relying on the DataFrame index\n node_ids = df[id_column].values\n\n # Add nodes with their features as attributes using the ID column\n for node_id in node_ids:\n if store_features:\n self.nx_graph.add_node(\n node_id, **df[df[id_column] == node_id].iloc[0].to_dict()\n ) # Use .iloc[0] for correct node attributes\n else:\n self.nx_graph.add_node(node_id)\n\n # Add edges with weights based on inverse distance\n for i, neighbors in enumerate(indices):\n one_edge_added = False\n for j, neighbor_idx in enumerate(neighbors):\n if i != neighbor_idx:\n # Compute the weight of the edge (inverse of distance)\n weight = 1.0 - (distances[i][j] / max_distance) # Scale to [0, 1]\n\n # Map back to the ID column instead of the DataFrame index\n src_node = node_ids[i]\n dst_node = node_ids[neighbor_idx]\n\n # Add the edge to the graph (if the weight is greater than 0.1)\n if weight > 0.1 or not one_edge_added:\n self.nx_graph.add_edge(src_node, dst_node, weight=weight)\n one_edge_added = True\n\n # Return the NetworkX graph\n return self.nx_graph\n\n def get_neighborhood(self, node_id: Union[str, int], radius: int = 1) -> nx.Graph:\n \"\"\"\n Get a subgraph containing nodes within a given number of hops around a specific node.\n\n Args:\n node_id: The ID of the node to center the neighborhood around.\n radius: The number of hops to consider around the node (default: 1).\n\n Returns:\n nx.Graph: A subgraph containing the specified neighborhood.\n \"\"\"\n # Use NetworkX's ego_graph to extract the neighborhood within the given radius\n if node_id in self.nx_graph:\n return nx.ego_graph(self.nx_graph, node_id, radius=radius)\n else:\n raise ValueError(f\"Node ID '{node_id}' not found in the graph.\")\n
"},{"location":"data_algorithms/graphs/overview/#sageworks.algorithms.graph.light.proximity_graph.ProximityGraph.__init__","title":"__init__(n_neighbors=5)
","text":"Initialize the ProximityGraph with the specified parameters.
Parameters:
Name Type Description Defaultn_neighbors
int
Number of neighbors to consider (default: 5)
5
Source code in src/sageworks/algorithms/graph/light/proximity_graph.py
def __init__(self, n_neighbors: int = 5):\n \"\"\"Initialize the ProximityGraph with the specified parameters.\n\n Args:\n n_neighbors (int): Number of neighbors to consider (default: 5)\n \"\"\"\n self.n_neighbors = n_neighbors\n self.nx_graph = nx.Graph()\n
"},{"location":"data_algorithms/graphs/overview/#sageworks.algorithms.graph.light.proximity_graph.ProximityGraph.build_graph","title":"build_graph(df, features, id_column, target, store_features=True)
","text":"Processes the input DataFrame and builds a proximity graph.
Parameters:
Name Type Description Defaultdf
DataFrame
The input DataFrame containing feature columns.
requiredfeatures
list
List of feature column names to be used for building the proximity graph.
requiredid_column
str
Name of the ID column in the DataFrame.
requiredtarget
str
Name of the target column in the DataFrame.
requiredstore_features
bool
Whether to store the features as node attributes (default: True).
True
Returns:
Type DescriptionGraph
nx.Graph: The proximity graph as a NetworkX graph.
Source code insrc/sageworks/algorithms/graph/light/proximity_graph.py
def build_graph(\n self,\n df: pd.DataFrame,\n features: list,\n id_column: str,\n target: str,\n store_features=True,\n) -> nx.Graph:\n \"\"\"\n Processes the input DataFrame and builds a proximity graph.\n\n Args:\n df (pd.DataFrame): The input DataFrame containing feature columns.\n features (list): List of feature column names to be used for building the proximity graph.\n id_column (str): Name of the ID column in the DataFrame.\n target (str): Name of the target column in the DataFrame.\n store_features (bool): Whether to store the features as node attributes (default: True).\n\n Returns:\n nx.Graph: The proximity graph as a NetworkX graph.\n \"\"\"\n # Drop NaNs from the DataFrame using the provided utility\n df = drop_nans(df)\n\n # Initialize FeatureSpaceProximity with the input DataFrame and the specified features\n knn_spider = FeatureSpaceProximity(\n df,\n features=features,\n id_column=id_column,\n target=target,\n neighbors=self.n_neighbors,\n )\n\n # Use FeatureSpaceProximity to get all neighbor indices and distances\n indices, distances = knn_spider.get_neighbor_indices_and_distances()\n\n # Compute max distance for scaling (to [0, 1])\n max_distance = distances.max()\n\n # Initialize an empty graph\n self.nx_graph = nx.Graph()\n\n # Use the ID column for node IDs instead of relying on the DataFrame index\n node_ids = df[id_column].values\n\n # Add nodes with their features as attributes using the ID column\n for node_id in node_ids:\n if store_features:\n self.nx_graph.add_node(\n node_id, **df[df[id_column] == node_id].iloc[0].to_dict()\n ) # Use .iloc[0] for correct node attributes\n else:\n self.nx_graph.add_node(node_id)\n\n # Add edges with weights based on inverse distance\n for i, neighbors in enumerate(indices):\n one_edge_added = False\n for j, neighbor_idx in enumerate(neighbors):\n if i != neighbor_idx:\n # Compute the weight of the edge (inverse of distance)\n weight = 1.0 - (distances[i][j] / max_distance) # Scale to [0, 1]\n\n # Map back to the ID column instead of the DataFrame index\n src_node = node_ids[i]\n dst_node = node_ids[neighbor_idx]\n\n # Add the edge to the graph (if the weight is greater than 0.1)\n if weight > 0.1 or not one_edge_added:\n self.nx_graph.add_edge(src_node, dst_node, weight=weight)\n one_edge_added = True\n\n # Return the NetworkX graph\n return self.nx_graph\n
"},{"location":"data_algorithms/graphs/overview/#sageworks.algorithms.graph.light.proximity_graph.ProximityGraph.get_neighborhood","title":"get_neighborhood(node_id, radius=1)
","text":"Get a subgraph containing nodes within a given number of hops around a specific node.
Parameters:
Name Type Description Defaultnode_id
Union[str, int]
The ID of the node to center the neighborhood around.
requiredradius
int
The number of hops to consider around the node (default: 1).
1
Returns:
Type DescriptionGraph
nx.Graph: A subgraph containing the specified neighborhood.
Source code insrc/sageworks/algorithms/graph/light/proximity_graph.py
def get_neighborhood(self, node_id: Union[str, int], radius: int = 1) -> nx.Graph:\n \"\"\"\n Get a subgraph containing nodes within a given number of hops around a specific node.\n\n Args:\n node_id: The ID of the node to center the neighborhood around.\n radius: The number of hops to consider around the node (default: 1).\n\n Returns:\n nx.Graph: A subgraph containing the specified neighborhood.\n \"\"\"\n # Use NetworkX's ego_graph to extract the neighborhood within the given radius\n if node_id in self.nx_graph:\n return nx.ego_graph(self.nx_graph, node_id, radius=radius)\n else:\n raise ValueError(f\"Node ID '{node_id}' not found in the graph.\")\n
"},{"location":"data_algorithms/graphs/overview/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"data_algorithms/spark/overview/","title":"Graph Algorithms","text":"Graph Algorithms
WIP: These classes are currently actively being developed and are subject to change in both API and functionality over time.
Graph Algorithms
Docs TBD
ComputationView Class: Create a View with a subset of columns for display purposes
"},{"location":"data_algorithms/spark/overview/#sageworks.core.views.computation_view.ComputationView","title":"ComputationView
","text":" Bases: ColumnSubsetView
ComputationView Class: Create a View with a subset of columns for computation purposes
Common Usage# Create a default ComputationView\nfs = FeatureSet(\"test_features\")\ncomp_view = ComputationView.create(fs)\ndf = comp_view.pull_dataframe()\n\n# Create a ComputationView with a specific set of columns\ncomp_view = ComputationView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n# Query the view\ndf = comp_view.query(f\"SELECT * FROM {comp_view.table} where prediction > 0.5\")\n
Source code in src/sageworks/core/views/computation_view.py
class ComputationView(ColumnSubsetView):\n \"\"\"ComputationView Class: Create a View with a subset of columns for computation purposes\n\n Common Usage:\n ```python\n # Create a default ComputationView\n fs = FeatureSet(\"test_features\")\n comp_view = ComputationView.create(fs)\n df = comp_view.pull_dataframe()\n\n # Create a ComputationView with a specific set of columns\n comp_view = ComputationView.create(fs, column_list=[\"my_col1\", \"my_col2\"])\n\n # Query the view\n df = comp_view.query(f\"SELECT * FROM {comp_view.table} where prediction > 0.5\")\n ```\n \"\"\"\n\n @classmethod\n def create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n ) -> Union[View, None]:\n \"\"\"Factory method to create and return a ComputationView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"computation\" view name\n return ColumnSubsetView.create(\"computation\", artifact, source_table, column_list, column_limit)\n
"},{"location":"data_algorithms/spark/overview/#sageworks.core.views.computation_view.ComputationView.create","title":"create(artifact, source_table=None, column_list=None, column_limit=30)
classmethod
","text":"Factory method to create and return a ComputationView instance.
Parameters:
Name Type Description Defaultartifact
Union[DataSource, FeatureSet]
The DataSource or FeatureSet object
requiredsource_table
str
The table/view to create the view from. Defaults to None
None
column_list
Union[list[str], None]
A list of columns to include. Defaults to None.
None
column_limit
int
The max number of columns to include. Defaults to 30.
30
Returns:
Type DescriptionUnion[View, None]
Union[View, None]: The created View object (or None if failed to create the view)
Source code insrc/sageworks/core/views/computation_view.py
@classmethod\ndef create(\n cls,\n artifact: Union[DataSource, FeatureSet],\n source_table: str = None,\n column_list: Union[list[str], None] = None,\n column_limit: int = 30,\n) -> Union[View, None]:\n \"\"\"Factory method to create and return a ComputationView instance.\n\n Args:\n artifact (Union[DataSource, FeatureSet]): The DataSource or FeatureSet object\n source_table (str, optional): The table/view to create the view from. Defaults to None\n column_list (Union[list[str], None], optional): A list of columns to include. Defaults to None.\n column_limit (int, optional): The max number of columns to include. Defaults to 30.\n\n Returns:\n Union[View, None]: The created View object (or None if failed to create the view)\n \"\"\"\n # Use the create logic directly from ColumnSubsetView with the \"computation\" view name\n return ColumnSubsetView.create(\"computation\", artifact, source_table, column_list, column_limit)\n
"},{"location":"data_algorithms/spark/overview/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"data_algorithms/sql/overview/","title":"SQL Algorithms","text":"SQL Algorithms
One of the main benefit of SQL Algorithms is that the 'heavy lifting' is all done on the SQL Database, so if you have large datassets this is the place for you.
SQL: SQL queries that provide a wide range of functionality:
SQL based Outliers: Compute outliers for all the columns in a DataSource using SQL
SQL based Descriptive Stats: Compute Descriptive Stats for all the numeric columns in a DataSource using SQL
SQL based Correlations: Compute Correlations for all the numeric columns in a DataSource using SQL
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.outliers.Outliers","title":"Outliers
","text":"Outliers: Class to compute outliers for all the columns in a DataSource using SQL
Source code insrc/sageworks/algorithms/sql/outliers.py
class Outliers:\n \"\"\"Outliers: Class to compute outliers for all the columns in a DataSource using SQL\"\"\"\n\n def __init__(self):\n \"\"\"SQLOutliers Initialization\"\"\"\n self.outlier_group = \"unknown\"\n\n def compute_outliers(\n self, data_source: DataSourceAbstract, scale: float = 1.5, use_stddev: bool = False\n ) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n Args:\n data_source(DataSource): The DataSource that we're computing outliers on\n scale (float): The scale to use for either the IQR or stddev outlier calculation (default: 1.5)\n use_stddev (bool): Option to use the standard deviation for the outlier calculation (default: False)\n Returns:\n pd.DataFrame: A DataFrame of outliers for this DataSource\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n\n # Note: If use_stddev is True, then the scale parameter needs to be adjusted\n if use_stddev and scale == 1.5: # If the default scale is used, adjust it\n scale = 2.5\n\n # Compute the numeric outliers\n outlier_df = self._numeric_outliers(data_source, scale, use_stddev)\n\n # If there are no outliers, return a DataFrame with the computation columns but no rows\n if outlier_df is None:\n columns = data_source.view(\"computation\").columns\n return pd.DataFrame(columns=columns + [\"outlier_group\"])\n\n # Get the top N outliers for each outlier group\n outlier_df = self.get_top_n_outliers(outlier_df)\n\n # Make sure the dataframe isn't too big, if it's too big sample it down\n if len(outlier_df) > 300:\n log.important(f\"Outliers DataFrame is too large {len(outlier_df)}, sampling down to 300 rows\")\n outlier_df = outlier_df.sample(300)\n\n # Sort by outlier_group and reset the index\n outlier_df = outlier_df.sort_values(\"outlier_group\").reset_index(drop=True)\n\n # Shorten any long string values\n outlier_df = shorten_values(outlier_df)\n return outlier_df\n\n def _numeric_outliers(self, data_source: DataSourceAbstract, scale: float, use_stddev=False) -> pd.DataFrame:\n \"\"\"Internal method to compute outliers for all numeric columns\n Args:\n data_source(DataSource): The DataSource that we're computing outliers on\n scale (float): The scale to use for the IQR outlier calculation\n use_stddev (bool): Option to use the standard deviation for the outlier calculation (default: False)\n Returns:\n pd.DataFrame: A DataFrame of all the outliers combined\n \"\"\"\n\n # Grab the column stats and descriptive stats for this DataSource\n column_stats = data_source.column_stats()\n descriptive_stats = data_source.descriptive_stats()\n\n # If there are no numeric columns, return None\n if not descriptive_stats:\n log.warning(\"No numeric columns found in the current computation view of the DataSource\")\n log.warning(\"If the data source was created from a DataFrame, ensure that the DataFrame was properly typed\")\n log.warning(\"Recommendation: Properly type the DataFrame and recreate the SageWorks artifact\")\n return None\n\n # Get the column names and types from the DataSource\n column_details = data_source.view(\"computation\").column_details()\n\n # For every column in the data_source that is numeric get the outliers\n # This loop computes the columns, lower bounds, and upper bounds for the SQL query\n log.info(\"Computing Outliers for numeric columns...\")\n numeric = [\"tinyint\", \"smallint\", \"int\", \"bigint\", \"float\", \"double\", \"decimal\"]\n columns = []\n lower_bounds = []\n upper_bounds = []\n for column, data_type in column_details.items():\n if data_type in numeric:\n # Skip columns that just have one value (or are all nans)\n if column_stats[column][\"unique\"] <= 1:\n log.info(f\"Skipping unary column {column} with value {descriptive_stats[column]['min']}\")\n continue\n\n # Skip columns that are 'binary' columns\n if column_stats[column][\"unique\"] == 2:\n log.info(f\"Skipping binary column {column}\")\n continue\n\n # Do they want to use the stddev instead of IQR?\n if use_stddev:\n mean = descriptive_stats[column][\"mean\"]\n stddev = descriptive_stats[column][\"stddev\"]\n lower_bound = mean - (stddev * scale)\n upper_bound = mean + (stddev * scale)\n\n # Compute the IQR for this column\n else:\n iqr = descriptive_stats[column][\"q3\"] - descriptive_stats[column][\"q1\"]\n lower_bound = descriptive_stats[column][\"q1\"] - (iqr * scale)\n upper_bound = descriptive_stats[column][\"q3\"] + (iqr * scale)\n\n # Add the column, lower bound, and upper bound to the lists\n columns.append(column)\n lower_bounds.append(lower_bound)\n upper_bounds.append(upper_bound)\n\n # Compute the SQL query\n query = self._multi_column_outlier_query(data_source, columns, lower_bounds, upper_bounds)\n outlier_df = data_source.query(query)\n\n # Label the outlier groups\n outlier_df = self._label_outlier_groups(outlier_df, columns, lower_bounds, upper_bounds)\n return outlier_df\n\n @staticmethod\n def _multi_column_outlier_query(\n data_source: DataSourceAbstract, columns: list, lower_bounds: list, upper_bounds: list\n ) -> str:\n \"\"\"Internal method to compute outliers for multiple columns\n Args:\n data_source(DataSource): The DataSource that we're computing outliers on\n columns(list): The columns to compute outliers on\n lower_bounds(list): The lower bounds for outliers\n upper_bounds(list): The upper bounds for outliers\n Returns:\n str: A SQL query to compute outliers for multiple columns\n \"\"\"\n # Grab the DataSource computation table name\n table = data_source.view(\"computation\").table\n\n # Get the column names and types from the DataSource\n column_details = data_source.view(\"computation\").column_details()\n sql_columns = \", \".join([f'\"{col}\"' for col in column_details.keys()])\n\n query = f'SELECT {sql_columns} FROM \"{table}\" WHERE '\n for col, lb, ub in zip(columns, lower_bounds, upper_bounds):\n query += f\"({col} < {lb} OR {col} > {ub}) OR \"\n query = query[:-4]\n\n # Add a limit just in case\n query += \" LIMIT 5000\"\n return query\n\n @staticmethod\n def _label_outlier_groups(\n outlier_df: pd.DataFrame, columns: list, lower_bounds: list, upper_bounds: list\n ) -> pd.DataFrame:\n \"\"\"Internal method to label outliers by group.\n Args:\n outlier_df(pd.DataFrame): The DataFrame of outliers\n columns(list): The columns for which to compute outliers\n lower_bounds(list): The lower bounds for each column\n upper_bounds(list): The upper bounds for each column\n Returns:\n pd.DataFrame: A DataFrame with an added 'outlier_group' column, indicating the type of outlier.\n \"\"\"\n\n column_outlier_dfs = []\n for col, lb, ub in zip(columns, lower_bounds, upper_bounds):\n mask_low = outlier_df[col] < lb\n mask_high = outlier_df[col] > ub\n\n low_df = outlier_df[mask_low].copy()\n low_df[\"outlier_group\"] = f\"{col}_low\"\n\n high_df = outlier_df[mask_high].copy()\n high_df[\"outlier_group\"] = f\"{col}_high\"\n\n column_outlier_dfs.extend([low_df, high_df])\n\n # If there are no outliers, return the original DataFrame with an empty 'outlier_group' column\n if not column_outlier_dfs:\n log.critical(\"No outliers found in the data source.. probably something is wrong\")\n return outlier_df.assign(outlier_group=\"\")\n\n # Concatenate the DataFrames and return\n return pd.concat(column_outlier_dfs, ignore_index=True)\n\n @staticmethod\n def get_top_n_outliers(outlier_df: pd.DataFrame, n: int = 10) -> pd.DataFrame:\n \"\"\"Function to retrieve the top N highest and lowest outliers for each outlier group.\n\n Args:\n outlier_df (pd.DataFrame): The DataFrame of outliers with 'outlier_group' column\n n (int): Number of top outliers to retrieve for each group, defaults to 10\n\n Returns:\n pd.DataFrame: A DataFrame containing the top N outliers for each outlier group\n \"\"\"\n\n def get_extreme_values(group: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Helper function to get the top N extreme values from a group.\"\"\"\n col, extreme_type = group.name.rsplit(\"_\", 1)\n if extreme_type == \"low\":\n return group.nsmallest(n, col)\n else:\n return group.nlargest(n, col)\n\n # Group by 'outlier_group' and apply the helper function, explicitly selecting columns\n top_outliers = outlier_df.groupby(\"outlier_group\", group_keys=False).apply(\n get_extreme_values, include_groups=True\n )\n return top_outliers.reset_index(drop=True)\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.outliers.Outliers.__init__","title":"__init__()
","text":"SQLOutliers Initialization
Source code insrc/sageworks/algorithms/sql/outliers.py
def __init__(self):\n \"\"\"SQLOutliers Initialization\"\"\"\n self.outlier_group = \"unknown\"\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.outliers.Outliers.compute_outliers","title":"compute_outliers(data_source, scale=1.5, use_stddev=False)
","text":"Compute outliers for all the numeric columns in a DataSource Args: data_source(DataSource): The DataSource that we're computing outliers on scale (float): The scale to use for either the IQR or stddev outlier calculation (default: 1.5) use_stddev (bool): Option to use the standard deviation for the outlier calculation (default: False) Returns: pd.DataFrame: A DataFrame of outliers for this DataSource Notes: Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma) The scale parameter can be adjusted to change the IQR multiplier
Source code insrc/sageworks/algorithms/sql/outliers.py
def compute_outliers(\n self, data_source: DataSourceAbstract, scale: float = 1.5, use_stddev: bool = False\n) -> pd.DataFrame:\n \"\"\"Compute outliers for all the numeric columns in a DataSource\n Args:\n data_source(DataSource): The DataSource that we're computing outliers on\n scale (float): The scale to use for either the IQR or stddev outlier calculation (default: 1.5)\n use_stddev (bool): Option to use the standard deviation for the outlier calculation (default: False)\n Returns:\n pd.DataFrame: A DataFrame of outliers for this DataSource\n Notes:\n Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma)\n The scale parameter can be adjusted to change the IQR multiplier\n \"\"\"\n\n # Note: If use_stddev is True, then the scale parameter needs to be adjusted\n if use_stddev and scale == 1.5: # If the default scale is used, adjust it\n scale = 2.5\n\n # Compute the numeric outliers\n outlier_df = self._numeric_outliers(data_source, scale, use_stddev)\n\n # If there are no outliers, return a DataFrame with the computation columns but no rows\n if outlier_df is None:\n columns = data_source.view(\"computation\").columns\n return pd.DataFrame(columns=columns + [\"outlier_group\"])\n\n # Get the top N outliers for each outlier group\n outlier_df = self.get_top_n_outliers(outlier_df)\n\n # Make sure the dataframe isn't too big, if it's too big sample it down\n if len(outlier_df) > 300:\n log.important(f\"Outliers DataFrame is too large {len(outlier_df)}, sampling down to 300 rows\")\n outlier_df = outlier_df.sample(300)\n\n # Sort by outlier_group and reset the index\n outlier_df = outlier_df.sort_values(\"outlier_group\").reset_index(drop=True)\n\n # Shorten any long string values\n outlier_df = shorten_values(outlier_df)\n return outlier_df\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.outliers.Outliers.get_top_n_outliers","title":"get_top_n_outliers(outlier_df, n=10)
staticmethod
","text":"Function to retrieve the top N highest and lowest outliers for each outlier group.
Parameters:
Name Type Description Defaultoutlier_df
DataFrame
The DataFrame of outliers with 'outlier_group' column
requiredn
int
Number of top outliers to retrieve for each group, defaults to 10
10
Returns:
Type DescriptionDataFrame
pd.DataFrame: A DataFrame containing the top N outliers for each outlier group
Source code insrc/sageworks/algorithms/sql/outliers.py
@staticmethod\ndef get_top_n_outliers(outlier_df: pd.DataFrame, n: int = 10) -> pd.DataFrame:\n \"\"\"Function to retrieve the top N highest and lowest outliers for each outlier group.\n\n Args:\n outlier_df (pd.DataFrame): The DataFrame of outliers with 'outlier_group' column\n n (int): Number of top outliers to retrieve for each group, defaults to 10\n\n Returns:\n pd.DataFrame: A DataFrame containing the top N outliers for each outlier group\n \"\"\"\n\n def get_extreme_values(group: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Helper function to get the top N extreme values from a group.\"\"\"\n col, extreme_type = group.name.rsplit(\"_\", 1)\n if extreme_type == \"low\":\n return group.nsmallest(n, col)\n else:\n return group.nlargest(n, col)\n\n # Group by 'outlier_group' and apply the helper function, explicitly selecting columns\n top_outliers = outlier_df.groupby(\"outlier_group\", group_keys=False).apply(\n get_extreme_values, include_groups=True\n )\n return top_outliers.reset_index(drop=True)\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.descriptive_stats.descriptive_stats","title":"descriptive_stats(data_source)
","text":"Compute Descriptive Stats for all the numeric columns in a DataSource Args: data_source(DataSource): The DataSource that we're computing descriptive stats on Returns: dict(dict): A dictionary of descriptive stats for each column in this format {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4, 'mean': 2.5, 'stddev': 1.5}, 'col2': ...}
Source code insrc/sageworks/algorithms/sql/descriptive_stats.py
def descriptive_stats(data_source: DataSourceAbstract) -> dict[dict]:\n \"\"\"Compute Descriptive Stats for all the numeric columns in a DataSource\n Args:\n data_source(DataSource): The DataSource that we're computing descriptive stats on\n Returns:\n dict(dict): A dictionary of descriptive stats for each column in this format\n {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4, 'mean': 2.5, 'stddev': 1.5},\n 'col2': ...}\n \"\"\"\n # Grab the DataSource computation view table name\n table = data_source.view(\"computation\").table\n\n # Figure out which columns are numeric\n num_type = [\"double\", \"float\", \"int\", \"bigint\", \"smallint\", \"tinyint\"]\n details = data_source.view(\"computation\").column_details()\n numeric = [column for column, data_type in details.items() if data_type in num_type]\n\n # Sanity Check for numeric columns\n if len(numeric) == 0:\n log.warning(\"No numeric columns found in the current computation view of the DataSource\")\n log.warning(\"If the data source was created from a DataFrame, ensure that the DataFrame was properly typed\")\n log.warning(\"Recommendation: Properly type the DataFrame and recreate the SageWorks artifact\")\n return {}\n\n # Build the query\n query = descriptive_stats_query(numeric, table)\n\n # Run the query\n log.debug(query)\n result_df = data_source.query(query)\n\n # Process the results\n # Note: The result_df is a DataFrame with a single row and a column for each stat metric\n stats_dict = result_df.to_dict(orient=\"index\")[0]\n\n # Convert the dictionary to a nested dictionary\n # Note: The keys are in the format col1__col2\n nested_descriptive_stats = defaultdict(dict)\n for key, value in stats_dict.items():\n col1, col2 = key.split(\"___\")\n nested_descriptive_stats[col1][col2] = value\n\n # Return the nested dictionary\n return dict(nested_descriptive_stats)\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.descriptive_stats.descriptive_stats_query","title":"descriptive_stats_query(columns, table_name)
","text":"Build a query to compute the descriptive stats for all columns in a table Args: columns(list(str)): The columns to compute descriptive stats on table_name(str): The table to compute descriptive stats on Returns: str: The SQL query to compute descriptive stats
Source code insrc/sageworks/algorithms/sql/descriptive_stats.py
def descriptive_stats_query(columns: list[str], table_name: str) -> str:\n \"\"\"Build a query to compute the descriptive stats for all columns in a table\n Args:\n columns(list(str)): The columns to compute descriptive stats on\n table_name(str): The table to compute descriptive stats on\n Returns:\n str: The SQL query to compute descriptive stats\n \"\"\"\n query = f'SELECT <<column_descriptive_stats>> FROM \"{table_name}\"'\n column_descriptive_stats = \"\"\n for c in columns:\n column_descriptive_stats += (\n f'min(\"{c}\") AS \"{c}___min\", '\n f'approx_percentile(\"{c}\", 0.25) AS \"{c}___q1\", '\n f'approx_percentile(\"{c}\", 0.5) AS \"{c}___median\", '\n f'approx_percentile(\"{c}\", 0.75) AS \"{c}___q3\", '\n f'max(\"{c}\") AS \"{c}___max\", '\n f'avg(\"{c}\") AS \"{c}___mean\", '\n f'stddev(\"{c}\") AS \"{c}___stddev\", '\n )\n query = query.replace(\"<<column_descriptive_stats>>\", column_descriptive_stats[:-2])\n\n # Return the query\n return query\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.correlations.correlation_query","title":"correlation_query(columns, table_name)
","text":"Build a query to compute the correlations between columns in a table
Parameters:
Name Type Description Defaultcolumns
list(str
The columns to compute correlations on
requiredtable_name
str
The table to compute correlations on
requiredReturns:
Name Type Descriptionstr
str
The SQL query to compute correlations
Pearson correlation coefficient ranges from -1 to 1:+1 indicates a perfect positive linear relationship. -1 indicates a perfect negative linear relationship. 0 indicates no linear relationship.
Source code insrc/sageworks/algorithms/sql/correlations.py
def correlation_query(columns: list[str], table_name: str) -> str:\n \"\"\"Build a query to compute the correlations between columns in a table\n\n Args:\n columns (list(str)): The columns to compute correlations on\n table_name (str): The table to compute correlations on\n\n Returns:\n str: The SQL query to compute correlations\n\n Notes: Pearson correlation coefficient ranges from -1 to 1:\n +1 indicates a perfect positive linear relationship.\n -1 indicates a perfect negative linear relationship.\n 0 indicates no linear relationship.\n \"\"\"\n query = f'SELECT <<cross_correlations>> FROM \"{table_name}\"'\n cross_correlations = \"\"\n for c in columns:\n for d in columns:\n if c != d:\n cross_correlations += f'corr(\"{c}\", \"{d}\") AS \"{c}__{d}\", '\n query = query.replace(\"<<cross_correlations>>\", cross_correlations[:-2])\n\n # Return the query\n return query\n
"},{"location":"data_algorithms/sql/overview/#sageworks.algorithms.sql.correlations.correlations","title":"correlations(data_source)
","text":"Compute Correlations for all the numeric columns in a DataSource Args: data_source(DataSource): The DataSource that we're computing correlations on Returns: dict(dict): A dictionary of correlations for each column in this format {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...}, 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}
Source code insrc/sageworks/algorithms/sql/correlations.py
def correlations(data_source: DataSourceAbstract) -> dict[dict]:\n \"\"\"Compute Correlations for all the numeric columns in a DataSource\n Args:\n data_source(DataSource): The DataSource that we're computing correlations on\n Returns:\n dict(dict): A dictionary of correlations for each column in this format\n {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...},\n 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}\n \"\"\"\n data_source.log.info(\"Computing Correlations for numeric columns...\")\n\n # Figure out which columns are numeric\n num_type = [\"double\", \"float\", \"int\", \"bigint\", \"smallint\", \"tinyint\"]\n details = data_source.view(\"computation\").column_details()\n\n # Get the numeric columns\n numeric = [column for column, data_type in details.items() if data_type in num_type]\n\n # If we have at least two numeric columns, compute the correlations\n if len(numeric) < 2:\n return {}\n\n # Grab the DataSource computation table name\n table = data_source.view(\"computation\").table\n\n # Build the query\n query = correlation_query(numeric, table)\n\n # Run the query\n log.debug(query)\n result_df = data_source.query(query)\n\n # Drop any columns that have NaNs\n result_df = result_df.dropna(axis=1)\n\n # Process the results\n # Note: The result_df is a DataFrame with a single row and a column for each pairwise correlation\n correlation_dict = result_df.to_dict(orient=\"index\")[0]\n\n # Convert the dictionary to a nested dictionary\n # Note: The keys are in the format col1__col2\n nested_corr = defaultdict(dict)\n for key, value in correlation_dict.items():\n col1, col2 = key.split(\"__\")\n nested_corr[col1][col2] = value\n\n # Sort the nested dictionaries\n sorted_dict = {}\n for key, sub_dict in nested_corr.items():\n sorted_dict[key] = {k: v for k, v in sorted(sub_dict.items(), key=lambda item: item[1], reverse=True)}\n return sorted_dict\n
"},{"location":"data_algorithms/sql/overview/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"enterprise/","title":"SageWorks Enterprise","text":"The SageWorks API and User Interfaces cover a broad set of AWS Machine Learning services and provide easy to use abstractions and visualizations of your AWS ML data. We offer a wide range of options to best fit your companies needs.
Accelerate ML Pipeline development with an Enterprise License! Free Enterprise: Lite Enterprise: Standard Enterprise: Pro Python API \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 SageWorks REPL \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Dashboard \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 AWS Onboarding \u2796 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Dashboard Plugins \u2796 \ud83d\udfe2 \ud83d\udfe2 \ud83d\udfe2 Custom Pages \u2796 \u2796 \ud83d\udfe2 \ud83d\udfe2 Themes \u2796 \u2796 \ud83d\udfe2 \ud83d\udfe2 ML Pipelines \u2796 \u2796 \u2796 \ud83d\udfe2 Project Branding \u2796 \u2796 \u2796 \ud83d\udfe2 Prioritized Feature Requests \u2796 \u2796 \u2796 \ud83d\udfe2 Pricing \u2796 $1500* $3000* $4000**USD per month, includes AWS setup, support, and training: Everything needed to accelerate your AWS ML Development team. Interested in Data Science/Engineering consulting? We have top notch Consultants with a depth and breadth of AWS ML/DS/Engineering expertise.
"},{"location":"enterprise/#try-sageworks","title":"Try SageWorks","text":"We encourage new users to try out the free version, first. We offer support in our Discord channel and our Documentation has instructions for how to get started with SageWorks. So try it out and when you're ready to accelerate your AWS ML Adventure with an Enterprise licence contact us at SageWorks Sales
"},{"location":"enterprise/#data-engineeringscience-consulting","title":"Data Engineering/Science Consulting","text":"Alongside our SageWorks Enterprise offerings, we provide comprehensive consulting services and domain expertise through our Partnerships. We specialize in AWS Machine Learning Systems and our extended team of Data Scientists and Engineers, have Masters and Ph.D. degrees in Computer Science, Chemistry, and Pharmacology. We also have a parntership with Nomic Networks to support our Network Security Clients.
Using AWS and SageWorks, our experts are equipped to deliver tailored solutions that are focused on your project needs and deliverables. For more information please touch base and we'll set up a free initial consultation SageWorks Consulting
"},{"location":"enterprise/#contact-us","title":"Contact Us","text":"Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or SageWorks Sales
"},{"location":"enterprise/private_saas/","title":"Benefits of a Private SaaS Architecture","text":""},{"location":"enterprise/private_saas/#self-hosted-vs-private-saas-vs-public-saas","title":"Self Hosted vs Private SaaS vs Public SaaS?","text":"At the top level your team/project is making a decision about how they are going to build, expand, support, and maintain a machine learning pipeline.
Conceptual ML Pipeline
Data \u2b95 Features \u2b95 Models \u2b95 Deployment (end-user application)\n
Concrete/Real World Example
S3 \u2b95 Glue Job \u2b95 Data Catalog \u2b95 FeatureGroups \u2b95 Models \u2b95 Endpoints \u2b95 App\n
When building out a framework to support ML Pipelines there are three main options:
The other choice, that we're not going to cover here, is whether you use AWS, Azure, GCP, or something else. SageWorks is architected and powered by a broad and rich set of AWS ML Pipeline services. We believe that AWS provides the best set of functionality and APIs for flexible, real world ML architectures.
"},{"location":"enterprise/private_saas/#resources","title":"Resources","text":"See our full presentation on the SageWorks Private SaaS Architecture
"},{"location":"enterprise/project_branding/","title":"Project Branding","text":"The SageWorks Dashboard can be customized extensively. Using SageWorks Project Branding allows you to change page headers, titles, and logos to match your project. All user interfaces will reflect your project name and company logos.
"},{"location":"enterprise/project_branding/#contact-us","title":"Contact Us","text":"Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or SageWorks Sales.
"},{"location":"enterprise/themes/","title":"SageWorks Themes","text":"The SageWorks Dashboard can be customized extensively. Using SageWorks Themes allows you to customize the User Interfaces to suit your preferences, including completely customized color palettes and fonts. We offer a set of default 'dark' and 'light' themes, but we'll also customize the theme to match your company's color palette and logos.
"},{"location":"enterprise/themes/#contact-us","title":"Contact Us","text":"Contact us on our Discord channel, we're happy to answer any questions that you might have about SageWorks and accelerating your AWS ML Pipelines. You can also send us email at SageWorks Info or SageWorks Sales.
"},{"location":"getting_started/","title":"Getting Started","text":"For the initial setup of SageWorks we'll be using the SageWorks REPL. When you start sageworks
it will recognize that it needs to complete the initial configuration and will guide you through that process.
Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"getting_started/#initial-setupconfig","title":"Initial Setup/Config","text":"Notes: SageWorks uses your existing AWS account/profile/SSO. So if you don't already have an AWS Profile or SSO Setup you'll need to do that first AWS Setup
Okay so after you've completed your AWS Setup you can now install SageWorks.
> pip install sageworks\n> sageworks <-- This starts the REPL\n\nWelcome to SageWorks!\nLooks like this is your first time using SageWorks...\nLet's get you set up...\nAWS_PROFILE: my_aws_profile\nSAGEWORKS_BUCKET: my-company-sageworks\n[optional] REDIS_HOST(localhost): my-redis.cache.amazon (or leave blank)\n[optional] REDIS_PORT(6379):\n[optional] REDIS_PASSWORD():\n[optional] SAGEWORKS_API_KEY(open_source): my_api_key (or leave blank)\n
That's It: You're now all set. This configuration only needs to be ONCE :)"},{"location":"getting_started/#data-scientistsengineers","title":"Data Scientists/Engineers","text":"For companies that are setting up SageWorks on an internal AWS Account: Company AWS Setup
"},{"location":"getting_started/#additional-resources","title":"Additional Resources","text":"AWS Glue Simplified
AWS Glue Jobs are a great way to automate ETL and data processing. SageWorks takes all the hassle out of creating and debugging Glue Jobs. Follow this guide and empower your Glue Jobs with SageWorks!
SageWorks make creating, testing, and debugging of AWS Glue Jobs easy. The exact same SageWorks API Classes are used in your Glue Jobs. Also since SageWorks manages the roles for both API and Glue Jobs you'll be able to test new Glue Jobs locally and minimizes surprises when deploying your Glue Job.
"},{"location":"glue/#glue-job-setup","title":"Glue Job Setup","text":"Setting up a AWS Glue Job that uses SageWorks is straight forward. SageWorks can be 'installed' on AWS Glue via the --additional-python-modules
parameter and then you can use the Sageworks API just like normal.
Here are the settings and a screen shot to guide you. There are several ways to set up and run Glue Jobs, with either the SageWorks-ExecutionRole or using the SageWorksAPIPolicy. Please feel free to contact SageWorks support if you need any help with setting up Glue Jobs.
Glue IAM Role Details
If your Glue Jobs already use an existing IAM Role then you can add the SageWorksAPIPolicy
to that Role to enable the Glue Job to perform SageWorks API Tasks.
Anyone familiar with a typical Glue Job should be pleasantly surpised by how simple the example below is. Also SageWorks allows you to test Glue Jobs locally using the same code that you use for script and Notebooks (see Glue Testing)
Glue Job Arguments
AWS Glue Jobs take arguments in the form of Job Parameters (see screenshot above). There's a SageWorks utility function get_resolved_options
that turns these Job Parameters into a nice dictionary for ease of use.
import sys\n\n# SageWorks Imports\nfrom sageworks.api.data_source import DataSource\nfrom sageworks.utils.config_manager import ConfigManager\nfrom sageworks.utils.glue_utils import get_resolved_options\n\n# Convert Glue Job Args to a Dictionary\nglue_args = get_resolved_options(sys.argv)\n\n# Set the SAGEWORKS_BUCKET for the ConfigManager\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", glue_args[\"sageworks-bucket\"])\n\n# Create a new Data Source from an S3 Path\nsource_path = \"s3://sageworks-public-data/common/abalone.csv\"\nmy_data = DataSource(source_path, name=\"abalone_glue_test\")\n
"},{"location":"glue/#glue-example-2","title":"Glue Example 2","text":"This example takes two 'Job Parameters'
The example will convert all CSV files in an S3 bucket/prefix and load them up as DataSources in SageWorks.
examples/glue_load_s3_bucket.pyimport sys\n\n# SageWorks Imports\nfrom sageworks.api.data_source import DataSource\nfrom sageworks.utils.config_manager import ConfigManager\nfrom sageworks.utils.glue_utils import get_resolved_options, list_s3_files\n\n# Convert Glue Job Args to a Dictionary\nglue_args = get_resolved_options(sys.argv)\n\n# Set the SAGEWORKS_BUCKET for the ConfigManager\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", glue_args[\"sageworks-bucket\"])\n\n# List all the CSV files in the given S3 Path\ninput_s3_path = glue_args[\"input-s3-path\"]\nfor input_file in list_s3_files(input_s3_path):\n\n # Note: If we don't specify a name, one will be 'auto-generated'\n my_data = DataSource(input_file, name=None)\n
"},{"location":"glue/#exception-log-forwarding","title":"Exception Log Forwarding","text":"When a Glue Job crashes (has an exception), the AWS console will show you the last line of the exception, this is mostly useless. If you use SageWorks log forwarding the exception/stack will be forwarded to CloudWatch.
from sageworks.utils.sageworks_logging import exception_log_forward\n\nwith exception_log_forward():\n <my glue code>\n ...\n <exception happens>\n <more of my code>\n
The exception_log_forward
sets up a context manager that will trap exceptions and forward the exception/stack to CloudWatch for diagnosis. "},{"location":"glue/#glue-job-local-testing","title":"Glue Job Local Testing","text":"Glue Power without the Pain. SageWorks manages the AWS Execution Role, so local API and Glue Jobs will have the same permissions/access. Also using the same Code as your notebooks or scripts makes creating and testing Glue Jobs a breeze.
export SAGEWORKS_CONFIG=<your config> # Only if not already set up\npython my_glue_job.py --sageworks-bucket <your bucket>\n
"},{"location":"glue/#additional-resources","title":"Additional Resources","text":"SageWorks Lambda Layers
AWS Lambda Jobs are a great way to spin up data processing jobs. Follow this guide and empower AWS Lambda with SageWorks!
SageWorks makes creating, testing, and debugging of AWS Lambda Functions easy. The exact same SageWorks API Classes are used in your AWS Lambda Functions. Also since SageWorks manages the access policies you'll be able to test new Lambda Jobs locally and minimizes surprises when deploying.
Work In Progress
The SageWorks Lambda Layers are a great way to use SageWorks but they are still in 'beta' mode so please let us know if you have any issues.
"},{"location":"lambda_layer/#lambda-job-setup","title":"Lambda Job Setup","text":"Setting up a AWS Lambda Job that uses SageWorks is straight forward. SageWorks can be 'installed' using a Lambda Layer and then you can use the Sageworks API just like normal.
Here are the ARNs for the current SageWorks Lambda Layers, please note they are specified with region and Python version in the name, so if your lambda is us-east-1, python 3.12, pick this ARN with those values in it.
"},{"location":"lambda_layer/#python-312-if-you-need-another-versionregion-let-us-know","title":"Python 3.12 (if you need another version/region let us know)","text":"us-east-1
us-west-2
Note: If you're using lambdas on a different region or with a different Python version, just let us know and we'll publish some additional layers.
At the bottom of the Lambda page there's an 'Add Layer' button. You can click that button and specify the layer using the ARN above. Also in the 'General Configuration' set these parameters:
Set the SAGEWORKS_BUCKET ENV SageWorks will need to know what bucket to work out of, so go into the Configuration...Environment Variables... and add one for the SageWorks bucket that your are using for AWS Account (dev, prod, etc).
Lambda Role Details
If your Lambda Function already use an existing IAM Role then you can add the SageWorks policies to that Role to enable the Lambda Job to perform SageWorks API Tasks. See SageWorks Access Controls
"},{"location":"lambda_layer/#sageworks-lambda-example","title":"SageWorks Lambda Example","text":"Here's a simple example of using SageWorks in your Lambda Function.
SageWorks Layer is Compressed
The SageWorks Lambda Layer is compressed (to fit all the awesome). This means that the load_lambda_layer()
method must be called before using any other SageWorks imports, see the example below. If you do not do this you'll probably get a No module named 'numpy'
error or something like that.
import json\nfrom pprint import pprint\nfrom sageworks.utils.lambda_utils import load_lambda_layer\n\n# Load/Decompress the SageWorks Lambda Layer\nload_lambda_layer()\n\n# After 'load_lambda_layer()' we can use other SageWorks imports\nfrom sageworks.api import Meta\nfrom sageworks.api import Model \n\ndef lambda_handler(event, context):\n\n # Create our Meta Class and get a list of our Models\n meta = Meta()\n models = meta.models()\n\n print(f\"Number of Models: {len(models)}\")\n print(models)\n\n # Onboard a model\n model = Model(\"abalone-regression\")\n pprint(model.details())\n\n # Return success\n return {\n 'statusCode': 200,\n 'body': { \"incoming_event\": event}\n }\n
"},{"location":"lambda_layer/#exception-log-forwarding","title":"Exception Log Forwarding","text":"When a Lambda Job crashes (has an exception), the AWS console will show you the last line of the exception, this is mostly useless. If you use SageWorks log forwarding the exception/stack will be forwarded to CloudWatch.
from sageworks.utils.sageworks_logging import exception_log_forward\n\nwith exception_log_forward():\n <my lambda code>\n ...\n <exception happens>\n <more of my code>\n
The exception_log_forward
sets up a context manager that will trap exceptions and forward the exception/stack to CloudWatch for diagnosis. "},{"location":"lambda_layer/#lambda-function-local-testing","title":"Lambda Function Local Testing","text":"Lambda Power without the Pain. SageWorks manages the AWS Execution Role/Policies, so local API and Lambda Functions will have the same permissions/access. Also using the same Code as your notebooks or scripts makes creating and testing Lambda Functions a breeze.
python my_lambda_function.py --sageworks-bucket <your bucket>\n
"},{"location":"lambda_layer/#additional-resources","title":"Additional Resources","text":"Using SageWorks for ML Pipelines: SageWorks API Classes
Consulting Available: SuperCowPowers LLC
Artifact and Column Naming?
You might have noticed that SageWorks has some unintuitive constraints when naming Artifacts and restrictions on column names. All of these restrictions come from AWS. SageWorks uses Glue, Athena, Feature Store, Models and Endpoints, each of these services have their own constraints, SageWorks simply 'reflects' those contraints.
"},{"location":"misc/faq/#naming-underscores-dashes-and-lower-case","title":"Naming: Underscores, Dashes, and Lower Case","text":"Data Sources and Feature Sets must adhere to AWS restrictions on table names and columns names (here is a snippet from the AWS documentation)
Database, table, and column names
When you create schema in AWS Glue to query in Athena, consider the following:
A database name cannot be longer than 255 characters. A table name cannot be longer than 255 characters. A column name cannot be longer than 255 characters.
The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character.
For more info see: Glue Best Practices
"},{"location":"misc/faq/#datasourcefeatureset-use-_-and-modelendpoint-use-","title":"DataSource/FeatureSet use '_' and Model/Endpoint use '-'","text":"You may notice that DataSource and FeatureSet uuid/name examples have underscores but the model and endpoints have dashes. Yes, it\u2019s super annoying to have one convention for DataSources and FeatureSets and another for Models and Endpoints but this is an AWS restriction and not something that SageWorks can control.
DataSources and FeatureSet: Underscores. You cannot use a dash because both classes use Athena for Storage and Athena tables names cannot have a dash.
Models and Endpoints: Dashes. You cannot use an underscores because AWS imposes a restriction on the naming.
"},{"location":"misc/faq/#additional-information-on-the-lower-case-issue","title":"Additional information on the lower case issue","text":"We\u2019ve tried to create a glue table with Mixed Case column names and haven\u2019t had any luck. We\u2019ve bypassed wrangler and used the boto3 low level calls directly. In all cases when it shows up in the Glue Table the columns have always been converted to lower case. We've also tried uses the Athena DDL directly, that also doesn't work. Here's the relevant AWS documentation and the two scripts that reproduce the issue.
AWS Docs
Scripts to Reproduce
SageWorks is a medium granularity framework that manages and aggregates AWS\u00ae Services into classes and concepts. When you use SageWorks you think about DataSources, FeatureSets, Models, and Endpoints. Underneath the hood those classes handle all the details around updating and managing a complex set of AWS Services. All the power and none of the pain so that your team can Do Science Faster!
"},{"location":"misc/general_info/#sageworks-documentation","title":"SageWorks Documentation","text":"See our Python API and AWS documentation here: SageWorks Documentation
"},{"location":"misc/general_info/#full-sageworks-overview","title":"Full SageWorks OverView","text":"SageWorks Architected FrameWork
"},{"location":"misc/general_info/#why-sageworks","title":"Why SageWorks?","text":"Visibility into the AWS Services that underpin the SageWorks Classes. We can see that SageWorks automatically tags and tracks the inputs of all artifacts providing 'data provenance' for all steps in the AWS modeling pipeline.
Image TBD
Clearly illustrated: SageWorks provides intuitive and transparent visibility into the full pipeline of your AWS Sagemaker Deployments.
"},{"location":"misc/general_info/#getting-started","title":"Getting Started","text":"The SageWorks Classes are organized to work in concert with AWS Services. For more details on the current classes and class hierarchies see SageWorks Classes and Concepts.
"},{"location":"misc/general_info/#contributions","title":"Contributions","text":"If you'd like to contribute to the SageWorks project, you're more than welcome. All contributions will fall under the existing project license. If you are interested in contributing or have questions please feel free to contact us at sageworks@supercowpowers.com.
"},{"location":"misc/general_info/#sageworks-alpha-testers-wanted","title":"SageWorks Alpha Testers Wanted","text":"Our experienced team can provide development and consulting services to help you effectively use Amazon\u2019s Machine Learning services within your organization.
The popularity of cloud based Machine Learning services is booming. The problem many companies face is how that capability gets effectively used and harnessed to drive real business decisions and provide concrete value for their organization.
Using SageWorks will minimize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a SageWorks Alpha Tester, contact us at sageworks@supercowpowers.com.
\u00ae Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates.
"},{"location":"misc/sageworks_classes_concepts/","title":"SageWorks Classes and Concepts","text":"A flexible, rapid, and customizable AWS\u00ae ML Sandbox. Here's some of the classes and concepts we use in the SageWorks system:
Endpoint
Transforms
Our experienced team can provide development and consulting services to help you effectively use Amazon\u2019s Machine Learning services within your organization.
The popularity of cloud based Machine Learning services is booming. The problem many companies face is how that capability gets effectively used and harnessed to drive real business decisions and provide concrete value for their organization.
Using SageWorks will minimizize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a SageWorks Alpha Tester, contact us at sageworks@supercowpowers.com.
"},{"location":"misc/scp_consulting/#typical-engagements","title":"Typical Engagements","text":"SageWorks clients typically want a tailored web_interface that helps to drive business decisions and provides value for their organization.
Rapid Prototyping is typically done via these steps.
Quick Construction of Web Interface (tailored)
Goto Step 1
When the client is happy/excited about the ProtoType we then bolt down the system, test the heavy paths, review AWS access, security and ensure 'least privileged' roles and policies.
Contact us for a free initial consultation on how we can accelerate the use of AWS ML at your company sageworks@supercowpowers.com.
"},{"location":"plugins/","title":"OverView","text":"SageWorks Plugins
The SageWorks toolkit provides a flexible plugin architecture to expand, enhance, or even replace the Dashboard. Make custom UI components, views, and entire pages with the plugin classes described here.
The SageWorks Plugin system allows clients to customize how their AWS Machine Learning Pipeline is displayed, analyzed, and visualized. Our easy to use Python API enables developers to make new Dash/Plotly components, data views, and entirely new web pages focused on business use cases.
"},{"location":"plugins/#concept-docs","title":"Concept Docs","text":"Many classes in SageWorks need additional high-level material that covers class design and illustrates class usage. Here's the Concept Docs for Plugins:
Each plugin class inherits from the SageWorks PluginInterface class and needs to set two attributes and implement two methods. These requirements are set so that each Plugin will conform to the Sageworks infrastructure; if the required attributes and methods aren\u2019t included in the class definition, errors will be raised during tests and at runtime.
Note: For full code see Model Plugin Example
class ModelPlugin(PluginInterface):\n \"\"\"MyModelPlugin Component\"\"\"\n\n \"\"\"Initialize this Plugin Component \"\"\"\n auto_load_page = PluginPage.MODEL\n plugin_input_type = PluginInputType.MODEL\n\n def create_component(self, component_id: str) -> dcc.Graph:\n \"\"\"Create the container for this component\n Args:\n component_id (str): The ID of the web component\n Returns:\n dcc.Graph: The EndpointTurbo Component\n \"\"\"\n self.component_id = component_id\n self.container = dcc.Graph(id=component_id, ...)\n\n # Fill in plugin properties\n self.properties = [(self.component_id, \"figure\")]\n\n # Return the container\n return self.container\n\n def update_properties(self, model: Model, **kwargs) -> list:\n \"\"\"Update the properties for the plugin.\n\n Args:\n model (Model): An instantiated Model object\n **kwargs: Additional keyword arguments\n\n Returns:\n list: A list of the updated property values\n \"\"\"\n\n # Create a pie chart with the endpoint name as the title\n pie_figure = go.Figure(data=..., ...)\n\n # Return the updated property values for the plugin\n return [pie_figure]\n
"},{"location":"plugins/#required-attributes","title":"Required Attributes","text":"The class variable plugin_page determines what type of plugin the MyPlugin class is. This variable is inspected during plugin loading at runtime in order to load the plugin to the correct artifact page in the Sageworks dashboard. The PluginPage class can be DATA_SOURCE, FEATURE_SET, MODEL, or ENDPOINT.
"},{"location":"plugins/#s3-bucket-plugins-work-in-progress","title":"S3 Bucket Plugins (Work in Progress)","text":"Note: This functionality is coming soon
Offers the most flexibility and fast prototyping. Simple set your config/env for blah to an S3 Path and SageWorks will load the plugins from S3 directly.
Helpful Tip
You can copy files from your local system up to S3 with this handy AWS CLI call
aws s3 cp . s3://my-sageworks/sageworks_plugins \\\n --recursive --exclude \"*\" --include \"*.py\"\n
"},{"location":"plugins/#additional-resources","title":"Additional Resources","text":"Need help with plugins? Want to develop a customized application tailored to your business needs?
The SageWorks framework makes AWS\u00ae both easier to use and more powerful. SageWorks handles all the details around updating and managing a complex set of AWS Services. With a simple-to-use Python API and a beautiful set of web interfaces, SageWorks makes creating AWS ML pipelines a snap.
Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"presentations/#sageworks-presentations_1","title":"SageWorks Presentations","text":"The SageWorks API documentation SageWorks API covers our in-depth Python API and contains code examples. The code examples are provided in the Github repo examples/
directory. For a full code listing of any example please visit our SageWorks Examples
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
\u00ae Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates
"},{"location":"release_notes/0_7_8/","title":"Release 0.7.8","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on [Discord](https://discord.gg/WHAJuz8sw8
Since we've recently introduced a View() class for DataSources and FeatureSets we needed to rename a few classes/modules.
"},{"location":"release_notes/0_7_8/#featuresets","title":"FeatureSets","text":"For setting holdout ids we've changed/combined to just one method set_training_holdouts()
, so if you're using create_training_view()
or set_holdout_ids()
you can now just use the unified method set_training_holdouts()
.
There's also a change to getting the training view table method.
old: fs.get_training_view_table(create=False)\nnew: fs.get_training_view_table(), does not need the create=False\n
"},{"location":"release_notes/0_7_8/#models","title":"Models","text":"inference_predictions() --> get_inference_predictions()\n
"},{"location":"release_notes/0_7_8/#webplugins","title":"Web/Plugins","text":"We've changed the Web/UI View class to 'WebView'. So anywhere where you used to have view just replace with web_view
from sageworks.views.artifacts_view import ArtifactsView\n
is now from sageworks.web_views.artifacts_web_view import ArtifactsWebView\n
"},{"location":"release_notes/0_7_8/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_11/","title":"Release 0.8.11","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover all the changes from 0.8.8
to 0.8.11
The AWSAccountClamp had too many responsibilities so that class has been split up into two classes and a set of utilities:
For all/most of these API changes they include both DataSources and FeatureSets. We're using a FeatureSet (fs) in the examples below but also applies to DataSoources.
Column Names/Table Names
fs.column_names() -> fs.columns\nfs.get_table_name() -> fs.table_name\n
Display/Training/Computation Views
In general methods for FS/DS are now part of the View API, here's a change list:
fs.get_display_view() -> fs.view(\"display\")\nfs.get_training_view() -> fs.view(\"training\")\nfs.get_display_columns() -> fs.view(\"display\").columns\nfs.get_computation_columns() -> fs.view(\"computation\").columns\nfs.get_training_view_table() -> fs.view(\"training\").table_name\nfs.get_training_data(self) -> fs.view(\"training\").pull_dataframe()\n
Some FS/DS methods have also been removed
num_display_columns() -> gone num_computation_columns() -> gone
Views: Methods that we're Keeping
We're keeping the methods below since they handle some underlying mechanics and serve as nice convenience methods.
ds/fs.set_display_columns()\nds/fs.set_computation_columns()\n
AWSAccountClamp
AWSAccountClamp().boto_session() --> AWSAccountClamp().boto3_session\n
All Classes
If the class previously had a boto_session
attribute that has been renamed to boto3_session
For sageworks==0.8.8
you needed to be careful about when/where you set your config/ENV vars. With >=0.8.9
you can now use the typical setup like this:
```\nfrom sageworks.utils.config_manager import ConfigManager\n\n# Set the SageWorks Config\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", args_dict[\"sageworks-bucket\"])\ncm.set_config(\"REDIS_HOST\", args_dict[\"redis-host\"])\n```\n
"},{"location":"release_notes/0_8_11/#robust-modelnotreadyexception-handling","title":"Robust ModelNotReadyException Handling","text":"AWS will 'deep freeze' Serverless Endpoints and if that endpoint hasn't been used for a while it can sometimes take a long time to come up and be ready for inference. SageWorks now properly manages this AWS error condition, it will report the issue, wait 60 seconds, and try again 5 times before raising the exception.
(endpoint_core.py:502) ERROR Endpoint model not ready\n(endpoint_core.py:503) ERROR Waiting and Retrying...\n...\nAfter a while, inference will run successfully :)\n
"},{"location":"release_notes/0_8_11/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_20/","title":"Release 0.8.20","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.11
to 0.8.20
The cloud_watch
AWS log aggregator, is now officially awesome. It provides a fairly sophisticated way of both doing broad scanning and deep dives on individual streams. Please see our Cloud Watch documentation.
The View classes have finished their refactoring. The 'read' class View()
can be constructed either directly or with the ds/fs.view(\"display\")
methods. See Views for more details. There also a set of classes for constructing views, please see View Overview
Table Name attribute
The table_name
attribute/property has been replaced with just table
ds.table_name -> ds.table\nfs.table_name -> fs.table\nview.table_name -> view.table\n
Endpoint Confusion Matrix
The endpoint
class had a method called confusion_matrix()
this has been renamed to the more descriptive generate_confusion_matrix()
. Note: The model method, of the same name, has NOT changed.
end.confusion_matrix() -> end.generate_confusion_matrix()\nmodel.confusion_matrix() == no change\n
Fixed: There was a corner case where if you had the following sequence:
set_training_holdouts()
The corner case was a race-condition where the FeatureSet would not 'know' that a training view was already there and would create a default training view.
"},{"location":"release_notes/0_8_20/#improvements","title":"Improvements","text":"The log messages that you receive on a plugin validation failure should now be more distinquishable and more informative. They will look like this and in some cases even tell you the line to look at.
ERROR Plugin 'MyPlugin' failed validation:\nERROR File: ../sageworks_plugins/web_components/my_plugin.py\nERROR Class: MyPlugin\nERROR Details: my_plugin.py (line 35): Incorrect return type for update_properties (expected list, got Figure)\n
"},{"location":"release_notes/0_8_20/#internal-api-changes","title":"Internal API Changes","text":"In theory these API should not affect end user of the SageWorks API but are documented here for completeness.
The internal method used by Artifact subclasses has changed names from ensure_valid_name
to is_name_valid
, we've also introduced an optional argument to turn on/off lowercase enforcement, this will be used later when we support uppercase for models, endpoints, and graphs.
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_22/","title":"Release 0.8.22","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.20
to 0.8.22
Mostly bug fixes and minor API changes.
"},{"location":"release_notes/0_8_22/#api-changes","title":"API Changes","text":"Removing target_column
arg when creating FeatureSets
When creating a FeatureSet via DataSource or Pandas Dataframe there was an optional argument for the target_column
after some discussion we decided to remove this argument. In general FeatureSets
are often used to create multiple models with different targets, so it doesn't make sense to specify a target
at the FeatureSet level.
Changed for both DataSource.to_features()
and the PandasToFeatures()
classes.
Fixed: The SHAP computation was occasionally complaining about the additivity check so we flipped that flag to False
shap_vals = explainer.shap_values(X_pred, check_additivity=False)\n
"},{"location":"release_notes/0_8_22/#improvements","title":"Improvements","text":"The optional requirements for [UI]
now include matplotlib since it will probably be useful in the future.
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_23/","title":"Release 0.8.23","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.22
to 0.8.23
Mostly bug fixes and minor API changes.
"},{"location":"release_notes/0_8_23/#api-changes","title":"API Changes","text":"Removing auto_one_hot
arg from PandasToFeatures
and DataSource.to_features()
When creating a PandasToFeatures
object or using DataSource.to_features()
there was an optional argument auto_one_hot
. This would try to automatically convert object/string columns to be one-hot encoded. In general this was only useful for 'toy' datasets but for more complex data we need to specify exactly which columns we want converted.
Adding optional one_hot_columns
arg to PandasToFeatures.set_input()
and DataSource.to_features()
When calling either of these FeatureSet creation methods you can now add an option arg one_hot_columns
as a list of columns that you would like to be one-hot encoded.
Our pandas dependency was outdated and causing an issue with an include_groups
arg when outlier groups were computed. We've changed the requirements:
pandas>=2.1.2\nto\npandas>=2.2.1\n
We also have a ticket for the logic change so that we avoid the deprecation warning."},{"location":"release_notes/0_8_23/#improvements","title":"Improvements","text":"The time to ingest
new rows into a FeatureSet can take a LONG time. Calling the FeatureGroup AWS API and waiting on the results is what takes all the time.
There will hopefully be a series of optimizations around this process, the first one is simply increasing the number of workers/processes for the ingestion manager class.
feature_group.ingest(.., max_processes=8)\n(has been changed to)\nfeature_group.ingest(..., max_processes=16, num_workers=4)\n
"},{"location":"release_notes/0_8_23/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_27/","title":"Release 0.8.27","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.23
to 0.8.27
KNNSpider() --> FeatureSpaceProximity()
If you were previously using the KNNSpider
that class has been replaced with FeatureSpaceProximity
. The API is also a bit different please see the documentation on the FeatureSpaceProximity Class.
The model scripts used in deployed AWS Endpoints are now case-insensitive. In general this should make the use of the endpoints a bit more flexible for End-User Applications to hit the endpoints with less pre-processing of their column names.
CloudWatch default buffers have been increased to 60 seconds as we appears to have been hitting some AWS limits with running 10 concurrent glue jobs.
"},{"location":"release_notes/0_8_27/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_29/","title":"Release 0.8.29","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.27
to 0.8.29
Locking AWS Model Training Image: AWS will randomly update the images associated with training and model registration. In particular the SKLearn Estimator has been updated into a non-working state for our use cases. So for both training and registration we're now explicitly specifying the image that we want to use.
self.estimator = SKLearn(\n ...\n framework_version=\"1.2-1\",\n image_uri=image, # New\n )\n
"},{"location":"release_notes/0_8_29/#api-changes","title":"API Changes","text":"delete() --> class.delete(uuid)
We've changed the API for deleting artifacts in AWS (DataSource, FeatureSet, etc). This is part of our efforts to minimize race-conditions when objects are deleted.
my_model = Model(\"xyz\") # Creating object\nmy_model.delete() # just to delete\n\n<Now just one line>\nModel.delete(\"xyz\") # Delete\n
Bulk Delete: Added a Bulk Delete utility
from sageworks.utils.bulk_utils import bulk_delete\n\ndelete_list = [(\"DataSource\", \"abc\"), (\"FeatureSet\", \"abc_features\")]\nbulk_delete(delete_list)\n
"},{"location":"release_notes/0_8_29/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_33/","title":"Release 0.8.33","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.29
to 0.8.33
Replaced WatchTower Code: Had lots of issues with WatchTower on Glue/Lambda, the use of forks/threads was overkill for our logging needs, so simply replaced the code with boto3 put_log_events()
calls and some simple token handling and buffering.
None
"},{"location":"release_notes/0_8_33/#improvementsfixes","title":"Improvements/Fixes","text":"DataSource from DataFrame: When creating a DataSource from a Pandas Dataframe, the internal transform()
was not deleting the existing DataSource (if it existed).
ROCAUC on subset of classes: When running inference on input data that only had a subset of the classification labels (e.g. rows only had \"low\" and \"medium\" when model was trained on \"low\", \"medium\", \"high\"). The input to ROCAUC needed to be adjusted so that ROCAUC doesn't crash. When this case happens we're returning proper defaults based on scikit learn docs.
"},{"location":"release_notes/0_8_33/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_35/","title":"Release 0.8.35","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.33
to 0.8.35
SageWorks REPL: The REPL now has a workaround for the current iPython embedded shell namespace scoping issue. See: iPython Embedded Shell Scoping Issue. So this pretty much means the REPL is 110% more awesome now!
"},{"location":"release_notes/0_8_35/#api-changes","title":"API Changes","text":"None
"},{"location":"release_notes/0_8_35/#improvementsfixes","title":"Improvements/Fixes","text":"AWS Service Broker: The AWS service broker was dramatic when it pulls meta data for something that just got deleted (or partially deleted), it was throwing CRITICAL log messages. We've refined the AWS error handling so that it's more granular about the error_codes for Validation or ResourceNotFound exceptions those are reduced to WARNINGS.
ROCAUC modifications: Version 0.8.33
put in quite a few changes, for 0.8.35
we've also added logic to both validate and ensure proper order of the probability columns with the class labels.
Code Diff v0.8.33 --> v0.8.35
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What do you call a cow with no legs? ........Ground beef.
"},{"location":"release_notes/0_8_35/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_36/","title":"Release 0.8.36","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.35
to 0.8.36
Fast Inference: The current inference method for endpoints provides error handling, metrics calculations and capture mechanics. There are use cases where the inference needs to happen as fast as possible without all the additional features. So we've added a fast_inference()
method that streamlines the calls to the endpoint.
end = Endpoint(\"my_endpoint\")\nend.inference(df) # Metrics, Capture, Error Handling\nWall time: 5.07 s\n\nend.fast_inference(df) # No frills, but Fast!\nWall time: 308 ms\n
"},{"location":"release_notes/0_8_36/#api-changes","title":"API Changes","text":"None
"},{"location":"release_notes/0_8_36/#improvementsfixes","title":"Improvements/Fixes","text":"Version Update Check: Added functionality that checks the current SageWorks version against the latest released and gives a log message for update available.
ROCAUC modifications: Functionality now includes 'per label' rocauc calculation along with label order and alignment from previous versions.
"},{"location":"release_notes/0_8_36/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.35 --> v0.8.36
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What\u2019s a cow\u2019s best subject in school? ......Cow-culus.
"},{"location":"release_notes/0_8_36/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_39/","title":"Release 0.8.39","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.36
to 0.8.39
Just a small set of error handling and bug fixes.
"},{"location":"release_notes/0_8_39/#api-changes","title":"API Changes","text":"None
"},{"location":"release_notes/0_8_39/#improvementsfixes","title":"Improvements/Fixes","text":"Scatter Plot: Fixed a corner case where the hoover columns included AWS generated fields.
Athena Queries: Put in additional error handling and retries when looking for and querying Athena/Glue Catalogs. These changes affect both DataSource and Features (which have DataSources internally for offline storage).
FeatureSet Creation: Put in additional error handling and retries when pulling AWS meta data for FeatureSets (and internal DataSources).
"},{"location":"release_notes/0_8_39/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.36 --> v0.8.39
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
That feeling like you\u2019ve done this before? .... Deja-moo
"},{"location":"release_notes/0_8_39/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_42/","title":"Release 0.8.42","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.39
to 0.8.42
Artifact deletion got a substantial overhaul. The 4 main classes received internal code changes for how they get deleted. Specifically deletion is now handled via a class method that allows an artifact to be delteed without instantiating an object. The API for deletion is actually more flexible now, please see API Changes below.
"},{"location":"release_notes/0_8_42/#api-changes","title":"API Changes","text":"Artifact Deletion
The API for Artifact deletion is more flexible, if you already have an instantiated object, you can simply call delete()
on it. If you're deleting an object in bulk/batch mode, you can call the class method managed_delete()
, see code example below.
fs = FeatureSet(\"my_fs\")\nfs.delete() # Used for notebooks, scripts, etc.. \nOR\nFeatureSet.managed_delete(\"my_fs\") # Bulk/batch/internal use\n\n<Same API for DataSources, Models, and Endpoints>\n
Note: Internally these use the same functionality, the dual API is simply for ease-of-use."},{"location":"release_notes/0_8_42/#improvementsfixes","title":"Improvements/Fixes","text":"Race Conditions
In theory, the changes to a class based delete will reduce race conditions where an object would try to create itself (just to be deleted) and the AWS Service Broker was encountering partially created (or partially deleted objects) and would barf error messages.
Slightly Better Throttling Logic
The AWS Throttles have been 'tuned' a bit to back off a bit faster and also not retry the list_tags request when the ARN isn't found.
"},{"location":"release_notes/0_8_42/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.39 --> v0.8.42
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
That feeling like you\u2019ve done this before? .... Deja-moo
"},{"location":"release_notes/0_8_42/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_46/","title":"Release 0.8.46","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.42
to 0.8.46
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We're starting to put in deprecation warning as we streamline classes and APIs. If you're using a class or method that's going to be deprecated you'll see a log message like this:
my_class = SomeOldClass()\nWARNING SomeOldClass is deprecated and will be removed in version 0.9.\n
In general these warning messages will be annoying but they will help us smoothly transistion and streamline our Classes and APIs.
"},{"location":"release_notes/0_8_46/#deprecations","title":"Deprecations","text":"Meta()
The new Meta()
class will provide API that aligns with the AWS list
and describe
API. We'll have functionality for listing objects (models, feature sets, etc) and then functionality around the details for a named artifact.
meta = Meta()\nmodels_list = meta.models() # List API\nend_list = meta.endpoints() # List API\n\nfs_dict = meta.feature_set(\"my_fs\") # Describe API\nmodel_dict = meta.model(\"my_model\") # Describe API\n
For more details see: Meta Class
The new Meta() API will be used inside of the Artifact classes (see Internal Changes...Artifacts... below)
Artifact Classes
The artifact classes (DataSource, FeatureSet, Model, Endpoint) have had some old arguments removed.
DataSource(force_refresh=True) -> Gone (remove it)\nFeatureSet(force_refresh=True) -> Gone (remove it)\nModel(force_refresh=True) -> Gone (remove it)\nModel(legacy=True) -> Gone (remove it)\n
"},{"location":"release_notes/0_8_46/#improvements","title":"Improvements","text":"Scalability
The changes to caching and the Meta() class should allow better horizontal scaling, we'll flex out the stress tests for upcoming releases before 0.9.0
.
Table Names starting with Numbers
Some of the Athena queries didn't properly escape the tables names and if you created a DataSource/FeatureSet with a name that started with a number the query would fail. Fixed now. :)
"},{"location":"release_notes/0_8_46/#internal-changes","title":"Internal Changes","text":"Meta()
Meta()
doesn't do any caching now. If you want to use Caching as part of your meta data retrieval use the CachedMeta()
class.
Artifacts
We're got rid of most (soon all) caching for individual Artifacts, if you're constructing an artifact object, you probably want detailed information that's 'up to date' and waiting a bit is probably fine. Note: We'll still make these instantiations as fast as we can, removing the caching logic will as least simplify the implementations.
"},{"location":"release_notes/0_8_46/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.42 --> v0.8.46
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
That feeling like you\u2019ve done this before? .... Deja-moo
"},{"location":"release_notes/0_8_46/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_50/","title":"Release 0.8.50","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.46
to 0.8.50
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We're going to lock in id_columns when FeatureSets are created, AWS FeatureGroup requires an id column, so this is the best place to do it, see API Changes below.
"},{"location":"release_notes/0_8_50/#featureset-robust-handling-of-training-column","title":"FeatureSet: Robust handling of training column","text":"In the past we haven't supported giving a training column as input data. FeatureSets are read-only, so locking in the training rows is 'suboptimal'. In general you might want to use the FeatureSet for several models with different training/hold_out sets. Now if a FeatureSet detects a training column it will give the follow message:
Training column detected: Since FeatureSets are read only, SageWorks \ncreates training views that can be dynamically changed. We'll use \nthis training column to create a training view.\n
"},{"location":"release_notes/0_8_50/#endpoint-auto_inference","title":"Endpoint: auto_inference()","text":"We're changing the internal logic for the auto_inference()
method to include the id_column in it's output.
FeatureSet
When creating a FeatureSet the id_column
is now a required argument.
ds = DataSource(\"test_data\")\nfs = ds.to_features(\"test_features\", id_column=\"my_id\") <-- Required\n
to_features = PandasToFeatures(\"my_feature_set\")\nto_features.set_input(df_features, id_column=\"my_id\") <-- Required\nto_features.set_output_tags([\"blah\", \"whatever\"])\nto_features.transform()\n
If you're data doesn't have a id column you can specify \"auto\" to_features = PandasToFeatures(\"my_feature_set\")\nto_features.set_input(df_features, id_column=\"auto\") <-- Auto Id (index)\n
For more details see: FeatureSet Class
The new Meta() API will be used inside of the Artifact classes (see Internal Changes...Artifacts... below)
"},{"location":"release_notes/0_8_50/#improvements","title":"Improvements","text":"DFStore
Robust handling of slashes, so now it will 'just work' with various upserts and gets:
```\n# These all give you /ml/shap_value dataframe\ndf_store.get(\"/ml/shap_values\")\ndf_store.get(\"ml/shap_values\")\ndf_store.get(\"//ml/shap_values\")\n```\n
"},{"location":"release_notes/0_8_50/#internal-changes","title":"Internal Changes","text":"There's a whole new directory structure that helps isolate Cloud Platform specific funcitonality.
- sageworks/src\n - core/cloud_platform\n - aws\n - azure\n - gcp\n
DFStore
now uses AWSDFStore
as its concrete implementation class.CachedMeta
and AWSAccountClamp
have had a revamp of their singleton logic.So as part of our v0.9.0 Roadmap we're continuing to revamp caching. We're experimenting with CachedMeta Class inside the Artifact classes. Caching continues to be challenging for the framework, it's an absolute must for Web Inferface/UI performance and then it needs to get out of the way for batch jobs and the concurrent building of ML pipelines.
"},{"location":"release_notes/0_8_50/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.46 --> v0.8.50
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
That feeling like you\u2019ve done this before? .... Deja-moo
"},{"location":"release_notes/0_8_50/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_55/","title":"Release 0.8.55","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.50
to 0.8.55
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We're got a good suggestion from one of our beta customers to change the training column to use True/False values instead of 1/0. Having boolean values make semantic sense and make filtering easier and more intuitive.
"},{"location":"release_notes/0_8_55/#api-changes","title":"API Changes","text":"FeatureSet Queries
Since the training column now contains True/False, any code that you have where you're doing a query against the training view.
fs.query(f'SELECT * FROM \"{table}\" where training = 1')\n<changed to>\nfs.query(f'SELECT * FROM \"{table}\" where training = TRUE')\n\nfs.query(f'SELECT * FROM \"{table}\" where training = 0')\n<changed to>\nfs.query(f'SELECT * FROM \"{table}\" where training = FALSE')\n
Also dataframe filtering is easier now, so if you have a call to filter the dataframe that also needs to change.
df_train = all_df[all_df[\"training\"] == 1].copy()\n<changed to>\ndf_train = all_df[all_df[\"training\"]].copy()\n\ndf_val = all_df[all_df[\"training\"] == 0].copy()\n<changed to>\ndf_val = all_df[~all_df[\"training\"]].copy()\n
For more details see: Training View Model Instantiation
We got a request to reduce the time for Model() object instantiation. So we created a new CachedModel()
class that is much faster to instantiate.
%time Model(\"abalone-regression\")\nCPU times: user 227 ms, sys: 19.5 ms, total: 246 ms\nWall time: 2.97 s\n\n%time CachedModel(\"abalone-regression\")\nCPU times: user 8.83 ms, sys: 2.64 ms, total: 11.5 ms\nWall time: 22.7 ms\n
For more details see: CachedModel"},{"location":"release_notes/0_8_55/#improvements","title":"Improvements","text":"SageWorks REPL Onboarding
At some point the onboarding with SageWorks REPL got broken and wasn't properly responding when the user didn't have a complete AWS/SageWorks setup.
"},{"location":"release_notes/0_8_55/#internal-changes","title":"Internal Changes","text":"The decorator for the CachedMeta class did not work properly in Python 3.9 so had to be slightly refactored.
"},{"location":"release_notes/0_8_55/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.50 --> v0.8.55
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
That feeling like you\u2019ve done this before? .... Deja-moo
"},{"location":"release_notes/0_8_55/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_58/","title":"Release 0.8.58","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.55
to 0.8.58
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We've created a new set of Cached Classes:
As part of this there's now a sageworks/cached
directory that housed these classes and the CachedMeta
class.
Meta Imports Yes, this changed AGAIN :)
from sageworks.meta import Meta\n<change to>\nfrom sageworks.api import Meta\n
CachedModel Import
from sageworks.api import CachedModel\n<change to>\nfrom sageworks.cached.cached_model import CachedModel\n
For more details see: CachedModel"},{"location":"release_notes/0_8_58/#improvements","title":"Improvements","text":"Dashboard Responsiveness
The whole point of these Cached Classes is to improve Dashboard/Web Interface responsiveness. The Dashboard uses both the CachedMeta and Cached(Artifact) classes to make both overview and drilldowns faster.
Supporting a Client Use Case There was a use case where a set of plugin pages needed to iterate over all the models to gather and aggregate information. We've supported that use case with a new decorator that avoids overloading AWS/Throttling issues.
Internal The Dashboard now refreshes all data every 90 seconds, so if you don't see you're new model on the dashboard... just wait longer. :)
"},{"location":"release_notes/0_8_58/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.55 --> v0.8.58
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What do you call a nonsense meeting? .... Moo-larkey
"},{"location":"release_notes/0_8_58/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_6/","title":"Release 0.8.6","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines. We've also fixed various corner cases mostly around 'half constructed' AWS artifacts (models/endpoints).
"},{"location":"release_notes/0_8_6/#additional-functionality","title":"Additional Functionality","text":"Model to Endpoint under AWS Throttle
A corner case where the to_endpoint()
method would fail when not 'knowing' the model input. This happened when AWS was throttling responses and the get_input()
of the Endpoint returned unknown
which caused a NoneType
error when using the 'unknown' model.
Empty Model Package Groups
There are cases where customers might construct a Model Package Group (MPG) container and not put any Model Packages in that Group. SageWorks has assumed that all MPGs would have at least one model package. The current 'support' for empty MPGs treats it as an error condition but the API tries to accommodate the condition and will properly display the model group. The group will indicate that it's 'empty' and provides an alert health icons.
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_60/","title":"Release 0.8.60","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.58
to 0.8.60
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We've now exposed additional functionality and API around adding your own custom models. The new custom model support is documented on the Features to Models page.
"},{"location":"release_notes/0_8_60/#api-changes","title":"API Changes","text":"None
"},{"location":"release_notes/0_8_60/#notes","title":"Notes","text":"Custom models introduce models that don't have model metrics or inference runs, so you'll see a lot of log messages complaining about not finding metrics or inference results, please just ignore those, we'll put in additional logic to address those cases.
"},{"location":"release_notes/0_8_60/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.58 --> v0.8.60
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What do you call a nonsense meeting? .... Moo-larkey
"},{"location":"release_notes/0_8_60/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_71/","title":"Release 0.8.71","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.60
to 0.8.71
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
We learned that thread safety is good when using plugin classes. We had a model plugin class that was setting an attribute in one callback and then using that attribute in another callback, this mostly worked until it didn't. Anyway so the Inference Run dropdown box on the Models page now actually works correctly.
"},{"location":"release_notes/0_8_71/#api-changes","title":"API Changes","text":"None
"},{"location":"release_notes/0_8_71/#internal-changes","title":"Internal Changes","text":"When using PandasToFeatures it will overwrite FeatureSets if you give the same name. This behavior is expected. The issue was that it was super eager about doing that and would do it during class initiation, so we've moved that logic to when transform()
is called.
# Create a Feature Set from a DataFrame\ndf_to_features = PandasToFeatures(\"test_features\")\ndf_to_features.set_input(data_df, id_column=\"id\", one_hot_columns=[\"food\"])\ndf_to_features.set_output_tags([\"test\", \"small\"])\ndf_to_features.transform() <--- Overwrite happens here\n
"},{"location":"release_notes/0_8_71/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.60 --> v0.8.71
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What do you call that feeling like you\u2019ve done this before? Deja-moo
"},{"location":"release_notes/0_8_71/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_72/","title":"Release 0.8.72","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
Note: These release notes cover the changes from 0.8.71
to 0.8.72
This release is an incremental release as part of the road map for v.0.9.0
. Please see the full details of the planned changes here: v0.9.0 Roadmap.
For content verification purposes we've added a hash()
method to all of the SageWorks Artifact classes (DataSource, FeatureSet, Model, Endpoint, Graph, etc). Also for DataSources and FeatureSets there is a table_hash()
method that will compute a total hash of all data in the Athena table.
ds = DataSource(\"abalone_data\")\n\nds.modified()\nOut[2]: datetime.datetime(2024, 11, 17, 19, 45, 58, tzinfo=tzlocal())\n\nds.hash()\nOut[3]: '67a9ebb495af573604794aa9c31eded8'\n\nds.table_hash()\nOut[4]: '622f5ddba9d4cad2cf642d1ea5555de9'\n\nfs = FeatureSet(\"test_features\")\n\nfs.hash()\nOut[5]: '1571eee207b72f14bd5065d6c4acdaaf'\n\n# Note: Model/Endpoint hashes will backtrack to model.tar.gz and can be used for validation\nmodel = Model(\"abalone-regression\")\nend = Endpoint(\"abalone-regression-end\")\n\nmodel.get_model_data_url()\nOut[6]: 's3://sagemaker-us-west-2-507740646243/abalone-regression-2024-11-18-03-09/output/model.tar.gz'\n\nmodel.hash()\nOut[7]: '00def9381366cdd062413d0b395ba70c'\n\n# Verify endpoint is using expected model\nend.hash()\nOut[7]: '00def9381366cdd062413d0b395ba70c'\n\n# Realtime endpoint created from the same model\nend = Endpoint(\"abalone-regression-end-rt\")\nend.hash()\nOut[8]: '00def9381366cdd062413d0b395ba70c'\n
Note: You will get a performance warning when running table_hash() on DataSources and FeatureSets as it typically involves a deeper computation on the table contents of that artifact.
"},{"location":"release_notes/0_8_72/#api-changes","title":"API Changes","text":"get_database()
has a deprecation warning, it's replaced with just the database
property.
ds.get_database()\n<replaced by>\nds.database\n
Added the hash()
method to Artifacts (see above).
table_hash()
method to DataSources and FeatuerSet (see above).There was a small refactor of the cache decorator. We fixed a case where if we blocked on getting a value we also spun up a background thread to get it. This chance will no affect existing code or APIs.
"},{"location":"release_notes/0_8_72/#specific-code-changes","title":"Specific Code Changes","text":"Code Diff v0.8.71 --> v0.8.72
Who doesn't like looking at code! Also +3 points for getting down this far! Here's a cow joke as a reward:
What do you call that feeling like you\u2019ve done this before? Deja-moo
"},{"location":"release_notes/0_8_72/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"release_notes/0_8_8/","title":"Release 0.8.8","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
"},{"location":"release_notes/0_8_8/#additional-functionality","title":"Additional Functionality","text":"Auto Inference name change
When auto_inference is run on an endpoint the name of that inference run is currently training_holdout
. That is too close to model_training
and is confusing. So we're going to change the name to auto_inference
which is way more explanatory and intuitive.
Porting plugins: There should really not be any hard coding for training_holdout
, plugins should just call list_inference_runs()
(see below) and use the first one on the list.
list_inference_runs()
The list_inference_runs()
method on Models has been improved. It now handles error states better (no model, no model training data) and will return 'model_training' LAST on the list, this should improve UX for plugin components.
Examples
model = Model(\"abalone-regression\")\n model.list_inference_runs()\n Out[1]: ['auto_inference', 'model_training']\n\n model = Model(\"wine-classification\")\n model.list_inference_runs()\n Out[2]: ['auto_inference', 'training_holdout', 'model_training']\n\n model = Model(\"aqsol-mol-regression\")\n model.list_inference_runs()\n Out[3]: ['training_holdout', 'model_training']\n\n model = Model(\"empty-model-group\")\n model.list_inference_runs()\n Out[4]: []\n
"},{"location":"release_notes/0_8_8/#glue-job-changes","title":"Glue Job Changes","text":"We're spinning up the CloudWatch Handler much earlier now, so if you're setting config like this:
from sageworks.utils.config_manager import ConfigManager\n\n# Set the SageWorks Config\ncm = ConfigManager()\ncm.set_config(\"SAGEWORKS_BUCKET\", args_dict[\"sageworks-bucket\"])\ncm.set_config(\"REDIS_HOST\", args_dict[\"redis-host\"])\n
Just switch out that code for this code. Note: these need to be set before importing sageworks
# Set these ENV vars for SageWorks \nos.environ['SAGEWORKS_BUCKET'] = args_dict[\"sageworks-bucket\"]\nos.environ[\"REDIS_HOST\"] = args_dict[\"redis-host\"]\n
"},{"location":"release_notes/0_8_8/#misc","title":"Misc","text":"Confusion Matrix support for 'ordinal' labels
Pandas has an \u2018ordinal\u2019 type, so the confusion matrix method endpoint.confusion_matrix()
now checks the label column to see if it\u2019s ordinal and uses that order, if not just it will alphabetically sort.
Note: This change may not affect your UI experience. Confusion matricies are saved in the Sageworks/S3 meta data storage, so a bunch of stuff upstream will also need to happen. FeatureSet object/api for setting the label order, recreation of the model/endpoint and confustion matrix, etc. In general this is a forwarding looking change that will be useful later. :)
"},{"location":"release_notes/0_8_8/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"repl/","title":"SageWorks REPL","text":"Visibility and Control
The SageWorks REPL provides AWS ML Pipeline visibility just like the SageWorks Dashboard but also provides control over the creation, modification, and deletion of artifacts through the Python API.
The SageWorks REPL is a customized iPython shell. It provides tailored functionality for easy interaction with SageWorks objects and since it's based on iPython developers will feel right at home using autocomplete, history, help, etc. Both easy and powerful, the SageWorks REPL puts control of AWS ML Pipelines at your fingertips.
"},{"location":"repl/#installation","title":"Installation","text":"pip install sageworks
Just type sageworks
at the command line and the SageWorks shell will spin up and provide a command view of your AWS Machine Learning Pipelines.
At startup the SageWorks shell, will connect to your AWS Account and create a summary of the Machine Learning artifacts currently residing on the account.
Available Commands:
All of the API Classes are auto-loaded, so drilling down on an individual artifact is easy. The same Python API is provided so if you want additional info on a model, for instance, simply create a model object and use any of the documented API methods.
m = Model(\"abalone-regression\")\nm.details()\n<shows info about the model>\n
"},{"location":"repl/#additional-resources","title":"Additional Resources","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
"},{"location":"road_maps/0_9_0/#general","title":"General","text":"Streamlining
We've learned a lot from our beta testers!
One of the important lessons is not to 'over manage' AWS. We want to provide useful, granular Classes and APIs. Getting out of the way is just as important as providing functionality. So streamlining will be a big part of our 0.9.0
roadmap.
Horizontal Scaling
The framwork is currently struggling with 10 parallel ML pipelines being run concurrently. When running simultaneous pipelines we're seeing AWS service/query contention, throttling, and the occasional 'partial state' that we get back from AWS.
Our plan for 0.9.0
is to have formalized horizontal stress testing that tests everything from 4 to 32 concurrent ML pipelines. Even though 32 may not seem like much, AWS has various quotas and limits that we'll be hitting, so 32 is a good goal for 0.9.0
. Obviously once we get to 32 we'll look forward to an architecture that will support 100's of concurrent pipelines.
Full Artifact Load or None
For the SageWorks\u2019 DataSource, FeatureSet, Model, and Endpoint classes the new functionality will ensure that objects are only instantiated when all required data is fully available, returning None if the artifact ID is invalid or if the object is only partially constructed in AWS.
By preventing partially constructed objects, this approach reduces runtime errors when accessing incomplete attributes and simplifies error handling for clients, enhancing robustness and reliability. We are looking at Pydantic for formally capturing schema and types (see Pydantic).
Onboarding
We'll have to balance the 'full artifact or None' with the need to onboard()
artifacts created outside of SageWorks. We'll probably have a class method for onboarding, something like:
my_model = Model.onboard(\"some_random_model\")\nif my_model is None:\n <handle failure to onboard>\n
Caching
Caching needs a substantial overhaul. Right now SageWorks over uses caching. We baked it into our AWSServiceBroker and that gets used by everything.
Caching only really makes sense when we can't wait for AWS to respond to requests. The Dashboard and Web Interfaces are the only use case where responsiveness is important. Other use cases like nightly batch processing, scripts or notebooks, will work totally fine waiting for AWS responses.
Class/API Reductions
The organic growth of SageWorks was based on user feedback and testing, that organic growth has led to an abundance of Classes and API calls. We'll be identifying classes and methods that are 'cruft' from some development push and will be deprecating those.
"},{"location":"road_maps/0_9_0/#deprecation-warnings","title":"Deprecation Warnings","text":"We're starting to put in deprecation warning as we streamline classes and APIs. If you're using a class or method that's going to be deprecated you'll see a log message like this:
broker = AWSServiceBroker()\nWARNING AWSServiceBroker is deprecated and will be removed in version 0.9.\n
If you're using a class that's NOT going to be deprecated but currently uses/relies on one that is you'll still get a warning that you can ignore (developers will take care of it).
# This class is NOT deprecated but an internal class is\nmeta = Meta() \nWARNING AWSServiceBroker is deprecated and will be removed in version 0.9.\n
In general these warning messages will be annoying but they will help us smoothly transistion and streamline our Classes and APIs.
"},{"location":"road_maps/0_9_0/#deprecations","title":"Deprecations","text":"Meta()
The new Meta()
class will provide API that aligns with the AWS list
and describe
API. We'll have functionality for listing objects (models, feature sets, etc) and then functionality around the details for a named artifact.
meta = Meta()\nmodels_list = meta.models() # List API\nend_list = meta.endpoints() # List API\n\nfs_dict = meta.feature_set(\"my_fs\") # Describe API\nmodel_dict = meta.model(\"my_model\") # Describe API\n
The new Meta() API will be used inside of the Artifact classes (see Internal Changes...Artifacts... below)
"},{"location":"road_maps/0_9_0/#improvementsfixes","title":"Improvements/Fixes","text":"FeatureSet
When running concurrent ML pipelines we occasion get a partially constructed FeatureSet, FeatureSets will now 'wait and fail' if they detect partially constructed data (like offline storage not being ready).
"},{"location":"road_maps/0_9_0/#internal-changes","title":"Internal Changes","text":"Meta()
We're going to make a bunch of changes to Meta()
specifically around more granular (LESS) caching. Also there will be an AWSMeta()
subclass that manages the AWS specific API calls. We'll also put stubs in for AzureMeta()
and GCPMeta()
, cause hey we might have a client who really wants that flexibility.
The new Meta class will also include API that's more aligned to the AWS list
and describe
interfacts. Allowing both broad and deep queries of the Machine Learning Artifacts within AWS.
Artifacts
We're getting rid of caching for individual Artifacts, if you're constructing an artifact object, you probably want detailed information that's 'up to date' and waiting a bit is probably fine. Note: We'll still make these instantiations as fast as we can, removing the caching logic will as least simplify the implementations.
"},{"location":"road_maps/0_9_0/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"},{"location":"road_maps/0_9_5/","title":"Road Map v0.9.5","text":"Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and SageWorks. So please contact us at sageworks@supercowpowers.com or on chat us up on Discord
The SageWorks framework continues to flex to support different real world use cases when operating a set of production machine learning pipelines.
"},{"location":"road_maps/0_9_5/#general","title":"General","text":"ML Pipelines
We've learned a lot from our beta testers!
One of the important lessons is that when you make it easier to build ML Pipelines the users are going to build lots of pipelines.
For the creation, monitoring, and deployment of 50-100 of pipelines, we need to focus on the consoldation of artifacts into Pipelines
.
Pipelines are DAGs
The use of Directed Acyclic Graphs for the storage and management of ML Pipelines will provide a good abstraction. Real world ML Pipelines will often branch multiple times, 1 DataSource may become 2 FeatureSets might become 3 Models/Endpoints.
New Pipeline Dashboard Top Page
The current main page shows all the individual artifacts, as we scale up to 100's models we need 2 additional levels of aggregation:
New Pipeline Details Page
When a pipeline is clicked on the top page, a Pipeline details page comes up for that specific pipeline. This page will give all relevant information about the pipeline, including model performance, monitoring, and endpoint status.
Awesome image TBD
"},{"location":"road_maps/0_9_5/#versioned-artifacts","title":"Versioned Artifacts","text":"Our beta customers have requested versioning for artifacts, so we support versioning for both Model and FeatureSets. Endpoints and DataSources typically do not need versioning, so we may wait on the versioning support for those artifact until a later version.
"},{"location":"road_maps/0_9_5/#questions","title":"Questions?","text":"The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord
"}]} \ No newline at end of file