The modern business world demands expedited decision-making, improved customer experience, and increased productivity. Gone are those days when business intelligence relied only on past data through batch processing. The order of the day is Operational analytics, which relies on measuring the existing or real-time operations of the business along with its past data trends.
First and foremost there is an exponential increase in data volumes and their varieties. Traditional DW needs to evolve constantly to meet this demand of changing needs. Most recent data are no more tabular in nature. Databases evolved into JSONs, Social logs, Images, Videos, and Time Series formats.
Of late the legacy relational database models are becoming showstoppers for programming and advanced analytics. With the legacy ways of datastore, the performance becomes a big bottleneck as the data grows into terabytes and petabytes.
So the need of the hour is to have a cohesive data model, which takes care of both the day-to-day operational data and its past.
Thus the coexistence of Atlas and Redshift evolves as the perfect solution for the business need.
The data from MongoDB Atlas is synced with Amazon Redshift in a two-step approach.
MongoDB Atlas has direct connectors with Apache Spark. Using the spark connectors the data is migrated from MongoDB Atlas to Redshift as a one-time load.
With the help of the MongoDB Atlas triggers or Amazon MSK, any delta changes to the database can be continuously written to the S3 bucket. From the S3 bucket, data can be loaded into the Redshift either through scheduled AWS Glue jobs or can be accessed as an external table.
In this demonstration, we provided step by step approach for each of these scenarios.
a) Good understanding of AWS Redshift, Glue, and S3 services.
b) Good understanding of MongoDB Atlas and Application services.
c) VPC and Network settings are already set up as per the security standards.
d) Redshift Database.
e) S3 bucket to store the JSON files.
f) MongoDB Atlas cluster for free cluster creation
g) Tools: VSCode, MongoDB Compass, Docker
a. Create a MongoDB Atlas cluster.
b. Configure the MongoDB Atlas cluster network security and access. Note down the connection string, user name and password to be referencenced in the future step.
c. Load the sample customer_activity data to a collection using MongoDB Compass. Note down the database name and collection name to be referenced in future step.
d. Create an Amazon Redshift Cluster and ensure the roles are created and associated with cluster as recommended in the link for creation.
e. Configure the Amazon Redshift Cluster network security and access.
Ensure the Redshift cluster is accessible from outside VPC , by enabling the "Modify publicly accessible setting". All other options are default.
f. Note down the database instance, database name ,username and password to be referenced in the future step.
g. Create an AWS role with AmazonDMSRedshiftS3Role and AWSGlueServiceRole policies to be used for the Glue Job. Add an in-line policy to grant access to the secrets and note down the role name.
in-line policy template
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": "secretsmanager:GetSecretValue",
"Resource": "arn:aws:secretsmanager:<region>:<accountid>:secret:<secretname>"
}
}
h. Store the database credentials as secrets using the AWS Secret Manager
Note down the secret name
j. Create an AWS Glue connection with the Amazon Redshift Database, by following the below steps.
i) Select "Connector" from the left side menu of AWS Glue Studio. Click "Create Connection" to create a new connection.
ii) Provide a name for the connection and select "Amazon Redshift" for the connection type, use the Redshift credentials noted down in the previous step.
iii) select the default VPC and a subnet - preferably Public Subnet for testing. (For prod environment, choose with the least access security model).
iv) Note down the connection name to be referenced in the future step.
v) Create a S3 endpoint.
vi) You can test the Connection using AWS Glue service.(Not AWS Glue studio) . Select the "Connections" from Legay Services and choose the connection you created and select "Test Connection".
k. Create a glue job in AWS Glue studio.
l. Select the Spark script editor and click "Create".
m. Copy the code from the link to the "Script" tab. Overwrite if there is a template code already available.
n. Update the MongoDB connection details for mongo_uri, and MongoDB Database credentials noted down in the previous step. Update the Database name and collection name as note above. Update the connection name you noted down in the previous step and provide a name for the Redshift database.
o. Configure the job name and AWS role in the "Job details" tab. You can keep all the other parameters as default.
p. Save the job and click "Run"
q. Ensure the job ran successfully
r. Validate the table and data in Redshift as described below.
i. Search for Redshift in AWS search bar.
ii. Open the "Amazon Redshift" service.
iii. select "Query" from the left side panel.
iv. click "Connect to Database" on the right.
v. Choose the "Create a new connection" , provide the database credentials and click "Connect".
v. Run the select query for the Redshift table created earlier, to validate the results.
example: select * from customer_activity
The Change Data Capture(CDC) feature of MongoDB Atlas is utilized to capture real-time data. Migrate the real-time data to the S3 bucket and then to Redshift by **ANY ONE ** of the following methods.
a. Create an MSK Cluster
b. Create custom plugins for MongoDB Atlas(source) using the zip file
c. Create custom plugins for S3(sink) using the zip file
d. Create a source connector to MongoDB Atlas using the custom plugin and code
e. Create a sink connector to the S3 bucket using the custom plugin and code.
f. Insert the data to MongoDB Atlas collection and ensure the data are written to the S3 bucket.
- The data from MongoDB Atlas can be continuously written to S3 bucket using the Data Federation and MongoDB Atlas triggers. Please refer to the link for the step-by-step instructions to capture the data to S3.
For any further reference, please follow the MongoDB documentation link
- Create a AWS Glue job to move the data from S3 bucket to AWS Redshift
a. Create the Glue Connections Redshift Database.
Navigate to the AWS Glue console and to the "Data Catalog" menu on the left panel. Select "Connections" and Click on "Add Connection". Enter the parameters taking guidance of the screenshots attached.
b.Create the Glue Connection for S3 bucket.
Test these connections are working fine.
C.Create the Crawler to populate Database and Tables in AWS Glue Catalog from S3. Navigate to "Crawlers" menu on the left side panel and click "Add Crawler". Add all the required information for the crawler taking guidance from the attached screenshots.
Once the crawlers are created successfully , run the crawler and ensure its successful completion.
Click the "Tables" menu from the left side and ensure the required tables are created.
c.Now we are all set to create the Job to map and populate the redshift tables. Click " Jobs" from the left side menu and select "Visual with a source and target"
select S3 bucket as source and Redshift as target from the dropdown.
update the source configuration
Validate the mapping and alter as required.
update the destination configuration
The scripts for the conversions are created automatically.
update the job details tab with job name and role. all other parameters are kept as default.
Run the job and ensure it's successful completion. Use the logs and Error logs generated for debugging. (if required)
Verify the data is populated successfully in the Redshift table.
Redshift Spectrum host the S3 bucket data as an external table. Provided the reference and steps to create the extenal table in the following link
Redshift Specturm - external table
Reference link for querying JSON structure data in Redshift.
The data thus populated from MongoDB Atlas either through AWS Glue or as a external tables in Redshift can be utilized to train the models . Redshift ML services enables to directly use the Sagemaker Models to train and infer results.
Please refer the link for training a pre-build model(customer churn model) and infer results.
Note: Steps to be followed in the above link. a. Complete the Administrative setup b. The data is already loaded to the Redshift table - "customer activity" and hence you can skips the steps for data loading in the Tutorial to build customer chrun model
Enterprises like Retail, Banking & Finance and Manufacturing are in great demand for the operational analytics for it's various real-time analytics.
A few are captured below
With the synergy it creates by having Atlas for its operational efficiency and Redshift for its DWH excellence, all the “Operational Analytics” use cases can be delivered in no time. The solution can be extended to integrate the AI/ML needs using the AWS SageMaker.
Hope you are able to setup the integration successfully. For any further reference pls reach out to [email protected]