In this article, we will go through the lab GSP323 Perform Foundational Data, ML, and AI Tasks in Google Cloud: Challenge Lab, which is labeled as an expert-level exercise. You will practice the skills and knowledge for running Dataflow, Dataproc, and Dataprep as well as Google Cloud Speech API.
The challenge contains 4 required tasks:
In this task, you have to transfer the data in a CSV file to BigQuery using Dataflow via Pub/Sub. First of all, you need to create a BigQuery dataset called lab
and a Cloud Storage bucket called with your project ID.
- In the Cloud Console, click on Navigation Menu > BigQuery.
- Select your project in the left pane.
- Click CREATE DATASET.
- Enter
dataset_name
in the Dataset ID, then click Create dataset. - Run
gsutil cp gs://cloud-training/gsp323/lab.schema .
in the Cloud Shell to download the schema file. - View the schema by running
cat lab.schema.
And Copy the schema - Go back to the Cloud Console, select the new dataset lab and click Create Table.
- In the Create table dialog, select Google Cloud Storage from the dropdown in the Source section.
- Copy
cloud-training/gsp323/lab.csv
to Select file from GCS bucket. - Enter
customers
to “Table name” in the Destination section. - Enable Edit as text and copy the JSON data from the
lab.schema
file to the textarea in the Schema section. - Click
Create table
.
- In the Cloud Console, click on Navigation Menu > Storage.
- Click CREATE BUCKET.
- Copy your Bucket Name from panel.
- Click CREATE.
- In the Cloud Console, click on Navigation Menu > Dataflow.
- Click CREATE JOB FROM TEMPLATE.
- In Create job from template, give an arbitrary job name.
- From the dropdown under Dataflow template, select Text Files on Cloud Storage BigQuery under “Process Data in Bulk (batch)”. (DO NOT select the item under “Process Data Continuously (stream)”).
- Under the Required parameters, enter the following values:
Field | Value |
---|---|
JavaScript UDF path in Cloud Storage | gs://cloud-training/gsp323/lab.js |
JSON path | gs://cloud-training/gsp323/lab.schema |
JavaScript UDF name | transform |
BigQuery output table | Use As per your lab |
Cloud Storage input path | gs://cloud-training/gsp323/lab.csv |
Temporary BigQuery directory | Use As per your lab |
Temporary location | Use As per your lab |
- Click RUN JOB.
- In the Cloud Console, click on Navigation Menu > Dataproc > Clusters.
- Click CREATE CLUSTER.
- Make sure the cluster is going to create in the region
Use As per your lab
. - Click Create.
- After the cluster has been created, click the SSH button in the row of the master instance.
- In the SSH console, run the following command:
hdfs dfs -cp gs://cloud-training/gsp323/data.txt /data.txt
- Close the SSH window and go back to the Cloud Console.
- Click SUBMIT JOB on the cluster details page.
- Select Spark from the dropdown of “Job type”.
- Copy
org.apache.spark.examples.SparkPageRank
to “Main class or jar”. - Copy
file:///usr/lib/spark/examples/jars/spark-examples.jar
to “Jar files”. - Enter
/data.txt
to “Arguments”. - Click CREATE.
- In the Cloud Console, click on Navigation menu > Dataprep.
- After entering the home page of Cloud Dataprep, click the Import Data button.
- In the Import Data page, select GCS in the left pane.
- Click on the pencil icon under Choose a file or folder.
- Copy below code and paste it to the textbox, and click the Go button next to it.
gs://cloud-training/gsp323/runs.csv
- After showing the preview of runs.csv in the right pane, click on the Import & Wrangle button.
-
After launching the Dataperop Transformer, scroll right to the end and select column10.
-
In the Details pane, click FAILURE under Unique Values to show the context menu.
-
Select Delete rows with selected values to Remove all rows with the state of “FAILURE”.
-
Click the downward arrow next to column9, choose Filter rows > On column value > Contains.
-
In the Filter rows pane, enter the regex pattern
/(^0$|^0\.0$)/
to “Pattern to match”. -
Select Delete matching rows under the Action section, then click the Add button.
-
Rename the columns to be:
- runid
- userid
- labid
- lab_title
- start
- end
- time
- score
- state
-
Confirm the recipe.And Add it
-
Click Run Job.
gcloud iam service-accounts create my-natlang-sa \
--display-name "my natural language service account"
gcloud iam service-accounts keys create ~/key.json \
--iam-account my-natlang-sa@$DEVSHELL_PROJECT_ID.iam.gserviceaccount.com
wget https://raw.githubusercontent.com/guys-in-the-cloud/cloud-skill-boosts/main/Challenge-labs/Perform%20Foundational%20Data%2C%20ML%2C%20and%20AI%20Tasks%20in%20Google%20Cloud%3A%20Challenge%20Lab/speech-request.json
curl -s -X POST -H "Content-Type: application/json" --data-binary @request.json \
"https://speech.googleapis.com/v1/speech:recognize?key=${API_KEY}" > result.json
- Replace with your first bucket url
gsutil cp speech.json <enter first bucket url>
gcloud ml language analyze-entities --content="Old Norse texts portray Odin as one-eyed and long-bearded, frequently wielding a spear named Gungnir and wearing a cloak and a broad hat." > language.json
- Replace with your second bucket url
gsutil cp language.json <enter second bucket url>
wget https://github.com/guys-in-the-cloud/cloud-skill-boosts/blob/main/Challenge-labs/Perform%20Foundational%20Data%2C%20ML%2C%20and%20AI%20Tasks%20in%20Google%20Cloud:%20Challenge%20Lab/video-intelligence-request.json
curl -s -H 'Content-Type: application/json' \
-H 'Authorization: Bearer '$(gcloud auth print-access-token)'' \
'https://videointelligence.googleapis.com/v1/videos:annotate' \
-d @video-intelligence-request.json > video.json
- Replace with your third bucket url
gsutil cp video.json <enter third bucket url>
Stay tuned till the next blog
- Linkedin: https://www.linkedin.com/in/akshat-jjain
- Twitter: https://twitter.com/akshatjain_13
- YouTube Channel: https://youtube.com/channel/UCQUEgfYbcz7pv36NoAv7S-Q/