Initial commit of gcs2bq.

cristian-gavril · May 6, 2020 · 6f38f1c · 6f38f1c
1 parent 46a3c4d
commit 6f38f1c
Show file tree

Hide file tree

Showing 11 changed files with 780 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -77,6 +77,7 @@ The tools folder contains ready-made utilities which can simpilfy Google Cloud P
 * [GCP Organization Hierarchy Viewer](tools/gcp-org-hierarchy-viewer) - A CLI utility for visualizing your organization hierarchy in the terminal.
 * [GCS Bucket Mover](tools/gcs-bucket-mover) - A tool to move user's bucket, including objects, metadata, and ACL, from one project to another.
 * [GCS Usage Recommender](tools/gcs-usage-recommender) - A tool that generates bucket-level intelligence and access patterns across all projects for a GCP project to generate recommended object lifecycle management.
+* [GCS to BigQuery](tools/gcs2bq) - A tool fetches object metadata from all Google Cloud Storage buckets and exports it in a format that can be imported into BigQuery for further analysis. 
 * [GKE Billing Export](tools/gke-billing-export) - Google Kubernetes Engine fine grained billing export.
 * [GSuite Exporter](tools/gsuite-exporter/) - A Python package that automates syncing Admin SDK APIs activity reports to a GCP destination. The module takes entries from the chosen Admin SDK API, converts them into the appropriate format for the destination, and exports them to a destination (e.g: Stackdriver Logging).
 * [Hive to BigQuery](tools/hive-bigquery/) - A Python framework to migrate Hive table to BigQuery using Cloud SQL to keep track of the migration progress.

diff --git a/tools/gcs2bq/.gitignore b/tools/gcs2bq/.gitignore
@@ -0,0 +1,15 @@
+# Binaries for programs and plugins
+*.exe
+*.exe~
+*.dll
+*.so
+*.dylib
+
+# Test binary, built with `go test -c`
+*.test
+
+# Output of the go coverage tool, specifically when used with LiteIDE
+*.out
+
+# Dependency directories (remove the comment below to include it)
+# vendor/
diff --git a/tools/gcs2bq/Dockerfile b/tools/gcs2bq/Dockerfile
@@ -0,0 +1,18 @@
+FROM golang:1.14
+
+WORKDIR /go/src/github.com/rosmo/gcs2bq
+COPY main.go .
+
+RUN go get -v ./... 
+RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o /gcs2bq .
+
+FROM google/cloud-sdk:slim
+WORKDIR /
+RUN chown -R 1000 /home
+COPY --from=0 /gcs2bq .
+COPY gcs2bq.avsc .
+COPY bigquery.schema .
+COPY run.sh .
+RUN chmod +x run.sh
+CMD ["/run.sh"]
+
diff --git a/tools/gcs2bq/README.md b/tools/gcs2bq/README.md
@@ -0,0 +1,202 @@
+# Google Cloud Storage to BigQuery
+
+## Ever wanted to know what's your organization's average file creation time or size?
+
+![Datastudio sample dashboard](datastudio.png)
+
+This small applications discovers all buckets from a Google Cloud Platform organization, 
+then fetches all the objects in those and creates an Avro file containing all the objects 
+and their attributes. This can be then imported into BigQuery.
+
+### Building
+
+You can build it either manually, or using the supplied `Dockerfile`:
+
+```bash
+export GOOGLE_PROJECT=your-project
+docker build -t eu.gcr.io/$GOOGLE_PROJECT/gcs2bq:latest .
+docker push eu.gcr.io/$GOOGLE_PROJECT/gcs2bq:latest
+```
+
+### Usage
+
+```bash
+$ ./gcs2bq -help
+Google Cloud Storage object metadata to BigQuery, version 0.1
+Usage of ./gcs2bq:
+  -alsologtostderr
+    	log to standard error as well as files
+  -buffer_size int
+    	file buffer (default 1000)
+  -concurrency int
+    	concurrency (GOMAXPROCS) (default 4)
+  -file string
+    	output file name (default "gcs.avro")
+  -log_backtrace_at value
+    	when logging hits line file:N, emit a stack trace
+  -log_dir string
+    	If non-empty, write log files in this directory
+  -logtostderr
+    	log to standard error instead of files
+  -stderrthreshold value
+    	logs at or above this threshold go to stderr
+  -v value
+    	log level for V logs
+  -versions
+    	include GCS object versions
+  -vmodule value
+    	comma-separated list of pattern=N settings for file-filtered logging
+```
+
+You can also use the supplied `run.sh` scripts, which accepts the following
+environment variables as input:
+
+- `GCS2BQ_PROJECT`: project ID where the storage bucket and BigQuery dataset resides in
+- `GCS2BQ_DATASET`: BigQuery dataset name (eg. `gcs2bq`)
+- `GCS2BQ_TABLE`: BigQuery table name (eg. `objects`)
+- `GCS2BQ_BUCKET`: Bucket for storing the temporary Avro file to be loaded into BigQuery (no `gs://` prefix)
+- `GCS2BQ_LOCATION`: Location for the bucket and dataset (if they need to be created, eg. `EU`)
+- `GCS2BQ_VERSIONS`: Set to non-empty if you want to retrieve object versions as well
+
+## IAM permissions on GCP
+
+To be able to discover all projects and buckets, the Service Account that you
+run GCS2BQ under should have the following permissions on organization level:
+
+- List all projects: `resourcemanager.projects.get`
+- List buckets: `storage.buckets.list`
+- List objects in bucket: `storage.objects.list`
+- Read ACLs from objects in bucket: `storage.objects.getIamPolicy`
+
+These permissions can be partly granted with the following predefined role (lacks
+permission to retrieve ACLs):
+
+- Storage Object Viewer: `roles/storage.objectViewer`
+
+There is also a custom role in [gcs2bq-custom-role.yaml](gcs2bq-custom-role.yaml) that
+only has the necessary permissions. See the file for instructions.
+
+To write the data through GCS to BigQuery, you'll need in a project that hosts the
+BigQuery dataset the following roles:
+
+- Storage Admin: `roles/storage.admin`
+- BigQuery User: `roles/bigquery.user`
+
+### BigQuery schema
+
+See file [bigquery.schema](bigquery.schema) for the BigQuery table schema. AVRO
+schema is in [gcs2bq.avsc](gcs2bq.avsc).
+
+## Sample BigQuery queries
+
+### Find average age of files and size of each storage tier
+
+```sql
+SELECT
+  project_id,
+  bucket,
+  ROUND(AVG(TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), created, DAY)), 1) AS created_average_days,
+  SUM(
+  IF
+    (storage_class='STANDARD',
+      size,
+      0)) AS size_standard,
+  SUM(
+  IF
+    (storage_class='NEARLINE',
+      size,
+      0)) AS size_nearline,
+  SUM(
+  IF
+    (storage_class='COLDLINE',
+      size,
+      0)) AS size_coldline,
+  SUM(
+  IF
+    (storage_class='ARCHIVE',
+      size,
+      0)) AS size_archived
+FROM
+  gcs2bq.files
+GROUP BY
+  project_id,
+  bucket
+```
+
+### Find a histogram of how data is allocated in different sized files
+
+```sql
+SELECT
+  CASE
+    WHEN histogram_bucket = 1 THEN "< 1 KB"
+    WHEN histogram_bucket = 2 THEN "< 100 KB"
+    WHEN histogram_bucket = 3 THEN "< 1 MB"
+    WHEN histogram_bucket = 4 THEN "< 100 MB"
+    WHEN histogram_bucket = 5 THEN "< 1 GB"
+  ELSE
+  "> 1 GB"
+END
+  AS class,
+  SUM(size) AS total_size
+FROM (
+  SELECT
+    size,
+    CASE
+      WHEN size <= 1024 THEN 1
+      WHEN size <= 1024*100 THEN 2
+      WHEN size <= 1024*1024 THEN 3
+      WHEN size <= 1024*1024*100 THEN 4
+      WHEN size <= 1024*1024*1024 THEN 5
+    ELSE
+    6
+  END
+    AS histogram_bucket
+  FROM
+    gcs2bq.files )
+GROUP BY
+  histogram_bucket
+ORDER BY
+  histogram_bucket ASC
+```
+
+### Find owners with most data
+
+```sql
+SELECT
+  owner,
+  SUM(size) AS total_size
+FROM
+  gcs2bq.files
+GROUP BY
+  owner
+ORDER BY
+  total_size DESC
+```
+
+### Find duplicate files across all buckets
+
+```sql
+SELECT
+  project_id,
+  CONCAT("gs://", bucket, "/", name) AS file,
+  COUNT(md5) AS duplicates
+FROM
+  gcs2bq.files
+GROUP BY
+  project_id,
+  file
+HAVING
+  duplicates > 1
+```
+
+
+### Running in GKE as a CronJob
+
+You can deploy the container as a `CronJob` in Google Kubernetes Engine. See the file
+[gcs2bq.yaml](gcs2bq.yaml). Replace the environment parameters with values appropriate
+for your environment.
+
+
+
+
+
diff --git a/tools/gcs2bq/bigquery.schema b/tools/gcs2bq/bigquery.schema
@@ -0,0 +1,118 @@
+[{
+	"name": "project_id",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "bucket",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "name",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "content_type",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "content_language",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "cache_control",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "event_based_hold",
+	"type": "BOOLEAN",
+	"mode": "REQUIRED"
+}, {
+	"name": "temporary_hold",
+	"type": "BOOLEAN",
+	"mode": "REQUIRED"
+}, {
+	"name": "retention_expiration_time",
+	"type": "INTEGER",
+	"mode": "REQUIRED"
+}, {
+	"name": "acl",
+	"type": "RECORD",
+	"mode": "REPEATED",
+	"fields": [{
+		"name": "key",
+		"type": "STRING",
+		"mode": "REQUIRED"
+	}, {
+		"name": "value",
+		"type": "STRING",
+		"mode": "REQUIRED"
+	}]
+}, {
+	"name": "predefined_acl",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "owner",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "size",
+	"type": "INTEGER",
+	"mode": "REQUIRED"
+}, {
+	"name": "content_encoding",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "content_disposition",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "md5",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "crc32c",
+	"type": "INTEGER",
+	"mode": "REQUIRED"
+}, {
+	"name": "media_link",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "generation",
+	"type": "INTEGER",
+	"mode": "REQUIRED"
+}, {
+	"name": "metageneration",
+	"type": "INTEGER",
+	"mode": "REQUIRED"
+}, {
+	"name": "storage_class",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "created",
+	"type": "TIMESTAMP",
+	"mode": "NULLABLE"
+}, {
+	"name": "deleted",
+	"type": "TIMESTAMP",
+	"mode": "NULLABLE"
+}, {
+	"name": "updated",
+	"type": "TIMESTAMP",
+	"mode": "NULLABLE"
+}, {
+	"name": "customer_key_sha256",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "kms_key_name",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}, {
+	"name": "etag",
+	"type": "STRING",
+	"mode": "REQUIRED"
+}]
diff --git a/tools/gcs2bq/datastudio.png b/tools/gcs2bq/datastudio.png
diff --git a/tools/gcs2bq/gcs2bq-custom-role.yaml b/tools/gcs2bq/gcs2bq-custom-role.yaml
@@ -0,0 +1,16 @@
+# To create the custom role:
+# gcloud iam roles create gcs2bq --organization=ORGANIZATION_ID --file=gcs2bq-custom-role.yaml
+# 
+# To grant the custom role at organization level:
+# gcloud organizations add-iam-policy-binding ORGANIZATION_ID \ 
+#   --member='serviceAccount:gcs2bq@PROJECT_ID.iam.gserviceaccount.com' \
+#   --role='organizations/ORGANIZATION_ID/roles/gcs2bq'
+#
+title: "GCS2BQ service account"
+description: "GCS2BQ service account"
+stage: GA
+includedPermissions:
+- resourcemanager.projects.get
+- storage.buckets.list
+- storage.objects.list
+- storage.objects.getIamPolicy