Skip to content

Commit

Permalink
Initial commit of gcs2bq.
Browse files Browse the repository at this point in the history
  • Loading branch information
rosmo committed May 6, 2020
1 parent 46a3c4d commit 6f38f1c
Show file tree
Hide file tree
Showing 11 changed files with 780 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ The tools folder contains ready-made utilities which can simpilfy Google Cloud P
* [GCP Organization Hierarchy Viewer](tools/gcp-org-hierarchy-viewer) - A CLI utility for visualizing your organization hierarchy in the terminal.
* [GCS Bucket Mover](tools/gcs-bucket-mover) - A tool to move user's bucket, including objects, metadata, and ACL, from one project to another.
* [GCS Usage Recommender](tools/gcs-usage-recommender) - A tool that generates bucket-level intelligence and access patterns across all projects for a GCP project to generate recommended object lifecycle management.
* [GCS to BigQuery](tools/gcs2bq) - A tool fetches object metadata from all Google Cloud Storage buckets and exports it in a format that can be imported into BigQuery for further analysis.
* [GKE Billing Export](tools/gke-billing-export) - Google Kubernetes Engine fine grained billing export.
* [GSuite Exporter](tools/gsuite-exporter/) - A Python package that automates syncing Admin SDK APIs activity reports to a GCP destination. The module takes entries from the chosen Admin SDK API, converts them into the appropriate format for the destination, and exports them to a destination (e.g: Stackdriver Logging).
* [Hive to BigQuery](tools/hive-bigquery/) - A Python framework to migrate Hive table to BigQuery using Cloud SQL to keep track of the migration progress.
Expand Down
15 changes: 15 additions & 0 deletions tools/gcs2bq/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Binaries for programs and plugins
*.exe
*.exe~
*.dll
*.so
*.dylib

# Test binary, built with `go test -c`
*.test

# Output of the go coverage tool, specifically when used with LiteIDE
*.out

# Dependency directories (remove the comment below to include it)
# vendor/
18 changes: 18 additions & 0 deletions tools/gcs2bq/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
FROM golang:1.14

WORKDIR /go/src/github.com/rosmo/gcs2bq
COPY main.go .

RUN go get -v ./...
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o /gcs2bq .

FROM google/cloud-sdk:slim
WORKDIR /
RUN chown -R 1000 /home
COPY --from=0 /gcs2bq .
COPY gcs2bq.avsc .
COPY bigquery.schema .
COPY run.sh .
RUN chmod +x run.sh
CMD ["/run.sh"]

202 changes: 202 additions & 0 deletions tools/gcs2bq/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# Google Cloud Storage to BigQuery

## Ever wanted to know what's your organization's average file creation time or size?

![Datastudio sample dashboard](datastudio.png)

This small applications discovers all buckets from a Google Cloud Platform organization,
then fetches all the objects in those and creates an Avro file containing all the objects
and their attributes. This can be then imported into BigQuery.

### Building

You can build it either manually, or using the supplied `Dockerfile`:

```bash
export GOOGLE_PROJECT=your-project
docker build -t eu.gcr.io/$GOOGLE_PROJECT/gcs2bq:latest .
docker push eu.gcr.io/$GOOGLE_PROJECT/gcs2bq:latest
```

### Usage

```bash
$ ./gcs2bq -help
Google Cloud Storage object metadata to BigQuery, version 0.1
Usage of ./gcs2bq:
-alsologtostderr
log to standard error as well as files
-buffer_size int
file buffer (default 1000)
-concurrency int
concurrency (GOMAXPROCS) (default 4)
-file string
output file name (default "gcs.avro")
-log_backtrace_at value
when logging hits line file:N, emit a stack trace
-log_dir string
If non-empty, write log files in this directory
-logtostderr
log to standard error instead of files
-stderrthreshold value
logs at or above this threshold go to stderr
-v value
log level for V logs
-versions
include GCS object versions
-vmodule value
comma-separated list of pattern=N settings for file-filtered logging
```

You can also use the supplied `run.sh` scripts, which accepts the following
environment variables as input:

- `GCS2BQ_PROJECT`: project ID where the storage bucket and BigQuery dataset resides in
- `GCS2BQ_DATASET`: BigQuery dataset name (eg. `gcs2bq`)
- `GCS2BQ_TABLE`: BigQuery table name (eg. `objects`)
- `GCS2BQ_BUCKET`: Bucket for storing the temporary Avro file to be loaded into BigQuery (no `gs://` prefix)
- `GCS2BQ_LOCATION`: Location for the bucket and dataset (if they need to be created, eg. `EU`)
- `GCS2BQ_VERSIONS`: Set to non-empty if you want to retrieve object versions as well

## IAM permissions on GCP

To be able to discover all projects and buckets, the Service Account that you
run GCS2BQ under should have the following permissions on organization level:

- List all projects: `resourcemanager.projects.get`
- List buckets: `storage.buckets.list`
- List objects in bucket: `storage.objects.list`
- Read ACLs from objects in bucket: `storage.objects.getIamPolicy`

These permissions can be partly granted with the following predefined role (lacks
permission to retrieve ACLs):

- Storage Object Viewer: `roles/storage.objectViewer`

There is also a custom role in [gcs2bq-custom-role.yaml](gcs2bq-custom-role.yaml) that
only has the necessary permissions. See the file for instructions.

To write the data through GCS to BigQuery, you'll need in a project that hosts the
BigQuery dataset the following roles:

- Storage Admin: `roles/storage.admin`
- BigQuery User: `roles/bigquery.user`

### BigQuery schema

See file [bigquery.schema](bigquery.schema) for the BigQuery table schema. AVRO
schema is in [gcs2bq.avsc](gcs2bq.avsc).

## Sample BigQuery queries

### Find average age of files and size of each storage tier

```sql
SELECT
project_id,
bucket,
ROUND(AVG(TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), created, DAY)), 1) AS created_average_days,
SUM(
IF
(storage_class='STANDARD',
size,
0)) AS size_standard,
SUM(
IF
(storage_class='NEARLINE',
size,
0)) AS size_nearline,
SUM(
IF
(storage_class='COLDLINE',
size,
0)) AS size_coldline,
SUM(
IF
(storage_class='ARCHIVE',
size,
0)) AS size_archived
FROM
gcs2bq.files
GROUP BY
project_id,
bucket
```

### Find a histogram of how data is allocated in different sized files

```sql
SELECT
CASE
WHEN histogram_bucket = 1 THEN "< 1 KB"
WHEN histogram_bucket = 2 THEN "< 100 KB"
WHEN histogram_bucket = 3 THEN "< 1 MB"
WHEN histogram_bucket = 4 THEN "< 100 MB"
WHEN histogram_bucket = 5 THEN "< 1 GB"
ELSE
"> 1 GB"
END
AS class,
SUM(size) AS total_size
FROM (
SELECT
size,
CASE
WHEN size <= 1024 THEN 1
WHEN size <= 1024*100 THEN 2
WHEN size <= 1024*1024 THEN 3
WHEN size <= 1024*1024*100 THEN 4
WHEN size <= 1024*1024*1024 THEN 5
ELSE
6
END
AS histogram_bucket
FROM
gcs2bq.files )
GROUP BY
histogram_bucket
ORDER BY
histogram_bucket ASC
```

### Find owners with most data

```sql
SELECT
owner,
SUM(size) AS total_size
FROM
gcs2bq.files
GROUP BY
owner
ORDER BY
total_size DESC
```

### Find duplicate files across all buckets

```sql
SELECT
project_id,
CONCAT("gs://", bucket, "/", name) AS file,
COUNT(md5) AS duplicates
FROM
gcs2bq.files
GROUP BY
project_id,
file
HAVING
duplicates > 1
```


### Running in GKE as a CronJob

You can deploy the container as a `CronJob` in Google Kubernetes Engine. See the file
[gcs2bq.yaml](gcs2bq.yaml). Replace the environment parameters with values appropriate
for your environment.





118 changes: 118 additions & 0 deletions tools/gcs2bq/bigquery.schema
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
[{
"name": "project_id",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "bucket",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "name",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "content_type",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "content_language",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "cache_control",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "event_based_hold",
"type": "BOOLEAN",
"mode": "REQUIRED"
}, {
"name": "temporary_hold",
"type": "BOOLEAN",
"mode": "REQUIRED"
}, {
"name": "retention_expiration_time",
"type": "INTEGER",
"mode": "REQUIRED"
}, {
"name": "acl",
"type": "RECORD",
"mode": "REPEATED",
"fields": [{
"name": "key",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "value",
"type": "STRING",
"mode": "REQUIRED"
}]
}, {
"name": "predefined_acl",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "owner",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "size",
"type": "INTEGER",
"mode": "REQUIRED"
}, {
"name": "content_encoding",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "content_disposition",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "md5",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "crc32c",
"type": "INTEGER",
"mode": "REQUIRED"
}, {
"name": "media_link",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "generation",
"type": "INTEGER",
"mode": "REQUIRED"
}, {
"name": "metageneration",
"type": "INTEGER",
"mode": "REQUIRED"
}, {
"name": "storage_class",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "created",
"type": "TIMESTAMP",
"mode": "NULLABLE"
}, {
"name": "deleted",
"type": "TIMESTAMP",
"mode": "NULLABLE"
}, {
"name": "updated",
"type": "TIMESTAMP",
"mode": "NULLABLE"
}, {
"name": "customer_key_sha256",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "kms_key_name",
"type": "STRING",
"mode": "REQUIRED"
}, {
"name": "etag",
"type": "STRING",
"mode": "REQUIRED"
}]
Binary file added tools/gcs2bq/datastudio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16 changes: 16 additions & 0 deletions tools/gcs2bq/gcs2bq-custom-role.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# To create the custom role:
# gcloud iam roles create gcs2bq --organization=ORGANIZATION_ID --file=gcs2bq-custom-role.yaml
#
# To grant the custom role at organization level:
# gcloud organizations add-iam-policy-binding ORGANIZATION_ID \
# --member='serviceAccount:gcs2bq@PROJECT_ID.iam.gserviceaccount.com' \
# --role='organizations/ORGANIZATION_ID/roles/gcs2bq'
#
title: "GCS2BQ service account"
description: "GCS2BQ service account"
stage: GA
includedPermissions:
- resourcemanager.projects.get
- storage.buckets.list
- storage.objects.list
- storage.objects.getIamPolicy
Loading

0 comments on commit 6f38f1c

Please sign in to comment.