forked from GoogleCloudPlatform/professional-services
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
780 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Binaries for programs and plugins | ||
*.exe | ||
*.exe~ | ||
*.dll | ||
*.so | ||
*.dylib | ||
|
||
# Test binary, built with `go test -c` | ||
*.test | ||
|
||
# Output of the go coverage tool, specifically when used with LiteIDE | ||
*.out | ||
|
||
# Dependency directories (remove the comment below to include it) | ||
# vendor/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
FROM golang:1.14 | ||
|
||
WORKDIR /go/src/github.com/rosmo/gcs2bq | ||
COPY main.go . | ||
|
||
RUN go get -v ./... | ||
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o /gcs2bq . | ||
|
||
FROM google/cloud-sdk:slim | ||
WORKDIR / | ||
RUN chown -R 1000 /home | ||
COPY --from=0 /gcs2bq . | ||
COPY gcs2bq.avsc . | ||
COPY bigquery.schema . | ||
COPY run.sh . | ||
RUN chmod +x run.sh | ||
CMD ["/run.sh"] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,202 @@ | ||
# Google Cloud Storage to BigQuery | ||
|
||
## Ever wanted to know what's your organization's average file creation time or size? | ||
|
||
 | ||
|
||
This small applications discovers all buckets from a Google Cloud Platform organization, | ||
then fetches all the objects in those and creates an Avro file containing all the objects | ||
and their attributes. This can be then imported into BigQuery. | ||
|
||
### Building | ||
|
||
You can build it either manually, or using the supplied `Dockerfile`: | ||
|
||
```bash | ||
export GOOGLE_PROJECT=your-project | ||
docker build -t eu.gcr.io/$GOOGLE_PROJECT/gcs2bq:latest . | ||
docker push eu.gcr.io/$GOOGLE_PROJECT/gcs2bq:latest | ||
``` | ||
|
||
### Usage | ||
|
||
```bash | ||
$ ./gcs2bq -help | ||
Google Cloud Storage object metadata to BigQuery, version 0.1 | ||
Usage of ./gcs2bq: | ||
-alsologtostderr | ||
log to standard error as well as files | ||
-buffer_size int | ||
file buffer (default 1000) | ||
-concurrency int | ||
concurrency (GOMAXPROCS) (default 4) | ||
-file string | ||
output file name (default "gcs.avro") | ||
-log_backtrace_at value | ||
when logging hits line file:N, emit a stack trace | ||
-log_dir string | ||
If non-empty, write log files in this directory | ||
-logtostderr | ||
log to standard error instead of files | ||
-stderrthreshold value | ||
logs at or above this threshold go to stderr | ||
-v value | ||
log level for V logs | ||
-versions | ||
include GCS object versions | ||
-vmodule value | ||
comma-separated list of pattern=N settings for file-filtered logging | ||
``` | ||
|
||
You can also use the supplied `run.sh` scripts, which accepts the following | ||
environment variables as input: | ||
|
||
- `GCS2BQ_PROJECT`: project ID where the storage bucket and BigQuery dataset resides in | ||
- `GCS2BQ_DATASET`: BigQuery dataset name (eg. `gcs2bq`) | ||
- `GCS2BQ_TABLE`: BigQuery table name (eg. `objects`) | ||
- `GCS2BQ_BUCKET`: Bucket for storing the temporary Avro file to be loaded into BigQuery (no `gs://` prefix) | ||
- `GCS2BQ_LOCATION`: Location for the bucket and dataset (if they need to be created, eg. `EU`) | ||
- `GCS2BQ_VERSIONS`: Set to non-empty if you want to retrieve object versions as well | ||
|
||
## IAM permissions on GCP | ||
|
||
To be able to discover all projects and buckets, the Service Account that you | ||
run GCS2BQ under should have the following permissions on organization level: | ||
|
||
- List all projects: `resourcemanager.projects.get` | ||
- List buckets: `storage.buckets.list` | ||
- List objects in bucket: `storage.objects.list` | ||
- Read ACLs from objects in bucket: `storage.objects.getIamPolicy` | ||
|
||
These permissions can be partly granted with the following predefined role (lacks | ||
permission to retrieve ACLs): | ||
|
||
- Storage Object Viewer: `roles/storage.objectViewer` | ||
|
||
There is also a custom role in [gcs2bq-custom-role.yaml](gcs2bq-custom-role.yaml) that | ||
only has the necessary permissions. See the file for instructions. | ||
|
||
To write the data through GCS to BigQuery, you'll need in a project that hosts the | ||
BigQuery dataset the following roles: | ||
|
||
- Storage Admin: `roles/storage.admin` | ||
- BigQuery User: `roles/bigquery.user` | ||
|
||
### BigQuery schema | ||
|
||
See file [bigquery.schema](bigquery.schema) for the BigQuery table schema. AVRO | ||
schema is in [gcs2bq.avsc](gcs2bq.avsc). | ||
|
||
## Sample BigQuery queries | ||
|
||
### Find average age of files and size of each storage tier | ||
|
||
```sql | ||
SELECT | ||
project_id, | ||
bucket, | ||
ROUND(AVG(TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), created, DAY)), 1) AS created_average_days, | ||
SUM( | ||
IF | ||
(storage_class='STANDARD', | ||
size, | ||
0)) AS size_standard, | ||
SUM( | ||
IF | ||
(storage_class='NEARLINE', | ||
size, | ||
0)) AS size_nearline, | ||
SUM( | ||
IF | ||
(storage_class='COLDLINE', | ||
size, | ||
0)) AS size_coldline, | ||
SUM( | ||
IF | ||
(storage_class='ARCHIVE', | ||
size, | ||
0)) AS size_archived | ||
FROM | ||
gcs2bq.files | ||
GROUP BY | ||
project_id, | ||
bucket | ||
``` | ||
|
||
### Find a histogram of how data is allocated in different sized files | ||
|
||
```sql | ||
SELECT | ||
CASE | ||
WHEN histogram_bucket = 1 THEN "< 1 KB" | ||
WHEN histogram_bucket = 2 THEN "< 100 KB" | ||
WHEN histogram_bucket = 3 THEN "< 1 MB" | ||
WHEN histogram_bucket = 4 THEN "< 100 MB" | ||
WHEN histogram_bucket = 5 THEN "< 1 GB" | ||
ELSE | ||
"> 1 GB" | ||
END | ||
AS class, | ||
SUM(size) AS total_size | ||
FROM ( | ||
SELECT | ||
size, | ||
CASE | ||
WHEN size <= 1024 THEN 1 | ||
WHEN size <= 1024*100 THEN 2 | ||
WHEN size <= 1024*1024 THEN 3 | ||
WHEN size <= 1024*1024*100 THEN 4 | ||
WHEN size <= 1024*1024*1024 THEN 5 | ||
ELSE | ||
6 | ||
END | ||
AS histogram_bucket | ||
FROM | ||
gcs2bq.files ) | ||
GROUP BY | ||
histogram_bucket | ||
ORDER BY | ||
histogram_bucket ASC | ||
``` | ||
|
||
### Find owners with most data | ||
|
||
```sql | ||
SELECT | ||
owner, | ||
SUM(size) AS total_size | ||
FROM | ||
gcs2bq.files | ||
GROUP BY | ||
owner | ||
ORDER BY | ||
total_size DESC | ||
``` | ||
|
||
### Find duplicate files across all buckets | ||
|
||
```sql | ||
SELECT | ||
project_id, | ||
CONCAT("gs://", bucket, "/", name) AS file, | ||
COUNT(md5) AS duplicates | ||
FROM | ||
gcs2bq.files | ||
GROUP BY | ||
project_id, | ||
file | ||
HAVING | ||
duplicates > 1 | ||
``` | ||
|
||
|
||
### Running in GKE as a CronJob | ||
|
||
You can deploy the container as a `CronJob` in Google Kubernetes Engine. See the file | ||
[gcs2bq.yaml](gcs2bq.yaml). Replace the environment parameters with values appropriate | ||
for your environment. | ||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
[{ | ||
"name": "project_id", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "bucket", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "name", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "content_type", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "content_language", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "cache_control", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "event_based_hold", | ||
"type": "BOOLEAN", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "temporary_hold", | ||
"type": "BOOLEAN", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "retention_expiration_time", | ||
"type": "INTEGER", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "acl", | ||
"type": "RECORD", | ||
"mode": "REPEATED", | ||
"fields": [{ | ||
"name": "key", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "value", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}] | ||
}, { | ||
"name": "predefined_acl", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "owner", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "size", | ||
"type": "INTEGER", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "content_encoding", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "content_disposition", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "md5", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "crc32c", | ||
"type": "INTEGER", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "media_link", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "generation", | ||
"type": "INTEGER", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "metageneration", | ||
"type": "INTEGER", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "storage_class", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "created", | ||
"type": "TIMESTAMP", | ||
"mode": "NULLABLE" | ||
}, { | ||
"name": "deleted", | ||
"type": "TIMESTAMP", | ||
"mode": "NULLABLE" | ||
}, { | ||
"name": "updated", | ||
"type": "TIMESTAMP", | ||
"mode": "NULLABLE" | ||
}, { | ||
"name": "customer_key_sha256", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "kms_key_name", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, { | ||
"name": "etag", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}] |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# To create the custom role: | ||
# gcloud iam roles create gcs2bq --organization=ORGANIZATION_ID --file=gcs2bq-custom-role.yaml | ||
# | ||
# To grant the custom role at organization level: | ||
# gcloud organizations add-iam-policy-binding ORGANIZATION_ID \ | ||
# --member='serviceAccount:gcs2bq@PROJECT_ID.iam.gserviceaccount.com' \ | ||
# --role='organizations/ORGANIZATION_ID/roles/gcs2bq' | ||
# | ||
title: "GCS2BQ service account" | ||
description: "GCS2BQ service account" | ||
stage: GA | ||
includedPermissions: | ||
- resourcemanager.projects.get | ||
- storage.buckets.list | ||
- storage.objects.list | ||
- storage.objects.getIamPolicy |
Oops, something went wrong.