Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question - Data retention and Bucket Lifecycle policies #2

Open
dlethin opened this issue Nov 4, 2020 · 1 comment
Open

Question - Data retention and Bucket Lifecycle policies #2

dlethin opened this issue Nov 4, 2020 · 1 comment

Comments

@dlethin
Copy link

dlethin commented Nov 4, 2020

Thanks for sharing this project and the extensive blog post behind it. I've got a few questions -

While I've manually setup athena to make adhoc queries against cloudtrail buckets on an as needed basis, we're considering our options for automating to have athena searchable on a daily basis across our cloudtrail logs, and your approach sounds interesting - even more appealing as we use terraform. I'm trying to catch up and learn about about AWS Glue as I have no direct experience with it nor with parquet format.

I'm curious how this solution is effected by the number of accounts/regions writing cloudtrail to the raw source bucket from the org and the number of days this data kept in the bucket before being purged by a lifecycle_rule. For example, what if we keep 6 months of log files for 30+ accounts with activity in 5 different regions. ( not exactly sure what our final retention policy will be... still working that out)

Does the crawler only crawl new objects uploaded since the last run, or does it need to crawl the entire bucket every day? What happens when objects expire via the lifecycle rule in the source bucket? Would the objects in the bucket holding the transformed parquet files get purged automatically, or would a lifecycle rule need to be written on the target bucket as well? Are the table partitions actively pruned to keep up with dates that have expired as a result of the lifecycle rule?

I will try to schedule some time in the coming week or two to try this out and that might shed light on my questions.

@BigDataDaddy
Copy link

WRT, the data retention and S3 bucket lifecycle rule removing the transformed data already crawled, that's going to depend on the configuration choices of the crawler behavior. See this doc page for configuration options: https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants