Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Security Lake Parquet File Schema Format Issues upon AWS Opensearch Ingestion & AWS Athena Querying #728

Open
m00lav opened this issue Dec 19, 2023 · 6 comments
Assignees
Labels
kind/bug Something isn't working lifecycle/rotten
Milestone

Comments

@m00lav
Copy link

m00lav commented Dec 19, 2023

Background:

We are leveraging AWS security lake to ingest various log sources into OCSF, have this data be queryable via AWS Athena, as well as ingest this data into AWS OpenSearch. We are attempting to ingest Falco data by following by the following article: falcosidekick integration documentation.

Describe the bug:

After following the instructions provided in the article linked above we are receiving Falco data in our security lake s3 bucket and this data is queryable via S3 Select. However, the lake formation table generated by security lake returns a generic error of Unable to Read Parquet File when attempting to query via Athena. Additionally, we are leveraging the AWS OpenSearch Ingestion Pipeline with the Security Lake S3 parquet OCSF pipeline template. Native sources from security lake are ingested without error but we are seeing an error when Falco data is ingested. The error from OS ingestion pipeline (via CloudWatch) is as follows:

java.lang.UnsupportedOperationException: REPEATED not supported outside LIST or MAP. Type: repeated binary types (STRING) = 0

AWS support was contacted regarding this error. The following was their response:

"REPEATED" is a keyword in protobuf. It seems the files are being written from protobufs and the generated schema is not supported by the Avro parquet library used by OS ingestion. The source of this error is https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L303
  
There's some useful information in this stackoverflow post:  
https://stackoverflow.com/questions/72634350/parquetprotowriters-creates-an-unreadable-parquet-file

How to reproduce it:

Expected behaviour:

  • Falco data in security lake will be ingestible without error by an AWS OpenSearch Ingestion Pipeline

Environment:

Falco version

0.36.1 (x86_64) - from docker.io/falcosecurity/falco-no-driver:0.36.1

System info

{
  "machine": "x86_64",
  "nodename": "falco-6sck4",
  "release": "5.10.197-186.748.amzn2.x86_64",
  "sysname": "Linux",
  "version": "#1 SMP Tue Oct 10 00:30:07 UTC 2023"
}

Cloud provider or hardware configuration

AWS EKS - managed nodegroups

OS

FALCO CONTAINER:
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel:

Linux falco-6sck4 5.10.197-186.748.amzn2.x86_64 #1 SMP Tue Oct 10 00:30:07 UTC 2023 x86_64 GNU/Linux

Installation method:

Kubernetes

Additional context:

N/A

@m00lav m00lav added the kind/bug Something isn't working label Dec 19, 2023
@Issif Issif added this to the 2.29.0 milestone Dec 19, 2023
@Issif
Copy link
Member

Issif commented Dec 19, 2023

Thanks for this report, I'll work on it asap.

@asuresh8
Copy link

asuresh8 commented Dec 20, 2023

Note that I was the one who mentioned that I thought it was an issue converting from proto to parquet. Upon going through the parquet library used to generate the files by this repo, it looks like REPEATED is a valid keyword in parquet. The issue is that the use of REPEATED is not correct. See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md for detailed description of how REPEATED should be used. I see an issue in these places:

If this field is repeated then OCSFSecurityFinding needs to be in a list or a map. I'm not sure if the top level of a parquet file counts as a list

If types is repeated then OCSFFIndingDetails needs to be in a list or a map. It is not

If tags is repeated then OCSFFIndingDetails needs to be in a list or a map. Is is not

See this tip in the parquet-go library.

@poiana
Copy link

poiana commented Mar 19, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@m00lav
Copy link
Author

m00lav commented Mar 19, 2024

/remove-lifecycle stale

@Issif Issif self-assigned this Apr 30, 2024
@Issif Issif modified the milestones: 2.29.0, 2.30 Jun 24, 2024
@poiana
Copy link

poiana commented Sep 22, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@poiana
Copy link

poiana commented Oct 22, 2024

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working lifecycle/rotten
Projects
Status: To do
Development

No branches or pull requests

4 participants