This is an example of an end-to-end Snowplow pipeline to track events using a Kafka broker.
The pipeline works in the following way:
- A request is sent to the Scala Collector
- The raw event (thift event) is put into the
snowplow_raw_good
(or bad) topic - The enricher grabs the raw events, parses them and put them into the
snowplow_parsed_good
(or bad) topic - A custom events processor grabs the parsed event, which is in a tab-delimited/json hybrid format and turns it into a proper
Json event using the python analytics SDK from Snowplow. This event is then put into a final topic called
snowplow_json_event
. - (WIP) A custom script grabs the final Json events and loads them into some storage solution (such as BigQuery or Redshift)
Execute:
docker-compose up -d
After that, make sure all the containers are up with docker-compose ps
. If not, try to run the command again.
This command will create all the components and also a simple web application that sends pageviews and other events to the collector.
Please checkout the docker-compose.yml
file for more details.
The configuration files are using endpoints for Kafka provided by kubernetes-kafka. You can configure your own brokers in the collector and enrich configuration file (configmaps).
Assuming you have configured Kafka for all the components:
- Deploy the collector:
kubectl apply -f ./k8s/collector
. This will create the configuration, deployment and a service that uses a Load Balancer to access the collector's endpoint. - Deploy the enricher:
kubectl apply -f ./k8s/stream-enrich
. This will create the configurations and deployment. - Build a Docker image for the events processor, upload it to some registry and add it to
k8s/events-processor/deploy.yml
. Checkout the files ink8s/events-processor
before building the image and change the broker configuration inapp.py
. - Deploy the events processor application using the previous
k8s/events-processor/deploy.yml
file. - Run the webapp example locally and change the collector's address to the load balancer IP address created for the collector. The collector is using the port 80 so remove the port and just leave the IP address.
- Once everything is running, open the webapp and refresh the page a couple of times. Checkout the logs of the events processor pod to see if the Json events are created correctly.