There is a list of visit events produced in time series. For each visit there is create event and series of update events. Only most recent update is relevant and visit is tracked up to 1 hour.
Example events:
{
"messageType": "VisitCreate",
"visit": {
"id": "82abce83-3892-48ee-9f1b-d34c4746ace7",
"userId": "dc0ad841-0b89-4411-a033-d3f174e8d0ad",
"documentId": "7b2bc74e-f529-4f5d-885b-4377c424211d",
"createdAt": "2015-04-22T11:42:07.602Z"
}
}
{
"messageType": "VisitUpdate",
"visit": {
"id": "82abce83-3892-48ee-9f1b-d34c4746ace7",
"engagedTime": 25,
"completion": 0.4,
"updatedAt": "2015-04-22T11:42:35.122Z"
}
}
The goal of this application is to read a stream of these events and functionally aggregate it into a list of statistics.
Example statistics:
document | start time | end time | visits | uniques | time | completion |
---|---|---|---|---|---|---|
62e09c7d-714d-40a6-9e6e-fdc525a90d59 | 2015-01-01T11:00Z | 2015-01-01T12:00Z | 1 | 1 | 0.75 | 1 |
a8e5010b-aa0a-44c3-b8b2-6865ca0bac90 | 2015-01-01T10:00Z | 2015-01-01T11:00Z | 2 | 1 | 0.0 | 0 |
a8e5010b-aa0a-44c3-b8b2-6865ca0bac90 | 2015-01-01T11:00Z | 2015-01-01T12:00Z | 1 | 1 | 0.0 | 0 |
62e09c7d-714d-40a6-9e6e-fdc525a90d59 | 2015-01-01T12:00Z | 2015-01-01T13:00Z | 1 | 1 | 0.5 | 0 |
Note: time is in hours.
sbt cucumber
to run integration testssbt test
to run the unit tests
Current implementation takes ~90 seconds to process and aggregate ~1 million of messages on a single core, which adds up to ~1 billion messages per day, which could be enough for many use cases.
- Aggregate data into analytics: features/data_processing.feature.
- Stream data IO: features/data_transport.feature;
- There are unhandled edge cases and code is not at it's cleanest in buffer implementation, emphasis was on having it working for basic cases;
- There could be more generic stream acquisition, not necessary from file;
- The output goes to a file, it could be going into a storage instead;
- Analytics persistence and querying are missing;
- If visit spans two hours (starts at 11:50, ends at 12:20), then it will be considered only towards the first hour from 11:00 to 12:00, but the stats will include the later hour from 12:00 to 13:00;
- Single incorrect message would kill the stream;
- Analytics calculation is not parallel;
- Dates are not handled cleanly - some places use ZonedDateTime, others converts it to long back and forth;
- Timeout is scattered across the code base, it could be concentrated in buffer instead;
- Only success path is considered in many places;
Call the main with these parameters:
- relative input file path;
- relative output file path;
sbt "runMain com.spikerlabs.streamingapp.App testdata/test-visit-messages.log testdata/test-visit-messages-result.log"