Skip to content

Latest commit

 

History

History
136 lines (93 loc) · 6.04 KB

Intro.md

File metadata and controls

136 lines (93 loc) · 6.04 KB

Introduction

KafkaStreams4s is a library for writing Kafka Streams programs while leveraging the cats ecosystem.

Quick example

This example uses the kafka-streams4s-circe module, but the code should look similar even if you do not use circe or even JSON.

For this example let's consider a use case where we track movies and movie sales and want to figure out if we can find trends for movie genres. Modeling this as Scala case classes we write the following:

import java.util.UUID

case class Movie(title: String, genre: String)
case class Purchase(movieId: UUID, amount: Int, user: String)

In order for us to use these case classes with kafka-streams4s-circe, we'll need to define Encoders and Decoders for them.

import io.circe.{Encoder, Decoder}

implicit val movieDecoder: Decoder[Movie] = Decoder.forProduct2("title", "genre")(Movie.apply _)
implicit val movieEncoder: Encoder[Movie] = Encoder.forProduct2("title", "genre")(m => (m.title, m.genre))

implicit val purchaseDecoder: Decoder[Purchase] = Decoder.forProduct3("movieId", "amount", "user")(Purchase.apply _)
implicit val purchaseEncoder: Encoder[Purchase] = 
  Encoder.forProduct3("movieId", "amount", "user")(p => (p.movieId, p.amount, p.user))

We will assume here that there exist two kafka topics example.movies and example.purchases which emit json records. Next, we will want to create a CirceTable for the two topics we want to use for our programs. A CirceTable[K, V] is a wrapper around a KTable[K, V] and includes all the necessary codecs so that Kafka Streams is able to serialize and deserialize our records without having to pass around any codecs explicitly. Simply put, we can create a CirceTable[K, V] for any K and V that have a circe Encoder and Decoder in scope.

import compstak.kafkastreams4s.circe.CirceTable
import org.apache.kafka.streams.StreamsBuilder

val sb = new StreamsBuilder

val movies = CirceTable[UUID, Movie](sb, "example.movies")
val purchases = CirceTable[UUID, Purchase](sb, "example.purchases")

Now, what we'll want to do is join the two topics, filter out some genres and accumulate the results into a Map of genre to number of purchases.

val withoutOther: CirceTable[UUID, Movie] = movies.filter(_.genre != "Other")
val pairs: CirceTable[UUID, (Purchase, Movie)] = 
  purchases.join(withoutOther)(purchase => purchase.movieId)((purchase, movie) => (purchase, movie))

val result: CirceTable[String, Int] =
  pairs.scanWith { case (id, (purchase, movie)) => (movie.genre, purchase.amount) }(_ + _)

Here we first filter out all the movies that are tagged with the "Other" genre. Then we join the movies and purchases topics and lastly we use the scanWith operation to select a new KV-pair and then pass a function to aggregate the result. Now, all that's left is to direct the result into an output topic example.output and run the program

import scala.concurrent.ExecutionContext
import cats._, cats.implicits._
import cats.effect._, cats.effect.implicits._
import compstak.kafkastreams4s.Platform
import org.apache.kafka.streams.Topology
import java.util.Properties
import java.time.Duration

val props = new Properties //in real code add the desired configuration to this object.

val topology: IO[Topology] = result.to[IO]("example.output") >> IO(sb.build())

val main: IO[Unit] = 
  topology.flatMap(topo => Platform.run[IO](topo, props, Duration.ofSeconds(2))).void

compstak.kafkastreams4s.Platform gives us a function run to run Kafka Streams programs and takes a topology, java properties and a timeout after which the stream threads will be shut off.

Testing your topologies

KafkaStreams4s comes with a testing module that allows us to test our kafka streams programs without even spinning up a kafka cluster. To start using it include the kafka-streams4s-testing module in your build.sbt. It's built upon kafka-streams-test-utils and should give us a good amount of confidence in our streams logic.

First we will create a test driver using our topology defined earlier:

import cats.effect.Resource
import compstak.kafkastreams4s.testing.KafkaStreamsTestRunner
import org.apache.kafka.streams.TopologyTestDriver

val driver: Resource[IO, TopologyTestDriver] = 
  Resource.eval(topology).flatMap(KafkaStreamsTestRunner.testDriverResource[IO])

Then we'll setup some inputs for our two topics:

import compstak.kafkastreams4s.circe.CirceCodec

val testMovies = List(
  UUID.fromString("150ac164-a4dd-4809-9e2f-fc092edb9d1d") -> Movie("The Godfather", "Crime"),
  UUID.fromString("1312c871-dd07-43a7-ae7b-3a74e0c9ce6d") -> Movie("Schindler's List", "Drama"),
  UUID.fromString("b6168e9a-a871-4712-a15b-0261edc7c9d2") -> Movie("Being John Malkovich", "Other")
)

val testPurchases = List(
  UUID.fromString("faad4c27-39d6-41df-9d97-102dc5b0bc93") -> Purchase(UUID.fromString("b6168e9a-a871-4712-a15b-0261edc7c9d2"), 2, "JohnDoe42"),
  UUID.fromString("310a790e-7df4-4cbb-91f8-42bcf998b9e5") -> Purchase(UUID.fromString("150ac164-a4dd-4809-9e2f-fc092edb9d1d"), 1, "MarkB98"),
  UUID.fromString("5d00d33e-82c4-418e-b776-46b16bd13508") -> Purchase(UUID.fromString("1312c871-dd07-43a7-ae7b-3a74e0c9ce6d"), 1, "JaneDoe"),
  UUID.fromString("0e36dd2f-e3db-4b1e-af45-52f8bc27d65f") -> Purchase(UUID.fromString("150ac164-a4dd-4809-9e2f-fc092edb9d1d"), 3, "NinaD14"),
)

def pipeIn(driver: TopologyTestDriver): IO[Unit] =
  KafkaStreamsTestRunner.inputTestTable[IO, CirceCodec](driver, "example.movies", testMovies: _*) >>
  KafkaStreamsTestRunner.inputTestTable[IO, CirceCodec](driver, "example.purchases", testPurchases: _*)
  

Next, we'll get out the values from the output topic using a different function from the KafkaStreamsTestRunner.

import cats.effect.unsafe.implicits.global

def pipeOut(driver: TopologyTestDriver): IO[Map[String, Int]] =
  KafkaStreamsTestRunner.outputTestTable[IO, CirceCodec, String, Int](driver, "example.output")

driver.use(d => pipeIn(d) >> pipeOut(d)).unsafeRunSync
// res0: Map[String, Int] = Map("Drama" -> 1, "Crime" -> 4)