diff --git a/python/docs/_static/css/styles.css b/python/docs/_static/css/styles.css index 8e6962976..b2814e2ac 100644 --- a/python/docs/_static/css/styles.css +++ b/python/docs/_static/css/styles.css @@ -65,4 +65,19 @@ body.quarto-dark .highlight dt { .bold-italic { font-weight: bolder; font-style: italic; -} \ No newline at end of file +} + +figcaption { + text-align: center; +} + +.quarto-grid-item { + border-radius: 10px; +} + +.card-img-top img { + border-top-right-radius: 10px; + border-top-left-radius: 10px; + border-bottom-left-radius: 0px; + border-bottom-right-radius: 0px; +} diff --git a/python/docs/_static/images/blog/6-factors_banner.png b/python/docs/_static/images/blog/6-factors_banner.png new file mode 100644 index 000000000..8d84597ba Binary files /dev/null and b/python/docs/_static/images/blog/6-factors_banner.png differ diff --git a/python/docs/_static/images/blog/6-factors_development-and-velocity.png b/python/docs/_static/images/blog/6-factors_development-and-velocity.png new file mode 100644 index 000000000..88858f0ef Binary files /dev/null and b/python/docs/_static/images/blog/6-factors_development-and-velocity.png differ diff --git a/python/docs/_static/images/blog/announcing-kaskada-oss_banner.png b/python/docs/_static/images/blog/announcing-kaskada-oss_banner.png new file mode 100644 index 000000000..040365d7d Binary files /dev/null and b/python/docs/_static/images/blog/announcing-kaskada-oss_banner.png differ diff --git a/python/docs/_static/images/blog/data-engineers-need-kaskada_banner.png b/python/docs/_static/images/blog/data-engineers-need-kaskada_banner.png new file mode 100644 index 000000000..88bdc849d Binary files /dev/null and b/python/docs/_static/images/blog/data-engineers-need-kaskada_banner.png differ diff --git a/python/docs/_static/images/blog/event-processing-and-time-calculations_banner.png b/python/docs/_static/images/blog/event-processing-and-time-calculations_banner.png new file mode 100644 index 000000000..f2d587225 Binary files /dev/null and b/python/docs/_static/images/blog/event-processing-and-time-calculations_banner.png differ diff --git a/python/docs/_static/images/blog/hyper-personalization_banner.png b/python/docs/_static/images/blog/hyper-personalization_banner.png new file mode 100644 index 000000000..280b73008 Binary files /dev/null and b/python/docs/_static/images/blog/hyper-personalization_banner.png differ diff --git a/python/docs/_static/images/blog/introducing-timelines/aggregation.svg b/python/docs/_static/images/blog/introducing-timelines/aggregation.svg new file mode 100644 index 000000000..952511b9a --- /dev/null +++ b/python/docs/_static/images/blog/introducing-timelines/aggregation.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/python/docs/_static/images/blog/introducing-timelines/continuous.png b/python/docs/_static/images/blog/introducing-timelines/continuous.png new file mode 100644 index 000000000..a1d49bec6 Binary files /dev/null and b/python/docs/_static/images/blog/introducing-timelines/continuous.png differ diff --git a/python/docs/_static/images/blog/introducing-timelines/data_windows_1.svg b/python/docs/_static/images/blog/introducing-timelines/data_windows_1.svg new file mode 100644 index 000000000..088ad7e3d --- /dev/null +++ b/python/docs/_static/images/blog/introducing-timelines/data_windows_1.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/python/docs/_static/images/blog/introducing-timelines/data_windows_2.svg b/python/docs/_static/images/blog/introducing-timelines/data_windows_2.svg new file mode 100644 index 000000000..18556e759 --- /dev/null +++ b/python/docs/_static/images/blog/introducing-timelines/data_windows_2.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/python/docs/_static/images/blog/introducing-timelines/data_windows_3.svg b/python/docs/_static/images/blog/introducing-timelines/data_windows_3.svg new file mode 100644 index 000000000..bfb488ed9 --- /dev/null +++ b/python/docs/_static/images/blog/introducing-timelines/data_windows_3.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/python/docs/_static/images/blog/introducing-timelines/discrete.png b/python/docs/_static/images/blog/introducing-timelines/discrete.png new file mode 100644 index 000000000..1471d43a4 Binary files /dev/null and b/python/docs/_static/images/blog/introducing-timelines/discrete.png differ diff --git a/python/docs/_static/images/blog/introducing-timelines/lifecycle.png b/python/docs/_static/images/blog/introducing-timelines/lifecycle.png new file mode 100644 index 000000000..9fe262d7c Binary files /dev/null and b/python/docs/_static/images/blog/introducing-timelines/lifecycle.png differ diff --git a/python/docs/_static/images/blog/introducing-timelines/part-1_banner.png b/python/docs/_static/images/blog/introducing-timelines/part-1_banner.png new file mode 100644 index 000000000..d122c769f Binary files /dev/null and b/python/docs/_static/images/blog/introducing-timelines/part-1_banner.png differ diff --git a/python/docs/_static/images/blog/introducing-timelines/part-2_banner.png b/python/docs/_static/images/blog/introducing-timelines/part-2_banner.png new file mode 100644 index 000000000..f115febc6 Binary files /dev/null and b/python/docs/_static/images/blog/introducing-timelines/part-2_banner.png differ diff --git a/python/docs/_static/images/blog/introducing-timelines/stream_abstraction_history.png b/python/docs/_static/images/blog/introducing-timelines/stream_abstraction_history.png new file mode 100644 index 000000000..4119b5fdc Binary files /dev/null and b/python/docs/_static/images/blog/introducing-timelines/stream_abstraction_history.png differ diff --git a/python/docs/_static/images/blog/introducing-timelines/temporal_join_1.svg b/python/docs/_static/images/blog/introducing-timelines/temporal_join_1.svg new file mode 100644 index 000000000..aac217a67 --- /dev/null +++ b/python/docs/_static/images/blog/introducing-timelines/temporal_join_1.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/python/docs/_static/images/blog/introducing-timelines/temporal_join_2.svg b/python/docs/_static/images/blog/introducing-timelines/temporal_join_2.svg new file mode 100644 index 000000000..5ee11ea92 --- /dev/null +++ b/python/docs/_static/images/blog/introducing-timelines/temporal_join_2.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/python/docs/_static/images/blog/introducing-timelines/timeline_history.png b/python/docs/_static/images/blog/introducing-timelines/timeline_history.png new file mode 100644 index 000000000..0d89bf677 Binary files /dev/null and b/python/docs/_static/images/blog/introducing-timelines/timeline_history.png differ diff --git a/python/docs/_static/images/blog/introducing-timelines/timeline_snapshot.png b/python/docs/_static/images/blog/introducing-timelines/timeline_snapshot.png new file mode 100644 index 000000000..6ca05c438 Binary files /dev/null and b/python/docs/_static/images/blog/introducing-timelines/timeline_snapshot.png differ diff --git a/python/docs/_static/images/blog/introducing-timelines/windowed.svg b/python/docs/_static/images/blog/introducing-timelines/windowed.svg new file mode 100644 index 000000000..10feb5ed7 --- /dev/null +++ b/python/docs/_static/images/blog/introducing-timelines/windowed.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/python/docs/_static/images/blog/kaskada-understands-time_banner.png b/python/docs/_static/images/blog/kaskada-understands-time_banner.png new file mode 100644 index 000000000..15cb95a9a Binary files /dev/null and b/python/docs/_static/images/blog/kaskada-understands-time_banner.png differ diff --git a/python/docs/_static/images/blog/ml-for-data-engineers_banner.png b/python/docs/_static/images/blog/ml-for-data-engineers_banner.png new file mode 100644 index 000000000..ae50c7410 Binary files /dev/null and b/python/docs/_static/images/blog/ml-for-data-engineers_banner.png differ diff --git a/python/docs/_static/images/blog/ml-for-data-engineers_coefficents.png b/python/docs/_static/images/blog/ml-for-data-engineers_coefficents.png new file mode 100644 index 000000000..4681cd5dd Binary files /dev/null and b/python/docs/_static/images/blog/ml-for-data-engineers_coefficents.png differ diff --git a/python/docs/_static/images/blog/ml-for-data-engineers_outlier.png b/python/docs/_static/images/blog/ml-for-data-engineers_outlier.png new file mode 100644 index 000000000..8a35d998a Binary files /dev/null and b/python/docs/_static/images/blog/ml-for-data-engineers_outlier.png differ diff --git a/python/docs/_static/images/blog/ml-for-data-engineers_process-1.png b/python/docs/_static/images/blog/ml-for-data-engineers_process-1.png new file mode 100644 index 000000000..4d39bf485 Binary files /dev/null and b/python/docs/_static/images/blog/ml-for-data-engineers_process-1.png differ diff --git a/python/docs/_static/images/blog/ml-for-data-engineers_process-2.png b/python/docs/_static/images/blog/ml-for-data-engineers_process-2.png new file mode 100644 index 000000000..53e237b78 Binary files /dev/null and b/python/docs/_static/images/blog/ml-for-data-engineers_process-2.png differ diff --git a/python/docs/_static/images/blog/ml-for-data-engineers_regression-1.png b/python/docs/_static/images/blog/ml-for-data-engineers_regression-1.png new file mode 100644 index 000000000..fb8c491b6 Binary files /dev/null and b/python/docs/_static/images/blog/ml-for-data-engineers_regression-1.png differ diff --git a/python/docs/_static/images/blog/ml-for-data-engineers_regression-2.png b/python/docs/_static/images/blog/ml-for-data-engineers_regression-2.png new file mode 100644 index 000000000..2909ca982 Binary files /dev/null and b/python/docs/_static/images/blog/ml-for-data-engineers_regression-2.png differ diff --git a/python/docs/_static/images/blog/model-contexts-affect-behavior_banner.png b/python/docs/_static/images/blog/model-contexts-affect-behavior_banner.png new file mode 100644 index 000000000..1a75e7d1f Binary files /dev/null and b/python/docs/_static/images/blog/model-contexts-affect-behavior_banner.png differ diff --git a/python/docs/_static/images/blog/new-kaskada_banner.png b/python/docs/_static/images/blog/new-kaskada_banner.png new file mode 100644 index 000000000..d99c6a5d2 Binary files /dev/null and b/python/docs/_static/images/blog/new-kaskada_banner.png differ diff --git a/python/docs/_static/images/blog/solving-gaming-challenges_banner.png b/python/docs/_static/images/blog/solving-gaming-challenges_banner.png new file mode 100644 index 000000000..928d4e0f2 Binary files /dev/null and b/python/docs/_static/images/blog/solving-gaming-challenges_banner.png differ diff --git a/python/docs/_static/images/blog/solving-gaming-challenges_revenue.png b/python/docs/_static/images/blog/solving-gaming-challenges_revenue.png new file mode 100644 index 000000000..e9bd376ce Binary files /dev/null and b/python/docs/_static/images/blog/solving-gaming-challenges_revenue.png differ diff --git a/python/docs/_static/images/blog/stores-vs-engines_banner.png b/python/docs/_static/images/blog/stores-vs-engines_banner.png new file mode 100644 index 000000000..bb8332c65 Binary files /dev/null and b/python/docs/_static/images/blog/stores-vs-engines_banner.png differ diff --git a/python/docs/_static/images/blog/time-centric-design_banner.png b/python/docs/_static/images/blog/time-centric-design_banner.png new file mode 100644 index 000000000..bef3462cc Binary files /dev/null and b/python/docs/_static/images/blog/time-centric-design_banner.png differ diff --git a/python/docs/_static/images/blog/time-centric-design_database-table.png b/python/docs/_static/images/blog/time-centric-design_database-table.png new file mode 100644 index 000000000..4ef313ea1 Binary files /dev/null and b/python/docs/_static/images/blog/time-centric-design_database-table.png differ diff --git a/python/docs/_static/images/blog/timeline-abstractions_banner.png b/python/docs/_static/images/blog/timeline-abstractions_banner.png new file mode 100644 index 000000000..c6e02db1b Binary files /dev/null and b/python/docs/_static/images/blog/timeline-abstractions_banner.png differ diff --git a/python/docs/_static/images/blog/unified-feature-engine_banner.png b/python/docs/_static/images/blog/unified-feature-engine_banner.png new file mode 100644 index 000000000..48c71c7d4 Binary files /dev/null and b/python/docs/_static/images/blog/unified-feature-engine_banner.png differ diff --git a/python/docs/_static/images/blog/unified-feature-engine_configurable-1.png b/python/docs/_static/images/blog/unified-feature-engine_configurable-1.png new file mode 100644 index 000000000..6409712de Binary files /dev/null and b/python/docs/_static/images/blog/unified-feature-engine_configurable-1.png differ diff --git a/python/docs/_static/images/blog/unified-feature-engine_configurable-2.png b/python/docs/_static/images/blog/unified-feature-engine_configurable-2.png new file mode 100644 index 000000000..4751f58a8 Binary files /dev/null and b/python/docs/_static/images/blog/unified-feature-engine_configurable-2.png differ diff --git a/python/docs/_static/images/blog/unified-feature-engine_training-examples-1.png b/python/docs/_static/images/blog/unified-feature-engine_training-examples-1.png new file mode 100644 index 000000000..1d5bb36f2 Binary files /dev/null and b/python/docs/_static/images/blog/unified-feature-engine_training-examples-1.png differ diff --git a/python/docs/_static/images/blog/unified-feature-engine_training-examples-2.png b/python/docs/_static/images/blog/unified-feature-engine_training-examples-2.png new file mode 100644 index 000000000..7cff88f04 Binary files /dev/null and b/python/docs/_static/images/blog/unified-feature-engine_training-examples-2.png differ diff --git a/python/docs/_static/images/blog/why-kaskada_banner.png b/python/docs/_static/images/blog/why-kaskada_banner.png new file mode 100644 index 000000000..e72eeb206 Binary files /dev/null and b/python/docs/_static/images/blog/why-kaskada_banner.png differ diff --git a/python/docs/_variables.yml b/python/docs/_variables.yml new file mode 100644 index 000000000..cbd0cf8fe --- /dev/null +++ b/python/docs/_variables.yml @@ -0,0 +1 @@ +slack-join-url: https://join.slack.com/t/kaskada-hq/shared_invite/zt-1t1lms085-bqs2jtGO2TYr9kuuam~c9w \ No newline at end of file diff --git a/python/docs/blog/index.qmd b/python/docs/blog/index.qmd index e2c74b6fd..86e66df01 100644 --- a/python/docs/blog/index.qmd +++ b/python/docs/blog/index.qmd @@ -3,10 +3,17 @@ title: Blog listing: contents: posts sort: "date desc" - type: default + type: grid categories: true sort-ui: true filter-ui: true - fields: ["title", "date", "author", "subtitle", "description", "reading-time", "categories"] # "image", "image-alt", + # the commented out fields are the remaining potential fields to include in the listing + fields: ["title", "date", "author", "subtitle", "reading-time", "categories", "image", "image-alt"] # "description" + image-height: auto page-layout: full +format: + html: + grid: + margin-width: 150px + body-width: 1000px --- diff --git a/python/docs/blog/posts/2022-04-15-ml-for-data-engineers.qmd b/python/docs/blog/posts/2022-04-15-ml-for-data-engineers.qmd new file mode 100644 index 000000000..e746df7d5 --- /dev/null +++ b/python/docs/blog/posts/2022-04-15-ml-for-data-engineers.qmd @@ -0,0 +1,86 @@ +--- +image: /_static/images/blog/ml-for-data-engineers_banner.png +title: Machine Learning for Data Engineers +author: Ben Chambers +date: "2022-04-15" +draft: true +--- + +As a data engineer, it is helpful to have a high-level understanding of machine learning to know how best to support data scientists in creating great models. This article explains some of the terms and how the high-level process works. My experience is as a Software engineer and a data engineer, so I’ll try to explain how I understand things, which may resonate with your experiences. This understanding developed over several years of working closely with some amazing data scientists, who have also helped review this for the correct usage of terms and concepts. You probably won’t be able to create a model after reading this, but you will better understand how you can help data scientists create a model. + +## What is a Model? + +Let’s start with a straightforward model you’re probably familiar with–linear regression. Given many data points, we can plot them and attempt to fit a line to them. + +![](/_static/images/blog/ml-for-data-engineers_outlier.png) + +We see many data points and a line that we may have fit to that data on the right. You may already be familiar with the term “outlier” meaning a point that doesn’t fit the rest of the data. Real data is often messy, and outliers aren’t unexpected. + +When training a model, each of these points is referred to as a “training example.” Each training example has an _x_ and _y_ value and represents an example of what we’re training a model to predict. Once we have a model, we can ask, “if a new point has x=3.7, what would we predict the value of _y_ is”? Based on the line we fit, we would predict y=3.7/2+1=2.85. + +![](/_static/images/blog/ml-for-data-engineers_coefficents.png) + +In this model, we may say that x is a _predictor feature_ – a value used to make the prediction – and y is the _target feature_ – the value we are predicting. Most models use more than a single predictor, but before we get there, let’s look at another set of data with just two dimensions. + +Do you think linear regression will produce a good model given the training examples here? + +![](/_static/images/blog/ml-for-data-engineers_regression-1.png) + +Of course not! The points don’t make a line. Instead, we’d need to fit a curve to the data. For instance: ax2+bx +c=y. + +![](/_static/images/blog/ml-for-data-engineers_regression-2.png) + +We’re almost done, but so far we’ve only talked about predicting one value from another. How could we use this to predict future user behavior? It all depends on how we compute the values. If the predictor (x) was computed in January and the target (y) was computed in February, then we _have_ trained a model which predicts behavior one month after the values it applied to. This is why working with time and historical data is important for creating training examples that lead to predictive models. + +Even with these simple models, we have seen a few important aspects of machine learning: + +1. Different problems may be solved better by different kinds of models. + +2. By looking at (visualizing) the data, we can gain useful insights. + +3. The quality of the model depends on the quality of the training data. + +4. We can create models that predict the future by creating training examples with time between predictors and targets. + + +Models often differ from these simple examples by having more than a single predictor feature. Imagine the function being fit takes more than a single x value. Instead, it takes a _vector_ containing many different predictor features. This is where the term _feature vector_ comes from – it’s simply a vector of feature values. For instance, training a model to predict future activity may use the number of movies watched in several different categories as the features. + +Another way that models often differ is by using a more sophisticated model and more sophisticated training techniques. These can efficiently discover more complex relationships between many predictor features and the target value. + +So while two dimensions may make it seem straightforward, most models use many predictors–tens to hundreds. Humans cannot easily visualize the relationship between more than three dimensions, so it is difficult to understand these more sophisticated models. While there may be many possible features to consider, finding the right set is difficult. And it isn’t as easy as providing all of them to the model – that often significantly increases the time it takes to train the model since it needs to make sense of the noise. + +## What is the ML Process? + +What surprised me about the ML process is how much iteration is involved. When adding functionality to a software project, I start with an idea of what I want to add. I may have requirements or mocks. Depending on the scope, I may begin by writing tests right away or designing the different parts I need to build. But the data science process involves more experimentation – it is more like debugging than writing code! + +The data scientist may hypothesize that some feature or information about the user will improve the quality of the model—the accuracy of predictions. For example, maybe they think knowing how many hours someone has been awake will help predict whether they’ll order coffee or water. So, they run an experiment to test that hypothesis. This experiment involves figuring out how to get that data and compute training examples, including the new feature, training a new model with the additional feature, and seeing if the results are better than the previous model. + +Of course, there are established techniques for ensuring quality results from the experiment—separating the data used for training and validation, etc. These are important for doing the process correctly and avoiding problems such as overfitting. The Data Scientist performs these as part of creating and evaluating the model, so we won’t explore these deeply now. The main takeaway is that the feature engineering process may be repeated many times to produce a good model. It may involve trying hundreds of different features. + +This is the first loop of the ML process. It should be a fast process, allowing the Data Scientist to quickly test many different hypotheses to identify the most useful features for creating the model. + +![](/_static/images/blog/ml-for-data-engineers_process-1.png) + +At each step of the process, the data and features may be visualized to get a sense of how things are distributed. But creating the model isn’t the only iterative process. Often, the process looks more like this – one loop involves creating the features and corresponding model, as we just saw. + +At times, Feature Engineering requires wrangling and analyzing new sources of data. This may be data that is already available at your organization but which the Data Scientist hasn’t used before, or it may involve acquiring or collecting additional data. + +Finally, once the model is created, it needs to be deployed to production and the actual results evaluated. This may involve measuring the error between actual and predicted values or looking at the impact of the model on user behaviors. + +![](/_static/images/blog/ml-for-data-engineers_process-2.png) + +After it has been deployed, a model may need to be further refined. Even if it initially had good results, the data (and user behavior) may drift over time, requiring retraining or even creating new features for the model to adapt to the changes. + +## Feature Engineering + +Creating features involves working with the data to compute the feature values for each training example. Each training example is created from a specific entity for which predictions may be made. In many cases, the entity may be specific users, but in other applications of ML it could be a specific vendor, ATM, etc. Generally, the entity is the noun the predictions are being made for. To ensure the examples are representative, they are often computed for many different entities. + +As noted, Feature Engineering is an iterative, experimental process. It may use familiar techniques such as filtering & aggregation to define the features, but it also has special techniques. For instance, there are a variety of ways of encoding categorical data – what I would consider an enum. It often may be undesirable to assign each category a number since that may confuse the model – it is likely insignificant that Delaware is the 1st state and Hawaii the 50th state. + +Another concern during this process is leakage. Leakage happens when the information related to the target (the y values in the simple linear regression example) is used as one of the predictors. In this case, the model may seem good in training and validation where this data is available but perform poorly in production when it is not. One form of leakage that is easy to introduce accidentally is temporal leakage. If you’re using information from one point in time to predict the value at some later point in time, then your predictor feature should not include anything that happened after the first point in time. + +## Conclusion + +Much more could be said on these topics, but I wanted to keep this brief. Hopefully, this provides some insight into the ML terms and high-level process used by Data Scientists. As data engineers, we must recognize the Feature Engineering process is iterative. Anything we can do to make it easier for data scientists to iterate quickly has significant benefits. Further, it is useful to understand how data scientists need to work with temporal and historical data to create training examples. These are two things to keep in mind as you create and deploy new data processing tools for Data Scientists at your organization. + +Stay tuned for more information on how we can help data scientists create better models and deliver business value faster! \ No newline at end of file diff --git a/python/docs/blog/posts/2022-05-01-unified-feature-engine.qmd b/python/docs/blog/posts/2022-05-01-unified-feature-engine.qmd new file mode 100644 index 000000000..8d3383498 --- /dev/null +++ b/python/docs/blog/posts/2022-05-01-unified-feature-engine.qmd @@ -0,0 +1,70 @@ +--- +image: /_static/images/blog/unified-feature-engine_banner.png +title: Why you need a unified feature engine, not a unified compute model +author: "" +date: "2022-05-01" +draft: true +--- + +Many data systems claim to provide a “unified model for batch and streaming” – Apache Spark, Apache Flink, Apache Beam, etc. This is an exciting promise because it suggests a pipeline may be written once and used for both batch processing of historical data and streaming processing of new data. Unfortunately, there is often a significant gap between this promise and reality. + +What these “unified” data systems provide is a toolbox of data manipulation operations which may be applied to a variety of sources and sinks and run as a one-time batch job or as an online streaming job. Depending on the framework, certain sources may not work or may behave in unexpectedly different ways depending on the execution mode. Using the components in this toolbox it is possible to write a batch pipeline or a streaming pipeline. In much the same way, it is possible to use Java to write a web server or an Android App – so Java is a “unified toolkit for Web and Android”. + +Let’s look at some of the functionality that is necessary to compute features for training and applying a machine learning model. We’d like to provide the following: + +1. Training: Compute features from the historic data at specific points in time to train the model. This should be possible to do during iterative development of the model as well as when rebuilding the model to incorporate updated training data. +2. Serving: Maintain the up-to-date feature values in some fast, highly available storage such as Redis or DynamoDB so they may be served for applying the model in production. + +We’ll look at a few of the problems you’re likely to encounter. First, how to deal with the fact that training uses examples drawn from a variety of times while serving uses the latest values. Second, how to update the streaming pipeline when feature definitions change. Third, how to support the rapid iteration while experimenting with new features. Finally, we’ll talk about how a feature engine provides a complete solution to the problems of training and serving features. + +If you haven’t already, it may be helpful to read [Machine Learning for Data Engineers](2022-04-15-machine-learning-for-data-engineers.html). It provides an overview of the Machine Learning terms and high-level process we’ll be discussing. + +## Training Examples and Latest Values + +One of the largest differences between training and serving is the shape of the output. Each training example should include the computed predictor features as well as the target feature. In cases where the model is being trained to predict the future, the predictor features must be computed at the time the prediction would be made and the target feature at the point in the future where the result is known. Using any information from the time after the prediction is made would result in _temporal leakage,_ and produce models that look good in training but perform poorly in practice, since that future information won’t be available. On the other hand, for applying the model the current values of each of the predictor features should be used. + +This creates several problems to solve working with the tools in our data processing toolbox. First, we need to figure out how to compute the values at the appropriate times for the training examples. This is surprisingly difficult since most frameworks treat events as an unordered bag and aggregate over all of them. Second, we need to make this behavior configurable, so that the same logic can be used to compute training examples and current values. + +## Computing Training Examples + +Computing training examples poses an interesting challenge for most data processing tools. Specifically, given event-based data for multiple entities – for example, users to make predictions for – we may want to create training examples 3 months before a user churns. Since each user churns at different points in time, it requires creating training examples for each user at different points in time. This is the first challenge here – ensuring that the training example for each user is computed using only the data that is available at the point in time the prediction would be made. The training example for Alice may need to be created using data up to January 5th, while the training example for Bob may need to be created using data up to February 7th. In a Spark pipeline, for instance, this may require first selecting the time for each user, then filtering out data based on the user and timestamp, and then actually computing the aggregations. + +![](/_static/images/blog/unified-feature-engine_training-examples-1.png) + +Once we’ve solved this challenge, things may get more difficult. Imagine we want to use information from the user’s social network as part of the prediction. For instance, we want to look up values from the 5 most frequent contacts, since we suspect that if they quit then the user is more likely to quit. Now we have a problem – when we compute features for Bob (using data up to February 7th), it may look up features for Alice. In this case, those computed values for Alice should include data up to February 7th. We see the simple strategy of filtering events per-user doesn’t work. If we’re using Spark, we may need to either copy all events from Alice to Bob, and then filter – but this rapidly leads to a data explosion. An alternative would be to try to process data ordered by time – but this quickly leads to poor performance since Spark and others aren’t well suited to computing values at each point in time. + +![](/_static/images/blog/unified-feature-engine_training-examples-2.png) + +Providing the ability to compute training examples at specific points in time requires deeper support for ordered processing and time travel than is provided by existing compute models. A Feature Engine must provide support for computing features (including joins) at specific, data dependent points in time. + +## Making it Configurable + +In addition to computing training examples, we need to be able to use the same logic for computing the current values when applying the model. If we’re using Spark, this may mean packaging the logic for computing the prediction features into library methods, and then having two pipelines – one which runs on historic data (at specific, filtered points in time) to produce training examples and another which applies the same logic to a stream of data. This does require codifying some of the training example selection in the first pipeline – which is something that Data Scientists are likely to want control over, since the choice of training examples affects the quality of the model. So, achieving configurability by baking it into the pipeline creates some difficulties for changing the training example selection. We’ll also see that having the streaming pipeline for the latest values creates problems with updating the pipeline. This is discussed in more detail in the next section. A Feature Engine must provide support for computing training examples from historic data and the latest results from a combination of historic and streaming data. Ideally, the choice of training examples is easy for Data Scientists to configure. Updating Stateful Pipelines Streaming pipelines are useful for maintaining the latest feature values for the purposes of applying the model. Unfortunately, streaming pipelines which compute aggregations are stateful. This internally managed–in fact nearly hidden–state often makes it difficult to update the pipeline. Consider a simple aggregation such as “sum of transaction amounts over the last week”. When this feature is added to the system, no state is available. There are two strategies that may be used. First, we could compute the initial state by looking at the last week of data. This is nice because it lets us immediately use the new feature, but may require either storing at least one week of data in the stream or a more complicated streaming pipeline that first reads events from historic data and then switches to reading from the stream. This might lead to a feature pipeline that looked something like that shown below. Note that the work to concatenate and deduplicate would depend on how your data is stored, and it may need to happen for each source or kind of data. + +![](/_static/images/blog/unified-feature-engine_configurable-1.png) + +Alternatively, you could run the streaming pipeline for a week before using the feature. This is relatively easy to do, but rapidly breaks down if features use larger aggregation windows (say a month or a year). It also poses problems for rapid iteration – if each new feature takes a week to roll out, and data scientists are trying out multiple new features a day, we end up with many feature pipelines in various states of deployment. + +![](/_static/images/blog/unified-feature-engine_configurable-2.png) + +A Feature Engine must provide support for computing the latest features in a way that is easily updated. The ideal strategy allows for a new feature to be used immediately, with state backfilled from historic data. + +## Rapid Iteration + +Another thing that is easy to forget as a Data Engineer is the iterative and exploratory process of creating and improving models. This requires the ability to define new features, train models to see how they work, and possible experiment with their use in production. All of this requires providing the ability for Data Scientists to experiment with new features within a notebook. New features should be usable in production with minimal effort. A Feature Engine must allow a Data Scientist to rapidly experiment with different features and training example selections as part of iterative model creation. + +## Feature Engines + +We saw why computing features for machine learning poses a unique set of challenges which are not solved simply by a unified model. You may choose to build this yourself on existing data processing tools, in which case the above provides a roadmap for things you should think about as you put the building blocks together. An alternative is to choose a Feature Engine which provides the mentioned capabilities out of the box. This lets you provide Data Scientists with an easy way to develop new features and deploy them, requiring minimal to no intervention from Data Engineers. In turn, this lets you focus on providing more and higher quality to Data Scientists. + +## Things to look for when choosing a Feature Engine: + +1. A Feature Engine must provide support for computing features (including joins) at specific, data dependent points in time. + +2. A Feature Engine must provide support for computing training examples from historic data and the latest results from a combination of historic and streaming data. Ideally, the choice of training examples is easy for Data Scientists to configure. + +3. A Feature Engine must provide support for computing the latest features in a way that is easily updated. The ideal strategy allows for a new feature to be used immediately, with state backfilled from historic data. + +4. A Feature Engine must allow a Data Scientist to rapidly experiment with different features and training example selections as part of iterative model creation. + +The Kaskada Feature Engine provides all of these and more. Precise, point-in-time computations allow accurate, leakage-free computations, including with joins. One query may compute the features at selected points in time for training examples. The same query may be executed to produce only final values for serving. Queries operate over historic and streaming data, with incremental computations used to ensure that only new data needs to be processed to update feature values. Features authored in a notebook may be deployed to production with no modifications. During exploration, the Data Scientist can operate on a subset of entities to reduce the data set size without affecting aggregate values. diff --git a/python/docs/blog/posts/2022-06-17-solving-gaming-challenges.qmd b/python/docs/blog/posts/2022-06-17-solving-gaming-challenges.qmd new file mode 100644 index 000000000..8617f43cc --- /dev/null +++ b/python/docs/blog/posts/2022-06-17-solving-gaming-challenges.qmd @@ -0,0 +1,40 @@ +--- +image: /_static/images/blog/solving-gaming-challenges_banner.png +title: How AI and ML are helping the gaming industry solve its biggest challenges +author: "" +date: "2022-06-17" +draft: true +--- + + +AI and ML are enabling incredible advancements across many sectors, and the gaming industry is no exception. Games are rich, virtual environments in which it is possible to gather huge amounts of event-based data from player interactions and behavior. For many companies, this information is forgotten or not utilized effectively, leading games to struggle to attract and retain players. However, for those gaming companies which can effectively apply event-based ML to their data, they can now, more than ever before, extract structure and meaning from player behavior and reap the benefits. These include drastically improved player engagement and retention, heightened player satisfaction, and ultimately much greater player monetization. + +Incorporating AI and ML into gameplay and player retention efforts may be the key to securing an outsized slice of the sizable gaming industry revenue pie. Gaming is more popular than ever before, and in recent years its revenues have eclipsed those of the more entrenched entertainment industries. Total gaming revenues (i.e. mobile, PC, and console gaming, combined) have topped $180 billion as of 2021, beating out those from television, digital music, and even box office films. Much of this growth has been driven by better player acquisition, retention, and personalization efforts driven by AI and ML. + +![](/_static/images/blog/solving-gaming-challenges_revenue.png) + +## Player Attraction and Retention + +Attracting new players to a game has always been a struggle for many mobile gaming companies. In today’s noisy world, where new games are released daily and app stores constantly advertise the latest and greatest thing, it can be difficult to appeal to prospective players over a longer time horizon. Luckily, ML can help companies identify the players who are most likely to download their game and then target those same individuals with advertisements, content, and outreach. In order to identify prospective players, it is vital to collect demographic information from current players such as their device type, OS, network latency, email open and read events, ad impressions, browsing history, and location. These and many other features can be fed into a machine learning model which can be trained to predict which player demographics are most likely to be receptive to a given game. + +Similarly, player retention rates in the mobile gaming industry are abysmally low. In fact, on average, fewer than 5% of players continue playing a given game after the first day. The reasons why a player might stop playing a game are diverse and can range from frustration (too much difficulty) to boredom (not enough difficulty) to distraction with other games or content. At the core of keeping a player in a game is maximizing engagement, and this is achieved by properly calibrating the game’s difficulty, customizing messaging content towards players, and reaching out to them at the right times and via the right medium (e.g. sending a notification vs. an email). It can be difficult to tune all these parameters correctly as they vary from game to game and player to player. Therefore, collecting event-based data is critical because it allows gaming companies to understand how players’ preferences are updated over time and adjust game content accordingly. Because only certain ML models are able to properly incorporate event-based data, choosing the right model and data ingestion platform is critical to gaming businesses that want to get the most from their data. + +Event-based data also plays a key role in designing a marketing pipeline to acquire players. Based on historical data, gaming companies can iteratively adjust their marketing email and mobile notification flows, learning when are the best times and ways in which to reach out to players in order to keep them engaged. They can also use machine learning to assess how best to reestablish contact with those players who abandon a game and discover new and creative ways to incentivize them to return, perhaps by offering them inducements such as free in-game perks, powerups, and content. + +## Forecasting Player Behavior and NPCs + +Because game players are human, their behavior is inherently complex. However, that’s not to say that it’s entirely unpredictable. Using paradigms such as reinforcement learning and other advanced types of ML, it is now possible for game developers to anticipate how players might react to a given in-game situation. There are many reasons why being able to make such a prediction might be useful. The first is for designing smarter and more realistic non-player characters (NPCs). Whether they are allies or enemies, having NPCs anticipate player behavior and interact more convincingly has an incredible impact on increasing player engagement which thereby leads to improved retention. + +Forecasting player behavior can also be useful for generating dynamic game content. Open world and role-playing games are particularly well suited for utilizing such insights. Imagine an open world wild west game in which a player recruits a posse of cowboys in order to rob a bank, and the game realizes that this complex behavior has occurred and sends the sheriff and his deputies to confront the player in advance. Or visualize instead a situation in which entire worlds or maps are spawned in response to player actions. + +Recent advances in NLP and dialog systems are yet another way in which ML can better anticipate player behavior in order to generate realistic interactions. No longer are game creators limited to simplistic, predetermined conversational paths. Now, entirely novel conversations can be generated between players and NPCs, and the game storyline and structure can be modified in response to off-the-cuff interactions. + +## Player Ranking and Fraudulent Activity + +For online multiplayer games in particular, ranking players is a key contribution of ML. It’s important that the ranking and pairing of players is accurate, because improper ranking can lead to frustration, boredom, and low player engagement. A given player does not want to play with others who are too far above or below their current skill level as this diminishes the fun of the experience. Closely related to this problem is the detection of fraudulent in-game activity. Detecting and removing players who are cheating the system by using bots or disallowed hacks is critical to ensuring that the gaming experience is fair and enjoyable for everyone. Event-based models can monitor player behavior over time and quickly detect aberrations that might be suggestive of cheating. + +Much of the difficulty of detecting cheating and fraud within gaming is that the model must operate entirely in real-time and across networks of many players, all in response to event-based data. Identifying and banning cheaters after many instances of cheating have occurred is not all that useful, as by that point the gaming experiences of other players have already been negatively impacted, causing frustration and perhaps leading them to abandon the game. A better solution is to block and abort cheaters from the game in real-time. Development of tools to work with real-time, event-based data has enabled this ability, and [research](https://openworks.wooster.edu/cgi/viewcontent.cgi?article=11803&context=independentstudy) has validated the effectiveness of this approach. + +## Conclusion + +AI and ML have truly revolutionized what’s possible within the gaming space, allowing game developers to more effectively target potential customers, generate new and exciting in-game content in response to forecasted player actions, detect and prevent cheating, improve player ranking and pairing, and identify what causes individual players to abandon the gaming experience. A common thread underlying all of these newfound capabilities is the incorporation of event-based data into the modeling process, something which necessitates new tools. Feature engines are at the heart of event-based ML tooling and provide the requisite infrastructure for training models on top of what is inherently an ever-evolving training set. Static feature stores are good for models which train on fixed user data such as demographics, but quickly become limited in the context of dynamic game environments. However, given the right data toolbox, a development team can easily build a model which travels back in time across the entire history of user actions and updates in response to new behaviors. It is this powerful machine learning model paradigm which will truly allow gaming companies to learn from all of the information they have at their disposal and secure their deserved piece of the gaming revenue pie. \ No newline at end of file diff --git a/python/docs/blog/posts/2022-07-01-hyper-personalization.qmd b/python/docs/blog/posts/2022-07-01-hyper-personalization.qmd new file mode 100644 index 000000000..9388ee79a --- /dev/null +++ b/python/docs/blog/posts/2022-07-01-hyper-personalization.qmd @@ -0,0 +1,47 @@ +--- +image: /_static/images/blog/hyper-personalization_banner.png +title: "Hyper-personalization in Retail: 5 Ways to Boost the Customer Experience" +author: "" +date: "2022-07-01" +draft: true +--- + +When it comes to shopping for a given product, consumers can choose from many different online and retail outlets, each of which is vying for their attention in an increasingly competitive marketplace. In order to stand out, retailers need to provide each customer with an individualized shopping experience that speaks uniquely to them. This is accomplished via _hyper-personalization_, the application of AI and machine learning technology to learn from customer behavior and customize the shopping experience accordingly in order to match each consumer’s individual preferences. In doing so, retailers can better attract their target demographics, improve customer engagement, and ultimately increase customer lifetime value. + +## Website Content Personalization + +One of the best ways to achieve hyper-personalization is by creating catered web content that is specific to individual customers. Many aspects of a retailer’s website can be personalized, from the content layout, to the landing page, to product recommendations, to the order flow. Based on past and real-time browsing history, retailers can learn about their customer’s specific preferences. For example, a clothing retailer might learn that previous to landing on their homepage, a customer visited a number of sites pertaining to a particular product category. Using this information, a machine learning model can learn to push recommendations for similar products that are tailored toward this particular customer. They might also be able to gather personal details such as a user’s name or birthdate and use these to provide them with birthday offers or personalized greetings. + +While website personalization has many obvious benefits, its implementation can sometimes prove challenging. User activity data streams are constantly growing, and the associated feature engineering can prove complex. Temporal leakage can occur when subsampled points from the time series are used to predict intervening points, leading to models which appear to generalize but which in reality don’t. Being able to iterate quickly on feature definitions is also critical–a feature should continuously update itself as new data is streamed in, which is not possible with static feature stores. Furthermore, mapping feature definitions to user behavior and building understandable models is not possible without a feature engine that transparently computes features based on continuous streams of data. Finally, part of being able to choose the right feature is being able to rapidly iterate on different feature definitions. A feature engine makes this iterative process seamless, obviating the need to write many convoluted database queries in order to experiment with different feature definitions. + +## Personalization based on Shopping History, Behavior, and Insights + +As briefly mentioned in the last section, a customer’s shopping history and behavioral patterns provide one of the richest sources of mineable data for personalization. Shopping history can be sourced from both the customer’s account on a retailer’s website and third-party data acquirable via cookies. The value added here is clear. By understanding which items a customer has purchased in the past, retailers can predict using machine learning which items the customer is most likely to purchase in the future and then serve them recommendations linking to these items. Insights based on this data can be narrowly focused or far-reaching. A customer who has recently purchased a laptop will likely need accessories. Therefore, an electronics retailer would be well-served to recommend products such as laptop cases, USB A-to-USB C dongles, and docking stations. Less obviously, a model could infer that a customer who has recently purchased a blender, a toaster, and oven mitts might be moving to a new apartment and restocking on household essentials. As a result, it could recommend less closely related products such as cutlery sets, shower caddies, brooms, and mops. + +Integrating customer behavior is one of the more technologically complex avenues of personalization. This most often takes the form of tracking the amount of time the user spends on various site pages, the ways in which the user clicks through the site, and the types of products they add to their cart. Based on this aggregated data, machine learning models can be trained to add specific calls to action or expose real-time offers in order to convince a customer to complete a purchase. For example, if a customer spends a lot of time reading the reviews for a particular book and adds it to their cart but then tries to exit from the web page, the machine learning tool can quickly display a modal that tries to entice the customer into checking out by offering them a limited-time 5% discount on their order. Similarly, if the customer does decide to leave the webpage, the site can follow up with them via a personalized email reminding them of their outstanding cart as well as offering recommendations for books by the same author. All of this machinery requires that machine learning models be able to operate in real-time and with low latency. The model needs to be able to predict on features that incorporate dynamic, time-based behaviors and serve results immediately. This has to happened before the customer has time to complete a new action such as leaving the site. Everything involved in customer behavior is a time-based behavior, from the actions they take on a website to the time they spend on a given page. ML analyzes customers’ behaviors from the past in order to predict what they will do in the future. Given accurate predictions, a retailer can infer based on the customer’s behavior what products they want and what offers they will be receptive to. + +## Dynamic Pricing and Min Discount Calculation + +In order to optimize the sale of certain products, the price may need to be calculated dynamically. This dynamic pricing effect can also be accomplished by offering discounts to certain demographics. For example, a retailer offering a product may want to decrease their prices for lower-income residents of India but simultaneously may need to increase them for residents of the EU in order to offset the burden of import taxes and greater regulatory costs. Information about country of origin can be gleaned from a customer’s IP address, and product prices can be cost-adjusted accordingly. Similarly, as detailed in the last section, a company might want to decrease the price of a product or offer discounts based on a user’s behavior on their site. + +Dynamic pricing need not be predicated on location information. Within the US alone, Amazon and Walmart are two well-known retailers that regularly vary their prices, on a timescale as short as every 10 to 15 minutes, based on consumer behavior and real-time product demand. On the basis of such dynamic pricing algorithms, Amazon was able to increase its sales by 27.2% in 2013. + +Developing a model which handles dynamic pricing requires identifying the right features. Primary dimensions for specifying product demand might be selected by computing the number of impressions relative to specific demographics or the number of impressions relative to a particular product category. As the number of impressions and aggregate demand are time-varying variables, computing them requires combining historical data with real-time information. In database lingo, this requires a _point-in-time join_ that joins different tables at a single point in time in order to recreate the data state at that particular time. This can be a very difficult task to accomplish without the help of a dedicated feature engine. + +## Loyalty Programs + +Loyalty programs offer a way to engage customers by promising future rewards and gamifying the retail shopping experience. They are most commonly offered through a retailer’s website or mobile app and are typically implemented via virtual “points” that are accrued through digital or in-store purchases. Starbucks is famous for its app-based loyalty program which rewards customers based on their past purchases. Customers can trade in their points for free brewed coffee, bakery items, drink customizations such as flavor shots, or specialty drinks. Similarly, Sephora offers a popular loyalty program in which customers can earn points by purchasing Sephora merchandise and achieve different “status” levels based on those points, such as Very Important Beauty (VIB) status and Rouge status. Savvy retailers will use machine learning models to assess different ways of structuring their loyalty programs, from how to determine different status tiers to what sorts of rewards they will offer to customers. + +In a related vein, this is how social media companies optimize the process of sending users notifications in order to keep them engaged. Retailers can send mobile app or email notifications that reach out to customers and encourage them to keep shopping, be that by providing useful product recommendations, helpful how-to guides that lend themselves towards product purchases, or incentivization via exclusive loyalty program offers and discounts. Knowing when and how to send notifications, as well as which mediums work best, is something that is likely different for each customer and can be optimized via machine learning models which have access to real-time streaming data based on that customer’s behavior and past purchases. Even details as fine-grained as deciding what types of offers to provide and when to schedule them can be fine-tuned and traded off by a machine learning model trained on a rich enough dataset. + +For loyalty programs, there are endless types of data that can be collected in order to form more complete pictures of customer segments. For example, Starbucks is known to aggregate all data on user purchases made through the Starbucks app, and from there they can predict [down to the day of the week, and even the time of day](https://demand-planning.com/2018/05/29/how-starbucks-uses-predictive-analytics-and-your-loyalty-card-data/), which menu items you are most likely to purchase, and from there provide enticements to get you to indulge just a little bit more. They are even able to include location, weather events (such as an oncoming storm that might send you seeking shelter in your nearest store), and local inventory data in order to determine which location you are most likely to visit and how to nudge you towards placing a mobile order. + +## Track Everything + +This leads us to our final point — track everything! There is no such thing as having too much data when it comes to knowing your customers. More data leads to better predictions and better customer experiences, which in turn leads to more purchases and increased customer lifetime value. While collecting data is critical, it’s also important to collect _the right kind of data_. The best data is often real-time or from the very recent past, as it provides the clearest snapshot of a customer at a given point in time. As a customer’s preferences evolve, so too does their behavior, and tracking that behavior is key to capturing the insights that it provides. + +However, data is only so useful in isolation. Machine learning models require that features be generated from the data, and creating these features can be tricky in a dynamic environment where the data is constantly changing. When working with time-based and sequential data, it’s important to use a feature engine that can allow your model to rapidly iterate on time and event-based features. Being able to predict across the entire history of customer's behavior in conjunction with incoming signals is essential for building models that can personalize to individual customers and promote retention and brand loyalty. It’s also critical to ensure that your models are heavily optimized for speed and scalability in order to function in the dynamic retail environment, both online and offline. + +## Wrapping Up + +Hyper-personalization brings the power of machine learning to the retail environment, allowing retailers to better target customers and provide them with a superior shopping experience that is catered to their own, unique needs. By receiving product recommendations that are relevant to them and offers that pique their interest, customers are incentivized to return to a retailer’s website or brick-and-mortar location and make repeat purchases. In order to enable this level of personalization, retailers are encouraged to track everything that they can about their customers, and utilize the resulting insights gleaned from shopping history and behavior in order to personalize website content, compute dynamic pricing, and offer compelling loyalty programs. By tracking everything and harvesting the rich reservoirs of real-time data that are available, companies stand to better reach potential customers, increase the engagement of the customers they already have, improve customer loyalty, and maximize customer lifetime value. Having loyal customers is essential to building a thriving business, both from a fiscal and a marketing standpoint. Loyal customers continually return to a company to make repeat purchases that generate dependable long-term revenue. Furthermore, they often act as brand ambassadors by advocating for a company and its products to friends and family members. \ No newline at end of file diff --git a/python/docs/blog/posts/2022-07-15-data-engineers-need-kaskada.qmd b/python/docs/blog/posts/2022-07-15-data-engineers-need-kaskada.qmd new file mode 100644 index 000000000..bf9efd067 --- /dev/null +++ b/python/docs/blog/posts/2022-07-15-data-engineers-need-kaskada.qmd @@ -0,0 +1,83 @@ +--- +title: Why Data Engineers Need Kaskada? +image: /_static/images/blog/data-engineers-need-kaskada_banner.png +author: "" +date: "2022-07-15" +draft: true +--- + +Building and deploying machine-learning (ML) models is challenging. It’s hard for many reasons, but one of the biggest is that the compute engines we use to build ML models were designed to solve different and incompatible problems. + +Behavioral models work by revisiting the past to learn how the world behaves, then using these lessons to predict the future. The ability to revisit the past is _central_ to training behavioral models, but traditional compute engines don’t know anything about time. These tools tell us the _answer_ to a query, but we need tools that tell us the **_story_** **of how** the answer has changed over time. + +Kaskada is a time-based feature engine based on manipulating _streams of values_. These streams tell the story of exactly how the result of a computation changes over time. Knowing the whole story allows Kaskada to provide a unique set of powerful time-based operations that result in the ability to develop and deliver behavioral ML models rapidly. + +### What other engines are good for + +Traditional compute engines serve many use-cases well. For example, data analysis over “web-scale” datasets has benefitted from decades of investment and innovation: tools like Spark, Presto, and Snowflake make it possible to analyze data at a massive scale. Whatever your analytic use case, there is likely a mature compute engine you can use. These systems emphasize throughput and model computation as a short-lived query or job. + +Similarly, event handling at scale has benefitted from significant innovation and investment. We can now choose from scalable event logs such as Kafka, Pulsar, or RabbitMQ. You can implement handling logic various frameworks such as Flink or Apex. Streaming compute engines are less mature than their analytic counterparts, but recent years have seen significant improvements. Many use cases are well-served, for example, monitoring, event-sourcing, and aggregations over short-lived windows. These streaming systems emphasize latency and model computation as a long-lived process. + +You’ve probably worked with many of these systems before. Many organizations have developed robust technical infrastructure to support data collection, analysis, monitoring, and real-time event processing, + +It’s reasonable to assume that the tools and infrastructure that support these use cases so well will also support the development and deployment of ML models. Unfortunately, this often isn’t the case; behavioral ML has unique and challenging requirements. + +### Why Behavioral ML is hard with other engines + +ML is a broad category that includes everything from simple classifiers to massive neural networks for machine translation. Kaskada focuses on _behavioral_ models - models that learn cause-and-effect relationships to predict how the world will behave in the future, given our knowledge of the past. These models are used for personalization, demand forecasting, or risk modeling applications. Building a behavioral model is typically slow, complex, and frustrating - because understanding behavior means understanding the _story_ of cause and effect. + +### Time Travel + +Our knowledge of the world is constantly growing: the passage of time uncovers the _effects_ that behavioral models learn to correlate with earlier _causes_. Training behavioral models requires reconstructing what was known about the world at specific times in the past - limiting computations to inputs older than a given point in time. The most important information is often the most recent when making a prediction, so fine-grained time travel is necessary. Existing analytic compute engines treat time as just another column and don’t provide abstractions for reconstructing the past. + +### Deferred Decisions + +Feature engineering is an experimental process. To complicate matters further, it’s often impossible to know beforehand the features and observation times needed to train a successful model. Improving model performance generally requires an iterative process of trial-and-error - each iteration exploring different ways of filtering and aggregating information. + +### Incremental Computation + +While training a behavioral model often requires large amounts of historical data, it’s often necessary to make predictions quickly once a model is in production. Usually, models whose inputs depend on aggregating historical data precompute these aggregates and maintain the “current” value in a dedicated feature store. Existing tools force a choice between throughput-optimized analytic and latency-optimized streaming engines, but both use-cases are part of the ML lifecycle. + +### Sharing & Discovery + +High-quality features can often be re-used to make several related models - similar to shared software libraries. Like software, feature definitions must be managed over an extended life cycle during which different individuals may join and leave the team. The ability to reuse feature definitions and quickly understand the logic behind a given feature significantly reduces the cost of developing and maintaining ML models. The architecture required to make existing tools work for ML often results in a system whose complexity makes feature sharing and maintenance painful. + +## How Kaskada is different + +Kaskada is a new type of feature compute engine designed from the ground up to address these challenges. Computations in Kaskada are described using composable abstractions intended to tell stories from event-based data and bridge the divide between historical and incremental computation. + +### Abstractions + +Tools like Spark describe computation using operations drawn from functional programming (map, reduce, filter, etc.) and apply these operations to data represented as RDDs or dataframes. SQL-based query engines describe computation using operations drawn from relational algebra (select, join, aggregation, and element-wise mapping) and apply these operations to data represented as tables. + +Kaskada describes computation using a set of operations similar to SQL’s element-wise and aggregation functions (add, sum, min, etc.) but applies these operations to time-stamped streams of values. + +### Unique Operations + +By keeping the time component of computations, Kaskada can concisely describe many operations that would be difficult or impractical using the abstractions of traditional compute engines. + +Kaskada distinguishes between two types of value streams: discrete and continuous. Discrete values are described at specific instants, for example, events. Continuous values are defined over time intervals, for instance, aggregations like the maximum value seen to date. This distinction makes it easy to zip together time streams computed at different points in time, for example combining event-level values with aggregated values. + +Values in a stream are associated with an “entity.” Aggregations are limited to values with the same entity. These unique operations result in queries with very little boilerplate code specifying group-by and join constraints and maintain the ability to group, regroup and join between differently-grouped values. Every operation in Kaskada preserves time, including joins - the result of a join operation is a stream describing how the result of the join changes over time. + +Kaskada supports operating both on a stream’s values and its times. Streams can be shifted forward in time to combine values computed at different times. For example, when labeling an example’s outcome one week in the future, the predictor features are shifted forward a week and combined with a computed target feature to produce a training example. Values cannot be shifted backward in time, making it challenging to introduce temporal leakage into training datasets accidentally. + +### Unified Queries over Historical & Incremental Data + +A given Kaskada computation can produce training examples from historical events _or_ process new events as they arrive incrementally. This unification is possible because queries are described in terms of value streams, allowing the details of incremental vs. non-incremental computation to be abstracted away from the query author. + +Kaskada builds on this to support continuously writing the results of a given query to external data stores, a feature we call “materializations.” You can use materializations to keep a low-latency feature store up to date as data arrives to support online predictions. + +### Composable & Declarative MLOps + +Kaskada supports the extended ML lifecycle by streamlining feature reuse and integrating with your existing infrastructure management process. Features in Kaskada are easy to understand, “what” to compute, and “when” is described as part of the query expression. Since features are just code, they’re easy to reuse, refine and extend. Kaskada infrastructure can be managed as code in production using a declarative specification file and the same code review and change management processes you already have in place. + +## Conclusion + +Behavioral ML models are challenging to develop and deploy using traditional compute engines due to these models’ unique requirements. Building training datasets depends on understanding the stories in your data, and the trial-and-error nature of model training means you need the ability to experiment with different aggregations and prediction time selections interactively. After training a model, you often need to maintain the current value of millions of feature vectors as data arrives to support online predictions. Features can often be re-used in multiple models, so improving feature discoverability and reusability can significantly accelerate the “second model.” + +We designed Kaskada specifically to support iterative time-based feature engineering. By starting with a new set of abstractions for describing feature computation, Kaskada provides unique and powerful operations for describing what to compute and when to compute it. Kaskada is fully incremental, allowing features to be efficiently updated as data arrives. Kaskada queries are composable and readable, simplifying feature discovery and reuse. + +In the past, behavioral ML has been hard to develop and maintain, but it doesn’t have to be. By using the right tool for the job you can build and deploy models in a fraction of the time. After migrating their behavioral ML feature engineering from traditional compute engines to Kaskada, our users typically see at least 25x reductions in time-to-production and are able to explore thousands of times more feature variations - resulting in better-performing models. + +Trying out Kaskada is easy - you can signup today for free and start discovering the stories hidden in your data. \ No newline at end of file diff --git a/python/docs/blog/posts/2022-08-01-time-centric-design.qmd b/python/docs/blog/posts/2022-08-01-time-centric-design.qmd new file mode 100644 index 000000000..68f08ab48 --- /dev/null +++ b/python/docs/blog/posts/2022-08-01-time-centric-design.qmd @@ -0,0 +1,158 @@ +--- +title: How data science and machine learning toolsets benefit from Kaskada’s time-centric design +image: /_static/images/blog/time-centric-design_banner.png +author: Brian Godsey +date: "2022-08-01" +draft: true +--- + +## Part 1 — Kaskada Was Built for Event-Based Data + +This is the first part in our “How data science and machine learning toolsets benefit from Kaskada’s time-centric design" series, illustrates how data science and machine learning (ML) toolsets benefit from Kaskada’s time-centric design, and specifically how Kaskada’s feature engineering language, FENL, makes defining and calculating features on event-based data simpler and more maintainable than the equivalents in the most common languages for building features on data, such as SQL and Python Pandas. + +## Event-based data + +A software ecosystem has been growing for many years around generating and processing events, messages, triggers, and other point-in-time datasets. The “Internet-of-Things” (IoT), security monitoring, and behavioral analytics are areas making heavy use of this ecosystem. Every click on a website can generate an event in Google Analytics. Every item purchased in an online video game may send event data back to the main game servers. Unusual traffic across a secure network may trigger an alert event sent to a fraud department or other responsible party. + +Devices, networks, and humans are constantly taking actions that are converted into event data which is sent to data stores and other systems. Often, event data flows into data lakes, from which it can be further processed, standardized, aggregated, and analyzed. Working with event data—and deriving real business value from it—presents some challenges that other types of data do not. + +## Challenges of Event-based Data + +Some properties of event data are not common in other data types, such as: + +1. Events are point-in-time actions or statuses of specific software components, often without any context or knowledge of the rest of the system. + +2. Event timestamps can be at arbitrary, irregular times, sometimes down to the nanosecond or smaller. + +3. Events of different but related types often contain different sets of information, making it hard to combine them into one table with a fixed set of columns. + +4. Event aggregations can standardize information for specific purposes, e.g. hourly reporting, but inherently lose valuable granularity for other purposes, e.g. machine learning. + + +None of these challenges are insurmountable by themselves, but when considered together, and in the context of event data being used for many purposes downstream, it can be particularly hard to make data pipeline choices that are efficient, both from a runtime perspective as well as a development and maintenance standpoint. + +## Simple case study + +As a simple example, let’s assume we have e-commerce event data from an online store, where each event is a user action in the store, such as a login or an item sale. This event data has two downstream consumers, a reporting platform and a data science team that is building features and training datasets for predictive machine learning models. We want our data pipeline to enable both downstream customers in an efficient, maintainable way. + +Our data is stored in a database table called **ecommerce_event_history** and looks like this: + +![](/_static/images/blog/time-centric-design_database-table.png) + +The reporting platform needs hourly aggregations of activity, such as counts and sums of activity. The data science team is building and experimenting with features that may be helpful for machine learning models that predict future sales, lifetime value of a customer, and other valuable quantities. While some quantities calculated for the reporting platform may be of use to the data scientists, their features need to go beyond simple aggregations, and so the data scientists need a way to quickly build and iterate on features and training datasets without the help of data engineers and others who build the reporting capabilities on the same event data. + +## FENL vs SQL for Hourly Reporting + +Kaskada’s feature engineering language, FENL, is built specifically for working with time-centric features on event-based data. Aggregations of events over time is also a core competency of FENL—aggregations can be considered a type of feature. The hourly reporting platform at our e-commerce company mainly requires simple aggregations, such as those shown below in SQL and the equivalent in FENL. + + +SQL +``` +SELECT + DATE_TRUNC ('hour', event_at) as reporting_at, + entity_id as user_id, + count (*) as event_count, + SUM (revenue) as revenue_total +FROM + ecommerce_event_history +GROUP BY 1, 2 +``` + +FENL +``` +let event_count = ecommerce_event_history | count(window=since(hourly())) +let reporting_date = event_count | time_of() +let user_id: ecommerce_event_history.entity_id | last() + +{ + reporting_date, + user_id, + event_count, + revenue_total: ecommerce_event_history.revenue | sum() +} +| when (hourly()) +``` + +The SQL aggregations are fairly straightforward, and the FENL equivalent is similar but slightly different due to FENL’s special handling of time variables and our use of the _let_ keyword for assigning variable names within the query, for clarity and convenience. Both versions use a _count_() and a _sum_() function, and both group by user_id and date, though FENL’s grouping is implicit because user_id is the entity ID and because the hourly() function is used to specify the times at which we would like to know results. For more information on entities and the treatment of time in FENL, see [the FENL language guide](https://kaskada.io/docs-site/kaskada/main/fenl/language-guide.html). + +Overall, SQL and FENL are comparable for these simple aggregations, with the main trade-off being that SQL requires an explicit GROUP BY where FENL uses the time function hourly() and implicit grouping by entity ID to accomplish the same thing. In addition, if you want to have values for every hour, not just the hours that contain events, SQL requires some non-trivial extra steps to add those empty hours and fill them with zeros, nulls, etc. FENL includes all hours (or other time intervals) by default, and if you don’t want those empty intervals, a simple row filter removes them. + +## FENL vs SQL for ML Features + +Features for machine learning generally need to be more sophisticated than the basic stats used in reporting analytics. Data scientists need to build beyond those simple aggregations and experiment with various feature definitions in order to discover the features that work best for the ML models they are building. + +Adding to the case study from above, our [FENL vs SQL vs Python Pandas Colab notebook](https://colab.research.google.com/drive/1Wg02zrxrJI_EEN8sAtoEXsRM7u8oDdBw?usp=sharing) takes an expanded version of the e-commerce case and builds ML feature sets that are significantly more sophisticated than the example code for hourly aggregations that we showed previously. It’s not that any single feature is particularly complicated, but the combination of hourly aggregations, like revenue total, with full-history stats, like age of a user’s account and time since last account activity, makes building the feature set as a whole more complicated. + +See the example features written below in both SQL and FENL, and the notebook linked for more details and working code. + +## SQL + +``` +/* features based on hourly aggregations */ + with hourly agg as ( + select + entity_id + , event_hour_start + , event_hour_end + , event_hour_end_epoch + + /* features from hourly aggregation */ + , count (*) as event_count_hourly + , sum (revenue) as revenue_hourly + from + events_augmented /* the equivalent of CodeComparisonEvents in this notebook */ + group by + 1, 2, 3 + ) + + + /* features based on full event history up to that time */ + select + hourly_agg.entity_id + , hourly_agg.event_hour_end + , hourly_agg.event_hour_end_epoch + , hourly_agg.event_count_hourly + , hourly_agg.revenue_hourly + + /* features from all of history */ + , count (*) as event_count_total + , sum (events_augmented. revenue) as revenue_total + , min (events_augmented.event_at_epoch) as first_event_at_epoch + , max(events_augmented.event_at_epoch) as last_event_at_epoch + from + hourly_agg + left join events_augmented + on hourly_agg.entity_id = events_augmented.entity_id + and hourly_agg.event_hour_end >= events_augmented.event_at + group by + 1, 2, 3, 4, 5 +``` + +## FENL +``` +# set basic variables to be used in later expressions +let epoch_start = 0 as timestamp_ns + +# create continuous-time versions of inputs +let timestamp = CodeComparisonEvents | count() | time_of() +let entity_id = CodeComparisonEvents.entity_id | last () + +in { + entity_id, + timestamp, + event_count_hourly: CodeComparisonEvents | count (window=since(hourly())), + revenue_hourly: CodeComparisonEvents.revenue | sum (window=since(hourly())), + event_count_total: CodeComparisonEvents | count(), + revenue_total: CodeComparisonEvents.revenue | sum(), + first_event_at_epoch: CodeComparisonEvents.event_at | first() | seconds_between(epoch_start, $input) as 164, + last_event_at_epoch: CodeComparisonEvents.event_at | last () | seconds_between(epoch_start, $input) as 164, +} + +| when (hourly()) +``` + +FENL, in this example, allows us to write a set of feature definitions in one place, in one query block (plus some variable assignments for conciseness and readability). It is not a multi-stage, multi-CTE query with multiple JOIN and GROUP BY clauses, like the equivalent in SQL, which adds complexity and lines of code, and increases the chances of introducing bugs into the code. + +Of course, FENL can’t simplify every type of feature definition, but when working with events over time and a set of entities driving those events, building features for predicting entities’ future behavior is a lot easier with FENL. FENL’s default behavior handles multiple simultaneous and disparate time aggregations in a natural way and joins resulting feature values on the time variable by default as well. All of this comes with syntax that feels clean, clear, and functional. + +See our [FENL vs SQL vs Python Pandas Colab notebook](https://colab.research.google.com/drive/1Wg02zrxrJI_EEN8sAtoEXsRM7u8oDdBw?usp=sharing) for more detailed code examples, more details, information, and example notebooks are available on [Kaskada’s documentation pages](https://kaskada.io/docs-site). \ No newline at end of file diff --git a/python/docs/blog/posts/2022-09-01-6-factors.qmd b/python/docs/blog/posts/2022-09-01-6-factors.qmd new file mode 100644 index 000000000..d340ea554 --- /dev/null +++ b/python/docs/blog/posts/2022-09-01-6-factors.qmd @@ -0,0 +1,73 @@ +--- +title: 6 Factors to Consider When Evaluating a New ML Tool +image: /_static/images/blog/6-factors_banner.png +author: Ben Chambers +date: "2022-09-01" +draft: true +--- + +The incredible hype surrounding machine learning has precipitated a vast ecosystem of tools designed to support data scientists in the model development lifecycle. However, this has also made selecting the right suite of tools more challenging, as there are now so many different options to choose from. When evaluating a particular product, the key is to assess if, and how, it will enable you to perform your job more reliably, quickly, and impactfully. Towards that end, what follows are 6 factors to consider that you can use to better understand how Kaskada will impact your existing workflow. + +## 1. Development Velocity and Throughput + +In an ideal world, data scientists would have as much time as they’d like to test, tweak, and recalibrate their models. Unfortunately, models have a shelf life, businesses need to make an impact, and as more models are deployed more need to be maintained—we call this the data-dependency debt shown in the figure below. In addition, the huge hype around ML means business leaders have correspondingly high expectations of it. You as a data scientist need to be able to demonstrate value by rapidly iterating on feature definitions and models in order to produce results. + +![](/_static/images/blog/6-factors_development-and-velocity.png) + +**_Data-dependency debt:_** + +**_As more models are deployed more need to be maintained, and other solutions can't scale to meet the demand for the event-based data needs for ML teams._** + +One of the biggest time sinks for data scientists is the feature engineering and selection process. Especially problematic in this regard is creating new features using time series, event-based, or streaming data in which changing a feature definition can mean having to rerun a computation across the entire historical dataset, or settling for proxy features with suboptimal model performance. + +Feature engines such as Kaskada offer a rich time-traveling toolset to enable instant iteration on your feature definitions as training examples are observed via the ability to interact with _slices_ of datasets. By slicing a vast dataset and preserving its statistical properties, experimental features run significantly faster, thereby enabling you to experiment on a subset of your entities. This frees you up to focus on asking the right questions, building the best features, and testing your hypothesis instead of waiting for backfill jobs or settling for proxy datasets captured at suboptimal points in time. + +In addition, maintaining models via retraining and validation on new data becomes simpler, increasing your throughput. Generating a labeled training dataset with new data is a matter of running a query, and, if it turns out that your features are no longer predictive, finding new ones is significantly faster. Plus, testing new models needed at a different prediction time, from existing feature definitions, is a matter of adjusting a single line of code and rerunning the query. + +## 2. Runtime Efficiency + +While a lot of tools can seem great on smaller datasets or on pre-aggregated data, the way they’re implemented under the hood can sometimes be subpar in terms of runtime efficiency for event-based data, leading to drastic inefficiencies that slow down your workflow or don’t enable you to operate at the scale you’d like. Feature engines such as Kaskada have a different computational model that is temporal and always incremental. Kaskada is time and entity aware, reorganizes data, and efficiently caches as needed in various stages. Kaskada can be deployed into your cloud with configurable resource allocation allowing you to scale your feature engine performance to meet your feature computation needs. + +Generating training datasets is only half the job of a feature engine. Runtime efficiency can be even more important in a production environment, where the features for many more entities need to be computed and maintained. Kaskada is a declarative feature engine, and customers can use our Amazon AMI to deploy one or more instances of Kaskada's feature engine inside of their own infrastructure in the region where their data is located. The ability to deploy multiple feature engines gives you the ability to trade off costs and efficiency as needed. Now your CI/CD automation processes can deploy new features and feature engines quickly, without costly code rewrites. + +## 3. Compatibility and Maintainability + +A new tool won’t do you much good if it doesn’t integrate well with upstream and downstream components, such as ML modeling libraries. The best tools will drop into your existing data science toolset and work alongside most, if not all the tools you’re already using—while adding additional capabilities or increasing your productivity. + +Kaskada is available as a package in Python, the most popular language for data science, and it can query databases via a query language called FENL, which is even easier to use than SQL. You can connect once to all your structured event-based data — enabling access to the source of truth powering the downstream pre-aggregated data sources that you’re used to working with. This data contains additional information necessary to explore behavior and context. All this while allowing you to iterate in your existing workflow using the python client library, our APIs, or other integrations to explore and create new time-based features. + +Queries return dataframes and parquet files for easy integration with your visualization, experiment tracking, model training, AutoML, and validation libraries. Kaskada also enables reproducibility and audibility via data-token-ids to compute incremental changes, audit previous results and track as metadata with experiments, models and predictions. + +## 4. ML Model Performance + +Of course, a primary consideration in a data scientist’s workflow is the performance of the model. This is also where you as a data scientist demonstrate value to the organization. Without a high-performing model that has a direct, positive impact on business KPIs, the data scientist’s job is irrelevant. + +There are several ways to measure model performance, from common accuracy scores to train-test data set measures in the lab. Precision, recall, F1 scores, and uncertainty and confidence are all factors to consider for a model's performance. But these scores only tell you if the model is performing as intended when you actually take these measurements across inclusivity and fairness variables that you’re looking. + +Part of the difficulty in training a model to improve KPIs is that the connection between the two is often fuzzy and hard to discern. It can be difficult to understand and interpret how incremental changes to the model affect the downstream KPIs. However, this problem can often be ameliorated by building explainable models with clear feature definitions. + +When working with event-based and streaming data, these feature definitions are often more complicated to express and compute than those involved in static models, making the discernment process even more challenging. Often these features are subject to drift and require bias handling using evaluation methods to spot erroneous predictions. They are also subject to frequent retraining and evaluation. + +Feature engines such as Kaskada illuminate the relationship between features, model performance, and KPI improvements by transforming complicated event-based feature definitions into clear readable, expressions using logical operators, allowing data scientists to uncover the hidden relationships between models and business processes. In addition, Kaskada provides rapid development and training speed, shortening the time to bring a data science product to market. With Kaskada you can discover new, relevant features with 1000x more self-service iterations on train, test and evaluation data sets. Ultimately this leads to keeping your models performing better at a 26x faster pace. + +## 5. Transparency + +As alluded to before, part of a data scientist’s job lies in generating explanations for complex phenomena. In order to do this reliably, the data scientist has to first be able to understand how each of the tools in their toolchain operates and affects both data and models. Kaskada provides transparency in three complementary ways. + +Transparency is first achieved through readability, and queries for behavioral ML are more concise to write in Kaskada. Compared to an existing survival analysis accelerator, Kaskada recently simplified the code from 63 pages down to 2. Since features in Kaskada can be fully described by a single, composable, and concise query they are also much easier to understand, debug and get right. + +In addition, rewriting feature definitions from historical to real-time data systems is a common and frustrating source of bugs. Computing features using a feature engine eliminates common developer errors and simplifies code by sharing the same feature definition computed at the current time. This makes it easier to understand for anyone who needs to fiddle with the model, connect results to business processes, update feature definitions, or otherwise write code within the surrounding software system. + +Second, Kaskada is a transparent tool in its own right. Data scientists can easily inspect the results of their queries on existing data by defining _views_, and they can watch as these views automatically update in response to incoming streaming data. In doing so, data scientists gain a better understanding of how their feature definitions apply to their datasets and are able to adjust the features and models accordingly. + +Finally, for every query, historical or real-time, a data-token-id is returned for logging. This token provides a snapshot of the data, and combining it with the query allows for reproducing the state of the data and the resulting feature values to audit why a certain prediction was made and justify the decision of the model. + +## 6. Support and Documentation + +Even in the best of hands, tools will inevitably break or fail to operate in an expected way. It is in these times that the true strength (or weakness) of the organization backing a product is revealed. You’ll want a product that has robust, well-maintained documentation that explains all of the product’s features. You’ll also want access to solution accelerators where you can see how the tool is used to accomplish common tasks in existing applications. + +Finally, you’ll want an organization you can turn to for personal support when debugging fails or when something breaks. Having access to technical members of the product’s development team who can get to the root of the problem is crucial. Kaskada provides [documentation](https://kaskada.io/guide) of all product features, demos and case studies, and enterprise support for issues that arise in development and production. It is also under active development by a team that is responsive to users’ issues and feature requests, both within an enterprise context and beyond. + +## Final Thoughts + +Committing to an ML tool, particularly something as integral as a feature engine, is not a decision to be taken lightly. Because ML models are built on features, having features that are easily understandable, efficiently computed, scalable, and can accommodate event-based data is imperative for maximizing your efficiency as a data scientist and cementing your value to an organization. Being able to demonstrate value by connecting improvements in business results and KPIs to the models you implement is the key to being promoted and obtaining greater influence and reach within your organization. An excellent feature engine such as Kaskada will enable you to get more done, have an outsized impact in your role, and work to your peak potential. \ No newline at end of file diff --git a/python/docs/blog/posts/2022-10-01-why-kaskada.qmd b/python/docs/blog/posts/2022-10-01-why-kaskada.qmd new file mode 100644 index 000000000..396691d91 --- /dev/null +++ b/python/docs/blog/posts/2022-10-01-why-kaskada.qmd @@ -0,0 +1,29 @@ +--- +layout: post +title: Why Kaskada? – The Three Reasons +date: "2022-10-01" +image: /_static/images/blog/why-kaskada_banner.png +draft: true +--- + +Kaskada accelerates your model training workflows by allowing you to easily create and iterate on new features from event-based data. In doing so, data scientists and engineers can now control the context of model predictions. For example, Kaskada allows data scientists to assess whether a model is best suited to predict on a daily/hourly basis or to make predictions in real-time—as users are about to take action or make a purchase decision—flexibility that allows them to understand customer behavior better. Additionally, they can rapidly ship features to production and accelerate the model development lifecycle. + +## Fast Feature Engineering and Iteration + +The [typical process](2022-05-01-why-you-need-a-unified-feature-engine-not-a-unified-compute-model.html) for generating features from event-based data is complex, manual, and frustrating. Suppose you’ve ever tried building a training dataset for a model that required taking examples at different data-dependent points in time or necessitated doing a point-in-time join. In that case, you probably know what we’re talking about. These problems are just too hard to accomplish within the iterative exploration process needed in feature engineering. + +What really happens in these cases is a lot of friction—building bespoke ETL pipelines, waiting for data to be collected, or managing complex backfill jobs—often only to find the needed data is missing from those pipelines or the features aren’t observed when the model will be making predictions. And even if you can make it work, the model will often underperform, and temporal leakage will be introduced, kicking off a long process of diagnosing an underperforming model. + +[Kaskada bypasses this difficulty](2022-09-01-6-factors-to-consider-when-evaluating-a-new-ml-tool.html) by allowing users to define features and model context in terms of computations and then apply those computations to data across arbitrary slices of time. Once in production, these time-based features undergo an efficient incremental update as new data is streamed in, saving valuable time compared to having to recompute the feature using the entire history of the data. Furthermore, Kaskada can be accessed via a Python client library, integrates with your existing data silos, and can ingest any type of structured data, meaning there’s very little additional work needed to incorporate it into your feature engineering process. + +## Discover How Different Contexts Affect Behavior + +A context describes how a person’s behavior is informed. In particular, understanding the situation surrounding what a customer is doing before and after taking a certain action is critical to understanding their mindset as well as how they’re likely to act again in the future. Machine learning is predicated on this assumption, but it can be challenging to adjust the contexts that are provided to a model in order to see how predicted behavior changes. Kaskada allows data scientists to specify and iterate on the context under which customer behavior is observed before training models, all without having to rewrite a data pipeline and manually recompute feature values. This allows data scientists to [_explore a wider variety of contexts and learn which ones are most predictive_](2022-12-02-discover-how-different-model-contexts-affect-behavior.html) _with enough time left to achieve_ the desired end goal, such as higher engagement or retention. + +## Easily Deploy Features to Production + +One of the [primary barriers](2022-07-15-why-data-engineers-need-kaskada.html) to getting features into production is that data scientists are not working in a scalable, performant language or environment. Costly rewrites of code and feature definitions are required to achieve the desired latency. It can also be difficult to share feature definitions between data scientists and across teams, slowing down the process further with repeated work. Kaskada provides a declarative framework enabling you to share resources as code, from table definitions to views and materialized views that fully define a feature. Resources can easily be version controlled and shared. Furthermore, once features are ready for production, they can be kept up to date in any data store, such as Amazon S3, DynamoDB, Redis, BigTable, or a whole host of others. Finally, it is easy to replace production features and assess new ones, all without affecting the availability of your ML system. + +## Summary + +Behavioral ML involves working with event-based data, a process that is more difficult, tedious, and time-consuming than traditional batch prediction methods. Models need to be provided with the right features in real time in order to make correct predictions, and digging into these contexts can be incredibly difficult with existing tools. However, this is where Kaskada shines. Kaskada allows you to rapidly iterate on feature definitions and observation times from event-based data, allowing you to move features into production 26x faster and experiment with thousands of hypotheses quickly. Kaskada also makes your modeling more interpretable, giving insight into how context affects customer behavior without requiring you to rewrite your data pipelines or manually recompute feature values. Finally, Kaskada makes deploying to production easy by allowing you to share resources as code and integrate with your existing CI/CD systems. \ No newline at end of file diff --git a/python/docs/blog/posts/2022-12-02-model-contexts-affect-behavior.qmd b/python/docs/blog/posts/2022-12-02-model-contexts-affect-behavior.qmd new file mode 100644 index 000000000..d9e2eeaf6 --- /dev/null +++ b/python/docs/blog/posts/2022-12-02-model-contexts-affect-behavior.qmd @@ -0,0 +1,41 @@ +--- +title: Discover How Different Model Contexts Affect Behavior +image: /_static/images/blog/model-contexts-affect-behavior_banner.png +author: Charna Parkey +date: "2022-12-02" +draft: true +--- + +Predicting customer behavior is not easy, and attempting to alter it predictably is even harder. In order to develop machine learning models that can accurately anticipate and guide customer behavior, it’s critical to understand how your model’s training contexts affect possible interventions and change the final prediction. Can we retain a subscription customer predicted to churn, prevent fraud before it happens, or get non-paying players to spend money so we don’t need to rely only on the less than 5% who usually do? + +## Subscription Businesses + +_Model Context_ refers to the time domain used by a model to make predictions. Therefore the relative time-slices that can be used to select training data for a model. Aligning training datasets with a model’s future use case is critical for obtaining the best performance. For example, when training a model to predict customer retention based on past customer actions, we can choose whether we want to feed the model data about the customer up to different points in time, such as 3 months or 9 months past the start of their subscription. Other activity-dependent times to consider might be 7 days after a new feature release or at the onset of a changing market condition or significant world event. The points in time at which we compute the training data and provide it to the model affect how the model understands the customer and provides insight into what actions can be taken to change their behavior, with the end goal of driving engagement and, ultimately, renewal. + +Giving a model information about a customer up to 3 months past their subscription start might allow the model to perform very well at predicting customer retention 3 months in, when there is still enough time to make an impact on enterprise renewals, but new data and features are needed at 9 months or years later. More so, mismatching a model’s training context and its prediction context, i.e. using a model trained at 3 months of adoption to predict at 9 months, leads to underperforming models and often completely erroneous results. That said, smaller contexts can allow for quickly building larger datasets, finding leading indicators of success, faster model training, faster inference, and a potentially finer granularity of events upon which the model can make predictions. Thus, understanding how to choose your contexts _and_ features iteratively with context windows of varying sizes, starting, and ending points, provides the key to training robust models that predict what is important while there is still enough time to change the outcome. + +Consider an enterprise SaaS subscription business looking to allocate resources of their Customer Success representatives within a quarter. Each rep has a book of business that renews staggered over the course of the year; they are expecting to win some and to lose some, but nothing should be a surprise. In fact, the purpose of this team is to proactively manage attrition and expansion. Some accounts will start this quarter, be handed off from sales, and need a kickoff call and onboarding within the first 30 days. Others will be looking for their first quarterly review to understand if the onboarding took or if additional training or workshops are needed. Some will hit their mid-year review, 9 month review, negotiating a renewal, or come with an additional budget to upsell. Ideally, an account stays for many years and upgrades to more of your products over time. + +At first glance, this example seems simple to select model context: daily, monthly and quarterly. On a daily basis, you might be able to use email drip campaigns and in product interventions to drive behavior. On a monthly basis your rep might want a list of underperforming accounts to call and ask questions. And quarterly you could prioritize bigger interventions such as onsite quarterly reviews or live training. However, a subscription may grant access to products for a certain number of users or amount of metered resources. When should usage actually begin: the start of the month, the kick off call, after cohorts attend a webinar, when each user logs in and completes an onboarding? And are the boundaries of the subscription the only context we care about? Consider the impact of fiscal quarters versus calendar quarters, new product launches, budget cycles and the impact of ensuring there is a line item in that budget to renew and a margin to upsell. What additional contexts may impact budgets: pandemics, stock valuations, hiring, attrition, layoffs, elections, protests, competitor pricing, the list goes on. + +In the above example, we can see how changing the model’s context allows for identifying better leading indicators, and more accurate prediction of renewal, sooner, in order to enable the right intervention to change user behavior and increase the likelihood of renewal without wasting resources. However, for digital consumer businesses, with self-service subscription management adjusting the model’s context to real-time predictions is no longer an optional optimization. For every interaction with your product and others your user base is making a decision about their experience with your product. This is why personalized experiences directly impact subscription businesses. Still, in these scenarios, your users want you to succeed, let’s take a few more examples in different scenarios to explore further how model context impacts behavior in adversarial environments and in zero commitment situations. + +## Detecting Fraud + +Fraud occurs in a wide range of industries, including banking, insurance, medical, government, the public sector, and law enforcement. In all of these environments, detecting and rooting out fraud is critical to ensuring a business’ continued viability. ML models are frequently used to identify potential instances of fraud and suggest the best courses of subsequent action. + +Consider a scenario in which a bank is looking to implement ML-based models specialized to a variety of tasks such as the detection of money laundering, fraudulent banking claims such as a falsely labeled chargeback, payment and transaction fraud like stolen credit card purchases, take-over fraud in which a nefarious actor seizes control of another person’s account, and account fraud linked to new accounts created with a fake identity. + +As compared to the previous example, in which we presume that the end user you’re making a prediction about is who they say they are and _wants_ to use your product, here any purchase or merchant flagged as fraudulent is attempting to hide from you and is actively trying to learn what is giving their behavior away. In this adversarial environment, the model's context impacts your fake customer and actual customer’s behavior in different ways. As an example, consider an account takeover scenario, where a hacker obtains access to your banking information using the credentials to your existing account. Once they’re logged in, they have access to your previous purchase behavior and account statements as well as the ability to request new credit cards, add authorized users on your existing cards, transfer their outstanding debt to your card, and even instantly transfer your money to other accounts. + +In this instance, when should our model be making predictions that the user is in fact who they say they are? Furthermore, how does the model context impact the interventions we can design to prevent fraud, and how do those interventions impact the behavior of our real account holder and potential hackers? Certainly there are a few obvious model contexts: at the time of an attempted login, when opening a new line of credit, or at the point of any account-altering activity such as changing the account password or updating the account address. These moments allow us to re-verify identity, potentially in a way that’s different or more thorough than typical. The behavioral impact of these verifications _may_ result in teaching the hacker what is needed next time and reassuring the real user that they are safe and even predicting the fraudulent activity that is later verified. However, remember we’re in an adversarial environment, we imagine that there exists a possibility that the hacker _can get_ access to the information, an approved device, IP address or spoof a phone number. This leads us to consider a set of model contexts that do not reveal desirable information to fraudulent actors. Such contexts might occur during normal session activity, in real-time, or on clicks, hovers, or highlighting of text. These real-time, in-session model contexts allow for designing interventions that do not expose what was detected. + +## Mobile Gaming + +Similarly, the free mobile gaming industry provides another compelling example of the importance of context. Player engagement, retention, monetization, and the cross-marketing of games within a subscription or label are the primary focuses of most gaming companies. For instance, a data scientist might want to predict whether a player is likely to win a level during gameplay, and based on that information trigger an offer to purchase a power-up at a suitable price. Key contexts in this setting might be after every move, when the user has lost a certain number of lives, when they have a certain amount of time remaining to complete a level, or when their in-game health is running dangerously low. By treating moments of heightened player need as opportunities for purchase enticement, changing of the gameplay difficulty, or providing hints, gaming companies may very well boost player engagement and monetization while improving the overall gameplay experience. However, learning these optimal offer points is far from simple in such dynamic environments and requires iterating over many different model contexts over time. Such large-scale testing is only possible when it is natively supported, and feature engines such as Kaskada are the only tools that fully support this level of experimentation. + +The gaming example is different from the previous two for a few reasons. First, you’re changing the game as individuals play, altering the logged activity data and the outcomes that are possible to use for training in the future, complicating how new training data sets need to be generated. Second, there are no commitments; it's possible that half of your players leave within minutes of gameplay or after months without any built-in need to know when they might be able to come back. Finally, with the changes in the privacy arena, less and less information is known about a player before they begin playing; therefore, relying on in-game or cross-game user and population level activity data becomes critical. + +## Conclusion + +For real-time ML, choosing the appropriate model context to incorporate into your model’s training is an involved process that is made lengthy and cumbersome by traditional feature stores and data science tools. Feature engines such as Kaskada transform this previously ponderous process into one that’s quick, easy, and replicable. Via intuitive, iterative model context selection, Kaskada allows iteration over a multitude of context choices in a fraction of the time required by other systems. In doing so, it enables data scientists and engineers to make optimal decisions when choosing how to train their models, ultimately leading to improved model performance and unmatched business growth. \ No newline at end of file diff --git a/python/docs/blog/posts/2022-12-15-kaskada-understands-time.qmd b/python/docs/blog/posts/2022-12-15-kaskada-understands-time.qmd new file mode 100644 index 000000000..a478e31af --- /dev/null +++ b/python/docs/blog/posts/2022-12-15-kaskada-understands-time.qmd @@ -0,0 +1,159 @@ +--- +title: The Kaskada Feature Engine Understands Time +image: /_static/images/blog/kaskada-understands-time_banner.png +author: "" +date: "2022-12-15" +draft: true +--- + +This is the second part in our “How data science and machine learning toolsets benefit from Kaskada’s time-centric design” series, illustrating how data science and machine learning (ML) toolsets benefit from Kaskada’s time-centric design, and specifically how Kaskada’s feature engineering language, FENL, makes defining and calculating features on event-based data simpler and more maintainable than the equivalents in the most common languages for building features on data, such as SQL and Python Pandas. + +Part 1 of this series — **_Kaskada Was Built for Event-Based Data_** — gives an introduction to the challenges of working with event-based data, and how Kaskada makes defining features straight-forward. + +## What it Means to Understand Time + +Events, by definition, include some concept of time: when did the event happen? Therefore, processing of event data, almost always treats time as a very important attribute of the data—and often treats it as one of the two most important aspects of the data, together with the entity we need information about. + +When a time variable is important in data processing or feature engineering, we often do two things: + +1. Index our data store along a time dimension + +2. Partition the data by the entity we want to understand or make predictions about + +3. Do lots of aggregations and/or JOINs on a time dimension + + +When storing data, indexing and partitioning is crucial for maintaining computational efficiency. In very large datasets, locating data for a particular point or narrow range in time and entity can be incredibly inefficient if the data is not indexed and partitioned correctly. So, when time matters, we make sure to index and store the data in a way that makes it easy to access the correct data from the correct time and partition the data to improve parallelization. + +When processing event data for the purposes of aggregation or for building features for machine learning, item (3) above is very often necessary—we aggregate and JOIN (in the SQL sense) along a time dimension in order to know the status of things at given points in time and to summarize event activity over time intervals. In SQL, as well as other tools that use SQL-like syntax and computation (such as many pandas dataframe operations), point-in-time statuses might require a rolling window function, while summarizing time intervals would happen via group aggregations on a time bucket column, e.g. date or month. + +Alternatively, with more functional or Pythonic feature calculation, as opposed to SQL-like operations, we need to write our own code that explicitly loops through time (as in a FOR or WHILE loop), caching, aggregating, and calculating as specified in a function definition. + +With Kaskada’s feature engine and FENL, users don’t need to explicitly loop through time like the functional paradigm, and they also don’t need to worry about time-based JOIN logic like in the SQL paradigm. Processing data and calculating features over time is the default behavior, and it understands the basic units of time—such as hour, day, and month—without needing to calculate them every time, fill in missing rows, or write any JOIN logic at all for many of the most common aggregations on time and entity. + +In short, FENL provides the abstractions and Kaskada provides the calculation engine that automatically and efficiently, by default, handles time-based feature calculations in the most common, natural way. + +## Without understanding time, it is just another numeric variable + +Time is an ordered, numeric quantity, which is particularly obvious when it appears in a Unix/epoch format—seconds since the midnight preceding January 1st, 1970 is a real number that goes up into the future, and down into the past. However, time has some properties that many other numeric quantities do not: + +1. Time is generally linear—a second in the future is equivalent to a second in the past, and rarely do we use time itself on an exponential or logarithmic scale, though many variables depend on time in non-linear ways. + +2. Periods of time exist whether we have data for them or not—the hours and seconds during holidays, lunch breaks, and system downtime still appear on a clock somewhere, even if no data was produced during that time. + +3. For event-based data, time is an implicit attribute—events have at least a relative time of occurrence, by definition, and without it an event can lose most or all of its context and meaning. + + +These properties of time provide opportunities for feature engines to make some starting assumptions and to build both syntactic and computational optimizations for calculating feature values over time variables. It is a purpose-built machine that is largely superior to general-purpose machines when used for the appropriate purposes, but which may be inferior for use cases beyond its design. Because transformations and calculations on time variables are very common, there are lots of practical opportunities for the Kaskada feature engine and FENL to improve upon general-purpose tools like SQL- and python-based feature stores. + +Below, we illustrate some of the most often seen benefits of the natural understanding and optimization of time that is built into Kaskada and FENL. + +## **Aggregating data into time intervals** + +Some common data products that are generated from time-centric data involve designating a set of time intervals or buckets, and aggregating data in some way into these buckets. Hourly, daily, or monthly reporting involves—by definition—aggregating data by hour, day, or month. Data visualization over time also typically requires some aggregation into time buckets. + +In order to make use of them, time buckets need to be defined somewhere in any code, whether SQL, python/pandas, or FENL. In SQL and python, we have some concise ways to extract the hour, day, or month from date-time objects, and also to calculate those from Unix/epoch seconds—probably involving a couple of function calls for date-time truncation or object type conversion. In FENL, we can simply pass the standard functions hourly(), daily(), or monthly() as parameters to any relevant feature definition, and the Kaskada feature engine handles these time intervals in the natural way. See the [section on Time Functions in the Kaskada Catalog](https://kaskada.io/docs-site/kaskada/main/fenl/catalog.html#time-functions) for more details. + +When performing the actual aggregations by time bucket, SQL or pandas would likely use a GROUP BY statement, which might look like this: + +``` +select + entity_id + , event_hour_start + , count(*) as event_count_hourly +from + events_table +group by + 1, 2, 3 +``` + +Non-pandas python would probably use a more manual strategy, like looping through the calculated time buckets. + +FENL’s syntax for the same would look something like this: + +```fenl +{ + entity_id, + timestamp, + event_count_hourly: EventsTable | count(window=since(hourly())), +} +``` + +While I could argue that FENL’s syntax is somewhat more concise here, it’s more important to point out that, if we wanted both the hourly and the daily event counts, SQL would require two separate CTEs and GROUP BYs (roughly double the amount of SQL code above), whereas FENL can do both in the same code block because FENL understands time, and features are defined independently, like so: + +```fenl +{ + entity_id, + timestamp, + event_count_hourly: EventsTable | count(window=since(hourly())), + event_count_daily: EventsTable | count(window=since(daily())), +} +``` + +Notably, using these hourly() and daily() functions in FENL will automatically include all of the time intervals, hours and days, within the time range of the query, regardless of whether there is data in the time intervals or not. SQL, on the other hand, does not automatically produce a row for a time interval if there was no data in the input corresponding to that time interval. + +## Calculating features over rolling windows of time + +Both SQL and python have date-time objects that are good at representing and manipulating points in time, but these cannot concisely enable calculations during the passage of time. Rolling window calculations present a good example of this. + +Let’s say we want to calculate a rolling average of daily event counts over the last 30 days. It seems pretty simple in concept: every day, take the 30 event counts from the past 30 days and average them. But, we have to take care to avoid some common problems. For example, what if a day has no events? Is there a row/entry in the input dataset containing a zero for that day, or is that day simply missing? To get the answers to these questions, we must look upstream in our data pipeline to see how we are generating the inputs to this calculation. + +In SQL, the most common way to calculate a 30-day rolling average is to use a window function over a table where each day has its own row, which might look something like this: + +``` +select + entity_id + , date + , avg(event_count) over ( + order by date rows between 29 preceding and current row + ) as avg_30day_event_count +from + events_table +``` + +Sometimes, when sourcing clean, complete, processed data, this type of SQL window function will work as-is. However, window functions require all rows to represent equal units of time, and there can be no missing rows. So, if we can’t guarantee that both of these are true, in order to guarantee that the above query will work, we need to aggregate to rows of daily values and fill in missing rows with zeros. + +We can aggregate to daily rows as in the previous section above. Filling in missing rows requires us to produce a list of all of the dates we are concerned with, independent of our data, to which we then JOIN our aggregated data, filling in zeros where the aggregated data had missing rows. Such a list of dates within a time range is often called a date spine, and could be generated by a recursive CTE as in: + +``` +/* a recursive CTE for generating a single `event_day_end` column that increments by 1 day */ + with recursive day_list(event_day_end) as ( + /* start at the first day in the data*/ + values( (select min(event_day_end) from events_table) ) + union all + select + datetime(strftime('%Y-%m-%d %H:00:00', event_day_end), '+1 day') + from + day_list + limit 1000 /* just needs to be larger than the expected number of rows/days */ + ) +``` + +We would then JOIN our table/CTE of daily aggregations to this date spine, and rows that were formerly missing would now be present and filled with zeros. + +To reiterate, to use SQL to get 30-day rolling average event counts, we would: + +1. Aggregate the event data by date + +2. Build a date spine + +3. JOIN the daily aggregation to the date spine + +4. Apply a rolling window function + + +Each of these steps would be one CTE, more or less, but of course you could combine them into more compact but less readable queries. There are other ways to do this in SQL that avoid window functions, but which still involve most or all of the other steps, plus additional ones. + +In FENL, we don’t need a date spine or a JOIN, and we can apply the aggregation and the rolling window using functions in the individual feature definition, as in: + +```fenl +{ +entity_id, +timestamp, +avg_30day_event_count: EventsTable + | count(window=since(daily())) + | mean(window=sliding(30, daily())) +} +``` + +See our [FENL vs SQL vs Python Pandas Colab notebook](https://colab.research.google.com/drive/1Wg02zrxrJI_EEN8sAtoEXsRM7u8oDdBw?usp=sharing) for more detailed code examples, more details, information, and example notebooks are available on [Kaskada’s documentation pages](https://kaskada.io/guide). \ No newline at end of file diff --git a/python/docs/blog/posts/2022-12-20-stores-vs-engines.qmd b/python/docs/blog/posts/2022-12-20-stores-vs-engines.qmd new file mode 100644 index 000000000..b61c6a7a2 --- /dev/null +++ b/python/docs/blog/posts/2022-12-20-stores-vs-engines.qmd @@ -0,0 +1,41 @@ +--- +title: Feature Stores vs. Feature Engines +image: /_static/images/blog/stores-vs-engines_banner.png +author: "" +date: "2022-12-20" +draft: true +--- + +## Introduction + +In the ML world, feature stores are all the rage. They bring the power of MLOps to feature orchestration by providing a scalable architecture that stores and computes features from raw data and serves those features within a production environment. Feature stores may automate many of the more tedious tasks associated with feature handling such as versioning and monitoring features in production for drift. This sort of automation of the feature computation and deployment process makes it easy to get from experimentation to production and reuse feature definitions, allowing data scientists to hand off production-ready features to development teams for their machine learning models. What’s not to like? + +Unfortunately, there are a few ways in which feature stores fall flat. For one, they do not enable the iteration and discovery of new features that is essential to the feature engineering process itself. From the moment ML models are trained, they begin to age and deteriorate. It is impractical to expect that a data scientist can simply find all the features that will be needed for all problems for all time—the only constant is change. A few example scenarios include: modeling data that changes over time, modeling behavior in the context of changing environments, or predicting outcomes that depend on human decisions. Feature stores also inherit the limitations of the feature engines they interact with, so choosing the correct feature engine is vital. + +Behavior, human decision-making, and changing data over time necessitate a new paradigm for dealing with time and event-based data. Enter the feature engine. Working with time fundamentally alters the feature engineering loop, requiring that it include the ability to select what to compute and when. Relative event times, features, prediction times, labels, and label times are all important. The feature authoring and selection processes need the ability to look back in time for predictor features and forward in time for labels. A feature engine provides the right abstractions, computational model and integrations to enable data scientists to iterate during experimentation and bridge the gap to production. + +## Feature Engines + +Feature engines can be integrated with specialized [feature stores](https://www.featurestore.org/) in order to provide additional functionality. They enable all of the useful elements we’ve come to expect from MLOps such as feature versioning, logging, and feature serving, as well as computing features, and providing useful metadata around each computation. However, feature engines take things a step further by providing an intuitive way to perform event-based feature calculations forward and backward in time, thus giving engineers and data scientists a new and efficient way by which to isolate and iterate upon important features and serve those features in production. Feature engines derive their name from the powerful engine they provide by which features can be iteratively recomputed in a real-time, dynamic environment. They leverage powerful abstractions that make designing time and event-based features intuitive and easy, allowing engineers and data scientists to save valuable time by doing away with complicated and error-prone queries, backfill jobs, and feature pipelines that often need to be rewritten to meet production requirements. + +### How They Work + +![](https://images.ctfassets.net/fkvz3lhe2g1w/1dn4JZzpP6TqbP05M2GQhx/ece9ef681a8a1e65b5c71134de5fcc4f/Screen_Shot_2022-05-16_at_3.18.47_PM.png) + +The fundamental unit (data object) in a feature engine is the event stream, grouped by an entity. Feature engines consume and produce these event streams or timelines. Timelines provide a way to capture a given feature’s value as it changes over time. They can be created, combined, and operated on in almost any conceivable way, allowing data scientists to apply arbitrary functions to the sequences they hold. + +This means that different time-based features, or the same feature across different entities (like objects/users), can be readily compared. Timelines are also efficiently implemented under the hood, meaning that these sequential features can be incrementally updated in response to new data without having to be recomputed across the entire data lineage. This saves time and valuable compute power as well as reduces costly errors, vastly speeding up iteration. Simultaneously, feature engines produce the current feature values needed for storage in feature stores. In fact, the feature values stored in a feature store are just the timeline’s last values updated as new data arrives. + +### A Mobile Gaming Use Case + +Mobile gaming companies understand that in their fast-paced industry, traditional, static feature stores are not enough. In today’s world, where new free games are constantly being built with new game mechanics, it can be difficult to attract, engage and retain players for more than a day. In fact, the mobile gaming industry as a whole suffers from abysmal player retention rates. For innumerable companies, only 24% of players come back the next day, and as many as 94% of players will leave within the first month. The reasons for which they abandon a game are often unknown. + +Thus, building and deploying ML models which can predict what will cause a player to leave, with enough time to intervene, are key for these businesses to [maximize their ROI](https://schedule.gdconf.com/session/boost-engagement-in-your-game-by-supercharging-liveops-with-time-based-machine-learning-presented-by-kaskada/886577?_mc=we_gdcmc_3pvr_x_un_x_LWT_London_2022). However, games are highly dynamic environments in which players may take and respond to hundreds of treatments every single minute. With this deluge of time-based data, how can gaming companies reduce the noise and discover which features are predictive of player frustration and engagement? + +A feature engine makes this once impossible task easy. It can effortlessly connect to massive quantities of event-based data to generate behavioral features such as a player’s number of attempts compared to their cohort, their win-loss rate over time, and the time between purchasing in-game power-ups. Feature engines can then combine behavioral features with helpful context features such as the device’s operating system, game version, current location, and connection status and latency. This potent combination of behavior and context features is what allows feature engines to excel in highly dynamic business environments and sets them apart from basic feature store counterparts. + +In one partnership with a popular gaming company, Kaskada’s feature engine enabled thousands of feature iterations in just 2.5 weeks, ultimately leading to eight production machine learning models that successfully predicted player survival at four key points for the business. This was an exponential improvement compared to the planned 12-month model development roadmap. + +## Conclusion + +Feature stores are great for certain production requirements, but in many business contexts, they’re not enough. Instead, they are best used when you already discovered which features are important. Any business which processes event-based data could hugely benefit from the incorporation of a feature engine into their machine learning pipeline. This includes businesses needing to accurately predict LTV, learn segments, predict behavior or decision making or personalize products. Feature engines provide the powerful abstractions and computational models necessary to handle this data, to compute and identify features, and rapidly iterate in a way that can exponentially increase engagement and decrease churn. \ No newline at end of file diff --git a/python/docs/blog/posts/2023-03-28-announcing-kaskada-oss.qmd b/python/docs/blog/posts/2023-03-28-announcing-kaskada-oss.qmd index d925cf241..0d9d1dc79 100644 --- a/python/docs/blog/posts/2023-03-28-announcing-kaskada-oss.qmd +++ b/python/docs/blog/posts/2023-03-28-announcing-kaskada-oss.qmd @@ -5,6 +5,7 @@ categories: - releases title: Announcing Kaskada OSS subtitle: From Startup to Open Source Project +image: /_static/images/blog/announcing-kaskada-oss_banner.png --- # Announcing Kaskada OSS diff --git a/python/docs/blog/posts/2023-04-06-event-processing-and-time-calculations.qmd b/python/docs/blog/posts/2023-04-06-event-processing-and-time-calculations.qmd new file mode 100644 index 000000000..562836a66 --- /dev/null +++ b/python/docs/blog/posts/2023-04-06-event-processing-and-time-calculations.qmd @@ -0,0 +1,256 @@ +--- +title: "Kaskada for Event Processing and Time-centric Calculations: Ecommerce and Beyond" +image: /_static/images/blog/event-processing-and-time-calculations_banner.png +author: Brian Godsey +date: "2023-04-06" +draft: true +--- + + +Kaskada was built to process and perform temporal calculations on event streams, +with real-time analytics and machine learning in mind. It is not exclusively for +real-time applications, but Kaskada excels at time-centric computations and +aggregations on event-based data. + +For example, let's say you're building a user analytics dashboard at an +ecommerce retailer. You have event streams showing all actions the user has +taken, and you'd like to include in the dashboard: + +* the total number of events the user has ever generated +* the total number of purchases the user has made +* the total revenue from the user +* the number of purchases made by the user today +* the total revenue from the user today +* the number of events the user has generated in the past hour + +Because the calculations needed here are a mix of hourly, daily, and over all of +history, more than one type of event aggregation needs to happen. Table-centric +tools like those based on SQL would require multiple `JOINs` and window functions, +which would be spread over multiple queries or code blocks (CTEs). + +Kaskada was designed for these types of time-centric calculations, so we can do +each of the calculations in the list in one line (with some text wrapping): + +``` +event_count_total: DemoEvents | count(), +purchases_total_count: DemoEvents | when(DemoEvents.event_name == 'purchase') + | count(), +revenue_total: DemoEvents.revenue | sum(), +purchases_daily: DemoEvents | when(DemoEvents.event_name == 'purchase') + | count(window=since(daily())), +revenue_daily: DemoEvents.revenue | sum(window=since(daily())), +event_count_hourly: DemoEvents | count(window=since(hourly())), +``` + + +Of course, a few more lines of code are needed to put these calculations to +work, but these six lines are all that is needed to specify the calculations +themselves. Each line may specify: + +* the name of a calculation (e.g. `event_count_total`) +* the input data to start with (e.g. `DemoEvents`) +* selecting event fields (e.g. `DemoEvents.revenue`) +* function calls (e.g. `count()`) +* event filtering (e.g. `when(DemoEvents.event_name == 'purchase')`) +* time windows to calculate over (e.g. `window=since(daily())`) + +...with consecutive steps separated by a familiar pipe (`|`) notation. + +Because Kaskada was built for time-centric calculations on event-based data, a +calculation we might describe as "total number of purchase events for the user" +can be defined in Kaskada in roughly the same number of terms as the verbal +description itself. + +See [the Kaskada +documentation](https://kaskada.io/docs-site/kaskada/main/installing.html) +for lots more information. + + +## Installation + +Kaskada is published alongside most familiar python packages, so installation is +as simple as: + +``` +pip install kaskada +``` + + +## Example dataset + +The demo uses a very small example data set, but you can load your own event +data from many common sources, including any pandas dataframe. See [the Loading +Data +documentation](https://kaskada.io/docs-site/kaskada/main/loading-data.html) +for more information. + + +## Define queries and calculations + +The Kaskada query language is parsed by the fenl extension, and query +calculations are defined in a code blocks starting with `%%fenl`. + +See [the fenl +documentation](https://kaskada.io/docs-site/kaskada/main/fenl/fenl-quick-start.html) +for more information. + +A simple query for events with a specific entity ID looks like this: + +``` +%%fenl + +DemoEvents | when(DemoEvents.entity_id == 'user_002') +``` + + +When using the pipe notation, we can use `$input` to represent the thing being +piped to a subsequent step, as in: + +``` +%%fenl + +DemoEvents | when($input.entity_id == 'user_002') +``` + + +Beyond querying for events, Kaskada has a powerful syntax for defining +calculations on events, temporally across the event history. + +The six calculations discussed at the top of this demo can be written as +follows: + +``` +%%fenl + +{ + event_count_total: DemoEvents | count(), + event_count_hourly: DemoEvents | count(window=since(hourly())), + purchases_total_count: DemoEvents + | when(DemoEvents.event_name == 'purchase') | count(), + purchases_daily: DemoEvents + | when(DemoEvents.event_name == 'purchase') | count(window=since(daily())), + revenue_daily: DemoEvents.revenue | sum(window=since(daily())), + revenue_total: DemoEvents.revenue | sum(), +} +| when(hourly()) # each row in the output represents one hour of time +``` + + +### Trailing when() clause + +A key feature of Kaskada's time-centric design is the ability to query for +calculation values at any point in time. Traditional query languages (e.g. SQL) +can only return data that already exists---if we want to return a row of +computed/aggregated data, we have to compute the row first, then return it. As a +specific example, suppose we have SQL queries that produce daily aggregations +over event data, and now we want to have the same aggregations on an hourly +basis. In SQL, we would need to write new queries for hourly aggregations; the +queries would look very similar to the daily ones, but they would still be +different queries. + +With Kaskada, we can define the calculations once, and then separately specify +the points in time at which we want to know the calculation values. + +Note the final line in the above query: + +``` +| when(hourly()) +``` + +We call this a "trailing `when()`" clause, and its purpose is to specify the time +points you would like to see in the query results. + +Regardless of the time cadence of the calculations themselves, the query output +can contain rows for whatever timepoints you specify. You can define a set of +daily calculations and then get hourly updates during the day. Or, you can +publish a set of calculations in a query view (see below), and different users +can query those same calculations for hourly, daily, and monthly +values---without editing the calculation definitions themselves. + + +### Adding more calculations to the query + +We can add two new calculations, also in one line each, representing: + +* the time of the user's first event +* the time of the user's last event + +We can also add the parameter `--var event_calculations` to save the results +into a python object called `event_calculations` that can be used in subsequent +python code. + + +``` +%%fenl --var event_calculations + +{ + event_count_total: DemoEvents | count(), + event_count_hourly: DemoEvents | count(window=since(hourly())), + purchases_total_count: DemoEvents + | when(DemoEvents.event_name == 'purchase') | count(), + purchases_daily: DemoEvents + | when(DemoEvents.event_name == 'purchase') + | count(window=since(daily())), + revenue_daily: DemoEvents.revenue | sum(window=since(daily())), + revenue_total: DemoEvents.revenue | sum(), + + first_event_at: DemoEvents.event_at | first(), + last_event_at: DemoEvents.event_at | last(), +} +| when(hourly()) +``` + + +This creates the python object `event_calculations`, which has an attribute +called `dataframe` that can be used like any other dataframe, for data +exploration, visualization, analytics, or machine learning. + +``` +# access results as a pandas dataframe +event_calculations.dataframe +``` + + +This is only a small sample of possible Kaskada queries and capabilities. See +[the fenl catalog for a full +list](https://kaskada.io/docs-site/kaskada/main/fenl/catalog.html) of +functions and operators. + + +## Publish Query Calculation Definitions as Views + +The definitions of your query calculations can be published in Kaskada and used +elsewhere, including in other Kaskada queries. + +``` +from kaskada import view as kview + +kview.create_view( + view_name = "DemoFeatures", + expression = event_calculations.expression, +) + +# list views with a search term +kview.list_views(search = "DemoFeatures") +``` + +We can query a published view just like we would any other dataset. + +``` +%%fenl + +DemoFeatures | when($input.revenue_daily > 0) +``` + +## Try it out and tell us what you think! + +The content of this blog post is based on the public notebook at the link: +[Kaskada Demo for Event Processing and Time-centric +Calculations](https://github.com/kaskada-ai/kaskada/blob/main/examples/Kaskada%20Demo%20for%20Event%20Processing%20and%20Time-centric%20Calculations.ipynb). +Kaskada is a brand new open source project, which makes your early feedback +exceptionally important to us. + +We think that Kaskada could be exceptionally useful for anyone trying to turn +event data into analytics, ML features, real-time stats of all kinds, and +insight in general. Have a look at the notebook, and [let us know what you like +and don't like about it](https://github.com/kaskada-ai/kaskada/discussions)! diff --git a/python/docs/blog/posts/2023-05-09-introducing-timelines-part-1.qmd b/python/docs/blog/posts/2023-05-09-introducing-timelines-part-1.qmd new file mode 100644 index 000000000..3cda678e0 --- /dev/null +++ b/python/docs/blog/posts/2023-05-09-introducing-timelines-part-1.qmd @@ -0,0 +1,181 @@ +--- +title: "Introducing Timelines" +subtitle: "Part 1: an Evolution in Stream Processing Abstraction" +image: /_static/images/blog/introducing-timelines/part-1_banner.png +author: Ben Chambers and Therapon Skoteiniotis +date: 2023-May-09 +draft: true +--- + +In the last decade, advances in data analytics, machine learning, and AI have occurred, thanks in part to the development +of technologies like Apache Spark, Apache Flink, and KSQL for processing streaming data. Unfortunately, streaming queries +remain difficult to create, comprehend, and maintain -- you must either use complex low-level APIs or work around the +limitations of query languages like SQL that were designed to solve a very different problem. To tackle these challenges, +we introduce a new abstraction for time-based data, called _timelines_. + +Timelines organize data by time and entity, offering an ideal structure for event-based data with an intuitive, graphical +mental model. Timelines simplify reasoning about time by aligning your mental model with the problem domain, allowing you +to focus on _what_ to compute, rather than _how_ to express it. + +In this post, we present timelines as a natural development in the progression of stream processing abstractions. We delve +into the timeline concept, its interaction with external data stores (inputs and outputs), and the lifecycle of queries +utilizing timelines. + +This post is the first in a series introducing Kaskada, an open-source event processing engine designed around the +timeline abstraction. The full series will include (1) this introduction to the timeline abstraction, (2) [how Kaskada +builds an expressive temporal query language on the timeline abstraction][timelines_part2], (3) how the timeline data +model enables Kaskada to efficiently execute temporal queries, and (4) how timelines allow Kaskada to execute +incrementally over event streams. + +## Rising Abstractions in Stream Processing + +Since 2011, alongside the rise of distributed streaming, there has been a trend towards higher abstractions for operating on events. +In the same year that [Apache Kafka][kafka] introduced its distributed stream, [Apache Storm][storm] emerged with a structured graph representation for stream consumers, enabling the connection of multiple processing steps. +Later in 2013, [Apache Samza][samza] provided fault-tolerant, stateful operations over streams, and the [MillWheel paper][millwheel] introduced the concept of watermarks to handle late data. +Between 2016 and 2017, [Apache Flink][flink], [Apache Spark][spark], and [KSQL][ksql] implemented the ability to execute declarative queries over streams by supporting SQL. + +![History of Streaming Abstractions, showing Apache Kafka, Apache Storm, Apache Samza, MillWheel, Apache Flink, Apache Spark and KSQL][stream_abstraction_history] + +Each of these developments enhanced the abstractions used when working with streams -- improving aspects such as composition, state, and late-data handling. +The shift towards higher-level, declarative queries significantly simplified stream operations and the management of compute systems. +However, expressing time-based queries still poses a challenge. +As the next step in advancing abstractions for event processing, we introduce the timeline abstraction. + +## The Timeline Abstraction + +Reasoning about time -- for instance cause-and-effect between events -- requires more than just an unordered set of event data. +For temporal queries, we need to include time as a first-class part of the abstraction. +This allows reasoning about when an event happened and the ordering -- and time -- between events. + +Kaskada is built on the _timeline_ abstraction: a multiset ordered by time and grouped by entity. + +Timelines have a natural visualization, shown below. +Time is shown on the x-axis and the corresponding values on the y-axis. +Consider purchase events from two people: Ben and Davor. +These are shown as discrete points reflecting the time and amount of the purchase. +We call these _discrete timelines_ because they represent discrete points. + +![A plot showing time on the x-axis and values on the y-axis, with discrete points.][discrete] + +The time axis of a timeline reflects the time of a computation’s result. +For example, at any point in time we might ask “what is the sum of all purchases?”. +Aggregations over timelines are cumulative -- as events are observed, the answer to the question changes. +We call these continuous timelines because each value continues until the next change. + +![A plot showing time on the x-axis and values on the y-axis, with steps showing continuous values.][continuous] + +Compared to SQL, timelines introduce two requirements: **ordering by time and grouping by entity**. +While the SQL _relation_ -- an unordered multiset or bag -- is useful for unordered data, the additional requirements of timelines make them ideal for reasoning about cause-and-effect. +Timelines are to temporal data what relations are to static data. + +Adding these requirements mean that timelines are not a fit for every data processing task. +Instead, they allow timelines to be a _better_ fit for data processing tasks that work with events and time. +In fact, most event streams (eg., Apache Kafka, Apache Pulsar, AWS Kinesis, etc.) provide ordering and partitioning by key. + +When thinking about events and time, you likely already picture something like a timeline. +By matching the way you already think about time, timelines simplify reasoning about events and time. +By building in the time and ordering requirements, the timeline abstraction allows temporal queries to intuitively express cause and effect. + +## Using Timelines for Temporal Queries +Timelines are the abstraction used in Kaskada for building temporal queries, but data starts and ends outside of Kaskada. +It is important to understand the flow of data from input, to timelines, and finally to output. + +![A diagram showing the lifecycle of temporal queries. Input is processed by a query as discrete and continuous timelines. Output is produced from the resulting timeline][lifecycle] + +Every query starts from one or more sources of input data. +Each input -- whether it is events arriving in a stream or stored in a table, or facts stored in a table -- can be converted to a timeline without losing important context such as the time of each event. + +The query itself is expressed as a sequence of operations. +Each operation creates a timeline from timelines. +The result of the final operation is used as the result of the query. +Thus, the query produces a timeline which may be either discrete or continuous. + +The result of a query is a timeline, which may be output to a sink. +The rows written to the sink may be a history reflecting the changes within a timeline, or a snapshot reflecting the values at a specific point-in-time. + +### Inputting Timelines +Before a query is performed, each input is mapped to a timeline. +Every input -- whether events from a stream or table or facts in a table -- can be mapped to a timeline without losing the important temporal information, such as when events happened. +Events become discrete timelines, with the value(s) from each event occurring at the time of the event. +Facts become continuous timelines, reflecting the time during which each fact applied. +By losslessly representing all kinds of temporal inputs, timelines allow queries to focus on the computation rather than the kind of input. + +### Outputting Timelines + +After executing a query, the resulting timeline must be output to an external system for consumption. +The sink for each destination allows configuration of data writing, with specifics depending on the sink and destination (see [execution documentation](/reference/Timestream/Execution)). + +There are several options for converting the timeline into rows of data, affecting the number of rows produced: + +1. Include the entire history of changes within a range or just a snapshot of the value at some point in time. +2. Include only events (changes) occurring after a certain time in the output. +3. Include only events (changes) up to a specific time in the output. + +A full history of changes helps visualize or identify patterns in user values over time. +In contrast, a snapshot at a specific time is useful for online dashboards or classifying similar users. + +Including events after a certain time reduces output size when the destination already has data up to that time or when earlier points are irrelevant. +This is particularly useful when re-running a query to materialize to a data store. + +Including events up to a specific time also limits output size and enables choosing a point-in-time snapshot. +With incremental execution, selecting a time slightly earlier than the current time reduces late data processing. + +Both "changed since" and "up-to" options are especially useful with incremental execution, which we will discuss in part 4. + +#### History + +The _history_ -- the set of all points in the timeline -- is useful when you care about past points. +For instance, this may be necessary to visualize or identify patterns in how the values for each user change over time. + +![Diagram showing the conversion of all points in a timeline into a sequence of change events.][timeline_history] + +Any timeline may be output as a history. +For a discrete timeline, the history is the collection of events in the timeline. +For a continuous timeline, the history contains the points at which a value changes -- it is effectively a changelog. + +#### Snapshots + +A _snapshot_ -- the value for each entity at a specific point in time -- is useful when you just care about the latest values. +For instance, when updating a dashboard or populating a feature store. + +![Diagram showing the conversion of values at a point in time in a timeline to a snapshot, including interpolation.][timeline_snapshot] + +Any timeline may be output as a snapshot. +For a discrete timeline, the snapshot includes rows for each event happening at that time. +For a continuous timeline, the snapshot includes a row for each entity with that entity's value at that time. + +## Conclusion +This blog post highlighted the challenges associated with creating and maintaining temporal queries on event-based data streams and introduced the timeline abstraction as a solution. +Timelines organize data by time and entity, providing a more suitable structure for event-based data compared to multisets. + +The timeline abstraction is a natural progression in stream processing, allowing you to reason about time and cause-and-effect relationships more effectively. +We also explored the flow of data in a temporal query, from input to output, and discussed the various options for outputting timelines to external systems. + +Rather than applying a tabular -- static -- query to a sequence of snapshots, Kaskada operates on the history (the change stream). +This makes it natural to operate on the time between snapshots, rather than only on the data contained in the snapshot. +Using timelines as the primary abstraction simplifies working with event-based data and allows for seamless transitions between streams and tables. + +You can [get started][getting_started] experimenting with your own temporal queries today. +[Join our Slack channel]({{< var slack-join-url >}}) and let us know what you think about the timeline abstraction. + +Check out the [next blog post in this series][timelines_part2], where we'll delve into the Kaskada query language and its capabilities in expressive temporal queries. +Together, we'll continue to explore the benefits and applications of the timeline abstraction in modern data processing tasks. + +[timelines_part2]: 2023-05-25-introducing-timelines-part-2.html + +[continuous]: /_static/images/blog/introducing-timelines/continuous.png "Continuous Timeline" +[discrete]: /_static/images/blog/introducing-timelines/discrete.png "Discrete Timeline" +[lifecycle]: /_static/images/blog/introducing-timelines/lifecycle.png "Lifecycle of a Temporal Query" +[stream_abstraction_history]: /_static/images/blog/introducing-timelines/stream_abstraction_history.png "History of Streaming Abstractions" +[timeline_history]: /_static/images/blog/introducing-timelines/timeline_history.png "Timeline History" +[timeline_snapshot]: /_static/images/blog/introducing-timelines/timeline_snapshot.png "Timeline Snapshot" + +[getting_started]: /guide/quickstart.html + +[flink]: https://flink.apache.org/ +[kafka]: https://kafka.apache.org/ +[ksql]: https://ksqldb.io/ +[samza]: https://samza.apache.org/ +[spark]: https://spark.apache.org/ +[storm]: https://storm.apache.org/ +[millwheel]: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf diff --git a/python/docs/blog/posts/2023-05-25-introducing-timelines-part-2.qmd b/python/docs/blog/posts/2023-05-25-introducing-timelines-part-2.qmd new file mode 100644 index 000000000..69598b446 --- /dev/null +++ b/python/docs/blog/posts/2023-05-25-introducing-timelines-part-2.qmd @@ -0,0 +1,216 @@ +--- +title: "Introducing Timelines" +subtitle: "Part 2: Declarative, Temporal Queries" +image: /_static/images/blog/introducing-timelines/part-2_banner.png +author: Ben Chambers and Therapon Skoteiniotis +date: 2023-May-25 +draft: true +--- + +Understanding time-based patterns in your data can unlock valuable insights, but expressing temporal queries can often be a challenging task. +Imagine being able to effortlessly analyze your users' behavior over time, to perform precise temporal joins, or to examine patterns of activity between different events, all while being intuitive and naturally handling time. +This is where the concept of timelines, a high-level abstraction for working with temporal data, comes to your aid. + +In this blog post, we are diving deep into the world of timelines. +We'll demonstrate how they make expressing temporal queries on events, and importantly, between events, not just easy but also intuitive. +We previously [introduced the timeline][timelines_part1] a high-level abstraction for working with temporal data. +Timelines organize event-based data by time and entity, making it possible to reason about the context around events. +Express temporal queries using timelines have several benefits: + +- **Intuitive**: Since timelines are ordered by time, it is natural for queries to operate in order as well. As time progresses, additional events – input – occur and are reflected in the output of the query. This way of thinking about computations – as progressing through time – is intuitive because it matches the way we observe events. +- **Declarative**: Temporal operations – like windowing and shifting – are clearly declared when working with timelines, since time is part of the abstraction. +- **Composable**: Every operation takes timelines and produces timelines, meaning that operations can be chained as necessary to produce the intended results. + +In the next sections, we are going to dissect four real-life examples demonstrating the immense benefits of timelines. We'll start with a simple query for an aggregation and progressively tackle more intricate temporal windows, data-dependent windows, and temporally correct joins. By the end, you'll be left with a deep understanding of how timelines make writing simple temporal queries as straightforward as SQL and how they empower us to take on the more challenging questions. + +# Total Spend: Cumulative Aggregation + +Timelines support everything you can do in SQL, intuitively extended to operate over time. +Before looking at some new capabilities for sophisticated temporal queries, let’s look at something simple – an aggregation. +Writing simple queries is easy: in fact, since timelines are ordered by time and grouped by entity, they can be even easier than in SQL! + +Consider the question, _how much did each user spend_? +When thinking about this over events, it is natural to process the purchases in order, updating the amount each user has spent over time. +The result is a _cumulative_ sum producing a _continuous timeline_. + +![Timeline visualization of how much each user spent][aggregation] + +The corresponding query is shown below written in two equivalent ways. +The first emphasizes the sum being applied to the purchases while the second emphasizes the chain of operations we have composed – “take the purchases then apply the sum”. +In the remainder of this post we’ll use the latter since it better matches the way we tend to think about processing timelines. + +```fenl +sum(Purchases.amount) + +# OR + +Purchases.amount +| sum() +``` + +Writing simple temporal queries with timelines was as easy as SQL. +Processing events in order is an intuitive way of operating over time. +Of course, aggregating over all events is only one way we may wish to aggregate things. +In the next example, we’ll see how to extend this query using temporal windows to focus on recent events. + +# Monthly Spend: Temporal Windowing + +When thinking about temporal queries, it’s very natural to ask questions about the recent past – year-to-date or the last 30 days. +The intuition of processing events in order suggests answering the question “_how much has each user spent this month_” should just require resetting the value at the start of each month. +And this intuition is exactly how these kinds of temporal windows work with timelines. + +![Timeline visualization of how much each user spent this month][windowed] + +The temporal query is shown below. +It clearly indicates the intent we expressed above – take the purchases and aggregate them since the beginning of each month. + +```fenl +Purchases.amount +| sum(window=since(monthly())) +``` + +Since time is inherently part of every timeline, every aggregation is able to operate within temporal windows. +In the next example we’ll see how easy it is to work with more complicated queries, including aggregations with more sophisticated windows. + +# Page Views Between Purchases + +Not all windows are defined in terms of time. +It is often useful to use events to determine the windows used for an aggregation. +Additionally, while all the examples so far have operated on a single type of event – `Purchases` – examining the patterns of activity between different events is critical for identifying cause and effect relationships. +In this example we’ll take advantage of timelines to declaratively express queries using data-defined windows and multiple types of events. +We’ll also filter an intermediate timeline to specific points to control the values used in later steps while composing the query. + +The question we’ll answer is “_what is the average number of page-views between each purchase for each user_?”. +We’ll first compute the page views since the last purchase, observe them at the time of each purchase, and then take the average. + +## Data-Defined Windowing +The first thing we’ll do is compute the number of page views since the last purchase. +In the previous example, we windowed since the start of the month. +But there is nothing special about the timeline defining the start of a month – we can window with any other timeline. + +```fenl +PageViews +| count(window=since(is_valid(Purchases))) +``` + +![Timeline showing data-defined windows][data_windows_1] + +In addition to data-defined windowing, we see how to work with multiple types of events. +Since every timeline is ordered by time and grouped by entity, every timeline can be lined up by time and joined on entity – _automatically_. + +## Observing at specific times +The previous step gave us the page views since the last purchase. +But it was a continuous timeline that increased at each page view until the next purchase. +We’re after a discrete timeline with a single value at the time of each purchase representing the page views since the previous purchase. +To do this, we use the [`when`]({% fenl_catalog when %}) operation which allows observing – and interpolating if needed – a timeline at specific points in time. + +![Observing a timeline at specific points in time using when][data_windows_2] + +The `when` operation can be used anywhere in a temporal query and allows for filtering points which are present in the output – or passed to later aggregations. + +# Nested Aggregation +With the number of page views between purchases computed, we are now able to compute the average of this value. +All we need to do is use the [](`kaskada.Timestream.mean`) aggregation. + +![Applying an outer aggregation to compute the average of observed page views][data_windows_3] + +# Putting it together +The complete query is shown below. +We see the steps match the logical steps we talked through above. +Even though the logic was reasonably complex, the query is relatively straightforward and captures our idea for what we want to compute – hard questions are possible. + +```fenl +PageViews +| count(window=since(is_valid(Purchases))) +| when(is_valid(Purchases)) +| mean() +``` + +This kind of query can be generalized to analyze a variety of patterns in the page view activity. +Maybe we only want to look at the page views for the most-frequently viewed item rather than all items, believing that the user is becoming more focused on that item. +Maybe we want to window since purchases of the same item, rather than any purchase. + +This query showed some ways timelines enable complex temporal queries to be expressed: +1. Ordering allows windows to be defined by their delimiters – when they start and end – rather than having to compute a “window ID” from each value for grouping. +1. Ordering also allows multiple timelines to be used within the same expression – in this case `PageViews` and `Purchases`. +1. Continuity allows values to be interpolated at arbitrary times and filtered using the `when` operation. +1. Composability allows the result of any operation to be used with later operations to express temporal questions. This allows complex questions to be expressed as a sequence of simple operations. + +These capabilities allow identifying cause-and-effect patterns. +While it may be that a purchase _now_ causes me to make purchases _later_, other events often have a stronger relationship – for instance, running out of tape and buying more, or scheduling a camping trip and stocking up. +Being able to look at activity (`PageViews`) within a window defined by other events (`Purchases`) is important for understanding the relationship between those events. + +# Minimum Review Score + +We’ve already seen how timelines allow working with multiple types of events associated with the same entity. +But it’s often necessary to work with multiple entities as well. +For instance, using information about the entire population to normalize values for each user. +Our final example will show how to work with multiple entities and perform a temporal join. + +The final question we’ll answer is _“what is the minimum average product review (score) at time of each purchase?”_. +To do this, we’ll first work with reviews associated with each product to compute the average score, and then we’ll join each purchase with the corresponding average review. + +## Changing entities +To start, we want to compute the average product review (score) for each item. +Since the reviews are currently grouped by user, we will need to re-group them by item, using the [`with_key`]({% fenl_catalog with_key %}) operation. +Once we’ve done that, we can use the `mean` aggregation we’ve already seen. + +![Timelines computing the per-item average score][temporal_join_1] + +## Lookup between entities +For each purchase (grouped by user) we want to look up the average review score of the corresponding item. +This uses the [`lookup`]({% fenl_catalog lookup %}) operation. + +![Timelines using lookup to temporally join with the item score][temporal_join_2] + +## Putting it together +Putting it all together we use the lookup with a [`min`]({% fenl_catalog min %}) aggregation to determine the minimum average rating of items purchased by each user. + +```fenl +Reviews.score +| with_key(Reviews.item) +| mean() +| lookup(Purchases.item) +| min() +``` + +This pattern of re-grouping to a different entity, performing an aggregation, and looking up the value (or shifting back to the original entities) is common in data-processing tasks. +In this case, the resulting value was looked up and used directly. +In other cases it is useful for normalization – such as relating each user’s value to the average values in their city. + +Ordering and grouping allow timelines to clearly express operations between different entities. +The result of a lookup is from the point-in-time at which the lookup is performed. +This provides a temporally correct “as-of” join. + +Performing a join _at the correct time_ is critical for computing training examples from the past that are comparable to feature values used when applying the model. +Similarly, it ensures that any dashboard, visualization, or analysis performed on the results of the query are actually correct as if they were looking at the values in the past, rather than using information that wasn’t available at that time. + +# Conclusion + +In this post, we've demonstrated the power of timelines as a high-level abstraction for handling temporal data. +Through intuitive, declarative, and composable operations, we showed how timelines enable efficient expression of temporal queries on events and between events. +With examples ranging from simple aggregations to sophisticated queries like data-dependent windows and temporally correct joins, we illustrated how timeline operations can be chained to produce intended results. +The potency of timelines lies in their ability to easily express simple temporal questions and intuitively extend to complex temporal queries. + +From total spend to minimum review score, we walked through four illustrative examples that highlight the capabilities of timelines in temporal querying. +We explored cumulative aggregation, temporal windowing, and observed how data-defined windowing offers the ability to express complex temporal questions. +We also showed how timelines facilitate multi-entity handling and temporal joins. +These examples show that with timelines, you have a powerful tool to identify cause-and-effect patterns and compute training examples that are comparably valid when applying a model. + +In the next post in this series, we delve further into the dynamics of temporal queries on timelines. +We'll explore how these queries are efficiently executed by taking advantage of the properties of timelines. + +We encourage you to [Get started][getting_started] writing your own temporal queries today, and [join our Slack channel]({{< var slack-join-url >}}) for more discussions and insights on timelines and other data processing topics. +Don't miss out on this opportunity to be a part of our growing data community. +Join us now and let's grow together! + +[timelines_part1]: 2023-05-09-introducing-timelines-part-1.html +[getting_started]: /guide/quickstart.html + +[aggregation]: /_static/images/blog/introducing-timelines/aggregation.svg "Timelines showing purchases and sum of purchases" +[windowed]: /_static/images/blog/introducing-timelines/windowed.svg "Timelines showing purchases and sum of purchases since start of the month" +[data_windows_1]: /_static/images/blog/introducing-timelines/data_windows_1.svg "Timelines showing count of page views since last purchase" +[data_windows_2]: /_static/images/blog/introducing-timelines/data_windows_2.svg "Timelines showing count of page views since last purchase observed at each purchase" +[data_windows_3]: /_static/images/blog/introducing-timelines/data_windows_3.svg "Timelines showing average of the page view count between purchases" +[temporal_join_1]: /_static/images/blog/introducing-timelines/temporal_join_1.svg "Timelines computing the average item review score" +[temporal_join_2]: /_static/images/blog/introducing-timelines/temporal_join_2.svg "Timelines looking up the average review score for each purchase" \ No newline at end of file diff --git a/python/docs/blog/posts/2023-06-05-timeline-abstractions.qmd b/python/docs/blog/posts/2023-06-05-timeline-abstractions.qmd new file mode 100644 index 000000000..35acfcad6 --- /dev/null +++ b/python/docs/blog/posts/2023-06-05-timeline-abstractions.qmd @@ -0,0 +1,292 @@ +--- +date: 2023-Jun-05 +author: Brian Godsey +title: "It’s Time for a Time-centric Stack: Timeline Abstractions Save Time and Money" +image: "/_static/images/blog/timeline-abstractions_banner.png" +draft: true +--- + +When processing and building real-time feature values from event-based data, it +is easy to underestimate the power of tools that are built to understand how +time works. Many software tools can parse and manipulate time-based data, but +the vast majority have little more than convenience functionality for parsing +and doing arithmetic on date-time values. On the other hand, Kaskada has a +built-in, natural understanding of time that just works by default. This +understanding is powerful: it can reduce development time and effort, simplify +deployment and debugging, and reduce the compute and storage resources needed to +support real-time event-based systems. + +In this post, I'll discuss some costly pitfalls when building data pipelines +with event-based data. In particular, I will share some anecdotes from my past +experience building SQL-based pipelines, how my team spent considerable time and +money that (in hindsight) could have been avoided, and how Kaskada specifically +could have simplified development and reduced costs. I will get into specifics +below, but the lessons I've learned from using SQL on event data boil down to +this: + +> Event-based data and time values have special properties and natural +> conventions, which lead us to write the same generic logic over and over. This +> time-centric logic—using window functions, time-JOINs, etc—can become +> complicated very quickly, both in code readability and computational +> complexity. Because of this, writing time-centric logic is often inefficient +> and buggy. + +The cautionary anecdotes that follow below may qualify as horror stories in some +circles, but I promise that no humans or computing resources were harmed during +the construction of these pipelines or in the mayhem that followed. There was, +however, plenty of time and money lost on a few fronts. + + +## Case study: calculating daily user statistics and aggregations + +In this example, I illustrate how hard it can be to calculate (seemingly) +simple rolling averages when your software tools don't understand time. + +On a project some years ago, our goal was a common one: daily reporting of user +activity, including aggregated statistics like 7-day and 30-day rolling totals +and averages of various types of activity, as well as day-of-week averages, etc. +We were using DBT, a SQL-based tool for developing and scheduling periodic +builds of data pipelines. Some pipeline jobs ran every 30 or 60 minutes, and +others ran as frequently as every 5 minutes—that was as close as we were able to +get to real-time AI, because RTAI is difficult. Importantly, the efficiency of +our data pipeline runs was dependent on the efficiency of the series of SQL +queries that defined the pipeline. + +There are many ways to implement rolling averages in SQL, but the simplest +implementation, conceptually, is to use window functions. Window functions, by +design, slide or roll through rows of data and compute sums, averages, and other +statistics. For example, we can calculate a 30-day rolling average with a data +window of 30 rows and a window function called `AVG`. One implication of using +window functions to calculate rolling averages is: there must be a row for every +day. If we don’t have a row of data for every day, then a window of 30 rows of +data will not correspond to 30 days. For example, assume a user has 30 active +days in the last 45 days, and therefore has only 30 rows of data for that time, +with the 15 inactive days corresponding to “missing” rows with no user activity. +Using the window function `AVG` on 30 rows of data will calculate the average +activity values of the 30 active days from the last 45 days, and will “ignore” +the days with zero activity because those rows are missing. Imagine doing this +with an infrequent user who logs in only once per month—you would be calculating +the user’s average for their 30 active days over the last 30 months! + +Because of this, if we are going to use window functions to calculate rolling +averages, it is extremely important that a row is present for every day, even if +a user has no activity on that day, because we want to include zero days in the +averages. As a quick side-note: Kaskada doesn’t have this pitfall, because it +understands that a day exists even if there is no activity on that day. We can +specify a “day” or a 30-day window without needing to “manually” build a table +of days. One way to calculate a 30-day rolling average for `activity_value` is: + +```python +Dataset.activity_value # specify the data +| sum(window=sliding(30, daily())) / 30 # sum values and divide +``` + +We do not need to worry about rows in the table (`Dataset`) being +“missing”—Kaskada understands how time works and by default performs the +calculations that give us our expected results. + + +### A costly inflection point + +Our implementation of window functions for rolling averages worked well at +first, but a month or two later, we noticed that cloud computing costs for +certain pipeline jobs were increasing, slowly at first and then more quickly. +Then, costs tripled in a matter of weeks—it was like we had hit an unforeseen +inflection point, crossing a line from linear cost scaling to exponential. The +bills for our cloud computing resources were going up much faster than our raw +datasets were growing, so we knew there was a problem somewhere. + +We investigated, and we identified a few queries in the pipeline that were +responsible for the increasing costs. These guilty queries centered around the +window functions that calculated rolling averages, in addition to some upstream +and downstream queries that either supported or depended on these rolling +averages. + +We dug in to find out why. + + +### How we had “optimized” the window functions + +Soon after building the rolling averages into the pipeline, we added some tricks +here or there to make things run more efficiently. Some of these tricks were +more obvious, and some more clever, but they all saved us time and money +according to the ad-hoc tests we did prior to putting them into production. In +the coming months, however, some of the optimizations we had used to streamline +the window functions stopped saving us money, and some even ended up backfiring +as the dataset scaled up. + +Because window functions are typically heavier calculations, they are easy +targets for optimizing overall computation time. In one optimization, we had +introduced an intermediate reference table containing all users and dates in the +dataset. This approach was convenient, as it reduced code redundancy wherever we +needed such a table, it simplified development, and it improved build efficiency +in places where building such a table within a query would have been repeated. + + +### But the table became too big + +Looking at query build times within the data pipeline, it was clear that this +intermediate table of all users and all dates was a significant part of the +problem. One specific issue was that the table had become so large that most +queries and scans of it were slow. Even worse, because SQL engines love to do +massively parallel calculations, our data warehouses (cloud instances) were +trying to pull the whole table into memory while doing aggregations for daily +reporting. The table did not fit into memory, so large amounts of temporary data +was spilling to disk, which is very slow compared with calculations in memory. + +We had always been aware that the table could become large—and maintaining a row +for every date and user throughout our entire history was clearly too much. We +didn’t need rows filled with zeros for users who had stopped using our app +months ago. That would lead to a lot of unnecessary data storage and scanning. +So we implemented some simple rules for dropping inactive users from these +tables and calculations. It wasn’t obvious what the exact rules should be, but +there were options. Should we wait for 30 days of inactivity before leaving +those users out of the table? 60 days? Less? + +It wasn’t that important to find the “correct” rule for user churn, but +implementing any rule at all added another layer of complexity and logic to the +data pipeline. Even after adding a rule, some of our largest user and date +tables still consisted of over 90% zeros, because the majority of users did not +engage with the app on a daily basis. Even occasional users could keep their +accounts “alive” if they had activity every month or so. Cleverer +implementations were more complex and had other drawbacks, so we stuck with our +simple rules. + + +### Calculations on a table that’s too big + +The problem of the big intermediate table spilling from memory onto disk was +compounded by the fact that window functions often use a lot of memory to begin +with, and we were using them on the big table in multiple places downstream. We +were piling memory-intensive calculations on top of each other, and once they +exceeded the capacity of the compute instances, efficiency took a nosedive and +costs skyrocketed. But what could we do about it? + +We knew that any solution to the problem would need to reduce the amount of data +that was spilling from memory to disk. We needed to move away from +memory-intensive window functions, and also to stop “optimizing early” in the +pipeline with a giant intermediate daily table filled with mostly zeros. (Also, +moving some non-incremental builds to incremental would be good—in other words, +don’t compute over the whole table every time, just the newest data.) + +Moving away from window functions and the big daily table required a different +type of time logic in SQL: instead of window functions across rows, we would +`JOIN` events directly to a list of dates and then do `GROUP BY` aggregations to +get the statistics we wanted. For example, instead of calculating a 30-day +moving average by literally averaging 30 numbers, we can do a `SUM` over the +event activity within a 30-day window and then divide by 30. Among other things, +this avoids the rows of zeros that we don’t really need, which means less data +to read into memory and potentially spill to disk during big window or +aggregation operations. + +This implementation of a 30-day rolling average isn’t as intuitive to write or +read as using a window function, but it was computationally more efficient for +our data. + + +### A more efficient implementation was only half the battle + +The biggest challenge wasn’t writing the query code for this implementation. +Once the code was written, before we could push the code to production, we +needed to verify that it produced correct results—essentially the same results +as the current production code—and didn’t have any bugs that would mess up the +user experience for our app. + +This quality assurance and roll-out process was incredibly tedious. Because the +changes we made were in the core of the pipeline—and because complex SQL can be +so hard to read, difficult to test, and a burden to debug—we needed to +painstakingly verify that everything downstream of the changes would not be +negatively affected. It took weeks of on-and-off code review, some manual +results verification, testing in a staging environment, and so on. + +These new efficiency optimizations for the rolling-average calculations, which +is what we had set out to do, had turned into an overhaul of parts of the core +of the data pipeline, with a painful QA process and risky deployment into +production. Along the way, we had many discussions about implementation +alternatives, sizing up our cloud instances instead, and possibly even removing +some minor features from our app because they were too costly to compute. And +all of this started because the most intuitive way to calculate a 30-day rolling +average in SQL can be terribly inefficient if you are not careful and clever in +your implementations. + + +## Kaskada has one natural way to deal with time + +In stark contrast with the above case study in SQL, Kaskada handles calculations +on time windows natively. They are easy to write and are designed to run +efficiently for all of the most common use cases. Time data is ordered data, +always, and Kaskada’s data model embodies that, making certain calculations +orders of magnitude more efficient than other unordered data models, like SQL. + +When writing logic around time windows and window functions, we work with time +directly instead of—as with SQL—needing to turn time into rows before +calculating statistics and aggregations. By working directly with time, Kaskada +removes many opportunities to introduce bugs and inefficiencies when turning +time into rows and back again. + +Let’s revisit the Kaskada code block from earlier in the article: + +```fenl +Dataset.activity_value # specify the data +| sum(window=sliding(30, daily())) / 30 # sum values and divide +``` + +This query syntax has all of the important semantic information—data source, +sliding window, 30 days, `sum` function, division—and not much more. +Conceptually, that is enough information to do the calculations we want. Kaskada +knows how to get it done without us having to specify (yet again) what a day is, +how sliding windows work, or how to manipulate the rows of event data to make it +happen. With less code and less logic to worry about, there is much less surface +area to introduce bugs or inefficiencies. + + +### We shouldn’t have to write query logic about how time works + +To put it half-jokingly, [Kaskada understands +time](https://kaskada.io/2022/12/15/the-kaskada-feature-engine-understands-time.html), +while with SQL you have to say repeatedly that the past comes before the future +and that a calendar day still exists even if you don’t have a row for it. How +many times in SQL do I have to write the same `JOIN` logic stating that the two +entity IDs have to be the same and that the event timestamps need to be less +than the time bucket/window timestamps? In Kaskada, you don’t have to write it +at all, and there are many other bugs and inefficiencies that can be avoided by +letting Kaskada handle time logic in the intuitive way (unless you want to use +different logic). + + +### Kaskada’s native timeline abstraction was built for this + +[Kaskada's concept of +timelines](https://kaskada.io/2023/05/09/introducing-timelines-part-1.html) and +its overall time-centric design means that natural logic around time +calculations happens natively. Kaskada can do a 30-day rolling average with a +single line of code, more or less, and you don't have to think about the rows +filled with zeros or tables blowing up in size because the computational model +is focused on progressing through time, and not on producing, manipulating, and +storing rows. + +In contrast, as illustrated earlier in this article, SQL has at least two +distinct ways to deal with time: + +* __window functions__: build one row per day so that row calculations become day calculations +* __time JOINs__: event data JOINed into time buckets and aggregated + +The first is conceptually easy to understand and potentially easy to write in +SQL if you are an expert. The second is not quite as easy to understand, and +requires writing logic in SQL about what it means for an event to belong to a +day or date. Both types of implementations leave the coder to choose how time +should work on each and every query that needs it—a recipe that invites trouble. + + +### Try it out and let us know what you think! + +Kaskada is open-source and easy to set up on your local machine in just a few +minutes. For more information about how Kaskada and native timelines can +dramatically simplify coding logic and computational efficiency for time-centric +feature engineering and analytics on event-based data, see [our +documentation](/guide), +in particular the [Quick Start +page](/guide/quickstart.html). + +You can also [join our Slack channel]({{< var slack-join-url >}}) to discuss and learn +more about Kaskada. diff --git a/python/docs/blog/posts/2023-08-25-new-kaskada.qmd b/python/docs/blog/posts/2023-08-25-new-kaskada.qmd index c89e3754a..a375ac1cd 100644 --- a/python/docs/blog/posts/2023-08-25-new-kaskada.qmd +++ b/python/docs/blog/posts/2023-08-25-new-kaskada.qmd @@ -5,6 +5,7 @@ categories: - releases title: Introducing the New Kaskada subtitle: Embedded in Python for accessible Real-Time AI +image: /_static/images/blog/new-kaskada_banner.png --- We started Kaskada with the goal of simplifying the real-time AI/ML lifecycle, and in the past year AI has exploded in usefulness and accessibility. Generative models and Large Language Models (LLMs) have revolutionized how we approach AI. Their accessibility and incredible capabilities have made AI more valuable than it has ever been and democratized the practice of AI.