Before beginning a data science investigation, we need to define a problem statement which the data science team can explore; this problem statement can have a significant influence on whether the project is likely to be successful.
The main goals of the envisioning process are:
- Establish a clear understanding of the problem domain and the underlying business objective
- Define how a potential solution would be used and how its performance should be measured
- Determine what data is available to solve the problem
- Understand the capabilities and working practices of the data science team
- Ensure all parties have the same understanding of the scope and next steps (e.g., onboarding, data exploration workshop)
The envisioning process usually entails a series of 'envisioning' sessions where the data science team work alongside subject-matter experts to formulate the problem in such a way that there is a shared understanding a shared understanding of the problem domain, a clear goal, and a predefined approach to evaluating a potential solution.
Generally, before defining a project scope for a data science investigation, we must first understand the problem domain:
- What is the problem?
- Why does the problem need to be solved?
- Does this problem require a machine learning solution?
- How would a potential solution be used?
However, establishing this understanding can prove difficult, especially for those unfamiliar with the problem domain. To ease this process, we can approach problems in a structured way by taking the following steps:
- Identify a measurable problem and define this in business terms. The objective should be clear, and we should have a good understanding of the factors that we can control - that can be used as inputs - and how they affect the objective. Be as specific as possible.
- Decide how the performance of a solution should be measured and identify whether this is possible within the restrictions of this problem. Make sure it aligns with the business objective and that you have identified the data required to evaluate the solution. Note that the data required to evaluate a solution may differ from the data needed to create a solution.
- Thinking about the solution as a black box, detail the function that a solution to this problem should perform to fulfil the objective and verify that the relevant data is available to solve the problem.
- One way of approaching this is by thinking about how a subject-matter expert could solve the problem manually, and the data that would be required; if a human subject-matter expert is unable to solve the problem given the available data, this is indicative that additional information is required and/or more data needs to be collected.
- Based on the available data, define specific hypothesis statements - which can be proved or disproved - to guide the exploration of the data science team. Where possible, each hypothesis statement should have a clearly defined success criteria (e.g., with an accuracy of over 60%), however, this is not always possible - especially for projects where no solution to the problem currently exists. In these cases, the measure of success could be based on a subject-matter expert verifying that the results meet their expectations.
- Document all the above information, to ensure alignment between stakeholders and establish a clear understanding of the problem to be solved. Try to ensure that as much relevant domain knowledge is captured as possible, and that the features present in available data - and the way that the data was collected - are clearly explained, such that they can be understood by a non-subject matter expert.
Once an understanding of the problem domain has been established, it may be necessary to break down the overall problem into smaller, meaningful chunks of work to maintain team focus and ensure a realistic project scope within the given time frame.
These problems are complex and require understanding from a variety of perspectives. It is not uncommon for the stakeholders to not be the end user of the solution framework. In these cases, listening to the actual end users is critical to the success of the project.
The following questions can help guide discussion in understanding the stakeholders' perspectives:
- Who is the end user?
- What is the current practice related to the business problem?
- What's the performance of the current solution?
- What are their pain points?
- What is their toughest problem?
- What is the state of the data used to build the solution?
- How does the end user or SME envision the solution?
During envisioning sessions, the following may prove useful for guiding the discussion. Many of these points are taken directly, or adapted from, [1] and [2].
- Define the objective in business terms.
- How will the solution be used?
- What are the current solutions/workarounds (if any)? What work has been done in this area so far? Does this solution need to fit into an existing system?
- How should performance be measured?
- Is the performance measure aligned with the business objective?
- What would be the minimum performance needed to reach the business objective?
- Are there any known constraints around non-functional requirements that would have to be taken into account? (e.g., computation times)
- Frame this problem (supervised/unsupervised, online/offline, etc.)
- Is human expertise available?
- How would you solve the problem manually?
- Are there any restrictions on the type of approaches which can be used? (e.g., does the solution need to be completely explainable?)
- List the assumptions you or others have made so far. Verify these assumptions if possible.
- Define some initial hypothesis statements to be explored.
- Highlight and discuss any responsible AI concerns if appropriate.
- What data science skills exist in the organization?
- How many data scientists/engineers would be available to work on this project? In what capacity would these resources be available (full-time, part-time, etc.)?
- What does the team's current workflow practices look like? Do they work on the cloud/on-prem? In notebooks/IDE? Is version control used?
- How are data, experiments and models currently tracked?
- Does the team employ an Agile methodology? How is work tracked?
- Are there any ML solutions currently running in production? Who is responsible for maintaining these solutions?
- Who would be responsible for maintaining a solution produced during this project?
- Are there any restrictions on tooling that must/cannot be used?
To illustrate how the above process can be applied to a tangible problem domain, as an example, consider that we are looking at implementing a recommendation engine for a clothing retailer. This example was, in part, inspired by [3].
Often, the objective may be simply presented, in a form such as "to improve sales". However, whilst this is ultimately the main goal, we would benefit from being more specific here. Suppose that we were to deploy a solution in November and then observed a December sales surge; how would we be able to distinguish how much of this was as a result of the new recommendation engine, as opposed to the fact that December is a peak buying season?
A better objective, in this case, would be "to drive additional sales by presenting the customer with items that they would not otherwise have purchased without the recommendation". Here, the inputs that we can control are the choice of items that are presented to each customer, and the order in which they are displayed; considering factors such as how frequently these should change, seasonality, etc.
The data required to evaluate a potential solution in this case would be which recommendations resulted in new sales, and an estimation of a customer's likeliness to purchase a specific item without a recommendation. Note that, whilst this data could also be used to build a recommendation engine, it is unlikely that this data will be available before a recommendation system has been implemented, so it is likely that we will have to use an alternate data source to build the model.
We can get an initial idea of how to approach a solution to this problem by considering how it would be solved by a subject-matter expert. Thinking of how a personal stylist may provide a recommendation, they are likely to recommend items based on one or more of the following:
- generally popular items
- items similar to those liked/purchased by the customer
- items that were liked/purchased by similar customers
- items which are complementary to those owned by the customer
Whilst this list is by no means exhaustive, it provides a good indication of the data that is likely to be useful to us:
- item sales data
- customer purchase histories
- customer demographics
- item descriptions and tags
- previous outfits, or sets, which have been curated by the stylist
We would then be able to use this data to explore:
- a method of measuring similarity between items
- a method of measuring similarity between customers
- a method of measuring how complementary items are relative to one another
which can be used to create and rank recommendations. Depending on the project scope, and available data, one or more of these areas could be selected to create hypotheses to be explored by the data science team. Some examples of such hypothesis statements could be:
- From the descriptions of each item, we can determine a measure of similarity between different items to a degree of accuracy which is specified by a stylist.
- Based on the behavior of customers with similar purchasing histories, we are able to predict certain items that a customer is likely to purchase; with a certainty which is greater than random choice.
- Using sets of items which have previously been sold together, we can formulate rules around the features which determine whether items are complementary or not which can be verified by a stylist.
To ensure clarity and alignment, it is useful to summarize the envisioning stage findings focusing on proposed detailed scenarios, assumptions and agreed decisions as well next steps.
We suggest confirming that you have access to all necessary resources (including data) as a next step before proceeding with data exploration workshops.
Below are the links to the exit document template and to some questions which may be helpful in confirming resource access.
- Summary of Scope Exit Document Template
- List of Resource Access Questions
- List of Data Exploration Workshop Questions
Many of the ideas presented here - and much more - were inspired by, and can be found in the following resources; all of which are highly recommended.