Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the big picture for training machine learning models and the different limitations #164

Closed
ALamraniAlaouiScibids opened this issue Jun 23, 2021 · 4 comments

Comments

@ALamraniAlaouiScibids
Copy link

ALamraniAlaouiScibids commented Jun 23, 2021

Here we will focus on optimizing a campaign towards conversion (common use case in the industry).
Our aim is to:

  • Validate the understanding of the reporting system (I believe that it would help other actors of the industry to have some clarifications about the documentation)
  • Give an idea of the impact that could have non carefully chosen privacy parameters (L1, epsilon)

Current state:

Currently the data that is available when optimizing an advertising campaign could be aggregated like this:

campaign_id site_id postal_code device_type frequency impressions conversions
123 789 75001 device_type1 10 1000 1
456 1011 75001 device_type2 1 2000 10
123 1012 75002 device_type1 2 10000 0

Which allows to train fully a machine learning model using the conversion column as a label.

With the privacy sandbox:

We will try to get a dataset as close as possible as in the current state by using the different APIs of the Privacy Sandbox.
As the event level reporting does not provide the ability to get enough granular information on the conversion and impression data, we will focus on the Aggregate API that is more suited for this use case.

  1. Getting the impression data:

We suppose here that we have a mechanism such as described in this pull requestt which allows us to do an aggregate reporting on impression data with the bid context transmitted by the generate_bid function.

Let’s first take a look at the simple case of the “mono variable” report.

The DSP would need to query the report api on behalf of the advertiser:
SELECT COUNT(*) FROM report_impression GROUP BY site_id (conceptual query)
with a L1 parameter that will define the level of noise in the data.

  • The DSP would need to keep a list of all the site_id (length of n) and do n queries of the type: SELECT COUNT(*) FROM report_impression WHERE site_id = xxxx ?
  • The L1 parameter would be specified by the DSP in a general config file ? By the advertiser ? By the publisher ? Will it be passed on to the Aggregate Service API for each query ? These different entities (dsp, advertiser, publisher) would need to agree on a value to use ? How will they share the privacy budget ?
  • The L1 parameter is the maximum contribution a user can make by source site, destination and time window. In the case of this query, the records would only be dropped if the user has been shown more than L1 impressions within the same “site_id”, is this right ?
  • If we have a L1 parameter of 2^4 then we would have for a given user at most 16 impressions for a given website. If we consider an epsilon_7_days value of 7/6 (corresponding to an epsilon_30_days of 5), we would add to the impression count a noise with a standard deviation of roughly 5 which remains still usable as we could suppose that we will often have more than 1000 impressions per website and less than 16 contribution of a user for the same tuple source_site, advertiser, time window.
  • If the DSP re-run the same query (for the sake of discussion, we will imagine that there is no cache mechanism here), then the “privacy budget” of each user will be deducted from the L1 value, ie if a user has contributed n times to the buckets of the precedent query, then the user could only contribute to L1 - n times to the buckets resulting of the new query ?

Now, let’s imagine that the DSP want to do a more complex query:

SELECT COUNT(*) FROM report_impression GROUP BY site_id, postal_code, device_type, frequency, campaign_id, hour_of_day, day_of_week (conceptual query)

  • If there are more than 2^32 combinations of these variables, the query would not be valid (the site_id is not included in the 2^32 allowed keys usable in the processAggregate function, right ? Otherwise the 2^32 would be a pretty low threshold). We need to keep in mind that this threshold could be very quickly reached. (40k zip codes in the USA, 7 days of week, 24 hour of days, 100 campaign_ids, 100 device_type, 10 bucket of frequency => 67 * 10^9 > 2^32).
  • One also needs to pay attention to the noise level here as we probably would need to increase the L1 parameter as a user could appear in multiple buckets within the histogram therefore the resulting noise could be more harmful.
  1. Getting the conversion data:

We would need here the conversion data joined with the impression level data (as described again in this pull request).
Let’s imagine the following query:
SELECT COUNT(*), SUM(conversions) FROM report_impression_merged_with_conversions GROUP BY site_id, postal_code, device_type (conceptual query)
Here we could imagine that a user would only contribute to 10 conversions for a 7 day period (all types of conversions) and therefore set a L1 parameter of 10. The standard deviation would be equal to approximately 4.
This may seem reasonable but in the Campaign Optimization context, we could have (very usual use case with a low number of conversions for a cold start campaign for instance) :

Context (site_id, postal_code, device_type) Impressions Conversions
GoodContext 1000000 10
BadContext 1000000 2

The differences on the number of conversions between the good and bad context is only equal to 2 times the standard deviation here.
=> Setting a high level of noise could easily have a huge impact on Campaign Optimization due to the very sparse distribution of the conversions.

Don't hesitate if you want me to reformulate some questions or to split the issue in different specific issues.

@csharrison
Copy link
Collaborator

Hey Lamrani, I think your example (1) is not in scope for this particular repo which is focused just on measuring conversions. I think some of the problems you are raising may be resolvable by allowing some amount of metadata to live alongside reports (e.g. publisher identity, etc). This would reduce the key space requirements if there is some safe information we think is OK to share, like we do in the current aggregate attribution explainer. Currently we don't have a full specification for how the aggregate reports from the FLEDGE calls will work, including extra in-the-clear information or the scope of the sensitivity budgets. @alexmturner is thinking through some of these issues which would likely end up in another repo.

I think for (2) the biggest problem is just sparsity in the data right, because we are measuring campaigns with conversion rate <= .001%, right?

One technique that could work in this situation is to use a larger amount of the L1 sensitivity for cold-start campaigns, and reduce it after some time when you expect users to make more conversions. Would that work?

@ALamraniAlaouiScibids
Copy link
Author

Hello @csharrison,
Thanks for you answers.

Exactly ! For (2) the biggest problem is the sparsity in the data.
This could concern cold-start campaigns but also always on campaigns (a campaign selling cars with a very few number of conversions per week).
In fact, the key here is handling well the L1 sensitivity and making sure that the levels chosen are well fit for the real world "campaigns". Do you have an idea of the range of values that will be allowed for the L1 sensitivity ?
We would be really happy to provide data driven insights if needed.

About (1), the example is in fact not well chosen but this is a great news that a repo will be addressing the specification for how the aggregate reports from the FLEDGE calls will work.
However the questions remain valid if we replace "impression" by "conversion". They were mainly about the general workflow of the aggregate reporting (L1 sensitivity handling).

For instance:

  • The L1 parameter would be specified by the DSP in a general config file ? By the advertiser ? By the publisher ? Will it be passed on to the Aggregate Service API for each query ? These different entities (dsp, advertiser, publisher) would need to agree on a value to use ? How will they share the privacy budget ?
    etc...

It would be great if you can provide more information and answers to the different questions.

Let me know if you want me to create another issue with a "conversion" example to avoid any confusion.

@csharrison
Copy link
Collaborator

Do you have an idea of the range of values that will be allowed for the L1 sensitivity ?
One idea we had for picking the L1 sensitivity is to not really pick it at all. Since the noise is a function of the L1 sensitivity, you could think of it as a real number from 0 to 1, where 1 is the maximum contribution of any user. Then, each "contribution" / conversion of a user can use some fraction of this budget.

The only reason this is not done in the explainer is because our infrastructure for doing multi-party computation can only handle integers, so we use [0, 2^16 - 1] as a discretization of [0, 1].

For attribution, we were scoping the L1 sensitivity to the partition (publisher site, advertiser site, time window) (link). These parameters are based on where impressions / conversions take place. Notably the DSP is not in this tuple. How much "budget" to spend per conversion is left up to the reporting origin, although we are open to conversations about how to best split the budget (possibly by introducing more limits).

Happy to discuss this in more detail in the call today.

A few more points:

If we have a L1 parameter of 2^4 then we would have for a given user at most 16 impressions for a given website. If we consider an epsilon_7_days value of 7/6 (corresponding to an epsilon_30_days of 5), we would add to the impression count a noise with a standard deviation of roughly

I don't think this math is quite right. You want to look at the Laplace distribution with scale parameter L1/epsilon which would end up with standard deviation closer to ~19 I think with those parameters.

If the DSP re-run the same query (for the sake of discussion, we will imagine that there is no cache mechanism here), then the “privacy budget” of each user will be deducted from the L1 value, ie if a user has contributed n times to the buckets of the precedent query, then the user could only contribute to L1 - n times to the buckets resulting of the new query ?

I don't think it is feasible (at least in the current design) to tightly couple the query model with the client-side data budgeting. I think the two should be considered separate (i.e. separate query limitations and client-side limitations). It may be possible to couple them but it would be quite complex, and involve the server having an understanding of "which user" each report came from.

@ALamraniAlaouiScibids
Copy link
Author

Thanks @csharrison for the explanations and taking the time to answer the questions during the call !
Will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants