-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding the big picture for training machine learning models and the different limitations #164
Comments
Hey Lamrani, I think your example (1) is not in scope for this particular repo which is focused just on measuring conversions. I think some of the problems you are raising may be resolvable by allowing some amount of metadata to live alongside reports (e.g. publisher identity, etc). This would reduce the key space requirements if there is some safe information we think is OK to share, like we do in the current aggregate attribution explainer. Currently we don't have a full specification for how the aggregate reports from the FLEDGE calls will work, including extra in-the-clear information or the scope of the sensitivity budgets. @alexmturner is thinking through some of these issues which would likely end up in another repo. I think for (2) the biggest problem is just sparsity in the data right, because we are measuring campaigns with conversion rate <= .001%, right? One technique that could work in this situation is to use a larger amount of the L1 sensitivity for cold-start campaigns, and reduce it after some time when you expect users to make more conversions. Would that work? |
Hello @csharrison, Exactly ! For (2) the biggest problem is the sparsity in the data. About (1), the example is in fact not well chosen but this is a great news that a repo will be addressing the specification for how the aggregate reports from the FLEDGE calls will work. For instance:
It would be great if you can provide more information and answers to the different questions. Let me know if you want me to create another issue with a "conversion" example to avoid any confusion. |
The only reason this is not done in the explainer is because our infrastructure for doing multi-party computation can only handle integers, so we use [0, 2^16 - 1] as a discretization of [0, 1]. For attribution, we were scoping the L1 sensitivity to the partition (publisher site, advertiser site, time window) (link). These parameters are based on where impressions / conversions take place. Notably the DSP is not in this tuple. How much "budget" to spend per conversion is left up to the reporting origin, although we are open to conversations about how to best split the budget (possibly by introducing more limits). Happy to discuss this in more detail in the call today. A few more points:
I don't think this math is quite right. You want to look at the Laplace distribution with scale parameter L1/epsilon which would end up with standard deviation closer to ~19 I think with those parameters.
I don't think it is feasible (at least in the current design) to tightly couple the query model with the client-side data budgeting. I think the two should be considered separate (i.e. separate query limitations and client-side limitations). It may be possible to couple them but it would be quite complex, and involve the server having an understanding of "which user" each report came from. |
Thanks @csharrison for the explanations and taking the time to answer the questions during the call ! |
Here we will focus on optimizing a campaign towards conversion (common use case in the industry).
Our aim is to:
Current state:
Currently the data that is available when optimizing an advertising campaign could be aggregated like this:
Which allows to train fully a machine learning model using the conversion column as a label.
With the privacy sandbox:
We will try to get a dataset as close as possible as in the current state by using the different APIs of the Privacy Sandbox.
As the event level reporting does not provide the ability to get enough granular information on the conversion and impression data, we will focus on the Aggregate API that is more suited for this use case.
We suppose here that we have a mechanism such as described in this pull requestt which allows us to do an aggregate reporting on impression data with the bid context transmitted by the
generate_bid
function.Let’s first take a look at the simple case of the “mono variable” report.
The DSP would need to query the report api on behalf of the advertiser:
SELECT COUNT(*) FROM report_impression GROUP BY site_id
(conceptual query)with a L1 parameter that will define the level of noise in the data.
SELECT COUNT(*) FROM report_impression WHERE site_id = xxxx
?Now, let’s imagine that the DSP want to do a more complex query:
SELECT COUNT(*) FROM report_impression GROUP BY site_id, postal_code, device_type, frequency, campaign_id, hour_of_day, day_of_week
(conceptual query)processAggregate
function, right ? Otherwise the 2^32 would be a pretty low threshold). We need to keep in mind that this threshold could be very quickly reached. (40k zip codes in the USA, 7 days of week, 24 hour of days, 100 campaign_ids, 100 device_type, 10 bucket of frequency => 67 * 10^9 > 2^32).We would need here the conversion data joined with the impression level data (as described again in this pull request).
Let’s imagine the following query:
SELECT COUNT(*), SUM(conversions) FROM report_impression_merged_with_conversions GROUP BY site_id, postal_code, device_type
(conceptual query)Here we could imagine that a user would only contribute to 10 conversions for a 7 day period (all types of conversions) and therefore set a L1 parameter of 10. The standard deviation would be equal to approximately 4.
This may seem reasonable but in the Campaign Optimization context, we could have (very usual use case with a low number of conversions for a cold start campaign for instance) :
The differences on the number of conversions between the good and bad context is only equal to 2 times the standard deviation here.
=> Setting a high level of noise could easily have a huge impact on Campaign Optimization due to the very sparse distribution of the conversions.
Don't hesitate if you want me to reformulate some questions or to split the issue in different specific issues.
The text was updated successfully, but these errors were encountered: