Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permissions to observe topics in page head and body #224

Open
dmarti opened this issue Jul 24, 2023 · 18 comments
Open

Permissions to observe topics in page head and body #224

dmarti opened this issue Jul 24, 2023 · 18 comments

Comments

@dmarti
Copy link
Contributor

dmarti commented Jul 24, 2023

Add two permissions, default none, for sites to allow training on the HTML head (including title) and body elements.

  • Permissions-Policy: browsing-topics-observe-head
  • Permissions-Policy: browsing-topics-observe-body

This would address the problem of sensitive titles and other page content covered in #118 while still allowing large, general-interest sites to contribute fairly to Topics API audience data collection.

Related: #92 #206

@jkarlin
Copy link
Collaborator

jkarlin commented Jul 25, 2023

By default off, do you mean default self or default none? Default self would mean that ad-tech (in the top frame) could still enable it on the page. Default none would mean that the publisher page would have to opt in in its response header.

@dmarti
Copy link
Contributor Author

dmarti commented Jul 25, 2023

Thank you @jkarlin, edited to default none. There are a lot of examples of sites where page titles and content could be totally inappropriate for training on a site that might be willing to have its domain used (such as book titles on a bookstore site, or titles of health advice articles on a general-interest consumer advice site) so any site that intends to have training on head and body should have to review their pages first and affirmatively turn it on.

@jkarlin
Copy link
Collaborator

jkarlin commented Jul 25, 2023

Ack, thanks. So what is the incentive for a publisher to opt into this?

@dmarti
Copy link
Contributor Author

dmarti commented Jul 25, 2023

Three possible reasons: (edited to include use case from @patmmccann)

  • An adtech intermediary compensates sites for providing it with additional data beyond what would be available from the domain alone

  • A large video site that is owned by the same company as a browser vendor chooses to opt in some or all of its pages, in order to avoid competition issues resulting from different treatment by the browser of video site channels and independent sites

  • An advertising service does a human review of sites before working with them and adding its third-party code. As part of the review, a human reviewer checks for sensitivity and privacy concerns in page head and/or body, and if none, adds permission for relevant topics observation by the service's own Topics API caller.

@jkarlin
Copy link
Collaborator

jkarlin commented Jul 25, 2023

Let's see if some publishers ask for this feature.

@dmarti
Copy link
Contributor Author

dmarti commented Jul 25, 2023

If YouTube requested opt-in HTML title or head training, are there any obstacles to giving it to them?

@michaelkleber
Copy link
Collaborator

I do think there are still some questions that would need consideration:

  1. As discussed in Use topics from a meta tag on Special Topics Provider Sites #206, there is still the risk of what you dubbed "the Reddit problem" of sites deliberately corrupting data.

  2. It would be easier for a malicious party to circumvent the per-caller filtering logic (which only allows a part to observe a topic if they were previously on a page about that topic) — we lose that protection if it's easy for one site to pretend to be about every topic in the world.

  3. Of course we would need a new topic-assignment ML model that did a good job on this new input data.

None of these seems insurmountable, but each one of them would require new work.

@patmmccann
Copy link

Let's see if some publishers ask for this feature.

We represent 4000 publishers, we're asking for this.

@michaelkleber
Copy link
Collaborator

@patmmccann Just in the interest of clarity, are you asking for this API change because you plan to use it on some of your 4000 sites? For example, what "large, general-interest sites" do you run that you want to make "contribute fairly"?

If I understand correctly, you and Don are both from Raptive, and Don has been asking for this so that someone else might be compelled to use it.

@patmmccann
Copy link

patmmccann commented Aug 9, 2023

I realize just now I am mixing up threads, and I moved my previous comment to the correct thread.

Our goal is absolutely to opt our publishers into using page context to better populate topics, as we have already deployed a topics network operating within all 4000 of them. Mediavine has done something similar I understand.

For example you can see https://ads.adthrive.com/builds/core/94b7c03/html/topics.html called from https://firstquarterfinance.com/

Page title or other meta data can be very helpful for a more compact network of sites to generate useful topics. For example, suppose there are five large content aggregator sites owned by newscorp; they would be much more able to have a useful network if they could give their own network permission to share headers with their own tech.

@gwhigs at Gannett is working on this in his network.

@patmmccann
Copy link

For example, what "large, general-interest sites" do you run that you want to make "contribute fairly"?

Encylopedia Britannica, thoughtcatalog.com, mediaite.com to name a few

@patmmccann
Copy link

There's another application here, which is the topics classifier just fails to generate a topic completely on many of our sites. This would allow those sites to "contribute fairly" as well, not just general sites.

image

@michaelkleber
Copy link
Collaborator

Thanks very much! Learning that you "have already deployed a topics network operating within all 4000 of" your sites makes this a compelling feature request.

The concerns that I mentioned above are all things we will need to figure out how to handle, so this is certainly still going to take work. But it's great to have a concrete demonstration that this would indeed be a way to add value to Topics data.

@AramZS
Copy link

AramZS commented Aug 30, 2023

  1. As discussed in Use topics from a meta tag on Special Topics Provider Sites #206, there is still the risk of what you dubbed "the Reddit problem" of sites deliberately corrupting data.

To be clear I don't think this is a solvable issue. Tech companies are daily (literally) making it easier for people to generate whole sites with unique domains to focus on all sorts of topics or on specific topics. I think there is a broader 'trust' issue in terms of if Topics should work on specific sites without some level of trustworthiness from some signal, but I think that is a general problem not one that is particularly relevant to observing head and body or not.

If Topics is successful there will be significant monetary incentive to play the model. I don't think that the changes suggested here will make a meaningful difference to the effectiveness of bad actors in doing so. It may make it harder or easier, but not meaningfully enough to dissuade anyone.

@michaelkleber
Copy link
Collaborator

@AramZS I think the Topics answer to that concern has to be "curation" on the part of the API caller — that is, their deciding whether or not to observe topics on a particular page or site.

My instincts are that this probably becomes harder if the calculation expands beyond domain name. But I fully agree that this is an issue that API callers ought to think about either way.

@patmmccann
Copy link

patmmccann commented Dec 1, 2023

Interesting update here, instead of getting no topic, the latest classifier gives the wrong topic to each of those sites. Not sure which is a better outcome.

image

cc @leeronisrael

@patmmccann
Copy link

@leeronisrael @michaelkleber icymi

image

@patmmccann
Copy link

@michaelkleber @AramZS it occurs to me the problem of sites deliberately corrupting the data with choice of site name occurs today. See for example https://www.workandmoney.com/s/actor-most-oscar-nominations-no-wins-b89d656968274d51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants