Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDNs throwing data off? #7

Open
tomByrer opened this issue Jul 14, 2019 · 5 comments
Open

CDNs throwing data off? #7

tomByrer opened this issue Jul 14, 2019 · 5 comments
Labels
question Further information is requested

Comments

@tomByrer
Copy link

tomByrer commented Jul 14, 2019

Third party CDNs can serve a cached, more local, thus faster, copy of a website. wouldn't Wouldn't that make your data inaccurate, since it would be random what sites are hosted + are CDN cached, while other sites don't use a CDN?

For example, for a client I was planning to add CloudFlare to their Weebly site. To do so, I would switch their DNS record from Weebly's IP to CloudFlare, then CloudFlare would request Weebly. IIRC, the first few requests through 1 of CloudFlare's 100+ POPs would have to request a new copy since it wasn't cached, but subsequent visits would be faster via the closer POP.
But not all Weebly's traffic runs though ClouldFlare. Likely few will. But likely a larger percentage of GoDaddy's WordPress admins would use a CDN, that could skew numbers in that direction, even if Weebly's raw hosting stats are faster.

You could test for CloudFlare via their cf-cache-status &/or cf-ray headers. But there are many more CDNs out there, a few CDNs a friend tests are here.

Note: It could be possible that ALL traffic for a particular host. Or CDNs are atomically available at higher tiers. Or someone configured their site to use an Image CDN, Google Fonts, & jsDelivr for JavaScript plugins....

@rviscomi
Copy link
Owner

We detect the host a website is using based on the response headers in the HTTP Archive dataset. If the CDN is also forwarding the response headers from the host's origin server then we would detect the website as using that host. We may also have CDN information about each website when available, so we could further segment the performance data by that dimension. Would that address your concern?

For websites that are inconsistently served by a CDN, HTTP Archive would only detect the CDN based on the single test. And to clarify, we're only looking at the response headers for the initial HTML page, not individual resources used by the page like images/JS/fonts. TTFB performance comes from the Chrome UX Report dataset and is limited to the time until the response for the initial HTML request.

@rviscomi rviscomi added the question Further information is requested label Jul 15, 2019
@tomByrer
Copy link
Author

further segment the performance data by [CDN]

That's a good idea.

@joshkoenig
Copy link

Some platforms (like ours) bundle a CDN as part of their value, so I appreciate it being included. :)

@rviscomi
Copy link
Owner

I hacked together this sheet to show what a CDN breakdown might look like. Feel free to make a copy of the sheet if you'd like to play with the filtering/sorting/formatting.

Results:

image

Host/CDN combos with fewer than 1,000 websites were hidden.

Here's the base query:

#standardSQL
SELECT
  CASE
   WHEN platform = 'seravo' THEN 'Seravo'
   WHEN platform = 'automattic.com/jobs' THEN 'Automattic'
   WHEN platform = 'x-ah-environment' THEN 'Acquia'
   WHEN platform = 'x-pantheon-styx-hostname' THEN 'Pantheon'
   WHEN platform = 'wpe-backend' THEN 'WP Engine'
   WHEN platform = 'x-kinsta-cache' THEN 'Kinsta'
   WHEN platform = 'hubspot' THEN 'HubSpot'
   WHEN platform = '192fc2e7e50945beb8231a492d6a8024' THEN 'Siteground'
   WHEN platform = 'x-github-request' THEN 'GitHub'
   WHEN platform = 'alproxy' THEN 'AlwaysData'
   WHEN platform = 'netlify' THEN 'Netlify'
   WHEN platform = 'x-lw-cache' THEN 'Liquid Web'
   WHEN platform = 'squarespace' THEN 'Squarespace'
   WHEN platform = 'x-wix-request-id' THEN 'Wix'
   WHEN platform = 'x-shopify-stage' THEN 'Shopify'
   WHEN platform = 'x-now-cache' THEN 'ZEIT Now'
   WHEN platform = 'flywheel' THEN 'Flywheel'
   WHEN platform = 'weebly' THEN 'Weebly'
   ELSE NULL
  END AS platform,
  client,
  cdn,
  COUNT(DISTINCT origin) AS n,
  SUM(IF(ttfb.start < 200, ttfb.density, 0)) / SUM(ttfb.density) AS fast,
  SUM(IF(ttfb.start >= 200 AND ttfb.start < 1000, ttfb.density, 0)) / SUM(ttfb.density) AS avg,
  SUM(IF(ttfb.start >= 1000, ttfb.density, 0)) / SUM(ttfb.density) AS slow
FROM
  `chrome-ux-report.all.201908`,
  UNNEST(experimental.time_to_first_byte.histogram.bin) AS ttfb
JOIN
  (SELECT _TABLE_SUFFIX AS client, url, REGEXP_EXTRACT(LOWER(CONCAT(respOtherHeaders, resp_x_powered_by, resp_via, resp_server)),
      '(seravo|x-kinsta-cache|automattic.com/jobs|x-ah-environment|x-pantheon-styx-hostname|wpe-backend|hubspot|192fc2e7e50945beb8231a492d6a8024|x-github-request|alproxy|netlify|x-lw-cache|squarespace|x-wix-request-id|x-shopify-stage|x-now-cache|flywheel|weebly)')
    AS platform,
  _cdn_provider AS cdn
  FROM `httparchive.summary_requests.2019_08_01_*`
  WHERE firstHtml)
ON
  client = IF(form_factor.name = 'desktop', 'desktop', 'mobile') AND
  CONCAT(origin, '/') = url
WHERE
  platform IS NOT NULL
GROUP BY
  platform,
  client,
  cdn
ORDER BY
  n DESC

@tomByrer
Copy link
Author

Nice work @rviscomi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants