-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDNs throwing data off? #7
Comments
We detect the host a website is using based on the response headers in the HTTP Archive dataset. If the CDN is also forwarding the response headers from the host's origin server then we would detect the website as using that host. We may also have CDN information about each website when available, so we could further segment the performance data by that dimension. Would that address your concern? For websites that are inconsistently served by a CDN, HTTP Archive would only detect the CDN based on the single test. And to clarify, we're only looking at the response headers for the initial HTML page, not individual resources used by the page like images/JS/fonts. TTFB performance comes from the Chrome UX Report dataset and is limited to the time until the response for the initial HTML request. |
That's a good idea. |
Some platforms (like ours) bundle a CDN as part of their value, so I appreciate it being included. :) |
I hacked together this sheet to show what a CDN breakdown might look like. Feel free to make a copy of the sheet if you'd like to play with the filtering/sorting/formatting. Results: Host/CDN combos with fewer than 1,000 websites were hidden. Here's the base query: #standardSQL
SELECT
CASE
WHEN platform = 'seravo' THEN 'Seravo'
WHEN platform = 'automattic.com/jobs' THEN 'Automattic'
WHEN platform = 'x-ah-environment' THEN 'Acquia'
WHEN platform = 'x-pantheon-styx-hostname' THEN 'Pantheon'
WHEN platform = 'wpe-backend' THEN 'WP Engine'
WHEN platform = 'x-kinsta-cache' THEN 'Kinsta'
WHEN platform = 'hubspot' THEN 'HubSpot'
WHEN platform = '192fc2e7e50945beb8231a492d6a8024' THEN 'Siteground'
WHEN platform = 'x-github-request' THEN 'GitHub'
WHEN platform = 'alproxy' THEN 'AlwaysData'
WHEN platform = 'netlify' THEN 'Netlify'
WHEN platform = 'x-lw-cache' THEN 'Liquid Web'
WHEN platform = 'squarespace' THEN 'Squarespace'
WHEN platform = 'x-wix-request-id' THEN 'Wix'
WHEN platform = 'x-shopify-stage' THEN 'Shopify'
WHEN platform = 'x-now-cache' THEN 'ZEIT Now'
WHEN platform = 'flywheel' THEN 'Flywheel'
WHEN platform = 'weebly' THEN 'Weebly'
ELSE NULL
END AS platform,
client,
cdn,
COUNT(DISTINCT origin) AS n,
SUM(IF(ttfb.start < 200, ttfb.density, 0)) / SUM(ttfb.density) AS fast,
SUM(IF(ttfb.start >= 200 AND ttfb.start < 1000, ttfb.density, 0)) / SUM(ttfb.density) AS avg,
SUM(IF(ttfb.start >= 1000, ttfb.density, 0)) / SUM(ttfb.density) AS slow
FROM
`chrome-ux-report.all.201908`,
UNNEST(experimental.time_to_first_byte.histogram.bin) AS ttfb
JOIN
(SELECT _TABLE_SUFFIX AS client, url, REGEXP_EXTRACT(LOWER(CONCAT(respOtherHeaders, resp_x_powered_by, resp_via, resp_server)),
'(seravo|x-kinsta-cache|automattic.com/jobs|x-ah-environment|x-pantheon-styx-hostname|wpe-backend|hubspot|192fc2e7e50945beb8231a492d6a8024|x-github-request|alproxy|netlify|x-lw-cache|squarespace|x-wix-request-id|x-shopify-stage|x-now-cache|flywheel|weebly)')
AS platform,
_cdn_provider AS cdn
FROM `httparchive.summary_requests.2019_08_01_*`
WHERE firstHtml)
ON
client = IF(form_factor.name = 'desktop', 'desktop', 'mobile') AND
CONCAT(origin, '/') = url
WHERE
platform IS NOT NULL
GROUP BY
platform,
client,
cdn
ORDER BY
n DESC |
Nice work @rviscomi |
Third party CDNs can serve a cached, more local, thus faster, copy of a website. wouldn't Wouldn't that make your data inaccurate, since it would be random what sites are hosted + are CDN cached, while other sites don't use a CDN?
For example, for a client I was planning to add CloudFlare to their Weebly site. To do so, I would switch their DNS record from Weebly's IP to CloudFlare, then CloudFlare would request Weebly. IIRC, the first few requests through 1 of CloudFlare's 100+ POPs would have to request a new copy since it wasn't cached, but subsequent visits would be faster via the closer POP.
But not all Weebly's traffic runs though ClouldFlare. Likely few will. But likely a larger percentage of GoDaddy's WordPress admins would use a CDN, that could skew numbers in that direction, even if Weebly's raw hosting stats are faster.
You could test for CloudFlare via their
cf-cache-status
&/orcf-ray
headers. But there are many more CDNs out there, a few CDNs a friend tests are here.Note: It could be possible that ALL traffic for a particular host. Or CDNs are atomically available at higher tiers. Or someone configured their site to use an Image CDN, Google Fonts, & jsDelivr for JavaScript plugins....
The text was updated successfully, but these errors were encountered: