Skip to content

FR: xml_url() to return resolved location, or new xml_url_canonical() #453

Open
@t-kalinowski

Description

@t-kalinowski

I am writing a small function to find all links on a page, and attempting to use url_absolute() to convert relative links to absolute links.

I've run into an issue if the original url to read_html() redirects to a different location, because then links normalized with url_absolute(link, base = xml_url(doc)) are incorrect.

small reprex:

library(xml2)
url <- "https://docs.posit.co/connect/admin"
x <- read_html(url)

links <- "../admin/appendix/branding/index.html"
# in actuality, 
# links <- x |> xml_find_all(".//a[@href]") |> xml_attr("href", default = "") 

# note the "/connect/" is swallowed  
x2 <- url_absolute("../admin/appendix/branding/index.html", base = xml2::xml_url(x))
x2
#> [1] "https://docs.posit.co/admin/appendix/branding/index.html"
read_html(x2)
#> Error in open.connection(x, "rb"): cannot open the connection

# because we need to add a trailing backslash to the base url
x2 <- url_absolute("../admin/appendix/branding/index.html", base = paste0(xml_url(x), "/"))
x2
#> [1] "https://docs.posit.co/connect/admin/appendix/branding/index.html"
read_html(x2)
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body class="nav-sidebar floating nav-fixed">\n\n<div id="quarto-search-r ...

# because the original request was redirected to a different location
system("curl -I https://docs.posit.co/connect/admin | grep location:", intern = T)
#> [1] "location: /connect/admin/\r"

system('curl -I -L -o /dev/null -s -w "%{url_effective}\n" https://docs.posit.co/connect/admin', intern = T)
#> [1] "https://docs.posit.co/connect/admin/"

Created on 2025-02-27 with reprex v2.1.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions