-
-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Public regex #77
base: master
Are you sure you want to change the base?
Conversation
Is this concept of "public URL" well understood or defined somewhere, or did you come up with it here? Personally, I think the better approach would be to expose more information with the existing regular expressions, like we did in 09d66fb, and then you can use Relaxed and do any filtering that you see fit. You could, for example, discard any matches with a scheme other than This approach wouldn't be significantly faster or slower I think, but what matters is that it would be more configurable to one's needs. Unless "public URL" is a very well defined and understood concept, I think that would be the way to go. |
I was considering forking and doing a PR for the a very similar thing. I would agree that there might be a lacking concensus of what a Public URL actually is, but the function could be renamed to be WebURL or something like that to filter out all other shcemas other than http(s). Then you could have a RelaxedWebURL that would allow for either http(s) or schemaless. Alternatively, it would be nice to have a function that would accept a slice of schemas as its input, along with a relaxed boolean to accept schemaless as well. I can see in the commit you referenced that schemeless URLs were suggested to be filtered in the same manner. Maybe this is the right way, but it seems a bit user unfriendly. |
The public suffix list initiative from Mozilla has defined a public suffix (aka effective TLD or eTLD). A good description here:
|
@cspeidel we already use the publicsuffix list for TLDs: xurls/generate/tldsgen/main.go Line 80 in 09d66fb
It also occurs to me that this is almost exactly
This is slightly more code, but it gives the end user a lot more flexibility in choosing what schemes, TLDs, or hostnames are acceptable. For example, I would argue that All the above said, I agree that there should be top-level funcs for common patterns, and that's why I added |
This PR adds a new
Public
function that returns a regex that matches public URLs. Such URLs are defined as:http
orhttps
as their protocol (all other protocols will not be matched)Probably the
Public
name is not the best one there could be (as IP addresses are also public), so if anyone has any suggestion please feel free to chime in.