Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter_external_links truncates colon at the end of URL #333

Open
harej opened this issue Sep 11, 2024 · 2 comments
Open

filter_external_links truncates colon at the end of URL #333

harej opened this issue Sep 11, 2024 · 2 comments

Comments

@harej
Copy link

harej commented Sep 11, 2024

Test case:

import mwparserfromhell

wikitext = """
<ref>{{cite news | first=109th Congress, 1st Session | last=U.S. Senate |  title= S. 1033, Secure America and Orderly Immigration Act | date=[[May 12]] [[2005]] |  url =http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033: | work =Thomas |  accessdate = 2007-09-30 | }}</ref>
"""

parsed = mwparserfromhell.parse(wikitext)
parsed.filter_external_links()

What I get: ['http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033']
What I should get: ['http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033:'] with the colon at the end

@harej
Copy link
Author

harej commented Sep 11, 2024

Yes, that's a valid URL, or at least it was nearly 20 years ago. https://web.archive.org/web/20080918055001/http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033:

(You may need to copy that URL with the colon into the address bar manually)

@lahwaacz
Copy link
Contributor

lahwaacz commented Nov 15, 2024

See the difference here:

import mwparserfromhell

wikitext1 = "http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033:"
wikitext2 = "[http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033: foo]"

parsed1 = mwparserfromhell.parse(wikitext1)
parsed2 = mwparserfromhell.parse(wikitext2)
print(parsed1.filter_external_links())
print(parsed2.filter_external_links())

Which gives

['http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033']
['[http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033: foo]']

Note that this is consistent with how MediaWiki behaves 🤷

For your snippet, the thing is that mwparserfromhell does not expand templates so it can't know that the url parameter is actually used inside square brackets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants