Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article Viewer displays raw wikicode for "Health effects of electronic cigarettes" #5957

Open
ragesoss opened this issue Sep 10, 2024 · 15 comments
Labels

Comments

@ragesoss
Copy link
Member

Visit here and wait for the authorship highlighting to load: https://dashboard.wikiedu.org/courses/UCSF/Foundations_II_(Summer_2024)/articles/edited?showArticle=52260526

Once it loads, the rendered article is replaced by highlighted wikicode:

Screenshot from 2024-09-10 12-12-23

Additional context

The ArticleViewer initially loads the parsed version of the current article, and requests the authorship data from the wikiwho server. Once received, the wikiwho data (which is annotated wikicode) gets processed by Dashboard code to add CSS classes on a per-author basis, it is sent to mediawiki to parse. No explicit errors are occurring in this example, either in the JS console or in network requests, but the call to the mediawiki API parse action is returning unparsed wikicode. One possible explanation is that the Dashboard code that operates on the wikiwho data is mishandling some particular aspect of this page's wikicode, resulting in a version that can't be parsed properly by mediawiki.

@ragesoss ragesoss added the bug label Sep 10, 2024
@empty-codes
Copy link
Contributor

Hello @ragesoss, I would like to try working on this!

@ragesoss
Copy link
Member Author

ragesoss commented Oct 3, 2024

@empty-codes go for it. This one may be a challenge, as I'm not sure which codebase is ultimately responsible for the error. There's almost certainly something about the wikicode for these example articles that is triggering the bug, so just knowing precisely what triggers it would be helpful.

@Abishekcs
Copy link
Contributor

Abishekcs commented Oct 3, 2024

Hi @ragesoss and @empty-codes,

I hope you're doing well! I wanted to share an observation I made while looking into this bug earlier today.

It seems that for the articles where this issue occurs most frequently, WikiWho is rendering the sup tag as shown below:

a) Opening tag: &lt;ref&gt; (where it should be <sup>)
b) Closing tag: &lt;/ref&gt; (where it should be </sup>)
c) Self-closing tag: &lt;ref /&gt; (where it should be <sup />)

For example in the actual HTML output, this appear as: <ref /> .

Interestingly, in articles where this bug does not occur, WikiWho outputs the <sup> tag correctly as <sup>.

Additionally, I've noticed that the bug tends to happen in articles where contributors have also added the citations.
However, I did encounter at least one article where a citation didn't cause this issue, though I can't recall which article it was.

Please consider these as initial thoughts— as I’ve been trying to understand how WikiWho algorithm works. Hope my explanation was clear

Good luck with solving the bug @empty-codes!

@empty-codes
Copy link
Contributor

empty-codes commented Oct 4, 2024

@Abishekcs Thank you so much for your insights; they were really helpful for getting started.
I still have not solved the bug, but these are my findings so far @ragesoss : (sorry it's a bit long)

Firstly, I found that there are actually two different APIs involved:

  1. The WikiWhoAPI which is designed to parse historical revisions of Wikipedia articles, providing detailed provenance of each token (word) in terms of who added, removed, or reintroduced it across different revisions.

  2. The WhoColorAPI which is built on top of the WikiWho API and allows for the visualization of authorship data by color-coding tokens in the text based on their original authors. Wiki Edu Foundation employs this to show authorship data on its dashboard for students.

For this issue, the WhoColorAPI is the one we're concerned with.

Flow:

  1. Initially, the ArticleViewer component loads the parsed version of the article:

    <div id="article-scrollbox-id" className="article-scrollbox">
      {
        fetched ? <ParsedArticle highlightedHtml={highlightedHtml} whocolorHtml={whoColorHtml} parsedArticle={parsedArticle} /> : <Loading />
      }
    </div>

    The ParsedArticle component is defined in ParsedArticle.jsx:

    export const ParsedArticle = ({ highlightedHtml, whocolorHtml, parsedArticle }) => {
      let articleHTML = highlightedHtml || whocolorHtml || parsedArticle;

    The ParsedArticle component accepts highlightedHtml, whocolorHtml, and parsedArticle as props and displays one of them based on what is available.

  2. It then fetches authorship data from the WikiWho server.

  3. Once the authorship data is available, it replaces the initially rendered parsed article with the highlighted HTML (from whoColorHtml).

Here's the difference between the three props:

  • parsedArticle: This is the basic version of the article fetched by the fetchParsedArticle method that is initially rendered. This is just the plain article HTML without any authorship highlighting obtained from the MediaWiki API call using the parsedArticleURL(lastRevisionId) method in the URLBuilder, which returns a URL of this format:

    `${base}/w/api.php?action=parse&oldid=${lastRevisionId}&disableeditsection=true&format=json`;
  • whoColorHtml: This is the raw HTML returned directly by the WhoColor API. It includes the token-level spans that identify which editor added or modified specific parts of the text. This is obtained from articleviewer.jsx by calling the fetchWhocolorHtml() function, which further calls the __wikiwhoColorURLTimedRequestPromise(timeout, lastRevisionId) function, which uses the wikiwhoColorURL URLBuilder method to get the URL of the format below for the API call:

    const url = `${WIKIWHO_DOMAIN}/${language}/whocolor/v1.0.0-beta/${encodeURIComponent(title)}/${revisionId}/`;
  • highlightedHtml: This is the processed version of whoColorHtml, where additional formatting and styling are applied to correctly display the authorship data (e.g., wrapping spans for tokens with additional attributes for styling). It is populated by the highlightAuthors function, which uses the whoColorHtml state:

    // This takes the extended_html from the whoColor API, and replaces the span
    // annotations with ones that are more convenient to style in React.
    // The matching and replacing of spans is tightly coupled to the span format
    // provided by the whoColor API: https://github.com/wikimedia/wikiwho_api
    const highlightAuthors = () => {
      let html = whoColorHtml;
       // Replace each editor span for this user with one that includes their
       // username and color class.
       const prevHtml = html;
       const colorClass = colors[i];
       const styledAuthorSpan = `<span title="${user.name}" class="editor-token token-editor-${user.userid} ${colorClass}`;
       const authorSpanMatcher = new RegExp(`<span class="editor-token token-editor-${user.userid}`, 'g');
       html = html.replace(authorSpanMatcher, styledAuthorSpan);
      // more logic and logic and logic 
      setHighlightedHtml(html); // highlightedHtml state variable populated here
      setPendingRequest(false);
    };

@Abishekcs identified that the problem seems to occur because WikiWho is incorrectly outputting <ref> tags instead of <sup> tags. This misrepresentation leads to HTML errors when the browser attempts to render the content since <ref> is not a recognized standard HTML tag.

Note that the MediaWiki parse action returns a parse.text property that correctly contains all the <sup> tags for both the affected and unaffected articles, which is why the page renders fine initially using the parsedArticle prop/state variable. So the problem is likely not from the MediaWiki API.

The fact that some articles display correctly while others do not suggests that there may be inconsistencies in how the WhoColor API processes certain revisions of articles.

So the questions are:

  1. Is it the whoColorHtml that replaces parts of the parsedArticle that are meant to be <sup> with <ref>, or is it the highlightedHtml that replaces the <sup> with <ref>?

  2. What exactly is it about the affected articles that is triggering the bug?

If it is the whoColorHtml replacing it, it is likely an issue with the WhoColor API; if it is the highlightedHtml, it is likely an issue with the highlightAuthors function logic in ArticleViewer.jsx.

Additionally, @Abishekcs also noticed that the bug tends to happen in articles where contributors have also added citations.

In this specific article: UCSF Foundations II, in the parse action response, there were parse warnings:

parsewarnings[ 
  "Script warning: <span style=\"color:#3a3\">One or more <code style=\"color: inherit; background: inherit; border: none; padding: inherit;\">&#123;{[[Template:cite journal|cite journal]]}}</code> templates have maintenance messages</span>; messages may be hidden ([[Help:CS1_errors#Controlling_error_message_display|help]])."
]

Since the bug appears more frequently in articles with numerous citations, it’s possible that these templates are not being parsed correctly. However, these parse warnings are inconsistent (and probably irrelevant) because they were present in another article that does not have this bug and also absent in another article that has this bug.

Keeping all these in mind, I will continue further investigation. Hopefully I can pinpoint a cause soon😅

@Abishekcs
Copy link
Contributor

So the questions are:

  1. Is it the whoColorHtml that replaces parts of the parsedArticle that are meant to be <sup> with <ref>, or is it the highlightedHtml that replaces the <sup> with <ref>?

@empty-codes, I believe it's the whoColorHtml. I'm mentioning this because 😅 I reviewed the raw HTML output for whoColorHtml, and here's a small screenshot below. However, it might be a good idea to cross-check just to be sure.

Screenshot from 2024-10-04 17-27-46

Screenshot from 2024-10-04 17-27-21

@empty-codes
Copy link
Contributor

empty-codes commented Oct 4, 2024

@empty-codes, I believe it's the whoColorHtml. I'm mentioning this because 😅 I reviewed the raw HTML output for whoColorHtml, and here's a small screenshot below. However, it might be a good idea to cross-check just to be sure.

@Abishekcs That answers the question, thank you! I'll also crosscheck from my end.

@empty-codes
Copy link
Contributor

empty-codes commented Oct 5, 2024

I was stuck trying to find a lead for a while 😅 but I finally got somewhere (I think).

For the Health effects of electronic cigarettes article:

The NewPP limit report for the parsed ver

image

The NewPP limit report for the highlighted authorship ver

image

For the Hispanic and Latino Americans article:

The NewPP limit report for the parsed ver

image

The NewPP limit report for the highlighted authorship ver

image

The key issue seems to be that the templates are not being expanded, as indicated by the 0 bytes in the Post-expand include size and Template argument size fields, as well as the minimal expansion depth.

This page provides more context about the meaning of the terms.

At this point, I would like to ask for further guidance @Abishekcs @ragesoss. What steps should I take from here, please?
Also, I found this file that seems to contain the parsing logic: markuppreparser.inc.php. Is this still in use, and would it be relevant to this issue?

Thank you in advance!

@ragesoss
Copy link
Member Author

ragesoss commented Oct 7, 2024

That's interesting. The template expansion seems like a good clue, it's not obvious to me whether an expansion limit is involved, or whether it's being misparsed for some other reason. The unparsed ref tags seem likely to be relevant.

I can't tell whether that whoCOLOR repository is indirectly used for this. The main repo for the wikiwho-api servers is https://github.com/wikimedia/wikiwho_api

@empty-codes
Copy link
Contributor

Noted! I will update you on any new findings @ragesoss

@empty-codes
Copy link
Contributor

empty-codes commented Oct 14, 2024

@ragesoss I successfully set up the wikiwho_api locally by importing XML dumps and generating pickles for the relevant articles. While examining the Hispanic article, I noticed a discrepancy between a template in the wikitext outputs:

Correct:

{{Redirect-multi|2|Latinas|Latinos|other uses|Latina (disambiguation){{!}}Latina|and|Latino (disambiguation){{!}}Latino}}

In extended_html:

{{Redirect-multi|2|Latinas|Latinos|other uses|Latina (disambiguation){{!}}<span class="editor-token token-editor-1152308" id="token-53">Latina</span><span class="editor-token token-editor-22831189" id="token-54">|</span>...}}

I traced this issue to the parser logic in ~/wikiwho_api/env/lib/python3.9/site-packages/WhoColor/parser.py and ~/wikiwho_api/env/lib/python3.9/site-packages/WhoColor/special_markups.py. The parser fails to recognize nested templates, prematurely closing templates after encountering {{!}}.

To fix this, I modified the __parse_wiki_text method in parser.py by introducing a template depth counter self.template_depth = 0 and a template stack self.template_stack = [] to track which templates are currently open and ensure they are only closed once all nested templates are closed.

While this change successfully eliminated the unwanted <span> tags in the templates ✔️, the <ref> tag bug persists.

Both the rev_text and wiki_text values correctly use formatted <ref> tags, but the bug occurs at this following point, because of the parser.extended_wiki_text generated. When i changed the argument of the convert_wiki_text_to_html function to wiki_text itself, the templates were properly expanded.

parser = WikiMarkupParser(wiki_text, whocolor_data['tokens'])
parser.generate_extended_wiki_markup()
extended_html = wp_rev_text_obj.convert_wiki_text_to_html(parser.extended_wiki_text)

I am currently investigating whether it is caused by the parser logic or token insertions or anything else.
A little note, I have just been editing this comment instead of creating a new one each time, Thank you for your patience!

@empty-codes
Copy link
Contributor

empty-codes commented Oct 21, 2024

Hello @ragesoss, Here is my current update.

Initially, I attempted to resolve the issue of the parser prematurely closing templates upon encountering {{!}} by modifying the __parse_wiki_text method in parser.py and introducing a template depth counter (self.template_depth = 0) and a template stack (self.template_stack = []), along with additional logic to track open templates.

However, this approach resulted in another bug involving duplicate {{ and }} template tags, prompting me to revert the changes. Instead, I added new markup in special_markups.py to specifically address template delimiters like {{!}}:

{
    'type': 'single',
    'start_regex': re.compile(r'{{!}}'),
    'end_regex': None,
    'no_spans': True,
    'no_jump': False
},

Note: I am aware the changes here are not permanent because the parser and special_markups py files are actually site packages/dependencies in a path like so: /home/emptycodes/wikiwho_api/env/lib/python3.9/site-packages/WhoColor/parser.py.

This modification effectively eliminated the unwanted <span> tags injected within templates; however, the <ref> tag bug persists.

Both the rev_text and wiki_text values utilize correctly formatted <ref> tags. However, the bug manifests when generating extended HTML:

parser = WikiMarkupParser(wiki_text, whocolor_data['tokens'])
parser.generate_extended_wiki_markup()
extended_html = wp_rev_text_obj.convert_wiki_text_to_html(parser.extended_wiki_text)

By changing the argument of the convert_wiki_text_to_html function to wiki_text, the templates were expanded correctly. This suggests a potential issue with the extended_wiki_text generated by the parser.

The following are steps I have taken in investigating the bug:

  1. I utilized the WikiTemplate UDL tool to compare the wikitexts (both the wikitext and the parser.extended_wiki_text of affected and unaffected pages, confirming no significant differences in template formats.

  2. I attempted the following actions without success:

    • Adding an additional action=expandtemplates request to the handler and utils .py to make a call to the mediawiki expand templates action first to expand all the templates in the wikitext and then pass the result from that to the action=parse, but it did not resolve the issue.
    • Increasing the request header timeout from 0 to 180 seconds, which also yielded no improvements.
    • Modifying the default_task_soft_time_limit in deployment/celery_config.py from 120 to 300 seconds, yet no change was observed.
    • Uncommenting redundant markup for <ref> tags in the markup file, which had no impact.
  3. I also utilized the Wikipedia Special:ExpandTemplates tool with both the wikitext and the expanded wikitext (including injected editor tokens) of a buggy page. The templates in the response expanded correctly with both of the input wikitexts, indicating that the injected editor tokens are not likely to be responsible for the problem.

In conclusion, I cannot pinpoint a cause because:

  1. If the cause is attributed to parsing logic, I found no evidence supporting this claim, as the same dashboard logic works on other pages, and templates expand properly using the Special:ExpandTemplates tool.
  2. If the cause is attributed to excessive template usage and citations, the same page with the same templates and citations renders correctly on Wikipedia and also the initial parsed article view.
  3. If the cause is attributed to timeouts, increasing request timeout and task limits did not yield any improvements.

Would you recommend I continue working on this issue? I'm sure there's something I am missing but I cannot pinpoint exactly what it is. Thank you for your patience!

@ragesoss
Copy link
Member Author

@empty-codes thanks! this is really useful documentation of your debugging work. I suggest leaving this one; hopefully we can find the next clue at a later time, but it's a relatively rare bug.

I just checked the second example with the Who Wrote That? tool on Wikipedia, and it also displays this buggy behavior (which makes sense based on your debugging, as it's clearly a problem with the WikiWho processing). So we can be pretty confident now that it's not a bug in our codebase.

wwt wikicode

One really useful way to wrap this up would be to open an issue on Phabricator against the Who-Wrote-That project, summarizing what you've learned about the like source of the bug within the WikiWho parser. There are some other issues there already related to pages that don't work as expected, but I don't see any that are clearly the same issue here, and I didn't spot anything along the lines of what you've done here to narrow down the source.

@empty-codes
Copy link
Contributor

@ragesoss I've created the Phabricator issue here: https://phabricator.wikimedia.org/T377898

While I wasn't able to completely pinpoint the source of the bug, I learned a lot throughout the process and I'm glad this documentation will be useful. Thanks for your guidance throughout this process! 🙏

@ragesoss
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants