Support `thumb.php` thumbnails (was: All images are missing in ZIM) #2088

benoit74 · 2024-09-23T07:08:13Z

This week-end we ran the dev Docker image on a MediaWiki and resulting ZIM seems to have no image at all.

ZIM online: https://dev.library.kiwix.org/viewer#psychonautwiki_en_all_maxi_2024-09/

Zimfarm recipe: https://farm.openzim.org/recipes/psychonautwiki_en_all

Sorry if this is a duplicate (I didn't found such a precise issue in the backlog), but this might be an interesting issue for you since MediaWiki is quite small and all images seems to be broken.

The text was updated successfully, but these errors were encountered:

audiodude · 2024-09-24T02:00:40Z

Sorry if this is a naive observation, but when I go to the config tab of that Zimfarm recipe I see:

format | [ "nopic:nopic", "novid:maxi" ]

Does that make a difference?

kelson42 · 2024-09-24T04:38:28Z

There is a problem indeed

Very interesting at https://dev.library.kiwix.org/viewer#psychonautwiki_en_all_maxi_2024-09/ because first picture is there but not the others.

My best guess:

There a particularism on this wiki
Shoukd be easy to fix.

benoit74 · 2024-09-24T08:45:46Z

format | [ "nopic:nopic", "novid:maxi" ]
Does that make a difference?

This means (sorry if you already know that, but your question is unclear so I prefer to give too much info) that we want two ZIMs, nopic without pictures (this one is OK) and novid (aliased to maxi) without videos (this is the broken one).

Having both formats in a single run is supposed to work well (but could be the source of the issue, didn't tried to reproduce with only novid)

audiodude · 2024-09-25T00:11:43Z

Running the command from zimfarm on 1.14-dev, I get a lot of:

[info] [2024-09-25T00:08:47.941Z] Getting JSON from [https://psychonautwiki.org/w/api.php?action=query&format=json&prop=redirects%7Crevisions%7Ccoordinates&rdlimit=max&rdnamespace=0&formatversion=2&colimit=max&rawcontinue=true&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&gapnamespace=0&gapcontinue=MiPLA&rdcontinue=Stomach_cramp%7C16614]
[warn] [2024-09-25T00:08:49.108Z] Got warning from MW Query {
	"main": {
		"warnings": "Unrecognized parameter: colimit."
	},
	"query": {
		"warnings": "Unrecognized value for parameter \"prop\": coordinates"
	}
}

Not sure if this is a problem (it's in the logs as just a warning).

kelson42 · 2024-09-25T04:31:50Z

@audiodude unrelated but we should open a dedicated issue about that. It has no impact beside flooding the logs with not relevant stuff.

audiodude · 2025-01-03T17:07:03Z

This looks like a clue: https://psychonautwiki.org/w/thumb.php?f=Map-of-world-cannabis-laws.svg.png&width=801

audiodude · 2025-01-03T17:54:44Z

Nevermind the above, I think I've figured it out.

The renderer embeds <img src=> of ../I/401691563f05f4e2d6f02e8261df0cfc.php.

But the downloader creates the file in the ZIM with the filename 401691563f05f4e2d6f02e8261df0cfc.php.webp.

My guess is that this is because mwoffliner doesn't expect image paths to end in .php.

audiodude · 2025-01-03T18:06:01Z

Well I was mostly right.

At article render time, we guess the mime type based on the file extension. Types that include image get a .webp extension added. In other ZIMs, this results in URLs like ArushaElephants.jpg.webp. Since we cannot guess an image type from the extension .php, no .webp extension is added to the article HTML. When we go to download the image however, the HTTP headers indicate that it is in fact an image and the .webp extension gets added to the downloaded file.

audiodude · 2025-01-03T18:13:11Z

So my first theory on how to fix this is to store the images in the ZIM as 401691563f05f4e2d6f02e8261df0cfc.php, which I imagine we don't want because it's "a lie" (the file is webp).

Is it true that we can only convert jpg/png to webp? Otherwise, if a URL appears in the resource="" block of a <figure> or otherwise in the src of an image, can't we assume it's an image and always add .webp if we're in webp mode?

@kelson42

kelson42 · 2025-01-04T14:04:04Z

@audiodude Thank you for having run the analysis. I don't see anything in your comments I could disagree with. Let me try to maybe give a few additional information, the fix might not be that complicated.

First of all I have to apologize to have part of MWoffliner relying on the filename/URL to identify if we deal with an "image" or not. It's not the right way of doing. The right way of doing is to deal with the HTTP Response mime-type (what we do to some extend) and the idea of doing this depending of the (HTML) context is also an interesting one. But at the time we needed this (introduction of the .webp optimisation?) we had not the resources to do the massive re-engineering to do that in the "right" way. I'm not sure that these days, we have more resources to do that properly, so we will probably have to keep current architecture.

Yes, "images" potentially converted to webp are jpg and png. The good news is that all these identifications are based on regexps which are all in src/util/misc.ts... like WEBP_CANDIDATE_IMAGE_MIME_TYPE. If the problem is "just" identifying a string as an image, this could be fixed by modifying a regex (and eventually a bit of the code dealing with it).

What I suspect here is that we face two bugs:

The standard extension thump.php of Mediawiki is not handled properly in function getMediaBase(). Considering that this is standard extension, we could simply build the support in less than 5 lines with the properly dedicated regex and handling. That would probably fix this issue.
But, it seems that the fallback scenario (which implies "inventing" a name based on the path md5 checksum) is incompatible under certain conditions (the one where the filename extension is not correctly found) with the --webp option. I would recommend here:
We could maybe better identify the extension in general for this fallback?
if such a scenario happens, at the time try convert to webp and/or convert to a webp URL, we would display a warning in the log and avoid to convert.

All of this should be doable in less than 20 lines, tests included.

@audiodude Does that make sense? Considering that this is a kind of serious bug (but not a regression AFAIK), should we keep it in the milestone or postponed it?

audiodude · 2025-01-04T14:28:29Z

First of all I have to apologize to have part of MWoffliner relying on the filename/URL to identify if we deal with an "image" or not. It's not the right way of doing. The right way of doing is to deal with the HTTP Response mime-type (what we do to some extend) and the idea of doing this depending of the (HTML) context is also an interesting one. But at the time we needed this (introduction of the .webp optimisation?) we had not the resources to do the massive re-engineering to do that in the "right" way. I'm not sure that these days, we have more resources to do that properly, so we will probably have to keep current architecture.

Yes, this would be a major undertaking, because right now the code that writes the article HTML (with the src="../I/401691563f05f4e2d6f02e8261df0cfc.php" attribute) is completely decoupled from the code that downloads the actual image and processes it to a .webp.

Yes, "images" potentially converted to webp are jpg and png. The good news is that all these identifications are based on regexps which are all in src/util/misc.ts... like WEBP_CANDIDATE_IMAGE_MIME_TYPE. If the problem is "just" identifying a string as an image, this could be fixed by modifying a regex (and eventually a bit of the code dealing with it).

What I suspect here is that we face two bugs:

The standard extension thump.php of Mediawiki is not handled properly in function getMediaBase(). Considering that this is standard extension, we could simply build the support in less than 5 lines with the properly dedicated regex and handling. That would probably fix this issue.

I think this is the same thing that I suggested. If the path includes thumb.php, and we are in webp mode, always put ../I/401691563f05f4e2d6f02e8261df0cfc.php.webp. However, this means that we are assuming thumb.php always returns a png or jpg, which I'm not sure is correct.

But, it seems that the fallback scenario (which implies "inventing" a name based on the path md5 checksum) is incompatible under certain conditions (the one where the filename extension is not correctly found) with the --webp option. I would recommend here:

We could maybe better identify the extension in general for this fallback?

if such a scenario happens, at the time try convert to webp and/or convert to a webp URL, we would display a warning in the log and avoid to convert.

Does this mean re-running the filename mime type detecting (regex based on path) at the time that we are downloading the image/converting to webp, and refusing to do the conversion if the detection doesn't produce png or jpg?

@audiodude Does that make sense? Considering that this is a kind of serious bug (but not a regression AFAIK), should we keep it in the milestone or postponed it?

Given how delayed 1.14 is, and considering that as you pointed out it's not a regression, I would recommend postponing.

kelson42 · 2025-01-04T14:37:28Z

However, this means that we are assuming thumb.php always returns a png or jpg, which I'm not sure is correct.

We should extract proper filename (with proper extension) from the whole URL.

kelson42 · 2025-01-04T14:38:14Z

Does this mean re-running the filename mime type detecting (regex based on path) at the time that we are downloading the image/converting to webp, and refusing to do the conversion if the detection doesn't produce png or jpg?

Yes this what i suggest.

benoit74 added bug question labels Sep 23, 2024

kelson42 added this to the 1.14.0 milestone Sep 24, 2024

kelson42 removed the question label Sep 24, 2024

kelson42 self-assigned this Dec 3, 2024

kelson42 changed the title ~~All images are missing in ZIM~~ Support thumb.php thunmails (was: All images are missing in ZIM) Jan 4, 2025

kelson42 assigned audiodude Jan 4, 2025

kelson42 changed the title ~~Support thumb.php thunmails (was: All images are missing in ZIM)~~ Support thumb.php thumbnails (was: All images are missing in ZIM) Jan 4, 2025

kelson42 modified the milestones: 1.14.0, 1.15.0 Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `thumb.php` thumbnails (was: All images are missing in ZIM) #2088

Support `thumb.php` thumbnails (was: All images are missing in ZIM) #2088

benoit74 commented Sep 23, 2024

audiodude commented Sep 24, 2024

kelson42 commented Sep 24, 2024 •

edited

Loading

benoit74 commented Sep 24, 2024

audiodude commented Sep 25, 2024

kelson42 commented Sep 25, 2024 •

edited

Loading

audiodude commented Jan 3, 2025

audiodude commented Jan 3, 2025

audiodude commented Jan 3, 2025

audiodude commented Jan 3, 2025 •

edited

Loading

kelson42 commented Jan 4, 2025 •

edited

Loading

audiodude commented Jan 4, 2025

kelson42 commented Jan 4, 2025

kelson42 commented Jan 4, 2025

Support thumb.php thumbnails (was: All images are missing in ZIM) #2088

Support thumb.php thumbnails (was: All images are missing in ZIM) #2088

Comments

benoit74 commented Sep 23, 2024

audiodude commented Sep 24, 2024

kelson42 commented Sep 24, 2024 • edited Loading

benoit74 commented Sep 24, 2024

audiodude commented Sep 25, 2024

kelson42 commented Sep 25, 2024 • edited Loading

audiodude commented Jan 3, 2025

audiodude commented Jan 3, 2025

audiodude commented Jan 3, 2025

audiodude commented Jan 3, 2025 • edited Loading

kelson42 commented Jan 4, 2025 • edited Loading

audiodude commented Jan 4, 2025

kelson42 commented Jan 4, 2025

kelson42 commented Jan 4, 2025

Support `thumb.php` thumbnails (was: All images are missing in ZIM) #2088

Support `thumb.php` thumbnails (was: All images are missing in ZIM) #2088

kelson42 commented Sep 24, 2024 •

edited

Loading

kelson42 commented Sep 25, 2024 •

edited

Loading

audiodude commented Jan 3, 2025 •

edited

Loading

kelson42 commented Jan 4, 2025 •

edited

Loading