Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support thumb.php thumbnails (was: All images are missing in ZIM) #2088

Open
benoit74 opened this issue Sep 23, 2024 · 13 comments
Open

Support thumb.php thumbnails (was: All images are missing in ZIM) #2088

benoit74 opened this issue Sep 23, 2024 · 13 comments
Assignees
Labels
Milestone

Comments

@benoit74
Copy link
Contributor

This week-end we ran the dev Docker image on a MediaWiki and resulting ZIM seems to have no image at all.

ZIM online: https://dev.library.kiwix.org/viewer#psychonautwiki_en_all_maxi_2024-09/

Zimfarm recipe: https://farm.openzim.org/recipes/psychonautwiki_en_all

ZIM request: openzim/zim-requests#634

Sorry if this is a duplicate (I didn't found such a precise issue in the backlog), but this might be an interesting issue for you since MediaWiki is quite small and all images seems to be broken.

@audiodude
Copy link
Member

Sorry if this is a naive observation, but when I go to the config tab of that Zimfarm recipe I see:

format | [ "nopic:nopic", "novid:maxi" ]

Does that make a difference?

@kelson42 kelson42 added this to the 1.14.0 milestone Sep 24, 2024
@kelson42
Copy link
Collaborator

kelson42 commented Sep 24, 2024

There is a problem indeed
image

Very interesting at https://dev.library.kiwix.org/viewer#psychonautwiki_en_all_maxi_2024-09/ because first picture is there but not the others.

My best guess:

  • There a particularism on this wiki
  • Shoukd be easy to fix.

@kelson42 kelson42 removed the question label Sep 24, 2024
@benoit74
Copy link
Contributor Author

format | [ "nopic:nopic", "novid:maxi" ]
Does that make a difference?

This means (sorry if you already know that, but your question is unclear so I prefer to give too much info) that we want two ZIMs, nopic without pictures (this one is OK) and novid (aliased to maxi) without videos (this is the broken one).

Having both formats in a single run is supposed to work well (but could be the source of the issue, didn't tried to reproduce with only novid)

@audiodude
Copy link
Member

Running the command from zimfarm on 1.14-dev, I get a lot of:

[info] [2024-09-25T00:08:47.941Z] Getting JSON from [https://psychonautwiki.org/w/api.php?action=query&format=json&prop=redirects%7Crevisions%7Ccoordinates&rdlimit=max&rdnamespace=0&formatversion=2&colimit=max&rawcontinue=true&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&gapnamespace=0&gapcontinue=MiPLA&rdcontinue=Stomach_cramp%7C16614]
[warn] [2024-09-25T00:08:49.108Z] Got warning from MW Query {
	"main": {
		"warnings": "Unrecognized parameter: colimit."
	},
	"query": {
		"warnings": "Unrecognized value for parameter \"prop\": coordinates"
	}
}

Not sure if this is a problem (it's in the logs as just a warning).

@kelson42
Copy link
Collaborator

kelson42 commented Sep 25, 2024

@audiodude unrelated but we should open a dedicated issue about that. It has no impact beside flooding the logs with not relevant stuff.

@kelson42 kelson42 self-assigned this Dec 3, 2024
@audiodude
Copy link
Member

@audiodude
Copy link
Member

Nevermind the above, I think I've figured it out.

The renderer embeds <img src=> of ../I/401691563f05f4e2d6f02e8261df0cfc.php.

But the downloader creates the file in the ZIM with the filename 401691563f05f4e2d6f02e8261df0cfc.php.webp.

My guess is that this is because mwoffliner doesn't expect image paths to end in .php.

@audiodude
Copy link
Member

Well I was mostly right.

At article render time, we guess the mime type based on the file extension. Types that include image get a .webp extension added. In other ZIMs, this results in URLs like ArushaElephants.jpg.webp. Since we cannot guess an image type from the extension .php, no .webp extension is added to the article HTML. When we go to download the image however, the HTTP headers indicate that it is in fact an image and the .webp extension gets added to the downloaded file.

@audiodude
Copy link
Member

audiodude commented Jan 3, 2025

So my first theory on how to fix this is to store the images in the ZIM as 401691563f05f4e2d6f02e8261df0cfc.php, which I imagine we don't want because it's "a lie" (the file is webp).

Is it true that we can only convert jpg/png to webp? Otherwise, if a URL appears in the resource="" block of a <figure> or otherwise in the src of an image, can't we assume it's an image and always add .webp if we're in webp mode?

@kelson42

@kelson42 kelson42 changed the title All images are missing in ZIM Support thumb.php thunmails (was: All images are missing in ZIM) Jan 4, 2025
@kelson42 kelson42 changed the title Support thumb.php thunmails (was: All images are missing in ZIM) Support thumb.php thumbnails (was: All images are missing in ZIM) Jan 4, 2025
@kelson42
Copy link
Collaborator

kelson42 commented Jan 4, 2025

@audiodude Thank you for having run the analysis. I don't see anything in your comments I could disagree with. Let me try to maybe give a few additional information, the fix might not be that complicated.

First of all I have to apologize to have part of MWoffliner relying on the filename/URL to identify if we deal with an "image" or not. It's not the right way of doing. The right way of doing is to deal with the HTTP Response mime-type (what we do to some extend) and the idea of doing this depending of the (HTML) context is also an interesting one. But at the time we needed this (introduction of the .webp optimisation?) we had not the resources to do the massive re-engineering to do that in the "right" way. I'm not sure that these days, we have more resources to do that properly, so we will probably have to keep current architecture.

Yes, "images" potentially converted to webp are jpg and png. The good news is that all these identifications are based on regexps which are all in src/util/misc.ts... like WEBP_CANDIDATE_IMAGE_MIME_TYPE. If the problem is "just" identifying a string as an image, this could be fixed by modifying a regex (and eventually a bit of the code dealing with it).

What I suspect here is that we face two bugs:

  • The standard extension thump.php of Mediawiki is not handled properly in function getMediaBase(). Considering that this is standard extension, we could simply build the support in less than 5 lines with the properly dedicated regex and handling. That would probably fix this issue.
  • But, it seems that the fallback scenario (which implies "inventing" a name based on the path md5 checksum) is incompatible under certain conditions (the one where the filename extension is not correctly found) with the --webp option. I would recommend here:
  • We could maybe better identify the extension in general for this fallback?
  • if such a scenario happens, at the time try convert to webp and/or convert to a webp URL, we would display a warning in the log and avoid to convert.

All of this should be doable in less than 20 lines, tests included.

@audiodude Does that make sense? Considering that this is a kind of serious bug (but not a regression AFAIK), should we keep it in the milestone or postponed it?

@audiodude
Copy link
Member

First of all I have to apologize to have part of MWoffliner relying on the filename/URL to identify if we deal with an "image" or not. It's not the right way of doing. The right way of doing is to deal with the HTTP Response mime-type (what we do to some extend) and the idea of doing this depending of the (HTML) context is also an interesting one. But at the time we needed this (introduction of the .webp optimisation?) we had not the resources to do the massive re-engineering to do that in the "right" way. I'm not sure that these days, we have more resources to do that properly, so we will probably have to keep current architecture.

Yes, this would be a major undertaking, because right now the code that writes the article HTML (with the src="../I/401691563f05f4e2d6f02e8261df0cfc.php" attribute) is completely decoupled from the code that downloads the actual image and processes it to a .webp.

Yes, "images" potentially converted to webp are jpg and png. The good news is that all these identifications are based on regexps which are all in src/util/misc.ts... like WEBP_CANDIDATE_IMAGE_MIME_TYPE. If the problem is "just" identifying a string as an image, this could be fixed by modifying a regex (and eventually a bit of the code dealing with it).

What I suspect here is that we face two bugs:

  • The standard extension thump.php of Mediawiki is not handled properly in function getMediaBase(). Considering that this is standard extension, we could simply build the support in less than 5 lines with the properly dedicated regex and handling. That would probably fix this issue.

I think this is the same thing that I suggested. If the path includes thumb.php, and we are in webp mode, always put ../I/401691563f05f4e2d6f02e8261df0cfc.php.webp. However, this means that we are assuming thumb.php always returns a png or jpg, which I'm not sure is correct.

  • But, it seems that the fallback scenario (which implies "inventing" a name based on the path md5 checksum) is incompatible under certain conditions (the one where the filename extension is not correctly found) with the --webp option. I would recommend here:
  • We could maybe better identify the extension in general for this fallback?
  • if such a scenario happens, at the time try convert to webp and/or convert to a webp URL, we would display a warning in the log and avoid to convert.

Does this mean re-running the filename mime type detecting (regex based on path) at the time that we are downloading the image/converting to webp, and refusing to do the conversion if the detection doesn't produce png or jpg?

@audiodude Does that make sense? Considering that this is a kind of serious bug (but not a regression AFAIK), should we keep it in the milestone or postponed it?

Given how delayed 1.14 is, and considering that as you pointed out it's not a regression, I would recommend postponing.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 4, 2025

However, this means that we are assuming thumb.php always returns a png or jpg, which I'm not sure is correct.

We should extract proper filename (with proper extension) from the whole URL.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 4, 2025

Does this mean re-running the filename mime type detecting (regex based on path) at the time that we are downloading the image/converting to webp, and refusing to do the conversion if the detection doesn't produce png or jpg?

Yes this what i suggest.

@kelson42 kelson42 modified the milestones: 1.14.0, 1.15.0 Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants