Page MenuHomePhabricator

MediaWiki images and image pages are not being indexed properly by external search engines
Open, LowPublic

Assigned To
None
Authored By
bzimport
Aug 8 2013, 11:06 AM
Referenced Files
F41509197: image.png
Nov 16 2023, 12:14 AM
F41509193: image.png
Nov 16 2023, 12:14 AM
F41463899: image.png
Nov 7 2023, 6:57 AM
F41463885: image.png
Nov 7 2023, 6:57 AM
F41463881: image.png
Nov 7 2023, 6:57 AM
Tokens
"Like" token, awarded by Alex44019.

Description

Author: visitor

Description:

Problem:

The conventions used by MediaWiki for dealing with uploaded images seem to result in the uploaded images and their description pages not being indexed by Google by default.

Suggested fix:

Adding an optional configuration switch that can force the default link for every thumbnail to be the URL of the original file rather than its description page.

Thumbnails created using "File:" already include a small additional icon that always points to the description page, so there would still be description page links alongside each thumbnail ... however, we'd need to apply these description-page icons to the auto-generated thumbnails that appear on Category pages, too.

Not all wiki owners would want this change, so it'd need to be "opt-in".

Presumed (speculated) reason for failure:

Normal default behaviour on a website is for a clickable thumbnail image to link directly to the full version of the file.

MediaWiki breaks this convention in order to have an intermediate page to hold additional metadata and history information about the image. Unfortunately, the URL for this additional page ends in an image file identifier (''e.g.'' ".jpg"), which means that search engines may have an understandable tendency to assume that the resource being linked to is an image file rather than HTML/XML.

It seems that the default behaviour for the search engine is then NOT to attempt to explore the innards of the "faux .jpg" file but to pass the URL to its image-indexing routines. These then attempt to load the file that corresponds to the description page, recognise immediately from the header that this is ''not'' an image file - and discard it.

This can result in a three-way failure: (1) the nice description page with image preview and copies of all the metadata as text, and with additional written descriptions, is ignored by the search engine because it appears to be a malformed (and potentially malicious) image file: (2) the original full-size image file with embedded metadata is also not indexed because Google never gets to read a page that links directly to it, and (3) the article thumbnail ''is'' indexed, but is low-quality and low resolution, inherits no embedded metadata, and might be flagged by the search engine as being associated with a bad (and potentially broken or malicious) link, so it gets assigned a poor ranking.

In the normal course of affairs, Google will never get to find out that the original image files exist. Google also can’t read the Wiki’s thumbnail image listings ( which ''do'' contain direct links to the images), because these are automatically given a NOINDEX tag, which specifically tells Google not to index them, and this flag doesn't seem to be overridable.

Presumed (speculated) reason for Google's behaviour:

We can argue that this problem is not down to a bug with MediaWiki, and is instead Google's fault - shouldn't Google analyse pages based on content rather than on apparent filename suffixes?

However, Google can counter-argue that ignoring apparent filenames would make their search routines less efficient, that authors should be encouraged to use appropriate filetype suffixes for their files, and that since maliciously-constructed JPG files are a known vector for malware, that perhaps there's even an argument that perhaps Google ''should'' be deliberately boycotting URLs that suggest that they lead to image files (but don't), on principle.

In any case, search engine optimisation is the job of a webpage author not Google, and if we decide to make our web-pages operate in a way that is misleading and results in pages not being crawled, then that's our problem rather than Google's.

See also T6421: Image file extension should not be part of the name

Partial temporary workarounds:

A wiki’s owner can add direct (Google-followable) links to point to the original image files themselves, either (1) by manually compiling a separate listings page with the direct links (which includes images but is missing any surrounding referential context), (2) by manually using the LINK= property for each individual manually-embedded thumbnail (which can involves a lot of extra work), or (3) by replacing MediaWiki’s "File:" link syntax with a custom thumbnail template that includes both a link to the image description page, and a direct link to the original image.

However, the "Link=" override method still doesn't solve the problem of creating corresponding direct links for the Category-page thumbnails generated by Mediawiki.

If a wiki is used partly as a storage system for a lot of large high-quality images, then its quite possible that many of those images will mainly be be accessed via category page thumbnails and may not have separate additional embedded thumbnails that the "Link=" override can be applied to - we still need some way of telling MediaWiki that we want it to create Google-followable paths from the category page thumbnails to the full image files.

Implementation

The suggested fix would be to have a switch that makes image thumbnails link directly to the original files, regardless of whether they were created within the body of an article using the File: syntax, or were automatically generated near the end of a "Category" page.

A secondary link would then be provided either hanging below, on or by the thumbnail to point to the description page. This secondary link already exists for thumbnails generated by "File:", but if the new global override feature was implemented, a similar "info page" link icon would need to be added to Category page thumbnails.

Possible enhanced implementation

If the bug-fixer wanted to be especially creative, the flag could support multiple options, for instance, to allow a choice of icon and icon placement – a wiki owner could then choose to specify, say, that an "INFO" strip icon sits below every thumbnail, or that a red dot icon or a "page corner curl" icon floats superimposed on top of the bottom right corner of every thumbnail image to link to the description page, while the rest of the exposed thumbnail links to the image itself.

Although the priority for this fix would be to allow a wiki’s administrator to solve the current problem with Google not indexing full images (without really changing the look of the pages), an "enhanced" implementation with choice of infopage icon and position would give MediaWiki additional visual customisation options.


Version: 1.22.0
Severity: normal

Details

Reference
bz52647

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:00 AM
bzimport set Reference to bz52647.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

Presumed (speculated) reason for failure:

We can argue that this problem is not down to a bug with MediaWiki, and is
instead Google's fault - shouldn't Google analyse pages based on content
rather than on apparent filename suffixes?

I agree. Plus the "page inbetween" is required in order to show license information plus history/revisions of a file. Hence WONTFIXing.
If you have clearer proof that search engines are to blame, please contact the corresponding search engines.

In any case, search engine optimisation is the job of a webpage author not
Google

MediaWiki is not meant to concentrate on SEO but nobody will stop webmasters to somehow enhance MediaWiki (e.g. via extensions or code changes) to perform better when it comes to SEO.
It's just nothing that is planned for the upstream codebase.

Apparently, MediaWiki is already doing it's job for images. But there's something users should also do on their part.

"Image publishing guidelines" from google: [1] suggests as a best practice to use "detailed, informative filenames". Failing to do that may cause google to index the thumbnails because of their context in the article, not the filename.

It also mentions "Even if your image appears on several pages on your site, consider creating a standalone landing page for each image". That's where image description pages come to play.

Another actions that can be taken are using an image sitemap [2] [3] or even preventing the indexation of thumbnails with robots.txt rules. Since thumbs are grouped in a separate directory, this can be done easily.

MediaWiki already has a tool for generating sitemaps [4], maybe it can be improved to generate image sitemaps as well (I don't know if it generates them). Maybe create an enhancement bug for this one?


[1] https://support.google.com/webmasters/answer/114016?hl=en

[2] https://support.google.com/webmasters/answer/178636?hl=en

[3] http://webmasters.stackexchange.com/questions/27176/make-google-index-the-actual-image-not-the-thumbnail

[4] https://www.mediawiki.org/wiki/Manual:GenerateSitemap.php

visitor wrote:

Hi Jesús!
This isn’t a conventional SEO "optimisation" problem, where one has to "do one's part" to help a search engine to find pages – in this case Google already knows all about the links that point to the description pages, it simply refuses point-blank to follow those links and index the contents, presumably on the basis that the apparent file type suffix doesn’t match the content, which is a classic mechanism for delivering malware.

Another interpretation is that since obfuscated links can be used for security purposes, perhaps Google shouldn’t be trying to index files that people are trying to conceal.

Anyhow, whatever the reason - it seems that no matter how many times you tell Google that your MW description pages exist using sitemaps, links, etc, it won’t look at them.

And unfortunately, since the MW designers decided to hide all of a wiki’s high-res images behind the description pages (instead of allowing thumbnail code blocks to link to both the description page AND the source image, by default, your source files won’t show up on Google either.

You can override this behaviour for manually-placed images using the LINK= workaround (which means training all your wiki’s contributors to add an extra editing stage every time that they add a link), but that fix isn’t available for category thumbnails.

Yes, bypassing MediaWiki and creating your own external sitemap or file-list page will tell Google that your high-resolution files exist, but when those images then turn up on an image search, and the user clicks on Google's button to visit the page where the image is used, then there either won’t be a page to go to (if you used the sitemap method, because no crawled page actually links to the image) , or there’ll be a link from Google to your "image list" page, rather than to the pages where the image is actually used. The image has no context unless there’s actually a link on the page where the thumbnails appear, and the easiest way to do this is to add the direct link to the thumbnail code.

When we write a script to generate the "image list" page with all the direct image links, we could include links to the info pages that can help human visitors that land on the page via Google, but again, Google itself won’t follow the link. We can also include an auto-generated link to the picture’s, info-page’s "What links here" page to provide the missing context to those humans, but once again, Google won’t read that link to create the network information, this time because the "what links here" page has its robots flags helpfully set to NOFOLLOW NOINDEX.

I suppose that one could try to recreate the "crosslink" info by writing your own custom spider, but then you’re practically writing your own image content management system in parallel to MediaWiki, just to get external search to work properly.

WIKIPEDIA:
So how does Wikipedia manage it?
It doesn’t. Wikipedia’s standard info pages are ignored by Google too. Looking at the WP page for "Fish", and image searching for the "Giant Grouper" fish file, Google will report copies of the full-size image (and the info page preview) on multiple external sites, but it won’t give you a link to any of the ones that are lodged on Wikipedia (other than their thumbnails). If we then Google a chunk of text from the standard info page, Google doesn’t seem to have that page in its indexes.
http://en.wikipedia.org/wiki/File:Georgia_Aquarium_-_Giant_Grouper_edit.jpg isn’t on Google. What _is_ showing up on Google (for some reason) is the text content for what I’m guessing is the mobile version of the page, at https://en.m.wikipedia.org/wiki/File:Georgia_Aquarium_-_Giant_Grouper_edit.jpg

What Wikipedia _have_ done that seems to be successful in increasing their ranking for embedded images is to use the "srcset" attribute to embed direct links to larger versions of the image ( $wgResponsiveImages ), which Google can presumably read and follow. Those are intended to be higher-res versions of the thumbnails for use with retina displays, etc, but it’s a way to at least create a logical link between an image’s use and a version that’s twice as large as the default embedded thumbnail. It’s just a bit of a shame if your thumbnailing engine strips out all the original image’s metadata. And apparently the feature doesn’t yet work for category images.

So, currently, mediawiki’s image handling, as far as search engine visibility is concerned, seems to be badly broken. Assuming that getting search engines to read the dedicated info page is a lost cause, we can still easily fix the main problem by giving site admins the option of globally overriding thumbnail behaviour so that there’s some sort of direct link from the thumbnail image block to the full file. It doesn’t have to be an obtrusive link, and it doesn’t have to be the main link that’s used when the thumbnail is clicked, but the information needs to be associated with that thumbnail region somehow, even if it’s only as an additional attribute value.

MediaWiki is a great system for storing and making accessible text content, but for most organisations interested in setting up a publishing system for image files, if that system doesn’t make the files and metadata visible to Google, then it’s not yet a working system. If you’re a museum or art gallery hoping to put your image libraries online to increase your organisation’s visibility, then a default Google ranking of zero is not acceptable, and until there’s a better workaround, MW can’t be recommended to those organisations (unless they have a damned good PHP hacker on their staff). I think that this is probably one of the reasons why so many of these organisations are still ignoring MW and using outside services like Flickr – it’s because, for their purposes, the image section of MW isn’t yet compatible with the main tool that people use for finding images online.

Needs fixing. Seriously

Reopening. I think you have some good points about that, and at least someone may take a look into what's wrong with search engines (all of them or just google?) and if there's something fixable from us.

According to this blog from a year ago, URLs have something to do with this problem [1]


[1] http://wirthi.blogspot.com.es/2012/08/mediawiki-why-are-my-image-description.html

I agree that this issue is at least problematic enough that we should actively look into it.

This issue has been brought up again in support desk: Images only indexed as thumbnails by search engines

Well, something needs to be done here, so I'll start proposing an idea at least. An RFC may be needed

  • Default link for embedded images should be the original version (high res)
  • Use the [[ http://www.w3.org/TR/html-longdesc/ | longdesc ]] attribute to point to the file description page
  • With JavaScript, place a small icon over the image (only visible when hovering over the image), that clicking on the icon would open the file description page. Clicking on the image will open the original image.
  • Next to the image add a link (hidden by default, only visible for text browsers) to the file description page. On images embedded using frame or thumb options it won't be rendered (maybe add the normal link on frame same as on thumb, on the container box)

Our wiki has a significant amount of images and a decent amount of organic traffic, but the original resolution images and file description pages are almost never indexed.

Take a popular png file on our wiki for example:

  • Google indicates that the URL is not on Google, even though it is linked on both the article page and the sitemap that we submitted:

image.png (1×2 px, 669 KB)

  • When we do a live test on the same URL, Google indicates that it can be indexed:

image.png (1×2 px, 587 KB)

  • Curiously, only webp file description pages are indexed properly with the right image metadata:

image.png (1×1 px, 757 KB)

That leads me to believe that Google is not indexing file description pages because those URLs are ending with certain file formats (e.g. png, jpg, etc.), as mentioned above. I have also compared the search results of a few other independent wikis, it seems that this only affects file description pages with Short URL enabled. URLs with index.php can be indexed properly by Google Images.

alistair3149 renamed this task from MediaWiki images not being indexed properly by external search engines to MediaWiki images and image pages are not being indexed properly by external search engines.Nov 7 2023, 7:06 AM

This issue has been brought up again in support desk: Images only indexed as thumbnails by search engines

Well, something needs to be done here, so I'll start proposing an idea at least. An RFC may be needed

  • Default link for embedded images should be the original version (high res)
  • Use the [[ http://www.w3.org/TR/html-longdesc/ | longdesc ]] attribute to point to the file description page
  • With JavaScript, place a small icon over the image (only visible when hovering over the image), that clicking on the icon would open the file description page. Clicking on the image will open the original image.
  • Next to the image add a link (hidden by default, only visible for text browsers) to the file description page. On images embedded using frame or thumb options it won't be rendered (maybe add the normal link on frame same as on thumb, on the container box)

Please feel free to chime in :)

  1. That is what Fandom has been doing, as their original resolution images are always indexed. But that won't solve the issue where the file description page isn't being indexed. Google Images do use the structured data from the description page for their image metadata (e.g. providing author and license information in Google Images).
  2. longdesc is deprecated. There are no direct replacements for it as of 2023.
  3. It can be done with just CSS as well though I am not sure about the a11y implications. The file description URL should be in the initial HTML so that it still get indexed.
  4. That might work. Not sure if visibility of the link affects how search engines index it though.

Our wiki has a significant amount of images and a decent amount of organic traffic, but the original resolution images and file description pages are almost never indexed.
URLs with index.php can be indexed properly by Google Images.

Well in that case... you could just try adapting the image linker.php class and swap links to images from the shortened form to using the index.php form ?
If they only filter out urls pre-accessing them (which I suspect is what they are doing), then that might just be enough.

Another option may be to add metadata to the pages that use the image. See https://developers.google.com/search/docs/appearance/structured-data/image-license-metadata

Using this structure, you can use the original URL and let the browser pick the optimal thumbnail from srcset. And also mark the link to the license page as such.

This seems to be somewhat the idea of T250317 but it seems to be stuck...

Also, Readers Web will probably block the addition of metadata because "every byte counts in mobile"

Our wiki has a significant amount of images and a decent amount of organic traffic, but the original resolution images and file description pages are almost never indexed.
URLs with index.php can be indexed properly by Google Images.

Well in that case... you could just try adapting the image linker.php class and swap links to images from the shortened form to using the index.php form ?
If they only filter out urls pre-accessing them (which I suspect is what they are doing), then that might just be enough.

That could be an interesting idea that should be tested. I am not sure whether Google uses the initial URL from the referred page or it uses the canonical URL after it is resolved. Unfortunately we don't have a testing environment at the moment and can't test that on production for now. But if anyone is interested, please give it a try and see if it works.

Another option may be to add metadata to the pages that use the image. See https://developers.google.com/search/docs/appearance/structured-data/image-license-metadata

Using this structure, you can use the original URL and let the browser pick the optimal thumbnail from srcset. And also mark the link to the license page as such.

This seems to be somewhat the idea of T250317 but it seems to be stuck...

Also, Readers Web will probably block the addition of metadata because "every byte counts in mobile"

License page is a bit complicated since metadata isn't written to JSON-LD in MediaWiki core, you need Extension:CommonMetadata and specific templates to have some of them written in JSON-LD on the file page (T323739). That solution involves many moving parts though so I am not sure that it is feasible in short to mid term.

As for Reader Web, it shouldn't affect the mobile footprint as they serve pages through MobileFrontend, which can strip the metadata out of needed.

As for Reader Web, it shouldn't affect the mobile footprint as they serve pages through MobileFrontend, which can strip the metadata out of needed.

But incidentally Google uses "mobile first", indexing mobile pages (only?) if there's a different mobile view than desktop, which would strip all metadata...

TL;DR: Google Images is able to pick up source images with an invisible anchor tag linked to the source image URL.

We ran an experiment to appending the source image URL in an invisible anchor tag, after the file description HTML of an image on an article page like this:

<figure class="mw-halign-center" typeof="mw:File/Frameless">
    <a href="/File:A1_landed_in_hangar_-_Cut.jpg" class="mw-file-description" title="">...</a>
    <a href="https://media.starcitizen.tools/7/74/A1_landed_in_hangar_-_Cut.jpg" class="mw-file-source"><!-- Image link for Crawlers --></a>
</figure>

Before the change, Google only picks up the thumbnail (and its srcset counterparts) inside the img tag, which has a low resolution (128x72px) .

After the change, Google is able to pick up the source image of the thumbnail, which has a higher resolution (4208x2367px).

image.png (570×625 px, 351 KB)

Previously, many of the images are either not on the Google Images search result or only as low resolution thumbnails.
This change allowed images on the wiki being indexed in its source resolution instead, which they appear in Google Images significantly more often:

image.png (593×604 px, 326 KB)

With Google being the market leader in search engines, it should result in a significant improvement in SEO for MediaWiki.