Page MenuHomePhabricator

sitemap-index doesn't include full location path
Closed, ResolvedPublic

Description

Author: rootmj_konf

Description:
google sitemap validation reports errors.
maintenance/generateSitemap.php doesn't generate full location to sitemap index.


Version: 1.16.x
Severity: major
URL: http://transgender-taiwan.org/jidanni_sitemap.makefile
See Also:
http://bugs.debian.org/460831

Details

Reference
bz9675

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:38 PM
bzimport set Reference to bz9675.
bzimport added a subscriber: Unknown Object (MLST).

rootmj_konf wrote:

generateSitemap.php.patch

fixs bug for my http://perl6.cz
see http://perl6.cz/sitemap-index-perl6.xml

attachment generateSitemap.php.patch ignored as obsolete

Is that guaranteed to be a correct path? That seems to assume that all output
files will be in the root URl directory at the wiki's $wgServer path.

rootmj_konf wrote:

There is

  --server=<server>	The protocol and server name to use in URLs, e.g.
		http://en.wikipedia.org. This is sometimes necessary because
		server name detection may fail in command line scripts.

and

$wgServer = $options['server'];

And more generally, not breaking the links to files if files and content are at
different places?

rootmj_konf wrote:

Sitemapindex and sitemap files should be in the same directory. $wgServer should
be only server name without path. No anything like http://en.wikipedia.org/wiki,
beause this break content links.

lcarsdata wrote:

I have tested this patch using the latest version of MediaWiki for SVN, it works in my configuration (which is very complex) but all my sitemaps are at the root of the website so I was unable to test this for the problem that Brion said might occurr. When can this be added to the trunk? - it would be very useful.

alxndr wrote:

I fixed this with a sed one-liner, if namespace sitemaps and index are in sitemaps/:

sed 's/>sitemap-/>http\:\/\/domain.tld\/sitemaps\/sitemap-/' $BASEDIR/sitemaps/sitemap-index.xml > $BASEDIR/sitemap.xml

Ugly but it works.

michal wrote:

I also faced this issue and created similar patch for it: http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=generateSitemap.php.patch;att=1;bug=460831

Any chance it will get fixed soon?

robert wrote:

Allow using full path via specificaiton of path to web root

This patch adds a new command line option --path allowing you to specify the path to the sitemap file relative to the system root.

e.g. --server=http://en.wikpedia.org --path=/w/

would generate

http://en.wikipedia.org/w/sitemapname....xml

It doesn't seam the cleanest way, but the only feasable one I could find.

attachment diff.diff ignored as obsolete

Here in 1.13 (I just changed it above too, hope that's OK),
I don't see why generateSitemap.php's --server option
only affects the sitemaps, and not the sitemap indexes.

Using

php generateSitemap.php --server=http://taizhongbus.jidanni.org --fspath=../

I get sitemap-taizhongbus-wiki_-NS_0-0.xml.gz ...
with absolute URIs, but sitemap-index-taizhongbus-wiki_.xml
with relative URIs.

So the chain robots.txt -> sitemap-index-taizhongbus-wiki_.xml ->
sitemap-taizhongbus-wiki_-NS_0-0.xml.gz
ends up being absolute -> relative -> absolute

As we see in http://www.sitemaps.org/protocol.php at "Sample XML
Sitemap Index" that they use absolute URIs and not relative URIs, so
wouldn't it be best, given the murky nature of all this, to follow
their example.

One knows that according to the rules, both robots.txt "Sitemap:" entries,
and the sitemaps themselves must contain absolute URIs, so one wonders how the
middle link in the chain can take the risk of containing relative URIs.

In http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd:
"The URI must conform to RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt)."
(but in that RFC there are also relative URIs, so who knows.)

robert wrote:

(In reply to comment #11)

Here in 1.13 (I just changed it above too, hope that's OK),
I don't see why generateSitemap.php's --server option
only affects the sitemaps, and not the sitemap indexes.

This is because the path to the sitemaps is not known and therefore cannot reliably be placed in the sitemap index file without an additional parameter to specify this being added to the script.

e.g. if your sitemaps were located in the directory

/var/www/sitemaps/

as apposed to

/var/www/

or

/var/www/w/

MediaWiki would have no way of knowing this and your --server parameter would cause the index file to list them as

http://www.example.com/<sitemap>

or

http://www.example.com/w/<sitemap>

(depending on implementation) rather than

http://www.example.com/sitemaps/<sitemap>

OK, I sure hope there will be a final way to get a http:// URL in the indexes.

For now I will just use my wacko Makefile:

  1. Make sitemaps for my wikis, that all live in the same tree
  2. Copyright : http://www.fsf.org/copyleft/gpl.html
  3. Author : Dan Jacobson http://jidanni.org/
  4. Created On : Thu Mar 27 04:11:10 2008
  5. Last Modified On: Fri Aug 22 07:39:26 2008
  6. Update Count : 80
  7. https://bugzilla.wikimedia.org/show_bug.cgi?id=9675

T=transgender-taiwan.org
R=radioscanningtw.jidanni.org
B=taizhongbus.jidanni.org
S=$T $R $B
all:$(addsuffix .SITEMAPS,$S)
%.SITEMAPS:
cd ../$*/maintenance && \

	    php generateSitemap.php --server=http://$* --fspath=../

perl -wpi -e 'use strict; use warnings FATAL => q(all);$(\
)s@(<loc>)(sitemap)@$$1http://$*/$$2@' ls -t sitemap-index-*.xml|sed q
sleep 2

mediawiki wrote:

New here, not to sure of the form but I want this fixed so I don't have to remember to swap in my own code *again* after the next update. Here is how I fix this problem.

The main point to note is that I sidestep the whole issue of finding out the correct path by asking the human. This thing has to be run by a sysadmin from the command line so we not talking monkeys here. This is documented with these lines:

+ --webpath=<dir> If you are placing the sitemap files in a sub folder
+ i.e. using the --fspath option and specify somewhere other than root
+ you need to place here the directory name e.g:
+
+ if -fspath = /var/www/httpdocs/mediawiki_sitemaps/ for example
+ then --webpath = /mediawiki_sitemaps *Note, no trailing / needed

I hope this helps.

$ svn diff generateSitemap.php

Index: generateSitemap.php

  • generateSitemap.php (revision 38559)

+++ generateSitemap.php (working copy)
@@ -1,4 +1,4 @@
-<?php
+x<?php
define( 'GS_MAIN', -2 );
define( 'GS_TALK', -1 );
/**
@@ -367,9 +367,11 @@

  • @return string */ function indexEntry( $filename ) {

+ global $wgServer;
+ global $wgWebpath;

return
        "\t<sitemap>\n" .
  • "\t\t<loc>$filename</loc>\n" .

+ "\t\t<loc>$wgServer$wgWebpath/$filename</loc>\n" .

                "\t\t<lastmod>{$this->timestamp}</lastmod>\n" .
                "\t</sitemap>\n";
}

@@ -457,18 +459,30 @@

        server name detection may fail in command line scripts.
 
--compress=[yes|no]     compress the sitemap files, default yes

+
+ --webpath=<dir> If you are placing the sitemap files in a sub folder
+ i.e. using the --fspath option and specify somewhere other than root
+ you need to place here the directory name e.g:
+
+ if -fspath = /var/www/httpdocs/mediawiki_sitemaps/ for example
+ then --webpath = /mediawiki_sitemaps *Note, no trailing / needed
+

EOT;

die( -1 );

}

-$optionsWithArgs = array( 'fspath', 'server', 'compress' );
+$optionsWithArgs = array( 'fspath', 'server', 'compress', 'webpath' );
require_once( dirname( FILE ) . '/commandLine.inc' );

if ( isset( $options['server'] ) ) {

$wgServer = $options['server'];

}

+if ( isset( $options['webpath'] ) ) {
+ $wgWebpath = $options['webpath'];
+}
+
$gs = new GenerateSitemap( @$options['fspath'], @$options['compress'] !== 'no' );
$gs->main();

mediawiki wrote:

Proposed replacement for generateSitemap.php

Attached:

Here is a very short patch for this problem:

  • generateSitemap.php 2008-11-03 11:37:53.000000000 +0100

+++ /srv/www/htdocs/mw/esl/maintenance/generateSitemap.php 2008-11-03 11:40:40.000000000 +0100
@@ -392,9 +392,13 @@

  • @return string */ function indexEntry( $filename ) {

+ global $wgServer;
+ $title = Title::makeTitle( '', '' );
+ $location = $wgServer . $title->getLocalUrl() . $filename;
+

return
        "\t<sitemap>\n" .
  • "\t\t<loc>$filename</loc>\n" .

+ "\t\t<loc>$location</loc>\n" .

                "\t\t<lastmod>{$this->timestamp}</lastmod>\n" .
                "\t</sitemap>\n";
}

dasch wrote:

*** Bug 14397 has been marked as a duplicate of this bug. ***

zrhwiki wrote:

Any news on this issue?

I'll stick the current workaround Makefile I'm "forced to use" in the URL box above.

Also noting inconsistent use of --server in maintenance scripts, just in case one day somebody
wants to unify them:
maintenance/generateSitemap.php:481: --server=<server> The protocol and server name to use in URLs, e.g.
maintenance/dumpBackup.php:87: --server=h Force reading from MySQL server h
maintenance/dumpTextPass.php:518: --server=h Force reading from MySQL server h

zrhwiki wrote:

The problem seems pretty easy to resolve, i.e. just add the full path (as most MediaWiki users have to do themselves if they want to use sitemaps with Google without an error message. What's causing the apparent delay in resolving this bug?

Fixed in r77176. I took a KISS approach to fixing this issue and simply added a new --urlpath parameter, which can be used to specify the URL path corresponding to --fspath. For most installations, this will probably equal or begin with the server name, but I figured the minor redundancy should be worth the flexibility and simplicity.