Page MenuHomePhabricator

Respect $wgNoFollowLinks and $wgNoFollowDomainExceptions
Open, MediumPublic

Description

Parsoid doesn't add rel='nofollow' on links. Someday maybe it should, if Parsoid HTML is going to be crawled directly.

If we add this, we should also parse the $wgNoFollowLinks and $wgNoFollowDomainExceptions configuration properties from mediawiki, and honor them. There are already parser tests for these.


Version: unspecified
Severity: normal

Details

Reference
bz52617

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:58 AM
bzimport added a project: Parsoid-DOM.
bzimport set Reference to bz52617.

(See also the discussion on https://gerrit.wikimedia.org/r/77984 for this issue on [[Image:...|link=...]] links.)

We can always add nofollow for mw:ExtLink links, but should check with the VE folks whether this is handled properly. Compression should keep the additional overhead minimal.

In practice nofollow won't matter to anybody until our HTML is used for regular page views (hence the low priority). The Google KG team will crawl our HTML, but uses a custom pipeline in any case.

Yesterday here at Wikimania Yong-Gang Wang of Google mentioned that their general crawling pipeline has a rule that disregards rel="nofollow" on all MediaWiki-powered sites. I would not be surprised if other engines had similar rules.

This suggests that adding rel="nofollow" has become largely pointless in MediaWiki. Blame all those high-quality external links that are hard to pass up for search engines.

Lowering priority to 'lowest' to reflect this.

That explains all the link spam I get on my small mediawikis. :(

Since the spam prevention effect is unlikely to still be significant we should deprecate $wgNoFollowLinks and $wgNoFollowDomainExceptions in core instead. Repurposing this bug to track that instead.

What's the argument for this bug? rel=nofollow still serves a purpose, and helps to deter spam. I would imagine it'd be quite useful depending on the circumstances.

(In reply to comment #6)

What's the argument for this bug?

It seems its scope was reversed by comment 5.

rel=nofollow still serves a purpose, and
helps to deter spam. I would imagine it'd be quite useful depending on the
circumstances.

Indeed; unless it was generally deprecated in standards and in all real world uses. As for the specific case of Google, we're waiting for clarifications: http://lists.wikimedia.org/pipermail/wikitech-l/2013-November/073175.html

(In reply to comment #6)

What's the argument for this bug? rel=nofollow still serves a purpose, and
helps to deter spam. I would imagine it'd be quite useful depending on the
circumstances.

Read comment 3, which claims that (a) no it doesn't, (b) no it doesn't, and (c) no it isn't. :-)

I now received an answer from my contact at Google:

Google will not follow rel=nofollow links, and not flow pagerank
through them.  That includes Wiki{m,p}edia sources.

So the information I got at Wikimania was either not correct or the
result of a misunderstanding on my part. Another possibility is that
this detail of how pagerank works is considered too sensitive for
publication.

It should not be too hard to verify this independently by setting up a
fresh page with an unguessable URL and linking it from a wiki page with
rel=nofollow. If googlebot visits that page (or it turns up in search
results), then rel=nofollow was ignored.

Moving this bug back to Parsoid for further investigation.

(In reply to comment #9)

It should not be too hard to verify this independently by setting up a
fresh page with an unguessable URL and linking it from a wiki page with
rel=nofollow. If googlebot visits that page (or it turns up in search
results), then rel=nofollow was ignored.

But you should also ensure the link is not included anywhere else and that nobody accesses it faking their user-agent (i.e. you need to check it's a Google IP and hope it's not a Google employee trying to deceive you ;) ).

Production doesn't add nofollow for links to these domains:

'wgNoFollowDomainExceptions' => array(
'default' => array(

  1. Original list 20111110 - bug 32309
		'mediawiki.org',
		'wikibooks.org',
		'wikimediafoundation.org',
		'wikimedia.org',
		'wikinews.org',
		'wikipedia.org',
		'wikiquote.org',
		'wikisource.org',
		'wikiversity.org',
		'wiktionary.org',
		'wikivoyage.org',
		'wikidata.org',
		'tools.wmflabs.org',
		'etherpad.wmflabs.org',

),
),

This is done in Parser::getExternalLinkRel.

Flow may choose to do this itself in its Parsoid content fixing, so I filed bug 66289.

If this blocks bug 66289, it's not unconfirmed.

(In reply to Nemo from comment #13)

If this blocks bug 66289, it's not unconfirmed.

The unconfirmed bit referred to whether rel=nofollow still works or not. We still haven't tested this.

(In reply to Gabriel Wicke from comment #14)

The unconfirmed bit referred to whether rel=nofollow still works or not. We
still haven't tested this.

Until bug 66289 is marked as a valid bug, Parsoid should support that expectation.
Flow probably wants to do what core does though.

marcoil added a project: Parsoid.
marcoil set Security to None.