Page MenuHomePhabricator

Implement efficient way to select random page from specified category on Wikimedia wikis
Closed, ResolvedPublic

Description

Author: jlatta6

Description:
Hi,
I was hoping that there would be a way to implement "Categories" into the Random Page function. Often I would like to be able to be able to stumble around Wikipedia using the Random Page link, but focusing on a certain subject, such as Computer Science, or Technology, etc. Using the Random Page function of Wikipedia is fun, but being able to focus randomly on a category will help me learn about information I never knew to search for in the first place.


Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=46918
https://bugzilla.wikimedia.org/show_bug.cgi?id=61840
https://bugzilla.wikimedia.org/show_bug.cgi?id=5589

Details

Reference
bz25931

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:21 PM
bzimport set Reference to bz25931.

Thats Bug 2170. I'm not sure if this bug should be duped to that, as your request is more to make it work on wikipedia, and the result of Bug 2170 was an extension that is not currently (and not likely to be in the future) enabled on wikimedia.

We have to review and then enable the extension
http://www.mediawiki.org/wiki/Extension:RandomInCategory

Rephrased title and changed component

(In reply to comment #2)

We have to review and then enable the extension
http://www.mediawiki.org/wiki/Extension:RandomInCategory

Rephrased title and changed component

The comments on bug 2170 seem to indicate the extension is not efficient enough, at least for enwikipedia.

mysql> describe select page_title, page_namespace FROM page JOIN categorylinks ON (page_id=cl_from) AND cl_to="Test" AND page_random >= 0.2265\G
+----+-------------+---------------+--------+---------------------------------+--------------+---------+------------------------------+------+--------------------------+

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra

+----+-------------+---------------+--------+---------------------------------+--------------+---------+------------------------------+------+--------------------------+

1SIMPLEcategorylinksrefcl_from,cl_timestamp,cl_sortkeycl_timestamp257const3Using where; Using index
1SIMPLEpageeq_refPRIMARY,page_randomPRIMARY4enwiki.categorylinks.cl_from1Using where

+----+-------------+---------------+--------+---------------------------------+--------------+---------+------------------------------+------+--------------------------+
2 rows in set (0.00 sec)

That's against enwiki. It doesn't seem '''that''' bad...

The worst case scenario I could think of was:
mysql> select page_title, page_namespace FROM page JOIN categorylinks ON (page_id=cl_from) AND cl_to='Living_people' and page_random>=0.999 limit 1;
1 row in set (0.15 sec)

That's not very fast, but it's much faster than I feared it would be.

(In reply to comment #5)

The worst case scenario I could think of was:
mysql> select page_title, page_namespace FROM page JOIN categorylinks ON
(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.999 limit 1;
1 row in set (0.15 sec)

That's not very fast, but it's much faster than I feared it would be.

The worst case is pretty realistic here, since there's not much point in picking a random article from a small category.

I'd like to hear what Domas thinks about it. What would happen if it was linked to from every category page?

(In reply to comment #6)

The worst case is pretty realistic here, since there's not much point in
picking a random article from a small category.

Note that my worst case also includes a very high page_random value. Even setting it to >= 0.99 makes the query run in 0.00 seconds. With 0.995 is took 0.10 seconds. So roughly speaking, this means this query may take up to 100-150ms, but only if run on a large category and only in 1% of those cases. In other cases it seems to run in <10 ms.

this random isn't really random

I mean this random isn't any close to any idea of 'random'

He means that because there is no "ORDER BY page_random", the query just fetches the first page from the category that satisfies the page_random condition. So for page_random >= 0.99, it will scan 100 pages on average, even if the category is 100k pages. Comment #4 shows that it uses the cl_timestamp index, so it's very likely to return a page from the first 10 or so pages that were added to the category.

yes, thats what I mean, thanks Tim!

Hmm. Using the Toolserver's enwiki_p:

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.999 ORDER BY page_random ASC limit 1;
+-----------------+----------------+

page_titlepage_namespace

+-----------------+----------------+

Jan_van_Deinsen0

+-----------------+----------------+
1 row in set (0.34 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.999 ORDER BY page_random ASC limit 1;
+-----------------+----------------+

page_titlepage_namespace

+-----------------+----------------+

Jan_van_Deinsen0

+-----------------+----------------+
1 row in set (0.00 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.999 ORDER BY page_random ASC limit 1;
+-----------------+----------------+

page_titlepage_namespace

+-----------------+----------------+

Jan_van_Deinsen0

+-----------------+----------------+
1 row in set (0.00 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.989 ORDER BY page_random ASC limit 1;
+---------------+----------------+

page_titlepage_namespace

+---------------+----------------+

Pavol_Baláž0

+---------------+----------------+
1 row in set (0.19 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.19 ORDER BY page_random ASC limit 1;
+---------------+----------------+

page_titlepage_namespace

+---------------+----------------+

Anthony_Tupou0

+---------------+----------------+
1 row in set (29.98 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.28 ORDER BY page_random ASC limit 1;
+--------------+----------------+

page_titlepage_namespace

+--------------+----------------+

Chris_Lehane0

+--------------+----------------+
1 row in set (3.80 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.432901 ORDER BY page_random ASC limit 1;
+-----------------+----------------+

page_titlepage_namespace

+-----------------+----------------+

Civard_Sprockel0

+-----------------+----------------+
1 row in set (2.96 sec)

This seems unacceptably slow. I think it'd be fairly trivial to disable this for categories with greater than X members until a better solution is implemented, however.

"better solution" needs more disk space, memory and IOPS, even if we have a decent index for that - is it worth for such feature, or is this just another "would be nice" eyecandy?

(In reply to comment #13)

"better solution" needs more disk space, memory and IOPS, even if we have a
decent index for that - is it worth for such feature, or is this just another
"would be nice" eyecandy?

I think it seems to be the latter - "would be nice" eyecandy

Unless the initial requester can tell us why otherwise

  • Bug 29373 has been marked as a duplicate of this bug. ***

steve.sperandeo wrote:

While I'm not the original requester, I did post a duplicate bug (Bug 29373).

I can assure you such a feature wouldn't be just eyecandy or a "would be nice" feature. It would be a serious learning tool.

Most people use wikipedia as a reference tool, linked from google's results. However, there are cases when you don't want to use it as a book, but a study guide. For example, when someone is new to a field, they'll want to immerse themselves in the subject and learn as much vocabulary about the subject as possible. Even people who have been out of the subject for a while would benefit from brushing up from time to time.

I personally use the Random Article feature every day. I have a link on my bookmarks toolbar in chrome that I click to learn about random things to spread my breadth of knowledge. However, most of the articles that I land on are biographies of athletes, which is totally useless to me.

I've been a computer scientist for about 8 years now. And I can say with certainty, that if I had a "Random Computer Science Article" button on toolbar, I'd use it every day too. Just look at how huge the subject is: http://en.wikipedia.org/wiki/Computer_science

I'm sure other people would like to immerse themselves in a subject, like Biology, Finance or some other large field.

I think that was the reasoning for this request.

Hope that helps!

PS. Thanks for working on wikimedia and wikipedia. It's really appreciated by so many people. Cheers!

If we had a way to restrict the random pages to a specific category, the feature could be used to get a random chapter of a book from Wikibooks, such as a random recipe from [[b:Cookbook]], or a random animal from [[Wikijunior:Animal Alphabet]] or a random sonnet from [[s:Category:Sonnets]].

I marked this as a blocker for bug 25931.

jlatta6 wrote:

Steve et all,

I am now using stumbleupon to do this. It works pretty good except you can't just search a category off the top of your head, stumbleupon will "stumble" through your pre-defined interests on wikipedia. Asides from that it works ok. Below is the link on how to do it.

http://getsatisfaction.com/stumbleupon/topics/why_cant_i_specifically_stumble_wikipedia_with_my_chrome_plugin

(In reply to comment #16)

While I'm not the original requester, I did post a duplicate bug (Bug 29373).

I can assure you such a feature wouldn't be just eyecandy or a "would be nice"
feature. It would be a serious learning tool.

Most people use wikipedia as a reference tool, linked from google's results.
However, there are cases when you don't want to use it as a book, but a study
guide. For example, when someone is new to a field, they'll want to immerse
themselves in the subject and learn as much vocabulary about the subject as
possible. Even people who have been out of the subject for a while would
benefit from brushing up from time to time.

I personally use the Random Article feature every day. I have a link on my
bookmarks toolbar in chrome that I click to learn about random things to spread
my breadth of knowledge. However, most of the articles that I land on are
biographies of athletes, which is totally useless to me.

I've been a computer scientist for about 8 years now. And I can say with
certainty, that if I had a "Random Computer Science Article" button on toolbar,
I'd use it every day too. Just look at how huge the subject is:
http://en.wikipedia.org/wiki/Computer_science

I'm sure other people would like to immerse themselves in a subject, like
Biology, Finance or some other large field.

I think that was the reasoning for this request.

Hope that helps!

PS. Thanks for working on wikimedia and wikipedia. It's really appreciated by
so many people. Cheers!

sumanah wrote:

Victor, could you put this extension into https://www.mediawiki.org/wiki/Git/Conversion/Extensions_queue so it can be moved to Git, which is a prerequisite for deployment on Wikimedia Foundation sites? Thanks.

sumanah wrote:

Asher Feldman has agreed to do a database administration review of this extension.

sumanah wrote:

Some discussion in IRC just now (MediaWiki-General):

<sumanah> "Waiting for database administration review by Asher Feldman, waiting for author Victor Vasiliev to move extension to Git." bug https://bugzilla.wikimedia.org/show_bug.cgi?id=25931 in https://www.mediawiki.org/wiki/Review_queue#Extensions
<vvv> Well

<vvv> I think it would fail the first stage
<vvv> Because it uses an unindexed query IIRC

<RoanKattouw> It's also not really indexable
<RoanKattouw> Unless you add a cl_random field to categorylinks
<vvv> Yes, this is the problem we found back in 2007
<RoanKattouw> As written it'd have to fetch all categorylinks rows for a given category, join them against page, then do a range scan on page_random

<RoanKattouw> The range scan may or may not be indexed, I couldn't say offhand, but joining an entire category against the page table is a problem when you consider stuff like [[Category:Living people]]
<vvv> Well, it was not even me who split it as an extension
<RoanKattouw> (I'm assuming the query is something like SELECT stuff FROM categorylinks, page WHERE cl_to='Category_name' AND page_id=cl_from AND page_random > 0.123 ORDER BY page_random LIMIT 1; )
<vvv> I believe someone made it after the original version was reverted
<vvv> And cl_random would sound something you would want to have in core
<RoanKattouw> That would probably want to be in core yeah
<RoanKattouw> But adding cl_random is an expensive operation

<vvv> RoanKattouw: I remember nobody wanted to do it because switching masters were done manually back then
<RoanKattouw> Still, switching masters is not what people do for fun
<RoanKattouw> To me it just seems like a lot of effort for a minor feature

<RoanKattouw> OTOH on smaller wikis it might work, but when Sumana and I talked to Asher about this last night he said he didn't want to assume that small wikis will stay small forver
<RoanKattouw> It would be particularly ironic if we enabled it on some Indic language wiki because it's small, while 3 floors above me there's people whose job it is to try and get that wiki to grow

<binasher> vvv: RoanKattouw: if the query is like what roan mentioned above, -1. this sort of thing should be done with a search engine and is probably even doable with the one we have

(In reply to comment #21)

<binasher> vvv: RoanKattouw: if the query is like what roan mentioned above,
-1. this sort of thing should be done with a search engine and is probably
even doable with the one we have

What do you mean? Which search engines return random pages?

(In reply to comment #22)

(In reply to comment #21)

<binasher> vvv: RoanKattouw: if the query is like what roan mentioned above,
-1. this sort of thing should be done with a search engine and is probably
even doable with the one we have

What do you mean? Which search engines return random pages?

Search engines have pregenerated document lists stored in an efficient format for various criteria. Usually the presence or absence of a given keyword is the criterion of interest, but membership in a category can be handled in the same way. Since the list is pregenerated, the length is known, so you can choose a random offset into the category and perhaps even skip to that offset efficiently. Asher probably means that if Lucene doesn't have such a feature already, it could be patched in.

For those wondering, this bug is not a duplicate of bug 2170. Bug 2170 is about having the feature generally available in MediaWiki (which was implemented as the "RandomInCategory" extension). This bug is about having the feature available on exceptional MediaWiki installations, namely those that run Wikimedia wikis.

afeldman wrote:

(In reply to comment #23)

(In reply to comment #22)

(In reply to comment #21)

<binasher> vvv: RoanKattouw: if the query is like what roan mentioned above,
-1. this sort of thing should be done with a search engine and is probably
even doable with the one we have

What do you mean? Which search engines return random pages?

Search engines have pregenerated document lists stored in an efficient format
for various criteria. Usually the presence or absence of a given keyword is the
criterion of interest, but membership in a category can be handled in the same
way. Since the list is pregenerated, the length is known, so you can choose a
random offset into the category and perhaps even skip to that offset
efficiently. Asher probably means that if Lucene doesn't have such a feature
already, it could be patched in.

Indeed, the method Tim outlines would let you grab a random result from any search engine that supports pagination.

You can also get randomized output directly from a search engine given control over sorting, which would normally be in descending order on an IR score. Solr has a random result module and it's implementable in Lucene, including version 2 which we run in production.

See the section "Bonus! For those of you trapped in Lucene 2" at the bottom of:
http://stackoverflow.com/questions/7201638/lucene-2-9-2-how-to-show-results-in-random-order

afeldman wrote:

How to implement a category based random feature in wikipedia without touching the database:

  1. Send lucene an incategory query with a limit of 1 to cheaply get the total number of articles indexed in the given category. Stuff this in memcache with a reasonable ttl (couple hours?) and try to grab there next time so lucene is only called once.
  1. Send the same category with an offset of rand(0, $doc_count - 1)
  1. Redirect to the article returned in step 2.

Command line example to get a random "Domestic animals" article.

Step one - the very first item in the response is the match count (36 in this case):

asher@bast1001:~/srchtest$ curl 'http://search1001:8123/search/enwiki/incategory:%22Domesticated%20animals%22?limit=1'
36
#info search=[search1001,search1001], highlight=[search1004] in 4 ms
#no suggestion
#interwiki 0 0
#results 1
12.586081 0 Genomics_of_domestication
#h.text [] [] [+] date+November+2011
#h.text [] [] [] Genomics+is+the+study+of+the+structure%2C+content%2C+and+evolution++of+genomes+%2C+or+the+entire+genetic+information+of+
#h.date 2012-04-04T12:25:13Z
#h.wordcount 2252
#h.size 15955

Step two - pick a random number between 0 - 35.. let's go with 16.

asher@bast1001:~/srchtest$ curl 'http://search1001:8123/search/enwiki/incategory:%22Domesticated%20animals%22?offset=16&limit=1'
36
#info search=[search1001,search1001], highlight=[search1005] in 3 ms
#no suggestion
#interwiki 0 0
#results 1
6.2930403 0 Fancy_pigeon
#h.text [] [] [+] Fancy+pigeons+are+domesticated++varieties+of+the+Rock+Pigeon++%28Columba+livia%29.+
#h.text [] [] [] They+are+bred+by+pigeon+fanciers++for+various+traits+
#h.date 2012-02-21T12:23:54Z
#h.wordcount 903
#h.size 7354

Hi, Fancy Pigeon!

afeldman wrote:

Our current build of lsearchd won't go deeper than an offset of 100000 (SearchEngine.java:protected static int maxoffset = 100000;) so for categories like Living People, we wouldn't be able to provide random results over the full set, just the first 100k as they appear in the index, which appears to be ordered on create time.

Actually getting the 100kth result (upper latency bound) takes ~280ms

asher@bast1001:~/srchtest$ curl 'http://search1001:8123/search/enwiki/incategory:%22Living%20people%22?limit=1&offset=99999&searchall=0'
567274
#info search=[search1001,search1001], highlight=[search1005] in 283 ms
#no suggestion
#interwiki 0 0
#results 1
1.4743276 0 Boris_Boillon

If you ditch the join and take the same approach with mysql, it's several times faster than lucene:

mysql> select cl_from from categorylinks where cl_to='Living_people' limit 1 offset 99999;
+----------+

cl_from

+----------+

13546433

+----------+
1 row in set (0.06 sec)

The worst case for Living_people isn't great (~350ms), but still faster than lucene would be if we upped lsearchd's max offset:

mysql> select cl_from from categorylinks where cl_to='Living_people' limit 1 offset 560000;
+----------+

cl_from

+----------+

27345638

+----------+
1 row in set (0.35 sec)

We need more features that scan full datasets to return single row.

afeldman wrote:

Domas is of course right. Adding a precomputed cl_random column+index is needed to make this feature acceptable via mysql. Doing so incurs a permanent cost. Alternatively, we could add the existing page_random field to the lucene index and make it searchable to eliminate offset scanning there. The latter may be cheaper.

  1. Send lucene an incategory query with a limit of 1 to cheaply get the total

number of articles indexed in the given category. Stuff this in memcache with a
reasonable ttl (couple hours?) and try to grab there next time so lucene is
only called once.

I was under the impression that lucence's incategory only worked for categories directly listed on a page (aka not inherited from a template). That would be a major negative point for using a lucence based solution (unless that issue could be fixed)

(In reply to comment #30)

  1. Send lucene an incategory query with a limit of 1 to cheaply get the total

number of articles indexed in the given category. Stuff this in memcache with a
reasonable ttl (couple hours?) and try to grab there next time so lucene is
only called once.

I was under the impression that lucence's incategory only worked for categories
directly listed on a page (aka not inherited from a template). That would be a
major negative point for using a lucence based solution (unless that issue
could be fixed)

True, but tangential. The relevant bug is bug 18861.

afeldman wrote:

There's no reason why the indexer couldn't pull in categorylinks instead of whatever its doing now (parsing wikitext?) but we are currently short on resources when it comes to developing around lucene. An upgraded search infrastructure with real-time indexing and greater accessibility around index definitions could open the door to all sorts of features that aren't currently practical in mediawiki at wikipedia scale.

(In reply to comment #30)

  1. Send lucene an incategory query with a limit of 1 to cheaply get the total

number of articles indexed in the given category. Stuff this in memcache with a
reasonable ttl (couple hours?) and try to grab there next time so lucene is
only called once.

I was under the impression that lucence's incategory only worked for categories
directly listed on a page (aka not inherited from a template). That would be a
major negative point for using a lucence based solution (unless that issue
could be fixed)

sumanah wrote:

Asher says that the easiest path to implementing this in a way that performs suitably is the precomputed cl_random column+index solution mentioned in comment 29 -- though it has a real cost in terms of hardware utilization. Assigning to Victor to see whether he would like to follow up on this.

Update: The e3 team implemented a limited version of this in https://gerrit.wikimedia.org/r/#/c/52468/ and https://gerrit.wikimedia.org/r/#/c/51881/ - it only makes it available for a pre-configured small set of categories.

Most of the code required for this is written, it just needs extracting out to its own Extension (RedisRandomCategory?), and perhaps an API Module. And there shouldn't be *too* much performance issues in getting this on cluster, considering that it is already deployed (albeit in a limited way) by E3.

(In reply to comment #34)

Update: The e3 team implemented a limited version of this in
https://gerrit.wikimedia.org/r/#/c/52468/ and
https://gerrit.wikimedia.org/r/#/c/51881/ - it only makes it available for a
pre-configured small set of categories.

Most of the code required for this is written, it just needs extracting out
to
its own Extension (RedisRandomCategory?), and perhaps an API Module. And
there
shouldn't be *too* much performance issues in getting this on cluster,
considering that it is already deployed (albeit in a limited way) by E3.

I was reading up on redis, and it sounds really cool. However what ive gathered from my brief look is that it stores all data in memory (?) I can't imagine that would scale to all cats on enwiki (let alone all cats everywhere)

(In reply to comment #35)

I was reading up on redis, and it sounds really cool. However what ive
gathered
from my brief look is that it stores all data in memory (?) I can't imagine
that would scale to all cats on enwiki (let alone all cats everywhere)

Ah, you're right! Though with appropriate swapping, I suppose you could use it indefinitely (as it swaps out unused pages). But yes, Redis doesn't look to meet our exact requirements, as is.

Change 71997 had a related patch set uploaded by Brian Wolff:
Add Special:RandomInCategory.

https://gerrit.wikimedia.org/r/71997

(In reply to comment #37)

Change 71997 had a related patch set uploaded by Brian Wolff:
Add Special:RandomInCategory.

https://gerrit.wikimedia.org/r/71997

I had an idea for an efficient method that doesn't need a schema change. It however gives quite biased results in some cases (You can have 2 of cheap [in the amount of ops work needed for a schema], fast and good. This one is cheap and fast).

I think this is good enough for the common use case of people just wanting an entry from a category that is different from last time they hit the random button. (For example to get a random thing out of articles for cleanup or whatever). To do something better would need a schema change, or some other more exotic solution. I think this method could be "good" enough for now.

Algorithm is:
*Get earliest and newest cl_timestamp in a category
*Pick a date in between
*Pick an offset between 0 and 30
*Get the page that is offset number of pages after the date picked.

Thoughts?

Possible tweak could also be to randomly change wether we do cl_timestamp > random_timestamp or use cl_timestamp < random_timestamp (along with asc vs desc), which might even things out if one had a category with mostly old entries from very long ago, and a few outlier new entries from very recent.

(In reply to comment #38)

Algorithm is:
*Get earliest and newest cl_timestamp in a category
*Pick a date in between
*Pick an offset between 0 and 30
*Get the page that is offset number of pages after the date picked.

Thoughts?

So, the downside for this is that bulk of pages added to the category in similar times would be constantly underrepresented, if I understand correctly. Those might be category renames, bot additions, new templates including the category... it may make it very hard to clear such big backlogs, but give a better representation of more "human" (slow) additions to the category.

Personally, I'd rather see a schema change or Lucene/Solr improvements cover this.

Change 71997 merged by Brion VIBBER:
Add Special:RandomInCategory.

https://gerrit.wikimedia.org/r/71997

(In reply to comment #41)

Personally, I'd rather see a schema change or Lucene/Solr improvements cover
this.

Perhaps open a separate bug for that, since this patch has been merged.

pierre.beaudouin wrote:

Thx for this new special page.

It doesn't work on [[Catégorie:Portail:Hélicoptères/Articles liés]]

https://fr.wikipedia.org/wiki/Sp%C3%A9cial:RandomInCategory/Portail:H%C3%A9licopt%C3%A8res/Articles_li%C3%A9s

(In reply to comment #44)

Thx for this new special page.

It doesn't work on [[Catégorie:Portail:Hélicoptères/Articles liés]]

https://fr.wikipedia.org/wiki/Sp%C3%A9cial:RandomInCategory/Portail:
H%C3%A9licopt%C3%A8res/Articles_li%C3%A9s

The Portail: prefix is being eaten. Can you check if it happens with any prefix matching the name of a namespace, or just with any : prefix, and file a bug?

(In reply to comment #44)

Thx for this new special page.

It doesn't work on [[Catégorie:Portail:Hélicoptères/Articles liés]]

https://fr.wikipedia.org/wiki/Sp%C3%A9cial:RandomInCategory/Portail:
H%C3%A9licopt%C3%A8res/Articles_li%C3%A9s

This is fixed on master. Next time wikimedia sites are updated, (thursday) it should be fixed.

Until then, include the category: prefix with the page name and it should work.

So, the downside for this is that bulk of pages added to the category in similar times would be constantly underrepresented, if I understand correctly. Those might be category renames, bot additions, new templates including the category...

Of course Wiktionary provides an extreme example. Discussion continues at https://en.wiktionary.org/wiki/Category_talk:English_plurals#Special:RandomInCategory.2FEnglish_plurals