Page MenuHomePhabricator

Wikibits patch #4 - more date formats
Closed, ResolvedPublic

Description

Author: mhorvath2161

Description:
unified diff file

Detect more date formats when sorting tables.

Currently only 3 formats are detected. The changes increase this number to about two dozen.


Version: unspecified
Severity: enhancement

attachment wikibits_more_date_formats.diff ignored as obsolete

Details

Reference
bz15401

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:21 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz15401.

Just a note to anyone who decides to apply the patch.

This patch also includes a change to the currency which is a different bug, not relevant to this bug.
And it also contains a change to numeric matching which merely adds some useless escape characters.

mhorvath2161 wrote:

unified diff file

Update, minus the change to currency and 'useles' escape sequences.

attachment wikibits_more_date_formats.diff ignored as obsolete

ayg wrote:

You're still removing semicolons. On another stylistic note, try to keep lines down to 80 chars or so if it's reasonable. Those comments in the first chunk should go on different lines, not the end of the current line.

I'll review this at some point, if no one else does, but not this second.

mhorvath2161 wrote:

unified diff file

Woops. Sorry about that. Here's a fixed version.

attachment wikibits_more_date_formats.diff ignored as obsolete

ayg wrote:

+ // matches: D(D)-M(M)-YY(YY) or M(M)-D(D)-YY(YY)
+ if (itm.match(/^\d{1,2}[\/\.\-\s]\d{1,2}[\/\.\-\s](\d{2}|\d{4})$/))
+ sortfn = ts_sort_date_1;

Maybe we want to narrow this a bit. Something like "1/3.14" would match, even though it's clearly not a date. I'd enforce that the two delimiters need to be the same. Making sure that neither of the first two parts is greater than 31 might also be a good idea, but on the other hand it might be too complicated.

+ // matches: D(D)-MONTHABV-YY(YY) or MONTHABV-D(D)-YY(YY) or MONTHABV D(D), YY(YY)
+ if (itm.match(/^([^\d\/\.\-\s]+[\/\.\-\s]\d{1,2}|\d{1,2}[\/\.\-\s][^\d\/\.\-\s]+),?[\/\.\-\s](\d{2}|\d{4})$/))
+ sortfn = ts_sort_date_2;

This seems way too broad. Wouldn't it catch strings like "Score: 5-1"? The old code seems to require that month abbreviations be three letters, which is less prone to error (although not as precise as a list).

-function ts_dateToSortKey(date) {
+function ts_dateToSortKey_1(date) {

Could you think of a more descriptive way to distinguish between their names than appending _1 and _2? Same for ts_sort_date_*.

+ date = date.replace(/[\/\.\s]/g, '-');
+ var dateLen = date.length;
+ var firstIdx = date.indexOf('-');
+ var secndIdx = date.indexOf('-', firstIdx + 1);
+ var yearIdx = secndIdx + 1;
+ var yearLen = dateLen - yearIdx;
+ var secndLen = dateLen - yearLen - firstIdx - 2;
+ var year = parseInt(date.substr(yearIdx, yearLen));
...
+ if (ts_europeandate) {
+ var month = date.substr(firstIdx + 1, secndLen);
+ var day = date.substr(0, firstIdx);
+ } else {
+ var month = date.substr(0, firstIdx);
+ var day = date.substr(firstIdx + 1, secndLen);

	}
  • return "00000000";

+ month = parseInt(month);
+ day = parseInt(day);

Couldn't you do this a lot more simply? Like (untested)

var day, month, year;
if( ts_europeandate ) {

		[day, month, year] = date.split( "-" );

} else {

		[month, day, year] = date.split( "-" );

}

Followed by parseInt().

+ var monthStr = "janfebmaraprmayjunjulaugsepoctnovdec";
+ var monthLen = 3;
+ var monthNam = date.match(/[^\d\s\/\.\-]+/)[0];
+ var monthAbv = monthNam.substr(0, monthLen);
+ date = date.replace(/[^\d\s\/\.\-]+/, monthAbv);
+ date = date.replace(/[\s\/\.]/g, '-');
+ date = date.replace(/,/g, '');
+ var dateLen = date.length;
+ var firstIdx = date.indexOf('-');
+ var secndIdx = date.indexOf('-', firstIdx + 1);
+ var yearIdx = secndIdx + 1;
+ var yearLen = dateLen - yearIdx;
+ var secndLen = dateLen - yearLen - firstIdx - 2;
+ var year = parseInt(date.substr(yearIdx, yearLen));

Again, this seems to be way more intricate than it needs to be. How about something like (untested):

var monthLen = 3;
var monthName = date.match(/[^\d\s\/.-]+/)[0].substr( 0, monthLen );
var month = monthStr.indexOf( monthName )/monthLen;
var day = parseInt( date.match(/\d\d?/)[1] );
var year = parseInt( date.match(/\d\d?/)[2] );

You're writing like this is C rather than JavaScript, with all those indexes all over the place.

Also, it might be that this function would best be a wrapper around the other function. There seems to be some code duplication here. How about replacing the month abbreviation with the month number and then just passing through to the first date parsing variant?

+ if (date.charCodeAt(0) < 58) {

What's this supposed to mean? Don't use incomprehensible constructs like this, please, *especially* not without comments. Most people don't have the ASCII code tables memorized.

mhorvath2161 wrote:

(In reply to comment #5)

+ // matches: D(D)-M(M)-YY(YY) or M(M)-D(D)-YY(YY)
+ if (itm.match(/^\d{1,2}[\/\.\-\s]\d{1,2}[\/\.\-\s](\d{2}|\d{4})$/))
+ sortfn = ts_sort_date_1;

Maybe we want to narrow this a bit. Something like "1/3.14" would match, even
though it's clearly not a date. I'd enforce that the two delimiters need to be
the same. Making sure that neither of the first two parts is greater than 31
might also be a good idea, but on the other hand it might be too complicated.

Good idea. I'm not sure off the top of my head how to ensure the deliminator is the same. I'll have to look it up.

+ // matches: D(D)-MONTHABV-YY(YY) or MONTHABV-D(D)-YY(YY) or MONTHABV D(D), YY(YY)
+ if (itm.match(/^([^\d\/\.\-\s]+[\/\.\-\s]\d{1,2}|\d{1,2}[\/\.\-\s][^\d\/\.\-\s]+),?[\/\.\-\s](\d{2}|\d{4})$/))
+ sortfn = ts_sort_date_2;

This seems way too broad. Wouldn't it catch strings like "Score: 5-1"? The
old code seems to require that month abbreviations be three letters, which is
less prone to error (although not as precise as a list).

It wouldn't pick up "Score: 5-1", specifically. Colons aren't included in the search, and the last set of digits must be 2 or 4 digits long.

-function ts_dateToSortKey(date) {
+function ts_dateToSortKey_1(date) {

Could you think of a more descriptive way to distinguish between their names
than appending _1 and _2? Same for ts_sort_date_*.

+ date = date.replace(/[\/\.\s]/g, '-');
+ var dateLen = date.length;
+ var firstIdx = date.indexOf('-');
+ var secndIdx = date.indexOf('-', firstIdx + 1);
+ var yearIdx = secndIdx + 1;
+ var yearLen = dateLen - yearIdx;
+ var secndLen = dateLen - yearLen - firstIdx - 2;
+ var year = parseInt(date.substr(yearIdx, yearLen));
...
+ if (ts_europeandate) {
+ var month = date.substr(firstIdx + 1, secndLen);
+ var day = date.substr(0, firstIdx);
+ } else {
+ var month = date.substr(0, firstIdx);
+ var day = date.substr(firstIdx + 1, secndLen);

	}
  • return "00000000";

+ month = parseInt(month);
+ day = parseInt(day);

Couldn't you do this a lot more simply? Like (untested)

var day, month, year;
if( ts_europeandate ) {
        [day, month, year] = date.split( "-" );
} else {
        [month, day, year] = date.split( "-" );
}

I haven't looked too closely, but that looks like a good alternative.

Followed by parseInt().

+ var monthStr = "janfebmaraprmayjunjulaugsepoctnovdec";
+ var monthLen = 3;
+ var monthNam = date.match(/[^\d\s\/\.\-]+/)[0];
+ var monthAbv = monthNam.substr(0, monthLen);
+ date = date.replace(/[^\d\s\/\.\-]+/, monthAbv);
+ date = date.replace(/[\s\/\.]/g, '-');
+ date = date.replace(/,/g, '');
+ var dateLen = date.length;
+ var firstIdx = date.indexOf('-');
+ var secndIdx = date.indexOf('-', firstIdx + 1);
+ var yearIdx = secndIdx + 1;
+ var yearLen = dateLen - yearIdx;
+ var secndLen = dateLen - yearLen - firstIdx - 2;
+ var year = parseInt(date.substr(yearIdx, yearLen));

Again, this seems to be way more intricate than it needs to be. How about
something like (untested):

var monthLen = 3;
var monthName = date.match(/[^\d\s\/.-]+/)[0].substr( 0, monthLen );
var month = monthStr.indexOf( monthName )/monthLen;
var day = parseInt( date.match(/\d\d?/)[1] );
var year = parseInt( date.match(/\d\d?/)[2] );

You're writing like this is C rather than JavaScript, with all those indexes
all over the place.

Ditto.

Also, it might be that this function would best be a wrapper around the other
function. There seems to be some code duplication here. How about replacing
the month abbreviation with the month number and then just passing through to
the first date parsing variant?

Ditto.

+ if (date.charCodeAt(0) < 58) {

What's this supposed to mean? Don't use incomprehensible constructs like this,
please, *especially* not without comments. Most people don't have the ASCII
code tables memorized.

This just tests whether the first character is a digit or not in order to differentiate between the month-day-year (American) and day-month-year (European) formats. It could be replaced with a regexp. Or, the variable "ts_europeandate" could be used instead. However, I'd rather not use "ts_europeandate" if at all possible.

mhorvath2161 wrote:

(In reply to comment #6)

Couldn't you do this a lot more simply? Like (untested)

var day, month, year;
if( ts_europeandate ) {
        [day, month, year] = date.split( "-" );
} else {
        [month, day, year] = date.split( "-" );
}

I haven't looked too closely, but that looks like a good alternative.

Errr, woops.

[day, month, year] is not valid syntax. However, storing the values in an array would work (which is your general idea I think).

mhorvath2161 wrote:

unified diff file

I tried to implement as many of your suggestions as possible. I also added support (and found a bug) for incomplete dates consisting of only a month and a day.

attachment wikibits_more_date_formats_b.diff ignored as obsolete

mhorvath2161 wrote:

unified diff file

Update. I discovered how to collapse the multiple, similar regexps to a single line.

attachment wikibits_more_date_formats_b.diff ignored as obsolete

mhorvath2161 wrote:

unified diff file

Tiny tweak.

attachment wikibits_more_date_formats_b.diff ignored as obsolete

ayg wrote:

(In reply to comment #6)

Good idea. I'm not sure off the top of my head how to ensure the deliminator is
the same. I'll have to look it up.

Back-references should work.

+ // matches: D(D)-MONTHABV-YY(YY) or MONTHABV-D(D)-YY(YY) or MONTHABV D(D), YY(YY)
+ if (itm.match(/^([^\d\/\.\-\s]+[\/\.\-\s]\d{1,2}|\d{1,2}[\/\.\-\s][^\d\/\.\-\s]+),?[\/\.\-\s](\d{2}|\d{4})$/))
+ sortfn = ts_sort_date_2;

This seems way too broad. Wouldn't it catch strings like "Score: 5-1"? The
old code seems to require that month abbreviations be three letters, which is
less prone to error (although not as precise as a list).

It wouldn't pick up "Score: 5-1", specifically. Colons aren't included in the
search, and the last set of digits must be 2 or 4 digits long.

[^\d\/\.\-\s] will pick up a colon, or anything else that's not a digit, slash, dot, whitespace, or hyphen. You're right about the other part, but "Score: 12-10" will be picked up AFAICT.

Incidentally, good style would be to write that as [^\d\/.\s-]: move the hyphen to the end so you don't have to escape it, and don't escape the dot (it has no special meaning in character classes).

+ if (date.charCodeAt(0) < 58) {

What's this supposed to mean? Don't use incomprehensible constructs like this,
please, *especially* not without comments. Most people don't have the ASCII
code tables memorized.

This just tests whether the first character is a digit or not in order to
differentiate between the month-day-year (American) and day-month-year
(European) formats. It could be replaced with a regexp. Or, the variable
"ts_europeandate" could be used instead. However, I'd rather not use
"ts_europeandate" if at all possible.

Find something else for that than comparing raw ASCII codes, please. :)

(In reply to comment #7)

(In reply to comment #6)

Couldn't you do this a lot more simply? Like (untested)

var day, month, year;
if( ts_europeandate ) {
        [day, month, year] = date.split( "-" );
} else {
        [month, day, year] = date.split( "-" );
}

I haven't looked too closely, but that looks like a good alternative.

Errr, woops.

[day, month, year] is not valid syntax. However, storing the values in an array
would work (which is your general idea I think).

It was supposed to mean

day = date.split("-")[0]
month = date.split("-")[1]
year = date.split("-")[2]

Syntax like this works in Python, and in PHP as well with the list() function. I was told the syntax I gave worked, but maybe not.

Will review new patch at some point.

mhorvath2161 wrote:

First of all, I changed my display name from my real name to my Wikipedia account.

(In reply to comment #11)

(In reply to comment #6)

Good idea. I'm not sure off the top of my head how to ensure the deliminator is
the same. I'll have to look it up.

Back-references should work.

Yes, I figured that out and already incorporated them.

It wouldn't pick up "Score: 5-1", specifically. Colons aren't included in the
search, and the last set of digits must be 2 or 4 digits long.

[^\d\/\.\-\s] will pick up a colon, or anything else that's not a digit, slash,
dot, whitespace, or hyphen. You're right about the other part, but "Score:
12-10" will be picked up AFAICT.

You're right. What I wanted was \w, except without the digits. The problem with [a-zA-Z] is that it is limited to certain locales. Using unicode ranges instead would be lengthy and complicated.

Incidentally, good style would be to write that as [^\d\/.\s-]: move the hyphen
to the end so you don't have to escape it, and don't escape the dot (it has no
special meaning in character classes).

I have to disagree with you on this one. I like to keep all special characters escaped. It makes the code easier to read, and I don't have to worry about treating certain characters differently in special cases (i.e., the dot and dash).

+ if (date.charCodeAt(0) < 58) {

What's this supposed to mean? Don't use incomprehensible constructs like this,
please, *especially* not without comments. Most people don't have the ASCII
code tables memorized.

This just tests whether the first character is a digit or not in order to
differentiate between the month-day-year (American) and day-month-year
(European) formats. It could be replaced with a regexp. Or, the variable
"ts_europeandate" could be used instead. However, I'd rather not use
"ts_europeandate" if at all possible.

Find something else for that than comparing raw ASCII codes, please. :)

OK.

(In reply to comment #7)

[day, month, year] is not valid syntax. However, storing the values in an array
would work (which is your general idea I think).

It was supposed to mean

day = date.split("-")[0]
month = date.split("-")[1]
year = date.split("-")[2]

Syntax like this works in Python, and in PHP as well with the list() function.
I was told the syntax I gave worked, but maybe not.
Will review new patch at some point.

Yes, this is what I've done. Check the most recent version I uploaded.

mhorvath2161 wrote:

Crap! I keep getting logged out. I've disabled the "restrict to a single IP" option in the Login screen.

mhorvath2161 wrote:

unified diff file

Update. Finding a unicode range didn't end up being so difficult after all. The code now uses [a-zA-Z\u00c0-\uffff] instead of [^\d\/\.\-\s].

Attached:

bluehairedlawyer wrote:

*** This bug has been marked as a duplicate of bug 8226 ***