Page MenuHomePhabricator

replace.py does not recognize "\r\n" pattern
Closed, DeclinedPublic

Description

In compat:
replace.py -regex -nocase -file:aa.log "==\s*Externí odkazy(.*?)\r\n\{\{Commonscat" "== Externí odkazy\1\n* {{Commonscat" -summary:"řádková verze {{Commonscat}}"

Getting 60 pages from wikipedia:cs...
...
No changes were necessary in [[Roman Polák (lední hokejista)]]

Roman Polanski <<<

  • {{Commonscat|Roman Polanski}}

+ * {{Commonscat|Roman Polanski}}

In core, the same command:
pwb.py replace -regex -nocase -file:aa.log "==\s*Externí odkazy(.
*?)\r\n\{\{Commonscat" "== Externí odkazy\1\n* {{Commonscat" -summary:"řádková
verze {{Commonscat}}"

Retrieving 50 pages from wikipedia:cs.
...
No changes were necessary in [[Roman Polanski]]
No changes were necessary in [[Roman Polák (lední hokejista)]]
No changes were necessary in [[Roman Romaněnko]]

Why?


Version: core-(2.0)
Severity: enhancement

Details

Reference
bz70607

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 3:45 AM
bzimport set Reference to bz70607.
bzimport added a subscriber: Unknown Object (????).

After some testing - core does not recognize \r\n, but only \n

There is a bug in Compat's PreloadingpageGenerator which makes it return page content (incorrectly) with '\r\n' instead of '\n'. compat's page.get() /does/ return '\n' by default.

I think using \n makes much more sense (and note that this works for both \n *and* \r\n due to python's universal newlines system), so I'm not even sure whether we should support \r\n at all.

Marking it as low-priority feature request for now.

Is that a problem of the bot then? Shouldn't it suffice to edit the regex (and if you want to be sure you could use (?:\r|\r\n|\n) instead of exactly \r\n.

Okay, I'm a bit confused about the newlines now, as

re.match(r'\n', '\r\n')

does not work. However,

python replace.py -lang:cs -regex -nocase -page:"Roman Polanski" '==\s*Externí odkazy(.*?)\n\{\{Commonscat' '== Externí odkazy\1\n* {{Commonscat' -summary:"řádkováverze {{Commonscat}}"

*did* work in compat (i.e. the variant without \r in it). I'm not sure why, though.

compat retrieves \r\n as linefeed via special export whereas core always get \n. See also config.line_separator variable.

You may use \r?\n for the regex for both framework branches.