Page MenuHomePhabricator

OpenSearchXml first sentences extraction produces bad results
Closed, ResolvedPublic

Description

The current regex, roughly "end capture after the second dot followed by whitespace produces wildly inaccurate results for sentences with dots in the middle, for example if article title contains dots:

https://en.wikipedia.org/w/api.php?action=opensearch&format=xmlfm&search=.s.p.%20v&limit=10

<Item>
  <Text xml:space="preserve">S. P. Venkatesh</Text>
  <Description xml:space="preserve">S. P. </Description>
  <Url xml:space="preserve">https://en.wikipedia.org/wiki/S._P._Venkatesh</Url>
</Item>
<Item>
  <Text xml:space="preserve">S. P. Velumani</Text>
  <Description xml:space="preserve">S. P. </Description>
  <Url xml:space="preserve">https://en.wikipedia.org/wiki/S._P._Velumani</Url>
</Item>

It should be something like "first dot followed by whitespace after a certain number of characters".


Version: unspecified
Severity: normal

Details

Reference
bz35083

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 12:09 AM
bzimport set Reference to bz35083.

Fixed (well, improved) in r113475.