Page MenuHomePhabricator

Inconsistent field separation makes Squid logs in Hadoop largely unusable
Closed, ResolvedPublic

Description

Screenshot of Beeswax showing parse failure

Sort out the field separator issue in your handling of squid logs first.

To summarize:

  1. Kafka byte offset is delimited from hostname by a tab (\t).
  2. Other fields are delimited by a space (\0020).
  3. The content-type field contains unescaped spaces.
  4. Beeswax only supports splitting on a single character.

As a result:

  1. Byte offset is not separable from the hostname ("316554683463cp1043.wikimedia.org")
  2. Unescaped spaces in the content type field cause it to span a variable number of columns.
  3. It is impossible to select the user agent field.

I'd like a solution to this that does not require that I provide a jar file for customized string processing.


Version: unspecified
Severity: critical

Attached:

DCKJ9.png (404×744 px, 76 KB)

Details

Reference
bz44236

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:32 AM
bzimport set Reference to bz44236.
bzimport added a subscriber: Unknown Object (MLST).