Page MenuHomePhabricator

Request to access redacted webproxy logfiles of (Tool) Labs
Closed, InvalidPublic

Description

Author: metatron

Description:
I want to integrate the pagecounts of Tool Labs resp. Labs into tool https://tools.wmflabs.org/wikiviewstats/ .For this, it would be necessary to have access to redacted webproxy logs, which include old web (apache) and new web (lighttpd) setups.

It would be very helpful, if these logs could be structured in the same way as the current pagecount-dumps and be released on an per hour basis.

Further suggestions:

  • identifier could be toollabs resp. labs.toollabs
  • querystring part of url (?xyz=..) should be removed completely

Reference:
1.) IRC Petan Jan 2, 2014

2.) WIP: Tools: Add infrastructure for AWStats
https://gerrit.wikimedia.org/r/#/c/80332/

3:) IRC scfc_de Jan 2, 2014
scfc_de: hedonil: I hope to have finished puppetizing tools-webproxy by the end of the week (the AWStats stuff is done IIRC). As -webproxy is the heart of the web access, review & deployment will then be *very* careful :-), but in general, depending on Coren's schedule, it should be deployable by between the end of next week and the end of the month.

The current pagecount-dumps are generated on an per hour basis and share the following structure:

filename eg:
pagecounts-20140101-020000.gz

1.: identifier 2.: pagetitle 3.: hits 4.: bytes

En.d perform 3 60088
En.d rainforest 3 33780
En.d servers 3 22471
En.d situation 1 107043
En.d upwards 1 32565
En.d variety 2 59495
En Allergy 3 324964
En Arthur_Rubinstein 1 0
En Article 1 0
En British_cuisine 1 191021

hierarchical structure of identifier

en - Wikipedia (en)
en.b - Wikibooks (en)
en.d - Wikdionary (en)
en.n - Wikinews (en)
etc.


Version: unspecified
Severity: normal

Details

Reference
bz59222

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:16 AM
bzimport set Reference to bz59222.
bzimport added a subscriber: Unknown Object (MLST).

That should be relatively simple to do. I do not, however, have the bandwidth to write this myself at this time.

The logs are currently in Apache common format; if someone provides a suitable script to generate them I'll add them to the tool chain.

metatron wrote:

(18:38:31) hedonil: YuviPanda: Coren: AFAIK apergos is the one who manages this log stuff in operations. maybe one could borrow some lines of his script, so that logfiles are summarized per hour and share the same structure. Seems to be powerful as it handles all syslogs and accesslogs from the varnishes.

So I added apergos to cc. Maybe can provide some help.

Heh, I don't manage it, I just know where stuff that lands on dumps.wikimedia.org comes from. Just for the sake of clarification, we have logs written already that get saved someplace?

metatron wrote:

Now that new YuviProxy is in place, I just need access to logdumps (IP's stripped off). sed & awk will do the rest of the job.

metatron wrote:

Any progress on this thing?

As already mentioned, both nginx-proxies (domainproxy & urlsproxy) went live.
Thus it should be knickknack to run some sed to sanitize the logs - and make them publicly available on /dumps or /shared.

Even if they can't be summarized in the requested manner, which would be fine though.

Any objections with that approach? If yes, which? Why? If not, when?

I can make redacted logs available in a familiar pattern, with the following stripped out:

  1. IP Address
  2. Referrer fields

The only problem is that currently the proxy's logs are rotated pretty frequently, so I'll have to find some method of archiving them.

metatron wrote:

Great! (UA & referer would be fine though, as they are already present in tools logs). Concerning archive - maybe one could steal some ideas for this from prod.-varnishes ;-)

Hmm, I don't see any non WMF Referrers in the access.log (looked at heritage's logs). Can someone verify / confirm?

After conversations with Coren:

Lighty's default format doesn't record referrers, but there's no reason for that. So I'll just strip out IPs.

So, current plan would be to:

  1. Have lograte set to rotate logs daily
  2. Setup a post-processing script that runs after the rotation has happened, and strip IPs (more probably, just replace them with 127.0.0.1).
  3. Move them to somewhere appropriate

This would incur a one day delay between logs being made available, which I guess is ok?

metatron wrote:

Would it be possible to logrotate/process them on an hourly basis?
Like: https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-06/
Just to be compatible and to allow a more fine-grained analysis (load over day). this would be really great. If anyhow possible..

metatron wrote:

(In reply to metatron from comment #11)

Would it be possible to logrotate/process them on an hourly basis?
Like: https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-06/
Just to be compatible and to allow a more fine-grained analysis (load over
day). this would be really great. If anyhow possible..

Well, this only applies, if logs are not summarized. But with an hourly rotation, single files are kept small and delay for "near-real-time" analysis would only be 1 hour (instead of 1 day).

metatron wrote:

If you need some helping hands, provide me some 100k raw logs and I'll write a bash-script with awk to summarize & format the logs exactly like pageview dumps.

@metratron: Help would be appreciated! I've copied scrubbed-of-IPs sample log (with 1000 entries) to /shared/sample-nginx-log/cleaned-samplelog.log. If you can write a script (Python please? pretty please?) that summarizes them to be like the pageview dumps, I'd be happy to get that puppetized.

Hopefully the 1000 log entries are enough. I can provide a larger sample if needed.

And I guess the format would be:

  1. toolname
  2. url
  3. hits
  4. bytes

I wonder if we should actually augment this with other stats, such as:

  1. error responses (non-200)
  2. UAs.

Perhaps the solution is to run something like awstats or similar on the nginx host itself. investigating.

metatron wrote:

Working on the little routine right now. It will provide 2 formats.

  • query-strings are stripped from both: request and referer
  1. "Tools Aggregate Format"

Period: per hour & per day
Format: toolname - hits, size, 2xx, 3xx, 4xx, 5xx

tools.admin - 5 69588 5 0 0 0
tools.awb - 1 349 0 0 0 1
tools.betacommand-dev - 2 1639 1 0 1 0
tools.bibleversefinder - 2 3195 2 0 0 0
tools.blockcalc - 4 42144 4 0 0 0
tools.bookmanagerv2 - 4 5990 0 0 4 0
tools.catfood - 5 24568 4 0 0 1
tools.catscan2 - 11 392817 5 0 6 0
tools.checkwiki - 4 241 4 0 0 0
tools.cluebot - 5 7886 3 2 0 0
tools.connectivity - 3 11448 0 0 0 3
tools.croptool - 1 19553 1 0 0 0
tools.dewikinews-rss - 3 31143 3 0 0 0
tools.dupdet - 1 11433 1 0 0 0
tools.enwp10 - 7 25721 0 0 0 7
tools.geohack - 383 3705405 319 49 15 0
tools.glamtools - 30 881853 30 0 0 0

2.) "Std. pageview-dumps format" (to be compatible)

Period: per hour
Format: project request hits size

labs.tools / 3 51037
labs.tools /Tool_Labs_logo_thumb.png 2 20140
labs.tools /admin/img/desc_dark.png 1 1036
labs.tools /admin/libs/jquery.js 2 57324
labs.tools /admin/libs/jquery.tablesorter.min.js 2 11228
labs.tools /apple-touch-icon-precomposed.png 1 1382
labs.tools /apple-touch-icon.png 1 6451
labs.tools /awb/stats/ 1 349
labs.tools /betacommand-dev/UserCompare/TreCoolGuy.html 1 1491
labs.tools /betacommand-dev/cgi-bin/uc 1 148
labs.tools /bibleversefinder/ 2 3195
labs.tools /blockcalc/index.php 1 974
labs.tools /blockcalc/style/backdrop.png 1 37681
labs.tools /blockcalc/style/style.css 1 361
labs.tools /blockcalc/style/wikimedia-toolserver-button.png 1 3128
labs.tools /bookmanagerv2/w/index.php 2 714
labs.tools /catfood/catfood.php 5 24568
labs.tools /catscan2/catscan2.php/CategoryIntersect.php 1 16041
labs.tools /catscan2/catscan2.php/Gallery.php 2 7451
labs.tools /catscan2/cross_cats.php 1 357043
labs.tools /catscan2/pages_in_cats.php 2 3889
labs.tools /catscan2/quick_intersection.php 5 8393
labs.tools /checkwiki/cgi-bin/checkwiki_bots.cgi 4 241
labs.tools /cluebot/ 5 7886
labs.tools /connectivity/cgi-bin/go.sh 3 11448
labs.tools /croptool/ 1 19553
labs.tools /dewikinews-rss/ 1 16589
labs.tools /dewikinews-rss/kategorie 2 14554
labs.tools /dupdet/compare.php 1 11433
labs.tools /enwp10/cgi-bin/list2.fcgi 7 25721
labs.tools /favicon.ico 17 256462

scfc claimed this task.

With the task reporter away, I'm closing this for the moment.

This is one of those tasks that require a high degree of interaction between the different parties as it makes no sense to produce a design-by-commitee that doesn't fulfill the actual requirements in practice.

If someone feels passionate about this and needs data for a consuming application, please reopen.