Sophie: awffull-3.10.2-2mdv2010.0 i586

awffull-3.10.2-2mdv2010.0.i586.rpm

/*
#   AWFFull - A Webalizer Fork, Full o' features
#
#   PERFORMANCE_TIPS.txt
#       Performance tips
#
#   Copyright (C) 2006, 2008 by Stephen McInerney
#       (spm@stedee.id.au)
#
#   This file is part of AWFFull.
#
#   AWFFull is free software: you can redistribute it and/or modify
#   it under the terms of the GNU General Public License as published by
#   the Free Software Foundation, either version 3 of the License, or
#   (at your option) any later version.
#
#   AWFFull is distributed in the hope that it will be useful,
#   but WITHOUT ANY WARRANTY; without even the implied warranty of
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#   GNU General Public License for more details.
#
#   You should have received a copy of the GNU General Public License
#   along with AWFFull.  If not, see <http://www.gnu.org/licenses/>.
#
*/

                             AWFFull
                A Webalizer Fork, Full o' Features!

Performance Tips
****************

Weblog analysis can be very slow. Particularly when you have a lot of data to
process.  The flip side is that such analysis can be scripted to run out of
band.  eg. Once a day, at 4am.

Sometimes the need arises where a more interactive process is required. In this
situation speed can be highly desirable.

This document describes some of the ways and means of making awffull run that
little bit faster, for when it *does* matter.


To assist with the below concepts, a new feature has been added from v3.4.1:
		--match_counts
as a command line switch only.
This outputs all your GroupXXX options with a count of how often each was
accessed. The list is in the same order as in the config file.
eg:
....
  List: SearchEngine
    Matched:    3039  yahoo.
    Matched:     885  msn.com
    Matched:   52633  google.
    Matched:     201  altavista.com
....
                               ---

        Ordering
        ----------

The order that config items are placed makes a quite noticeable difference in
speed. In general testing, I've achieved gains up to 5000 lines/sec simply by
reordering GroupXXX statements, with no loss in accuracy!

    How?

The key is understanding how awffull applies those configuration items to a
given log entry. In essence, they are matched one after the other till a match
is found. Which implies that no match means you go through the entire list
fruitlessly. Obviously then the more popular items should be located at the
head of the list. Incidentally, if anyone has an idea for a better way of doing
this, I'd love to hear from you!

Thus if most of your traffic comes from, eg. *.aol.com, then it would make a
great deal of sense to have as your first GroupSite item to be similar to the
below:
    GroupSite       *.aol.com               AOL

And so on down through your less popular Groupings. Unfortunately you need to run
awffull first to discover which sites should be grouped where anyway. Sorry.
But it can and does help with later runs.

This also implies that solo items, Group'ed, can add a performance increase. If
they appear often enough anyway.

NB!!! There is a danger here too. Be careful with the ordering of certain
items. Typically you need to be more explicit before you can be more general.
The sample.conf files has an excellent example of this for matching against
Browser Agents. Specifically the Internet Explorer matches.

While I'm quite sure that ordering for the HideXXX statements will make things
go faster, on modern hardware, you'd be hard pressed to notice. ie Don't bother
- order the HideXXX statements to make it easier to understand and maintain.


In the above example usage of --match_counts, we can quite clearly see that a
match for the Google search engine has been poorly ordered. It should be
placed first instead of 3rd.

                               ---

        Wild Cards
        ------------

Generally awffull uses a very simplistic wildcard method:

    GroupSite       *.aol.com               AOL

Which means that only at the end of the site portion of a log entry that equals
".aol.com" would be a match.

Or in shell terms:
    cut -f 1 -d ' ' <logfile> | egrep "\.aol\.com$"

Such a simplistic wildcard does allow the code to have some nice optimisations.
These were present in webalizer and have been, more or less, carried on. Just
don't try and put a wildcard in the middle of an item. The result will not be
what you were expecting.

This optimisation can be used to good effect:

    IgnoreSite      *localhost

This is where the webserver is doing calls against itself for example. We wish
to filter those out. By wildcarding the start of the pattern to match we stop
having to match the entire site address. And they can get quite long! Instead
we only have to match against the last 9 characters.

It won't help a lot, but a few of these can save several hundred or even
thousand of lines per sec. In my case where re-analysing a years worth of logs
involves hundreds of millions of log lines to process, the savings in time to
my employer is worth a few $$!


Keep in mind that this optimisation works both at the beginning and end of a
pattern to match!

                               ---

        Patterns
        ----------

Perhaps somewhat counter intuitively, the longer the pattern you're trying to
match, the faster it will tend to do so. If you really want to know the gory
details of the why, read up on the Boyer-Moore-Horspool pattern matching
algorithm.

This has only been implemented in AWFFull v3.3.1, so don't try this one on
earlier versions.