/* # AWFFull - A Webalizer Fork, Full o' features # # PERFORMANCE_TIPS.txt # Performance tips # # Copyright (C) 2006, 2008 by Stephen McInerney # (spm@stedee.id.au) # # This file is part of AWFFull. # # AWFFull is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # AWFFull is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with AWFFull. If not, see <http://www.gnu.org/licenses/>. # */ AWFFull A Webalizer Fork, Full o' Features! Performance Tips **************** Weblog analysis can be very slow. Particularly when you have a lot of data to process. The flip side is that such analysis can be scripted to run out of band. eg. Once a day, at 4am. Sometimes the need arises where a more interactive process is required. In this situation speed can be highly desirable. This document describes some of the ways and means of making awffull run that little bit faster, for when it *does* matter. To assist with the below concepts, a new feature has been added from v3.4.1: --match_counts as a command line switch only. This outputs all your GroupXXX options with a count of how often each was accessed. The list is in the same order as in the config file. eg: .... List: SearchEngine Matched: 3039 yahoo. Matched: 885 msn.com Matched: 52633 google. Matched: 201 altavista.com .... --- Ordering ---------- The order that config items are placed makes a quite noticeable difference in speed. In general testing, I've achieved gains up to 5000 lines/sec simply by reordering GroupXXX statements, with no loss in accuracy! How? The key is understanding how awffull applies those configuration items to a given log entry. In essence, they are matched one after the other till a match is found. Which implies that no match means you go through the entire list fruitlessly. Obviously then the more popular items should be located at the head of the list. Incidentally, if anyone has an idea for a better way of doing this, I'd love to hear from you! Thus if most of your traffic comes from, eg. *.aol.com, then it would make a great deal of sense to have as your first GroupSite item to be similar to the below: GroupSite *.aol.com AOL And so on down through your less popular Groupings. Unfortunately you need to run awffull first to discover which sites should be grouped where anyway. Sorry. But it can and does help with later runs. This also implies that solo items, Group'ed, can add a performance increase. If they appear often enough anyway. NB!!! There is a danger here too. Be careful with the ordering of certain items. Typically you need to be more explicit before you can be more general. The sample.conf files has an excellent example of this for matching against Browser Agents. Specifically the Internet Explorer matches. While I'm quite sure that ordering for the HideXXX statements will make things go faster, on modern hardware, you'd be hard pressed to notice. ie Don't bother - order the HideXXX statements to make it easier to understand and maintain. In the above example usage of --match_counts, we can quite clearly see that a match for the Google search engine has been poorly ordered. It should be placed first instead of 3rd. --- Wild Cards ------------ Generally awffull uses a very simplistic wildcard method: GroupSite *.aol.com AOL Which means that only at the end of the site portion of a log entry that equals ".aol.com" would be a match. Or in shell terms: cut -f 1 -d ' ' <logfile> | egrep "\.aol\.com$" Such a simplistic wildcard does allow the code to have some nice optimisations. These were present in webalizer and have been, more or less, carried on. Just don't try and put a wildcard in the middle of an item. The result will not be what you were expecting. This optimisation can be used to good effect: IgnoreSite *localhost This is where the webserver is doing calls against itself for example. We wish to filter those out. By wildcarding the start of the pattern to match we stop having to match the entire site address. And they can get quite long! Instead we only have to match against the last 9 characters. It won't help a lot, but a few of these can save several hundred or even thousand of lines per sec. In my case where re-analysing a years worth of logs involves hundreds of millions of log lines to process, the savings in time to my employer is worth a few $$! Keep in mind that this optimisation works both at the beginning and end of a pattern to match! --- Patterns ---------- Perhaps somewhat counter intuitively, the longer the pattern you're trying to match, the faster it will tend to do so. If you really want to know the gory details of the why, read up on the Boyer-Moore-Horspool pattern matching algorithm. This has only been implemented in AWFFull v3.3.1, so don't try this one on earlier versions.