Sophie: lbunzip2-0.03-2mdv2010.0 i586

lbunzip2-0.03-2mdv2010.0.i586.rpm

$Id: README,v 1.1.2.18 2008-08-19 17:50:07 lacos Exp $


License
=======

lbunzip2 Copyright (C) 2008 Laszlo Ersek.
Released under the GNU GPL v2 or later.
Contact: lacos@elte.hu


Design (since lbunzip2-0.02-shift)
==================================

        1 splitter --[splitq.used]--> n workers --[muxq]--> 1 multiplexer
          ^                                                   |
          |                                                   |
          +----------[splitq.free]----------------------------+

There is one splitter splitting the input into blocks. The blocks are delimited
in the bzip2 stream by block headers and (at the end of one stream -- multiple
streams can be concatenated) an end-of-stream marker. These are special
sequences of 48 bits each that can delimit the blocks at virtually any bit
position. They are also not guaranteed not to occur in the Huffman bit stream
(ie. inside one block), but that should be very unlikely.

The splitter puts the compressed blocks to [splitq.used]. In the beginning, a
fixed number of slots are allocated (FAC_SPLITQ for each worker thread, see
lbunzip2.h). The splitter takes one slot from [splitq.free], fills it, then
puts it to [splitq.used]. If [splitq.free] is empty, the splitter gets blocked.
A slot is called "splitb" in the source (for "splitter block"). Each splitb has
a serial number, which denotes the compressed block's position in the input
stream.

Both [splitq.used] and [splitq.free] are regular FIFO's.

Each worker takes one filled slot from [splitq.used], and puts uncompressed
sub-blocks on [muxq]. An uncompressed sub-block is called "muxb" in the source
(for "muxer block"). The reason for the existence of sub-blocks is that one
compressed block (splitb) can grow very big when uncompressed, and we don't
want to delay sending the first bytes of the decompressed block to the muxer
until the last bytes are decompressed as well. Thus the decompression of one
splitb can result in multiple muxb's, each muxb carrying at most MX_BUNZIP2
bytes of decompressed data. Each muxb has one serial and one sub-serial number,
the serial identifying the splitb out of which it was decompressed, and the
sub-serial identifying the muxb's position within the sequence of all muxb's
for this splitb. The last muxb for the splitb carries the splitb itself along
to the muxer.

[muxq] is a red-black tree (a heap would have been better, perhaps). It has to
support quick random insertions (because workers can decompress their
respective splitb's in any order), quick minimum selection and deletion
(because the muxer can only write out the subsequent decompressed sub-block).
[muxq] reorders, reassembles the muxb's so that muxb's uncompressed from the
same splitb appear in order, and so that all muxb's for a splitb that was read
earlier are written to stdout earlier than all muxb's for a splitb that was
read later. This is a lexicographical ordering on (serial, sub-serial) pairs.

When the multiplexer reaches the last sub-block for one splitb, it takes the
splitb carried with that muxb, and puts it back to [splitq.free]. This way, if
the output stream is blocked, the muxer will hang on the write() system call,
and won't be able to return slots to the splitter. After a while, the splitter
will block. All the workers finish decompressing their current compressed
blocks, exhaust [splitq.used] (containing compressed blocks) and get blocked
too. The muxer will still be able to resume its task after write() returns,
since the muxb's released until that point build a complete sequnce together.
This approach throttles the decompression when the multiplexer blocks on
write() without causing a deadlock, since the blocking of the muxer doesn't
directly block any random workers (as it would happen with a fixed size queue
between the workers and the muxer), and the situation cannot arise where the
workers wait for the muxer to make place in [muxq], and the muxer waits for the
workers to emit the next dumpable muxb. Instead, the blocking of the muxer
makes the splitter drain [splitq.free] and then block. The splitter works
sequentially, thus after it blocks and the workers drain [splitq.used], [muxq]
will contain the muxb the muxer will have to dump once it gets past write().

The memory usage is at maximum when all slots initially allocated are attached
to muxb's in [muxq], and the muxer is blocked. If the number of workers is
num_workers, then the number of slots is num_workers * FAC_SPLITQ. Each splitb
holds at most MX_BZIP2 bytes, and all their decompressed counterparts are also
held in memory (in muxb's). Let's define K to be the the maximum
decompressed:compressed ratio ever possible for bzip2. (I don't know its value,
but

http://www.paul.sladen.org/projects/pyflate/

suggests you can make 45,899,235 bytes out of 40 via bunzip2, rendering K more
than 1,147,480). Then the maximum memory usage is roughly

  num_workers * FAC_SPLITQ * (MX_BZIP2 + K * MX_BZIP2)
= num_workers * FAC_SPLITQ * MX_BZIP2 * (1 + K)

which is huge, but fairly improbable, and still finite.


Bugs
====

Probably.

The splitter doesn't check the original combined CRC.

The splitter doesn't really parse the bzip2 blocks, just looks for the block
headers and end of stream markers. It never stops searching for the first block
header in the first bzip2 stream or after an end of stream marker. (Try it with
/dev/zero.)

Presently, I'm not certain whether lbunzip2 universally supports bzip2 streams
created with "bzip2 -1" to "-8", although lbunzip2 successfully decompressed
all such streams I used for testing. The question arises because in the
recreated one-block bzip2 streams I pass to libbz2 I always indicate an input
(uncompressed) blocksize switch of "-9".

Please report bugs to <lacos@elte.hu>. Thank you.


Performance
===========

Test configurations
-------------------

"alpha":
  CPU (psrinfo -v)   : 8 x
    The alpha EV7 (21364) processor operates at 1300 MHz,
    has a cache size of 1835008 bytes,
    and has an alpha internal floating point processor.
  RAM (top)          : 6839M used, 31G total
  uname              : OSF1 V5.1 2650 alpha
  gcc                : gcc (GCC) 3.3
  bzip2              : Version 1.0.5, 10-Dec-2007.
  lbunzip2 workers   : 8

"athlon":
  CPU (/proc/cpuinfo): AMD Athlon(tm) 64 X2 Dual Core Processor 6000+
  RAM (free)         : 3979528K free, 4150504K total
  distribution       : Debian GNU/Linux 4.0 (Etch)
  uname              : Linux 2.6.18-6-686-bigmem #1 SMP i686 GNU/Linux
  gcc                : gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
  libc               : glibc-2.3.6.ds1-13etch7
  bzip2              : Version 1.0.3, 15-Feb-2005.
  lbunzip2 workers   : 2

"itanium":
  CPU (/proc/cpuinfo): Itanium 2 1300MHz (dual core)
  RAM (free)         : 8158496K free, 8302288K total
  distribution       : Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
  uname              : Linux 2.6.9-42.EL #1 SMP ia64 ia64 ia64 GNU/Linux
  gcc                : gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3)
  libc               : /lib/libc-2.3.4.so
  bzip2              : Version 1.0.2, 30-Dec-2001.
  lbunzip2 workers   : 2

"sparc":
  CPU (psrinfo -p -v): 4 x UltraSPARC-III+ (impl 0x15 ver 0xb0 clock 1050 MHz)
  RAM (top)          : 30G free, 32G total
  uname              : SunOS 5.10 Generic_120011-14 sun4u sparc
    SUNW,Sun-Fire-480R
  gcc                : gcc (GCC) 3.4.6
  bzip2              : Version 1.0.2, 30-Dec-2001.
  lbunzip2 workers   : 4

"xeon":
  CPU (/proc/cpuinfo): Intel(R) Xeon(TM) CPU 2.80GHz (4 cores in total)
  RAM (free)         : 5404124K free, 5907484K total
  distribution       : Red Hat Enterprise Linux AS release 3 (Taroon Update 5)
  uname              : Linux 2.4.21-32.ELsmp #1 SMP i686 i686 i386 GNU/Linux
  gcc                : gcc (GCC) 3.2.3 20030502 (Red Hat Linux 3.2.3-52)
  libc               : /lib/libc-2.3.2.so
  bzip2              : Version 1.0.2, 30-Dec-2001.
  lbunzip2 workers   : 4


Test files
----------

glibc-2.7.tar.bz2:
  URL:  http://ftp.gnu.org/pub/gnu/glibc/glibc-2.7.tar.bz2
  size: 15,976,860 bytes
  SHA1: ccc70e95db826e4b1fd3b484154402fdc3df88f7

linux-2.6.26.2.tar.bz2:
  URL: http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
  size: 49,442,109 bytes
  SHA1: 755c1dbf6ad58fc56e9557d2c14e91820a7a7085

plan9.iso.bz2:
  URL: http://www.kix.in/plan9/plan9.iso.bz2
  size: 83,638,869 bytes
  SHA1: 5820d938673fc02cfb0ca7600bb678e22354a849


An older, initial test on "athlon"
----------------------------------

$ time for ((I=0; I<20; ++I)); do cat linux-2.6.26.2.tar.bz2; done \
| bunzip2 >/dev/null

real    5m22.099s
user    5m20.220s
sys     0m3.200s

$ time for ((I=0; I<20; ++I)); do cat linux-2.6.26.2.tar.bz2; done \
| lbunzip2 2 >/dev/null
lbunzip2: tried to get next muxb or exit signal:    10765
lbunzip2: had to wait for next muxb or exit signal: 4764
lbunzip2: tried to get free splitb:           11267
lbunzip2: had to wait for free splitb:        5267
lbunzip2: tried to get used splitb or eof:    6052
lbunzip2: had to wait for used splitb or eof: 50

real    3m38.190s
user    6m49.794s
sys     0m21.301s


Number of CPU's multiplied by: 2.00
Speed multiplied by:           1.48


Block size tests for lbunzip2-0.01-shift, on "athlon"
-----------------------------------------------------

Version 0.01 didn't have decompression throttling. Thus it didn't have to
synchronize the multiplexer with the splitter. Additionally, the
synchronization between the splitter and the set of workers was faster (due to
coarser (still just apt) lock granularity and hence lower locking overhead).

In this test, I decompressed linux-2.6.26.2.tar.bz2 with bunzip2, recompressed
it with bzip2 with input block sizes 1 through 9, and the resulting 9 files
were (1) chopped up with bzip2recover to see the output (compressed) block
sizes, (2) decompressed with bunzip2 to /dev/null, (3) decompressed with
"lbunzip2 2" to /dev/null.

A     B       C       D   E        F        G     H
1  2699   21818   48324  11.34  0:11.44  0:08.60  1.33
2  1350   40669   86496  11.45  0:11.51  0:08.46  1.36
3   900   58961  113623  12.17  0:12.22  0:09.06  1.34
4   675   76889  136768  13.09  0:13.17  0:09.61  1.37
5   540   94780  161900  14.12  0:14.16  0:10.12  1.39
6   450  112297  188496  14.68  0:14.77  0:10.43  1.41
7   386  129759  221740  15.17  0:15.24  0:10.42  1.46
8   338  146976  244932  15.71  0:15.75  0:10.67  1.47
9   300  164821  273095  16.06  0:16.11  0:10.89  1.47

A: bzip2 input block size switch
B: number of compressed blocks
C: average size of full bzip2 stream containing one compressed block
D: maximum size of full bzip2 stream containing one compressed block
E: decompression user time (seconds) with bunzip2
F: decompression wall clock time (minutes and seconds) with bunzip2
   (100% usage of one core)
G: decompression wall clock time (minutes and seconds) with lbunzip2
   (191% - 198% usage of one core)
H: decompression speed ratio, lbunzip2 to bunzip2, based on wall clock
   (F/G)

I conclude: choosing smaller input block sizes leads to smaller compressed
block sizes. When decompressing bzip2 streams that produce the same
uncompressed data, smaller (compressed) block sizes seem to benefit the speed
of lbunzip2 and official bunzip2 alike. The relative difference is bigger for
bunzip2.


Wall clock speed test for lbunzip2-0.02-shift
---------------------------------------------

Version 0.02 added decompression throttling, which increased the locking
overhead.

A pseudo-code description of this test follows:

  for each COMMAND in { bunzip2, lbunzip2 NUM_CORES }
    for each FILE in { glibc.bz2, linux.bz2, plan9.bz2 }
      for each TRY in 1..3
        decompress FILE with COMMAND:
          save time into time.COMMAND.FILE.TRY
          save output into output.COMMAND.FILE, overwriting previous try

  for each FILE in { glibc.bz2, linux.bz2, plan9.bz2 }
    compare the lbunzip2 output against the bunzip2 output

A        B    C       D       E      F       G      H     I     J
alpha    8  254.20  308.10   87.50  85.49  421.60   *     3.52  0.44
athlon   2  123.44  129.29   90.56  98.22  181.11   1.28  1.42  0.71
itanium  2  189.05  195.42  140.33  99.33  191.78  43.60  1.39  0.69
sparc    4  236.37  244.06  105.31  99.93  348.34  55.20  2.31  0.57
xeon     4  288.44  308.31  143.21  99.22  378.78   8.33  2.15  0.53

A: test configuration
B: number of worker threads (cores)
C: cumulative user time of all nine tests for bunzip2 (seconds)
D: cumulative wall clock time of all nine tests for bunzip2 (seconds)
E: cumulative wall clock time of all nine tests for lbunzip2 (seconds)
F: average CPU load for bunzip2 across all nine tests (max. 100%)
G: average CPU load for lbunzip2 across all nine tests (max. NUM_CORES * 100%)
H: lbunzip2: workers had to wait on splitter (percent of "pop" operations on
   [split.used]), across all nine tests
I: cumulative decompression speed ratio, lbunzip2 to bunzip2, based on wall
   clock (D/E)
J: speed ratio divided by numer of cores (I/B)

*: For some reason, on "alpha" the statistics of lbunzip2 didn't appear on
   stderr in the end. I strongly suspect that that libc doesn't like my
   fprintf() format string. Still, lbunzip2 didn't crash; the exit status was
   0.

Conclusion: more cores seem to mean more speed (B, I), although I jump over a
few architectures here. Relatively, "athlon" gets the most out of a core (J ->
0.71), although even that is far from linear speedup (1.00). This indicator (J)
degrades as the number of processors (B) increases, which should mean we can't
parallelize ("scale") over a limit. On "itanium" and "sparc", the splitter
can't feed compressed blocks to the workers fast enough (H), although on
"itanium", we almost fully exploit both cores (G), so probably the splitter
eats too much CPU. On "alpha", lbunzip2 used just little more than half of the
cores (G); maybe the server was under load from other processes (since the
machine wasn't dedicated to this test), or the process became IO-bound, or the
splitter fully exhausted one CPU, or... :)


Robustness (lbunzip2-0.03-shift)
================================

I downloaded "c10-archive-r1.iso" from
http://www.ee.oulu.fi/research/ouspg/protos/testing/c10/archive/. I extracted
the bzip2 test cases (321818 files), and ran "lbunzip2 2 <TESTCASE >/dev/null"
on each of them. I did this on "athlon". Before starting the test, I limited
the virtual memory available to 300,000 K with "ulimit -S -v 300000". Here are
the results:

125946 lbunzip2: BZ2_bzDecompress(): BZ_DATA_ERROR
 42245 lbunzip2: premature EOF on stdin
  8176 lbunzip2: the splitter misrecognized a bit-sequence as a block delimiter
  4697 lbunzip2: input block too short

Furthermore, lbunzip2 crashed on file 1203ea663ea8545c9b66ad3ef46425d0.bz2 with
a segmentation fault, see below.

I picked the following examples, and checked how official bunzip2 would handle
them:

000001d8c11b84d738de284a8ba76226.bz2
  lbunzip2: BZ2_bzDecompress(): BZ_DATA_ERROR
  bunzip2: Data integrity error when decompressing.

00000184e108021bf3b3f860f1c7618c.bz2
  lbunzip2: premature EOF on stdin
  bunzip2: Compressed file ends unexpectedly;

0004f9c82f4c00833aba77f106b83060.bz2
  lbunzip2: the splitter misrecognized a bit-sequence as a block delimiter
  bunzip2: (stdin) is not a bzip2 file.

000c9eeb2bf688d42bf447d264b4b401.bz2
  lbunzip2: input block too short
  bunzip2: Data integrity error when decompressing.

1203ea663ea8545c9b66ad3ef46425d0.bz2
  lbunzip2: Segmentation fault
  bunzip2: Caught a SIGSEGV or SIGBUS whilst decompressing.

The last error, where both lbunzip2 and bunzip2 crash, should be due to a bug
in libbz2, which was fixed in version 1.0.5. As you can see above, on "athlon"
lbunzip2 and official bunzip2 were linked against version 1.0.3.


Why only decompression?
=======================

There were already multiple implementations of parallel bzip2 compression.
In no particular order:
- http://compression.ca/pbzip2/
- http://bzip2smp.sourceforge.net/
- http://home.student.utwente.nl/n.werensteijn/smpbzip2/
- http://www.mediawiki.org/wiki/Dbzip2


Acknowledgements
================

I'd like to thank

- Julian Seward for writing bzip2,

- Michael Thomas from Caltech HEP for allowing me to test lbunzip2 on their
Itanium 2 machines,

- Paul Sladen for his Wikipedia contributions on bzip2, eg.
http://en.wikipedia.org/wiki/Bzip2#File_format.