$Id: README,v 1.1.2.18 2008-08-19 17:50:07 lacos Exp $ License ======= lbunzip2 Copyright (C) 2008 Laszlo Ersek. Released under the GNU GPL v2 or later. Contact: lacos@elte.hu Design (since lbunzip2-0.02-shift) ================================== 1 splitter --[splitq.used]--> n workers --[muxq]--> 1 multiplexer ^ | | | +----------[splitq.free]----------------------------+ There is one splitter splitting the input into blocks. The blocks are delimited in the bzip2 stream by block headers and (at the end of one stream -- multiple streams can be concatenated) an end-of-stream marker. These are special sequences of 48 bits each that can delimit the blocks at virtually any bit position. They are also not guaranteed not to occur in the Huffman bit stream (ie. inside one block), but that should be very unlikely. The splitter puts the compressed blocks to [splitq.used]. In the beginning, a fixed number of slots are allocated (FAC_SPLITQ for each worker thread, see lbunzip2.h). The splitter takes one slot from [splitq.free], fills it, then puts it to [splitq.used]. If [splitq.free] is empty, the splitter gets blocked. A slot is called "splitb" in the source (for "splitter block"). Each splitb has a serial number, which denotes the compressed block's position in the input stream. Both [splitq.used] and [splitq.free] are regular FIFO's. Each worker takes one filled slot from [splitq.used], and puts uncompressed sub-blocks on [muxq]. An uncompressed sub-block is called "muxb" in the source (for "muxer block"). The reason for the existence of sub-blocks is that one compressed block (splitb) can grow very big when uncompressed, and we don't want to delay sending the first bytes of the decompressed block to the muxer until the last bytes are decompressed as well. Thus the decompression of one splitb can result in multiple muxb's, each muxb carrying at most MX_BUNZIP2 bytes of decompressed data. Each muxb has one serial and one sub-serial number, the serial identifying the splitb out of which it was decompressed, and the sub-serial identifying the muxb's position within the sequence of all muxb's for this splitb. The last muxb for the splitb carries the splitb itself along to the muxer. [muxq] is a red-black tree (a heap would have been better, perhaps). It has to support quick random insertions (because workers can decompress their respective splitb's in any order), quick minimum selection and deletion (because the muxer can only write out the subsequent decompressed sub-block). [muxq] reorders, reassembles the muxb's so that muxb's uncompressed from the same splitb appear in order, and so that all muxb's for a splitb that was read earlier are written to stdout earlier than all muxb's for a splitb that was read later. This is a lexicographical ordering on (serial, sub-serial) pairs. When the multiplexer reaches the last sub-block for one splitb, it takes the splitb carried with that muxb, and puts it back to [splitq.free]. This way, if the output stream is blocked, the muxer will hang on the write() system call, and won't be able to return slots to the splitter. After a while, the splitter will block. All the workers finish decompressing their current compressed blocks, exhaust [splitq.used] (containing compressed blocks) and get blocked too. The muxer will still be able to resume its task after write() returns, since the muxb's released until that point build a complete sequnce together. This approach throttles the decompression when the multiplexer blocks on write() without causing a deadlock, since the blocking of the muxer doesn't directly block any random workers (as it would happen with a fixed size queue between the workers and the muxer), and the situation cannot arise where the workers wait for the muxer to make place in [muxq], and the muxer waits for the workers to emit the next dumpable muxb. Instead, the blocking of the muxer makes the splitter drain [splitq.free] and then block. The splitter works sequentially, thus after it blocks and the workers drain [splitq.used], [muxq] will contain the muxb the muxer will have to dump once it gets past write(). The memory usage is at maximum when all slots initially allocated are attached to muxb's in [muxq], and the muxer is blocked. If the number of workers is num_workers, then the number of slots is num_workers * FAC_SPLITQ. Each splitb holds at most MX_BZIP2 bytes, and all their decompressed counterparts are also held in memory (in muxb's). Let's define K to be the the maximum decompressed:compressed ratio ever possible for bzip2. (I don't know its value, but http://www.paul.sladen.org/projects/pyflate/ suggests you can make 45,899,235 bytes out of 40 via bunzip2, rendering K more than 1,147,480). Then the maximum memory usage is roughly num_workers * FAC_SPLITQ * (MX_BZIP2 + K * MX_BZIP2) = num_workers * FAC_SPLITQ * MX_BZIP2 * (1 + K) which is huge, but fairly improbable, and still finite. Bugs ==== Probably. The splitter doesn't check the original combined CRC. The splitter doesn't really parse the bzip2 blocks, just looks for the block headers and end of stream markers. It never stops searching for the first block header in the first bzip2 stream or after an end of stream marker. (Try it with /dev/zero.) Presently, I'm not certain whether lbunzip2 universally supports bzip2 streams created with "bzip2 -1" to "-8", although lbunzip2 successfully decompressed all such streams I used for testing. The question arises because in the recreated one-block bzip2 streams I pass to libbz2 I always indicate an input (uncompressed) blocksize switch of "-9". Please report bugs to <lacos@elte.hu>. Thank you. Performance =========== Test configurations ------------------- "alpha": CPU (psrinfo -v) : 8 x The alpha EV7 (21364) processor operates at 1300 MHz, has a cache size of 1835008 bytes, and has an alpha internal floating point processor. RAM (top) : 6839M used, 31G total uname : OSF1 V5.1 2650 alpha gcc : gcc (GCC) 3.3 bzip2 : Version 1.0.5, 10-Dec-2007. lbunzip2 workers : 8 "athlon": CPU (/proc/cpuinfo): AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ RAM (free) : 3979528K free, 4150504K total distribution : Debian GNU/Linux 4.0 (Etch) uname : Linux 2.6.18-6-686-bigmem #1 SMP i686 GNU/Linux gcc : gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) libc : glibc-2.3.6.ds1-13etch7 bzip2 : Version 1.0.3, 15-Feb-2005. lbunzip2 workers : 2 "itanium": CPU (/proc/cpuinfo): Itanium 2 1300MHz (dual core) RAM (free) : 8158496K free, 8302288K total distribution : Red Hat Enterprise Linux AS release 4 (Nahant Update 4) uname : Linux 2.6.9-42.EL #1 SMP ia64 ia64 ia64 GNU/Linux gcc : gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3) libc : /lib/libc-2.3.4.so bzip2 : Version 1.0.2, 30-Dec-2001. lbunzip2 workers : 2 "sparc": CPU (psrinfo -p -v): 4 x UltraSPARC-III+ (impl 0x15 ver 0xb0 clock 1050 MHz) RAM (top) : 30G free, 32G total uname : SunOS 5.10 Generic_120011-14 sun4u sparc SUNW,Sun-Fire-480R gcc : gcc (GCC) 3.4.6 bzip2 : Version 1.0.2, 30-Dec-2001. lbunzip2 workers : 4 "xeon": CPU (/proc/cpuinfo): Intel(R) Xeon(TM) CPU 2.80GHz (4 cores in total) RAM (free) : 5404124K free, 5907484K total distribution : Red Hat Enterprise Linux AS release 3 (Taroon Update 5) uname : Linux 2.4.21-32.ELsmp #1 SMP i686 i686 i386 GNU/Linux gcc : gcc (GCC) 3.2.3 20030502 (Red Hat Linux 3.2.3-52) libc : /lib/libc-2.3.2.so bzip2 : Version 1.0.2, 30-Dec-2001. lbunzip2 workers : 4 Test files ---------- glibc-2.7.tar.bz2: URL: http://ftp.gnu.org/pub/gnu/glibc/glibc-2.7.tar.bz2 size: 15,976,860 bytes SHA1: ccc70e95db826e4b1fd3b484154402fdc3df88f7 linux-2.6.26.2.tar.bz2: URL: http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2 size: 49,442,109 bytes SHA1: 755c1dbf6ad58fc56e9557d2c14e91820a7a7085 plan9.iso.bz2: URL: http://www.kix.in/plan9/plan9.iso.bz2 size: 83,638,869 bytes SHA1: 5820d938673fc02cfb0ca7600bb678e22354a849 An older, initial test on "athlon" ---------------------------------- $ time for ((I=0; I<20; ++I)); do cat linux-2.6.26.2.tar.bz2; done \ | bunzip2 >/dev/null real 5m22.099s user 5m20.220s sys 0m3.200s $ time for ((I=0; I<20; ++I)); do cat linux-2.6.26.2.tar.bz2; done \ | lbunzip2 2 >/dev/null lbunzip2: tried to get next muxb or exit signal: 10765 lbunzip2: had to wait for next muxb or exit signal: 4764 lbunzip2: tried to get free splitb: 11267 lbunzip2: had to wait for free splitb: 5267 lbunzip2: tried to get used splitb or eof: 6052 lbunzip2: had to wait for used splitb or eof: 50 real 3m38.190s user 6m49.794s sys 0m21.301s Number of CPU's multiplied by: 2.00 Speed multiplied by: 1.48 Block size tests for lbunzip2-0.01-shift, on "athlon" ----------------------------------------------------- Version 0.01 didn't have decompression throttling. Thus it didn't have to synchronize the multiplexer with the splitter. Additionally, the synchronization between the splitter and the set of workers was faster (due to coarser (still just apt) lock granularity and hence lower locking overhead). In this test, I decompressed linux-2.6.26.2.tar.bz2 with bunzip2, recompressed it with bzip2 with input block sizes 1 through 9, and the resulting 9 files were (1) chopped up with bzip2recover to see the output (compressed) block sizes, (2) decompressed with bunzip2 to /dev/null, (3) decompressed with "lbunzip2 2" to /dev/null. A B C D E F G H 1 2699 21818 48324 11.34 0:11.44 0:08.60 1.33 2 1350 40669 86496 11.45 0:11.51 0:08.46 1.36 3 900 58961 113623 12.17 0:12.22 0:09.06 1.34 4 675 76889 136768 13.09 0:13.17 0:09.61 1.37 5 540 94780 161900 14.12 0:14.16 0:10.12 1.39 6 450 112297 188496 14.68 0:14.77 0:10.43 1.41 7 386 129759 221740 15.17 0:15.24 0:10.42 1.46 8 338 146976 244932 15.71 0:15.75 0:10.67 1.47 9 300 164821 273095 16.06 0:16.11 0:10.89 1.47 A: bzip2 input block size switch B: number of compressed blocks C: average size of full bzip2 stream containing one compressed block D: maximum size of full bzip2 stream containing one compressed block E: decompression user time (seconds) with bunzip2 F: decompression wall clock time (minutes and seconds) with bunzip2 (100% usage of one core) G: decompression wall clock time (minutes and seconds) with lbunzip2 (191% - 198% usage of one core) H: decompression speed ratio, lbunzip2 to bunzip2, based on wall clock (F/G) I conclude: choosing smaller input block sizes leads to smaller compressed block sizes. When decompressing bzip2 streams that produce the same uncompressed data, smaller (compressed) block sizes seem to benefit the speed of lbunzip2 and official bunzip2 alike. The relative difference is bigger for bunzip2. Wall clock speed test for lbunzip2-0.02-shift --------------------------------------------- Version 0.02 added decompression throttling, which increased the locking overhead. A pseudo-code description of this test follows: for each COMMAND in { bunzip2, lbunzip2 NUM_CORES } for each FILE in { glibc.bz2, linux.bz2, plan9.bz2 } for each TRY in 1..3 decompress FILE with COMMAND: save time into time.COMMAND.FILE.TRY save output into output.COMMAND.FILE, overwriting previous try for each FILE in { glibc.bz2, linux.bz2, plan9.bz2 } compare the lbunzip2 output against the bunzip2 output A B C D E F G H I J alpha 8 254.20 308.10 87.50 85.49 421.60 * 3.52 0.44 athlon 2 123.44 129.29 90.56 98.22 181.11 1.28 1.42 0.71 itanium 2 189.05 195.42 140.33 99.33 191.78 43.60 1.39 0.69 sparc 4 236.37 244.06 105.31 99.93 348.34 55.20 2.31 0.57 xeon 4 288.44 308.31 143.21 99.22 378.78 8.33 2.15 0.53 A: test configuration B: number of worker threads (cores) C: cumulative user time of all nine tests for bunzip2 (seconds) D: cumulative wall clock time of all nine tests for bunzip2 (seconds) E: cumulative wall clock time of all nine tests for lbunzip2 (seconds) F: average CPU load for bunzip2 across all nine tests (max. 100%) G: average CPU load for lbunzip2 across all nine tests (max. NUM_CORES * 100%) H: lbunzip2: workers had to wait on splitter (percent of "pop" operations on [split.used]), across all nine tests I: cumulative decompression speed ratio, lbunzip2 to bunzip2, based on wall clock (D/E) J: speed ratio divided by numer of cores (I/B) *: For some reason, on "alpha" the statistics of lbunzip2 didn't appear on stderr in the end. I strongly suspect that that libc doesn't like my fprintf() format string. Still, lbunzip2 didn't crash; the exit status was 0. Conclusion: more cores seem to mean more speed (B, I), although I jump over a few architectures here. Relatively, "athlon" gets the most out of a core (J -> 0.71), although even that is far from linear speedup (1.00). This indicator (J) degrades as the number of processors (B) increases, which should mean we can't parallelize ("scale") over a limit. On "itanium" and "sparc", the splitter can't feed compressed blocks to the workers fast enough (H), although on "itanium", we almost fully exploit both cores (G), so probably the splitter eats too much CPU. On "alpha", lbunzip2 used just little more than half of the cores (G); maybe the server was under load from other processes (since the machine wasn't dedicated to this test), or the process became IO-bound, or the splitter fully exhausted one CPU, or... :) Robustness (lbunzip2-0.03-shift) ================================ I downloaded "c10-archive-r1.iso" from http://www.ee.oulu.fi/research/ouspg/protos/testing/c10/archive/. I extracted the bzip2 test cases (321818 files), and ran "lbunzip2 2 <TESTCASE >/dev/null" on each of them. I did this on "athlon". Before starting the test, I limited the virtual memory available to 300,000 K with "ulimit -S -v 300000". Here are the results: 125946 lbunzip2: BZ2_bzDecompress(): BZ_DATA_ERROR 42245 lbunzip2: premature EOF on stdin 8176 lbunzip2: the splitter misrecognized a bit-sequence as a block delimiter 4697 lbunzip2: input block too short Furthermore, lbunzip2 crashed on file 1203ea663ea8545c9b66ad3ef46425d0.bz2 with a segmentation fault, see below. I picked the following examples, and checked how official bunzip2 would handle them: 000001d8c11b84d738de284a8ba76226.bz2 lbunzip2: BZ2_bzDecompress(): BZ_DATA_ERROR bunzip2: Data integrity error when decompressing. 00000184e108021bf3b3f860f1c7618c.bz2 lbunzip2: premature EOF on stdin bunzip2: Compressed file ends unexpectedly; 0004f9c82f4c00833aba77f106b83060.bz2 lbunzip2: the splitter misrecognized a bit-sequence as a block delimiter bunzip2: (stdin) is not a bzip2 file. 000c9eeb2bf688d42bf447d264b4b401.bz2 lbunzip2: input block too short bunzip2: Data integrity error when decompressing. 1203ea663ea8545c9b66ad3ef46425d0.bz2 lbunzip2: Segmentation fault bunzip2: Caught a SIGSEGV or SIGBUS whilst decompressing. The last error, where both lbunzip2 and bunzip2 crash, should be due to a bug in libbz2, which was fixed in version 1.0.5. As you can see above, on "athlon" lbunzip2 and official bunzip2 were linked against version 1.0.3. Why only decompression? ======================= There were already multiple implementations of parallel bzip2 compression. In no particular order: - http://compression.ca/pbzip2/ - http://bzip2smp.sourceforge.net/ - http://home.student.utwente.nl/n.werensteijn/smpbzip2/ - http://www.mediawiki.org/wiki/Dbzip2 Acknowledgements ================ I'd like to thank - Julian Seward for writing bzip2, - Michael Thomas from Caltech HEP for allowing me to test lbunzip2 on their Itanium 2 machines, - Paul Sladen for his Wikipedia contributions on bzip2, eg. http://en.wikipedia.org/wiki/Bzip2#File_format.