Copyright © Philip M. Parker, INSEAD. Terms of Use.

(From Wikipedia, the free Encyclopedia)
The Burrows-Wheeler transform (BWT, also called block-sorting compression), is an algorithm used in data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler.
When a string is transformed by the BWT, none of its characters change. It just rearranges the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters. For example, the string:
SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXEStransforms into this string, which is easier to compress because it has many repeated characters:TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIITThe transform is done by sorting all rotations of the text, then taking the last column. For example, the text ".BANANA." is transformed into "BNN.AA.A" through these steps:
Input All
RotationsSort the
LinesOutput .BANANA. .BANANA. ..BANANA A..BANAN NA..BANA ANA..BAN NANA..BA ANANA..B BANANA.. ANANA..B ANA..BAN A..BANAN BANANA.. NANA..BA NA..BANA .BANANA. ..BANANA BNN.AA.AIn these two examples, the output would actually be more than just the string shown. It would be the string plus a pointer telling which character in the output string was the last character in the input string. However, there is an equivalent way to describe the algorithm that doesn't use pointers.
The following pseudocode gives a simple, but inefficient, way to calculate the BWT and its inverse. It assumes that there is a special character 'EOF' which is the last character of the text, occurs nowhere else in the text, and is ignored during sorting. Given this assumption, the transformed text will be the same length as the original text, and there is no need to worry about pointers.
function BWT (string s) create a list of all possible rotations of s let each rotation be one row in a large, square table sort the table alphabetically (treat each row as a string) return the last (rightmost) column of the table function inverse_BWT (string s) create an empty table with no rows or columns repeat length(s) times: insert s as a new column down the left side of the table sort the table alphabetically (treat each row as a string) return the row that ends with the 'EOF' characterThe remarkable thing about the BWT is not that it generates a more easily coded output - any number of trivial operations would do that - but that it is reversible, allowing the original document to be re-generated from the last column data.
The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column. Then, the first and last columns together give you all pairs of characters in the document. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text.
A number of optimizations can make these algorithms run more efficiently without changing the output. In BWT, there is no need to actually store the table. Each row of the table can be represented by a single pointer into the string s. In inverse_BWT there is no need to store the table or to do the multiple sorts. It is sufficient to sort s once with a stable sort, and remember where each character moved. This gives a single-cycle permutation, whose cycle is the output. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.
There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. That means the BWT does expand its input slightly. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.
A complete description of the algorithms can be found in Burrow and Wheeler's paper, or in a number of online sources.
Sample implementation (in C, with English comments) is in article on Polish Wikipedia.
Reference:
External links:
- M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
- ResearchIndex page for BWT paper
- BWT paper hosted at DEC
- Article by Mark Nelson on the BWT
Source: adapted by the editor from Wikipedia, the free encyclopedia under a copyleft GNU Free Documentation License (GFDL) from the article "Burrows-Wheeler transform."
| Domain | Title |
References | |
Source: compiled by the editor from various references; see credits. | |
| Country | Name |
| Austria | BWT AG |
| (more examples...) |
Source: compiled by the editor from Icon Group International, Inc.
Scrabble® Enable2K-Verified Anagrams | |
| Words containing the letters "b-t-w" | |
+2 letters: bawty. | |
+3 letters: abwatt, bawtie, bestow, bewept, bowpot, twibil, wombat. | |
+4 letters: abwatts, batfowl, batwing, bawsunt, bawties, beltway, bestows, bestrew, bestrow, between, betwixt, bewitch, bewrapt, blowout, bowknot, bowpots, bowshot, brawest, catawba, howbeit, mistbow, outbawl, ribwort, rowboat, stewbum, teabowl, towboat, twibill, twibils, washtub, webfeet, webfoot, website, webster, wetback, wombats. | |
+5 letters: batfowls, bawdiest, bedstraw, bellwort, beltways, bentwood, bestowal, bestowed, bestrewn, bestrews, bestrown, bestrows, bitewing, blowiest, blowouts, blowtube, bobwhite, boomtown, bowfront, bowknots, bowshots, bowsprit, browbeat, brownest, brownout, catawbas, downbeat, drawtube, mistbows, nutbrown, outbawls, outbrawl, ribworts, rowboats, showboat, snowbelt, stewbums, stowable, sweatbox, teabowls, towboats, tubework, twibills, twinborn, washtubs, wastable, waterbed, webbiest, websites, websters, wetbacks, wettable, wombiest, workboat, writable. | |
| Source: compiled by the editor from various references; see credits. SCRABBLE® is a registered trademark. All intellectual property rights in and to the game are owned in the U.S.A and Canada by Hasbro Inc., and throughout the rest of the world by J.W. Spear & Sons Limited of Maidenhead, Berkshire, England, a subsidiary of Mattel Inc. Mattel and Spear are not affiliated with Hasbro. | |
Hexadecimal (or equivalents, 770AD-1900s) (references)42 57 54 |
| Leonardo da Vinci (1452-1519; backwards) (references)
|
| American Sign Language (origins from 1620-1817 in Italy and, especially, France) (references)
|
| Semaphore (1791, in France) (references)
|
| Braille (1829, in France) (references)
|
Morse Code (1836) (references)-... .--. - |
| Dancing Men (Sir Arthur Conan Doyle, 1903) (references)
|
Binary Code (1918-1938, probably earlier) (references)01000010 01010111 01010100 |
HTML Code (1990) (references)B W T |
ISO 10646 (1991-1993) (references)0042 0057 0054 |
| British Sign Language (Fingerspelling, BSL; 1992, British Deaf Association Dictionary of British Sign Language) (references)
|
Encryption (beginner's substitution cypher): (references)365754 |
| 1. Definition 2. Usage: Commercial 3. Names: Company Usage 4. Anagrams | 5. Orthography 6. Bibliography |
Copyright © Philip M. Parker, INSEAD. Terms of Use.