Computerphile
Күн бұрын
47,865
1

How Bzip2 Works (Burrows Wheeler Transform) - Computerphile

At the heart of Bzip2 is the Burrows Wheeler Transform. Dr Steve Bagley (and a live studio audience) explains how & why it works.
#compression #ComputerScience
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Пікірлер: 128

@aksela69128 ай бұрын
I did not follow how you can regenerate the original text from the BWT. A step-by-step explanation of the decompression procedure might be helpful
@0LoneTech
8 ай бұрын
Same here. The later part did demonstrate how the last and first columns each form a digram from the original text, but I don't see how that helps us distinguish AB from BA, or EEGEEHEEBEE from EEBEEHEEGEE. I suppose a stable sort would recover some of that; then common suffix bigrams are listed in order of occurrence.
@jbucata
8 ай бұрын
I think I sort of get it now... The last column's character is followed by the character in the first column because of the rotation pattern. So you've got a list of "foo is followed by bar". With some shenanigans you can follow that chain of steps around to recreate the whole original.
@billli11
8 ай бұрын
I think you can keep adding sort character back to regenerate the matrix. for example abcd. you have abcd bcda cdab dabc the last column d a b c sort it and you get a b c d and beacuse the orig is sorted. now you know a??d b??a c??b d??c last character is before first character, so you get da ab bc cd sort it and now you know ab?d bc?a cd?b da?c and beacuse last character is before first character again. you can keep doing it until you get all that back.
@sebastianelytron8450
8 ай бұрын
@@billli11 I feel like your example only works because it's literally "abcd" that was probably not the best choice
@billli11
8 ай бұрын
@@sebastianelytron8450 That was a simplify version base on the video example. checking wikipedia seem like BWT use 2 marker to mark the original string starting and ending point before doing the transform. or mark the orig with a index number.
@rudiklein8 ай бұрын
I worked for DEC when Altavista came out. At one time, they put a note on the screen that they were now running on 12Gb of memory. We were flabbergasted. This amount of memory was mind boggling at that time.
@kayakMike1000
8 ай бұрын
How? The PDP-11 was a 16 but machine...
@kayakMike1000
8 ай бұрын
They also had vax with 32bits... but the address space of 32 bits is 4GB
@jacobdegeling
7 ай бұрын
That must have been a whole floor of the office building devoted to memory 😅
@RobBCactive
7 ай бұрын
@@kayakMike1000the Virtual Address Xtension had 32bit addresses but you can have physical address registers of say 1K pages which use 10 bits each. So a physical memory address can be longer than a virtual memory one. That was already done on PDP in a less general way where a program could be linked to have seperate executable and data segments, with many 16bit address spaces stored in say a RAM approaching 1MB, the program would be swapped in/out as a single unit rather than virtual pages when RAM was full.
@trevinbeattie48888 ай бұрын
I feel like I only got half of the story - the compression part. I don’t see how we transform the scrambled characters back to the original text in decompression.
@misterhat5823
8 ай бұрын
Agreed. I didn't follow it at all.
@fly1ngsh33p7
8 ай бұрын
@@misterhat5823 @trevinbeattie4888 I found the Explanation chapter on wikipedia quite alright to try yourself. If you got the principle, it's quite simple, really. I'll write my understanding of it here: You write your transformed text as a column. Now write a second column next to it: the sorted first column. Now write a whole new set of columns containing the sorted _pairs_ of your first set of columns. Do you notice something? -> the first column of your new set matches the second column of the old set! Mash them together so that you omit the duplicates (and effectively only add the new letter). Repeat what you did earlier: sort your new 3-columns-set and add the last characters.
@pr0ntab
8 ай бұрын
Okay, so imagine the original block with all the rotated copies of the string to compress. The top line is the original text. Sort this bad boy. Now the first column is the original text, but in sorted order. Some line in the middle of the block is now the original text (remember which line number!). The last column is therefore the letters that come before each letter in the sorted version of the text. Now if we erased all of the columns in the sorted table except for the last one, what can we do? Let's begin by taking that last column and sorting it. Clearly this must be the first column again. Now, imagine you have the first and last columns. Let's rotate RIGHT the entire table once. If we sort it again, we will have the original sorted first column again, and the second column will be the ORIGINAL second column of the grid we started with that we erased. Why? Well when you first sorted the square, there was only one way to sort the square for that first and last column to be shown in the order you find them. Also consider if you rotate all the rows left or right by some amount of columns, all you're really doing is rearranging the rows in the square (think back to the very first square before sorting, rotating left or right is like rotating up and down pacman style) and sorting will put them back the way they should be. So with two columns, fill in the last column again, rotate right, sort, and now we have three columns. Keep doing this over and over and eventually all the columns will be filled and some row contains the original text (you have to remember which one and save it in the compression the first time you sort the block) This only works if the sorting algorithm you use is stable otherwise the rearranging of the rows based on the first few columns could move them around in an unpredictable way. Try it by hand with a small grid, but with no letters repeated. If there are repeated letters then you discover you need a sorting algorithm that makes decisions exactly the same way with repeated letters each time or it will become hopeless.
@pr0ntab
8 ай бұрын
Hand written example, the word "STORK" make rotation table >> STORK ** KSTOR RKSTO ORKST TORKS Sort the rotation table >> KSTOR ORKST RKSTO STORK ** TORKS Remember which row after sorting is the original text (starred row is now #4) Only keep the number and the last column >> #4, RTOKS To undo ????R ????T ????O ????K ????S rotate right once R???? T???? O???? K???? S???? Sort K???? O???? R???? S???? T???? add back in last column K???R O???T R???O S???K T???S Rotate right again RK??? TO??? OR??? KS??? ST??? Sort again KS??? OR??? RK??? ST??? TO??? Add last column KS??R OR??T RK??O ST??K TO??S Rotate RKS?? TOR?? ORK?? KST?? STO?? sort, add last column again KST?R ORK?T RKS?O STO?K TOR?S Rotate RKST? TORK? ORKS? KSTO? STOR? sort, add last column KSTOR ORKST RKSTO STORK TORKS pick the row # we remembered (4) STORK
@blackmagicprod7039
8 ай бұрын
@@pr0ntab This made me understand, thanks
@adamcetinkent8 ай бұрын
I have a MSc in Computer Science, and I could not follow this. How do we get back to the original text from the compressed form? That feels like it should be explained more than handwaved?
@asapshazzy8 ай бұрын
Always a joy to see a Dr Bagley video
@andrewdunbar8288 ай бұрын
Bzip2 has been my favourite since it first came out. But I'd be interested to learn more about some of the newer compression methods that have come out since that get higher compression ratios too.
@HebaruSan8 ай бұрын
Trying to fill in the missing parts of the video... You store only the last column, using run-length encoding, which will be shorter than the original string because of patterns of which characters tend to follow which in natural language (wonder how well it works for images or other non-text data). To decompress, you first run-length-decode the full contents of the last column, then sort it to regenerate the contents of the first column. Together, the two columns give you a mapping from ... I'm not sure, because the ordering of both columns is kind of scrambled by the sorting. I guess I agree with the consensus that there isn't sufficient info given here to reconstruct what the speaker is talking about (ironically!). More specific constructive criticisms: Pick one plain text and stick with it all the way through rather than changing from Shakespeare to the channel name, fewer context switches to make while viewing. Pick one format for illustrations, hand written or terminal commands, again the multiplicity is just distracting and wastes precious milliseconds on trying to remember what that screen was about. Finally, the whole "column" based approach may be the ultimate distraction, since the entire middle block of columns is irrelevant, and it appears that what you're trying to store is a mapping from one character to the next in an efficient way. If that's the case, then focus on that, from the point of view of the decompression algorithm.
@tcommbee2
7 ай бұрын
Combining the last and first column gives you the set of all two-letter substrings of the data, each of which will appear at the beginning of one line. I think what was missing in the video at that point is that lines being sorted implies that their beginnings are sorted as well. So sorting the two-letter substrings produces the first two columns of the table. Prepending the last column to that sort result produces all three-letter substrings, which can again be sorted, and so on. You are right though insofar that actual implementations do not reconstruct the block like this due to all the sorting involved.
@3vi1J8 ай бұрын
BZip2 fans: "There are dozens of us! DOZENS!". I know I'm one. Great video, thank you for sharing this insight.
@DarkJMKnight8 ай бұрын
I didn't find this entirely clear... So ultimately you're storing the entire first column and the entire last column, which is 2x the original data length, but because the first column has been sorted, run-length encoding is more efficient on the first copy, and the second copy is similarly more likely to have run length efficiencies... but... enough to encode 2x the original data? And how does one reconstruct the original text from those? If you re-expand the first and last column you can see what followed what, I guess? but then how do you know what comes next? It's been sorted... Just because one T preceded an H, it might also precede an I or an E, how do you know which is actually next?
@michaeldamolsen
8 ай бұрын
"So ultimately you're storing the entire first column and the entire last column" - No, the first column is sorted, so it can be recreated from the last column. Thus there is no need to store it.
@michaeldamolsen
8 ай бұрын
"And how does one reconstruct the original text from those?" The reply by @pr0ntab to the comment from @trevinbeattie4888 (starting with "I feel like I only got half of the story") answered it in a way that finally made it click for me. The comment is about 2 hours before yours, if you sort by newest first (just saying this for convenience in finding it.)
@shadaxgaming
8 ай бұрын
If you don't mind a long but *very* well done explanation of BWT, search for "billbird bwt." It's a few hours and reiterates points made in this video from Computerphile, but you can probably click through them if you understood the transformation part.
@DynoosHD8 ай бұрын
Could not follow half of the steps. Rotating and sorting was fine. But then?
@misterhat5823
8 ай бұрын
Step 1. Collect underpants. Step 2. Step 3. Profit.
@Modvivek
8 ай бұрын
just Google it
@Omnifarious08 ай бұрын
I still don't understand exactly how the Burrows Wheeler Transform works. But, I now think that if I read something more deeply technical about it that it would make sense.
@bgtubber4 ай бұрын
More of these videos, please! 🙂 I'm fascinated the topic of data compression.
@BlackHoleForge8 ай бұрын
This man is the only person I've ever heard say if you hear people laughing, they're laughing at me, and he's completely fine with it.😂
@jd_278 ай бұрын
Great video. I'm completely lost.
@londonbobby8 ай бұрын
Nope, didn't follow that at all. Normally he does a much better job of explaining this stuff. Maybe this needs to be redone with the normal graphics approach and make sure you demonstrate the complicated bits.
@PlasmaHH8 ай бұрын
Even knowing the algorithm I must say that this explanation is rushed and incomplete at best, you are missing at least the importance of the start/end markers!
@colly60228 ай бұрын
it seems like a very low chance that doubling the raw amount of data would result in something smaller than the original data, even after RLE. how can you assume that the last column will be just as "patternly" (not sure what the word is) as the first column?
@negativeseven
8 ай бұрын
The first column is obtained by sorting the last column - you don't store it.
@jbucata
8 ай бұрын
The last column's contents will cluster together based on frequent pairings of what follows them. When the 1st column is sorted you'll have a bunch of "t"'s at the last column in the "h" section (because of "the", "they", "this" etc. etc.), and another bunch of "t"'s sorted together in the "o" section ("to", "toy", "tomorrow" etc.). It's based on how frequently a letter follows after another letter in this particular text.
@jbucata
8 ай бұрын
In contrast, a random-looking string of characters like a KZread video ID, e.g. "dQw4w9WgXcQ", encodes to "wwXQWcdQ49g" which doesn't RLE encode much better at all than the original input. Here it just so happens that both lowercase "w" occurrences are right before a digit, so they both sorted together, but after RLE encoding with lengths that still would likely be longer than the original input itself.
@jamesdurham90278 ай бұрын
It is my favorite compression algorithm, also. Do not feel alone.
@mokovec7 ай бұрын
Why the 2 - was there ever bzip1? The version is still in 1.0 territory and the Wikipedia entry doesn't mention any forks or the like to warrant it.
@Athenas_Realm_System8 ай бұрын
On the topic of compression, I'd like to see a video on Prediction by Partial Matching
@raedev8 ай бұрын
"yes I have a favourite compression algorithm" hey I'm not judging I have a favourite RNG algorithm lmao (it's the Park-Miller/Lehmer LCG, beauty in simplicity)
@davidgillies6208 ай бұрын
This sort of byte-munging is a good candidate for the SIMD instructions in modern processors and, of course the vectorising capabilities of GPUs.
@ivandrofly8 ай бұрын
Very intersting - thanks
@elfboi5238 ай бұрын
question = 2b | ~2b
@elfboi523
8 ай бұрын
The result is of course always true.
@thetastefultoastie6077
8 ай бұрын
@@elfboi523 Since bitwise OR is used, the result is a primitive type with all bits set. So for any signed integer type: question = -1
@axelanderson2030
8 ай бұрын
brilliant
@MetPhotoDotNet8 ай бұрын
I thought I knew the command line reasonably well, but could someone explain what the | $BWT did? His code is bwt, but why prefix that with the $BWT?
@jbucata
8 ай бұрын
It's a shell environment variable. I guess the full command was way too long to type or show in the terminal so he saved it to a variable and referenced that instead.
@MetPhotoDotNet
8 ай бұрын
@@jbucata ah, I knew it was a environment variable, but I see what you mean. It could well something like the path the version of Python (guess) he used with command line options. Thanks
@jacobbush6581
8 ай бұрын
I am just guessing based on what I can see: - $BWT is the script he is running, set by running a command like `export BWT="path/to/your/script"`. - The script seems to take a parameter: I am assuming that the script can do multiple things. In this case he passes it `bwt` which is probably the matrix generating code. - The `|` is a pipe: it takes the standard output from the previous command and passes it as standard input of the next command. - So `echo COMPUTERPHILE | $BWT bwt` sends the string "COMPUTERPHILE" on STDIN to `$BWT bwt` which does its thing: computes the matrix and echos it to STDOUT (which gets printed to the terminal) Hope that helps :)
@DrSteveBagley
8 ай бұрын
As jbucata said, BWT is an environment variable that contains the full path to the C program I wrote to demo the transform. The bwt is a parameter that program that tells it what mode to run in…
@rachel_rexxx8 ай бұрын
Fascinating
@Modvivek8 ай бұрын
Great video!! I recall doing an assignment on my college days where we are given a dna sequence ( bw transformed )of some billion ATCG characters. Our goal was to find match of the given pattern (with few mismatches if not exact) on the original sequence.
@inanefool87818 ай бұрын
Oh damn, I love burrows wheeler
@Patterner8 ай бұрын
i'd be worried if you didn't have a favorite algorithm or a favorite mathematical formula. for me: Fast Fourier Transform and Euler's Identity
@sveu3pm8 ай бұрын
but didnt you doubled your start set? so now then you compress it more-less back to starting size?
@SoulSukkur
8 ай бұрын
just for clarification, is your question: "if you need to store two copies of the data, i.e. the first and last column, how are you saving space?" if that's your question, the answer is: you don't store the first column. remember, the first column is simply all the data in alphabetical order. so, if you take the last column and sort it, you now have a brand new copy of the first column again, and it doesn't cost you any space in the file. (it just costs the time needed to sort all that data at decompression time.)
@digmsymii3218 ай бұрын
Nice try! Part two, please.
@pafnutiytheartist7 ай бұрын
This video is so compressed - it's impossible to follow. How do we unsort the data when decompressing?
@Lion_McLionhead8 ай бұрын
How do these guys come up with these algorithms? Leetcode?
@YouPlague4 ай бұрын
the video is missing vital parts to actually understand the transform. Also an information about bzip2 being only good for texts really, which it is hinted at.
@Teagles2168 ай бұрын
Is Dr. Bagley a Cleveland Guardians baseball fan? I'm surprised to see that hat in Nottingham.
@copzi
8 ай бұрын
I noticed too 😮
@supercompooper8 ай бұрын
Love me some Lyndon words 💗
@masterdjon8 ай бұрын
How sad? That's the question!
@ipadista8 ай бұрын
Why strictly single character-based approaches? Isn't the obvious approach when it comes to natural language text to find recurring patterns? In the short Hamlet quote "to be" occurs twice, so just needs two pointers for where they start. As the text gets longer, more and more repeating patterns will be found.
@Meta7
8 ай бұрын
This is exactly how some encodings work (e.g. LZW), but the one being discussed is not one of those.
@janAkaliKilo
8 ай бұрын
There are encodings that are optimized for human-written text. But most popular (gzip, bzip, etc) are all generic for all kinds of data. And you won't find that many patterns in the computer-generated file formats that a lot more frequently need to be compressed.
@rb14718 ай бұрын
BZ2: Let me uncompress your string of N characters N times for a total of N*N*log(N) time complexity
@toby9999
8 ай бұрын
What's your point?
@rb1471
8 ай бұрын
@@toby9999 If you need it spelled out for you it's slow to decompress.
@darshanputtaswamy31998 ай бұрын
I feel proud to be CS grad 😅
@jon91038 ай бұрын
4:50 aka a shift register
@MorrisTart8 ай бұрын
I still don't get it, and I've written compression algorithms. But then I did it from scratch and don't have a maths degree.
@stensoft8 ай бұрын
Seeing the laptop name, is Steve a browncoat?
@AndrewGillard
7 ай бұрын
I was looking for this comment :) I'm surprised you're still the only one to have mentioned that, 10 days after publishing!
@PaulaBean8 ай бұрын
Am I the only one who can't follow the explanation? And can bzip work on binary data?
@jimmyporter89418 ай бұрын
I didn't understand this one at all. It must have been quite off-putting for Steve Bagley to have a room full of his peers watching whilst he filmed the video. I bet that put him off.
@billli118 ай бұрын
I think this a how BWT work for example abcd. you have abcd bcda cdab dabc the last column d a b c sort it and you get a b c d and beacuse the orig is sorted. now you know a??d b??a c??b d??c last character is before first character, so you get da ab bc cd sort it and now you know ab?d bc?a cd?b da?c and beacuse last character is before first character again. you can keep doing it until you get all that back.
@shadaxgaming
8 ай бұрын
This made it much easier to understand, thank you. How would you introduce a stop character ($) into this explanation?
@LeeSmith-cf1vo8 ай бұрын
I my experience bzip2 is only marginally better than gzip. Rzip is considerably better, at the cost of time taken to compress. But I believe all Rzip does differently is to increase the search space?
@azrobbins01
8 ай бұрын
But is it easy to explain how it works to a crowd of people without leaving everyone scratching their heads? Just because it works better, does not mean it will make for an interesting video with easy to grasp concepts.
@LeeSmith-cf1vo
8 ай бұрын
@@azrobbins01 indeed. My comment is an observation (and a question) , not a criticism
@squid11160
8 ай бұрын
In my experience with compressing large .dds files and textures, bzip2 is much much better than gzip, when comparing their highest (reasonable) compression settings. I picked bzip2 because even though the time to compress is much much longer, it pays in the end because of the amount of users my project is used by.
@LeeSmith-cf1vo
8 ай бұрын
@@squid11160 I guess like most things - it depends.
@jansenart08 ай бұрын
Altavista is mentioned (satirically) multiple times in Parks & Recreation.
@Sonny_McMacsson8 ай бұрын
2B∨¬2B
@BTWAIDS8 ай бұрын
dope!
@Heinz-bx8sd8 ай бұрын
Cool
@carlosmspk8 ай бұрын
"over the age of 25"? I have bad news for you...
@AndrewGillard
7 ай бұрын
Yeah he's off by about a decade there :) I'm 36 and think I only used AltaVista for a year or two around 1998~2000 when I was 11-12 and first had Internet access in my secondary school's library. I doubt there were many toddlers online back then :) I could easily be mistaken about some of the details of that time, so don't complain too much if I'm wrong about something 😅
@vadrif-draco8 ай бұрын
Bbeenoooorttt
@Ivo--8 ай бұрын
bzip2 is so slooowwww
@unvergebeneid8 ай бұрын
How sad is it to have a favourite compression algorithm that is outperformed on every level by LZMA? 😘
@vylbird8014
8 ай бұрын
I wonder if combining them would be worthwhile. Hmm.
@Axman6
8 ай бұрын
LZMA is very slow though; bzip2 isn’t fast either for it’s a better choice for many applications. These dats LZ4 and Zstd are the way to go though.
@unvergebeneid
8 ай бұрын
@@Axman6 I've read some very thorough tests of xz which seem to show that xz is faster than bzip2 at a similar compression ratio, with the option of compressing much better at higher settings.
@Scriabinfan5938 ай бұрын
I'm lost. cool video though.
@jsherborne928 ай бұрын
The Dr has lost weight!
@fnerf08 ай бұрын
Quite bad video. Not explaned at all
@mechwarrior838 ай бұрын
A rare miss for this channel. I did not follow this at all, sorry!
@HoD999x8 ай бұрын
didn't the same guy badly explain something else before? i forgot what it was. or maybe the video was badly cut?
@mrvincefox8 ай бұрын
What a confusion explanation.
@izaradesigncom8 ай бұрын
as always, super interesting for a non computer scientist. and love the shirt!