Low Level Learning
Жыл бұрын
1,523,654
1

computers suck at division (a painful discovery)

Ғылым және технология

I tried to take on a simple task. I TRIED to do a simple assembly problem. But, the flaws of the ARM architecture ultimately almost defeated me. Computers suck at division, and there's a few reasons for that. Division is so hard for computers, that the ARM processor core didn't have a divide instruction until 2004. Even now, certain ARM Cortex M series processers don't have the divide instruction.
So then, how do the processors do division? Watch the video and find out ;)
🏫 COURSES 🏫 Learn to code in C at lowlevel.academy
📰 NEWSLETTER 📰 Sign up for our newsletter at mailchi.mp/lowlevel/the-low-down
🛒 GREAT BOOKS FOR THE LOWEST LEVEL🛒
Blue Fox: Arm Assembly Internals and Reverse Engineering: amzn.to/4394t87
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation : amzn.to/3C1z4sk
Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software : amzn.to/3C1daFy
The Ghidra Book: The Definitive Guide: amzn.to/3WC2Vkg
🔥🔥🔥 SOCIALS 🔥🔥🔥
Low Level Merch!: lowlevel.store/
Follow me on Twitter: / lowleveltweets
Follow me on Twitch: / lowlevellearning
Join me on Discord!: / discord

Пікірлер: 2 400

@LowLevelLearning9 ай бұрын
Wanna learn MORE awesome stuff while helping out the channel? Go get a FREE month of Skillshare using my link: join.skillshare.com/learn-1/?coupon=AFF30D23
@LithiumDeuteride-6
5 ай бұрын
The "Degrees" class unambiguously requires floating-point numbers.
@user-qh5br9jl9g
4 ай бұрын
What about the neon DSP code? That handles div. Wide accumulated Also.
@sprytnychomik Жыл бұрын
Other solution: don't use Fahrenheits.
@LowLevelLearning
Жыл бұрын
Imperial Unit Gang
@PRIMEVAL543
Жыл бұрын
The good old freedom units that you got forced on by the uk cause nothing stands mor for freedom than the British empire. Did you learn about picas and points btw? And can you tell me how many miles 1 pica + 3 points are?
@renakunisaki
Жыл бұрын
@@PRIMEVAL543 about 8 football fields give or take an eagle or two
@NinjaRunningWild
Жыл бұрын
F that!
@devinnall2284
Жыл бұрын
For Non-Americans an easy way to understand Fahrenheit is to think of them as a percentage anything below 40% is cold 40% to around 75% are nice medium temperatures and anything above that is hot as hell!
@Finkelfunk Жыл бұрын
What you discovered here is something called "Division by Invariant Multiplication". That topic is so insanely complicated that a good number of people have written dissertations in computer sciences about this stuff. Long story short: It makes division go vrooooom on your processor.
@null6482
Жыл бұрын
Where can I find more info about this
@solomontan1524
Жыл бұрын
This comment should be pinned so everyone can learn about the name for this concept. Very interesting stuff
@adamrak7560
Жыл бұрын
if you want something even more crazy, look up fast inverse square root algorithm
@christopherbroms2508
Жыл бұрын
@@null6482 en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant
@okkoheinio5139
Жыл бұрын
@@MustacheMerlin wow, that sounds cool
@snoupix3332 Жыл бұрын
I love how us, programmers, are spending hours and nights to not use an existing external lib just to say "I've made it myself" 🤣
@themalaysiandude3903
11 ай бұрын
nothing ever beat than making your own function all day instead of just copy pasting
@OhhCrapGuy
11 ай бұрын
I think the real reason we do this isn't because we don't like libraries, it's because we want to understand. There's so many times that I've written something myself, fully understood the entire problem, gotten the bugs worked out, and as soon I was entirely satisfied with my understanding, I simply immediately replaced my solution with a library, because I don't want to maintain any of that nonsense!
@JonJenkins1982
11 ай бұрын
If you do that, you acquire skills that easily translate to other realms of computational problem solving.
@OhhCrapGuy
11 ай бұрын
@@JonJenkins1982 not only that, but even for higher level programmers, every time you deal with bit twiddling and such, you learn to think that way even better. Case in point, I had mostly figured out the solution for this only a few moments after he pointed out that ARM M0 cores don't have division capabilities, or at least close to it. I was thinking the solution was overflowing the irrelevant bits out of the multiplication register, but this is effectively the same thing, just putting them into a word that we simply ignore. It's not that I, or anyone else, have some sort of unique insight into how it works, I never would have thought of that earlier in my career. It's just that being exposed to these sorts of cool hacks over time gives you the tools you need to come up with new bit hacks when you encounter a problem.
@ricardocasimiro6424
10 ай бұрын
agree
@JustAnotherAlchemist Жыл бұрын
The generalization of this concept is called "division by multiplicative Inverse" or less commonly "multiplication by the reciprocal" and is a relatively common practice when you have a division where the divisor is a constant. It's talked about a lot in low-level books, like "Hacker's Delight." Basically, you precompute the division by encoding it as a reciprocal. It's fatal flaw is you have to know what you are dividing with from the start. To go farther, you have to use something like the Newton-Raphson method for division. This is the same as the above, in that it finds the reciprocation quotient. The first major difference is that it is a run time operation. That is, it FINDS the needed reciprocal during execution, no precomputation. The second major difference is that it uses an approximation to division _as a function_ to get close to the correct result, then uses Newton-Raphson method to refine that guess to as many digits as you desire. For 32bits, I find that 3 iterations is plenty. Incidentally, all of this is possible because floating point/fixed point number systems encode multiplication and division _into_ the number itself, in the same sense that having a sign bit encodes addition and subtraction into a number. Floating point, specifically, also encodes exponentiation into the number, which is why it is the usual goto number system. ... The comment claiming this to be called "Division by Invariant Multiplication" is the first time I have ever heard it called specifically by that name, and has just one research paper with that exact title... Wikipedia calls it "multiplication by [the] reciprocal" which someone responding to said comment linked to as well. multiplication by the reciprocal is a basic math operation, and gets you a lot more hits than "Division by Invariant Multiplication." It's also certainly not complicated to understand, as this ~5 min video has clearly demonstrated.
@idonnow2
11 ай бұрын
Based informative comment
@rickroller1566
5 ай бұрын
Doesn't Newton's method use division? What am I missing?
@JustAnotherAlchemist
5 ай бұрын
@@rickroller1566 en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division ...specifically, you want "Xi =(2-DXi) ... which can be calculated from Xi using only multiplication and subtraction, or using two fused multiply-adds. "
@pulakgautam3536
4 ай бұрын
you rlly are an alchemist
@Fujui
2 ай бұрын
Why dont you divide by 1.8
@chsovi7164 Жыл бұрын
"I didn't want to use an incredibly low level library, because it felt like cheating" Been there before, I know where this road leads. I wish you luck on your future breadboard CPU project!
@MayorMcC666
Жыл бұрын
Either breadboard or TempleOS, one of two roads
@mikicerise6250
Жыл бұрын
RNJesus speaks to me.
@EmptyZoo393
Жыл бұрын
Seriously, there is a reason high level languages exist. If you can let the compiler and the language take care of your branch tables for you, you can save a lot of development time. These days, the only reasons to work with assembly are for SIMD optimization, extremely optimized library functions (often involving SIMD), and occasionally interrupt routines. There are occasional times with small micro-controllers like this where you need to use assembly, but code the compiler creates is typically roughly equivalent to what you'd create, and often better. Don't underestimate the power of auto-loop unrolling and function inlining. The linker can take all sorts of shortcuts that you wouldn't think to.
@NoX-512
Жыл бұрын
@@EmptyZoo393 What's the fun in that?
@TheDavidlloydjones
Жыл бұрын
@@mikicerise6250 Do you speak Aramaic? What has He been telling you? Lemme guess. "Donald Trump has been sent to save you all" maybe?
@filipesaz Жыл бұрын
For those of you who, like me, had some difficulty understanding his explanation. His final calculation is something like: C = (F - 32) * 5 * Const / (2^32); where Const = ceil((2^32) / 9); He divides by 2^32 by grabbing the highest 32 bits of a little-endian architecture number, because it would be equivalent to bit shifting the 64-bit positive number 32 times to the right.
@danielschneider9358
5 ай бұрын
Thx! That actually cleared things up nicely :D
@ashwiniabhishek1504
5 ай бұрын
Thanks, it cleared up my doubts
@User-je7gf
Ай бұрын
You are a wizard
@kensmith5694 Жыл бұрын
As someone who does microcontroller stuff, I often do that sort of thing. There is even a method that allows you to effectively divide by a variable amount. It is based on two observations: 1) A/B = (N*A)/(N*B) so you can pick an N to make the denominator something easier to deal with 2) For X very small 1/(1+X) = 1-X this allows you to deal with any number near a power of two without a divide.
@jensknudsen4222 Жыл бұрын
This brings back painful memories of 8-bit computing. Consider yourself lucky to have a multiply instruction.
@svampebob007
Жыл бұрын
I remember building a 8 bit cpu, the division circuit was by far the longest to figure out. I started with obviously +-, *... then tried for months to figure out how to do division (without obviously looking for the answer online) once I realized how you could do division with substractions and additions it took me just a couple of weeks to deal with IO and W/R to ram.
@linuxization4205
Жыл бұрын
just add the number multiple times if you need multiplication.
@stinchjack
Жыл бұрын
I dabble with Z80 assembly. Multiplication is a charm. e.g. multiply HL register by 10 push de ld d, h ld e, l add hl, hl add hl, hl add hl, de add hl, hl pop de I've written something like that so often by now its just muscle memory :D
@daishi5571
Жыл бұрын
Motorola 6809 (1978) has a multiply command, takes register A and multiplies that by register B then combines the A & B registers to make a register D (which is 8+8 bit so 16 bit) and outputs the multiplication result.
@robertherndon4351
Жыл бұрын
Ah, tables of squares and difference-of-squares multiplication methods. And then there are Newton-Raphson approximations for computing reciprocals so you can pipeline divisions using multiplication. All good fun...
@Soupie62 Жыл бұрын
5 / 9 is roughly 569 / 1024. Dividing by multiples of 2 is an Arithmetic Shift Right. Multiply by 569, shift the result Right by ten bits. Multiply by 36409, shift 16 bits... and 2330169 with a shift of 32 bits merges nicely with your code.
@VictorCampos87
Жыл бұрын
Sir, this is what I call an elegant solution. Easy to understand. Easy to code. Easy to process. And give the most precise results available for this architeture (I think).
@Kyle-ke5fx
Жыл бұрын
@@VictorCampos87 This is exactly why they don't have these operations on the processor. Because skilled embedded developers don't need it and it would make it too expensive or large for the applications it was built for. People that do this every day generally don't need it.
@joshuahudson2170
Жыл бұрын
@@Kyle-ke5fx What do you do when a variable divide comes up?
@TrackedHiker
Жыл бұрын
@@joshuahudson2170 do it the same way we do long division by hand in decimal
@joshuahudson2170
Жыл бұрын
@@TrackedHiker That's a guess and check pattern.
@fireball2275 Жыл бұрын
I like how it changed to night with the easy bit at the start
@LowLevelLearning
Жыл бұрын
FINALLY someone noticed that XD
@fireball2275
Жыл бұрын
@@LowLevelLearning legend
@Xevos701
Жыл бұрын
The Firestorm core in Apple M1 has a dedicated Divider ALU.
@foobar1500
Жыл бұрын
@@Xevos701 The Firestorm implementation of integer division is surprisingly efficient, but it's nonetheless 2.5-4 times worse in performance than the corresponding multiplication.
@DavidGarcia-se3ko
Жыл бұрын
Multiply by 555, then bit shift
@Olav_Hansen Жыл бұрын
As soon as you aren't working with whole numbers, most humans start to break at division as well. 90/9 is something most people can do, 90/8 becomes a bit tricky to some, but 90/7.6 will require most people to pull out a writing sheet to 'cache partial results', and even makes us struggle past this point. And this is while we are the creatures that learn how to divide things evenly with others even before we learn math.
@farfa2937
10 ай бұрын
I'd say that most people (including myself) would actually no even try to do 90/7.6 by themselves at all, sheet or not.
@nerobernardino88
5 ай бұрын
@@farfa2937 My solution would be: "90/7.6=Fuckyou"
@llllNEOllllchannel
4 ай бұрын
90/7.6=900/76
@ss3nm0dn4r8
3 ай бұрын
the big difference here is that 8 and 9 have 1 digit while 7.6 has two not the fact its not a whole number if you say divide by .9 now its a cakewalk for most people again
@Olav_Hansen
3 ай бұрын
@@ss3nm0dn4r8 I wouldn't say most would find it easy again (althought definitely easier then 90/7.6), but that's also because the solution is a whole, natural number again.
@johnnyvsx Жыл бұрын
I used this same technique years ago. Another way of thinking about it is that you’re converting your fraction to a close fraction which has a power of 2 as the denominator since that will just be a bit shift operation. It was an old intel 8088 microprocessor. It had a divide opcode, but execution took over 160 clock cycles. The real-time code couldn’t finish processing one block of data before the next arrived. Converting all the multiply & divides into multiply & shifts saved the day.
@richardericlope3341 Жыл бұрын
Even as late as the the Nintendo DS ARM 9, we needed to use fixed point mathematics. The hardware registers themselves accepted only fixed point values. Same with trancendental trig functions. We had to use lookup tables in BRADIAN(Binary Radian) units interpolated on the fly. Fun times.
@circuit10
Жыл бұрын
Did you program DS games?
@raymundhofmann7661
Жыл бұрын
Fixed point or fractional integers can in fact do adequate calculations for most practical things and in signal processing, it just gets increasingly burdening on the programmer to decide where the decimal point is put best for each number represented in the program to not lose precision or keep the S/N. Consequently, it is easier for the programmer to have this done at runtime with floating points, but also more wasteful and also possibly numerically more unstable/indeterministic when dealing with feed back or iteration over data sets. Scientific calculations or a FFT (and other transformations) are examples where floating point calculations may be needed. In a FFT for example all data points in the time domain may contribute to only one data point in the frequency domain and vice versa, floating point may be the better choice to deal with the bigger dynamic of the transformed compared to the source data, compared to just increasing integer bit precision overall.
@tomcombe4813
Жыл бұрын
Yep one of the hardest problems in computer design has always been finding a good way to implement division. There have been some creative solutions, like in the cray-1 supercomputer which would instead compute a reciprocal that you could then multiply to do your division. But often, designers just omit division all together from their CPUs.
@grimonce
Жыл бұрын
Well this was just division... By constant....
@declanmoore
Жыл бұрын
The DS also has a divider unit, which makes up somewhat for the lack of support by the CPU :p
@razorstone3088 Жыл бұрын
bro's just casually programming some ARM assembly. love it man stuff like this makes me motivated to learn low level stuff even though it looks kinda scary to me
@LowLevelLearning
Жыл бұрын
Do it!
@Yas-gs8cm
Жыл бұрын
I think it's a waste of time beyond knowing "what is assembly and how it works + basics" (Note: I, myself, know the "basics" I mentioned above. It's logical that "knowing" is always superior comparing to "not knowing"...)
@andrewdunbar828
Жыл бұрын
Learning it will improve your high-level coding. Worth it!
@LuskyMJ
Жыл бұрын
@@Yas-gs8cm Once you learn assembly your thought process never becomes the same again. Even if you work with higher level languages you still constantly think about: "What happens once this gets converted to assembly". It helps you develop new ways to solve problems that you wouldn't have if you never coded in assembly.
@ImperatorZed
Жыл бұрын
@@Yas-gs8cm this isn't even the basics though, this video is super simple stuff
@JaseewaJasee7 күн бұрын
the depth you go into each topic is remarkable, truly enlightening!
@NothingButFish Жыл бұрын
Taking me back to applying newton's method for calculating square root in MIPS assembly as a programming assignment in an introductory computer architecture/CS class. I think more than anything else it taught me to respect all of the crazy shit that's happening at the lowest levels (although of course it goes even lower).
@jasonenns5076
2 ай бұрын
MIPS is dead! The company responsible for MIPS now is RISC-V.
@jameshogge Жыл бұрын
A couple of extra notes: - This division method (multiply+shift) is attempted by compilers whenever the divisor is known at compile time. It isn't used to divide by an unknown number as the constants can't be predetermined. - Another alternative to implementing these things in hardware is to translate the instruction into micro ops on the cpu. (I can't say for certain which instructions do this but I believe all exponential/trig functions usually do.) I believe the reciprocal is usually calculated in hardware so a division might be converted to this plus a multiplication And yes, regular divisions can take 100s of cycles
@weetabixharry
Жыл бұрын
There are a lot of iterative algorithms for computing divisions. My favourite is successive approximation: x = a/b, we rearrange to x.b = a. We then *guess* the value of x 1 bit at a time, by setting each bit to '1' from MSB to LSB and checking if x_guess.b is larger than a. If it is smaller (or equal), then we leave that bit set. If it is larger, then we reset that bit to '0'. This algorithm works nicely for a lot of monotonic functions with simple inverses (e.g. division --> multiplication, square root --> square, and there is even a clever trick for calculating logarithms this way).
@weetabixharry
Жыл бұрын
Of course, that algorithm is slow, but I think it is elegant. In digital hardware, it needs N clock cycles to calculate an N-bit result (but uses an incredibly small amount of silicon). In a CPU, much longer.
@absalomdraconis
Жыл бұрын
@@weetabixharry : I wonder if anyone's tried to shorten it with a "binary search" or priority encoder?
@weetabixharry
Жыл бұрын
@@absalomdraconis There are plenty of ways of doing it faster in hardware, it just costs more transistors. Depending on the clock frequency, it may even be possible in 1 clock cycle. The best solution depends on the use case.
@weetabixharry
Жыл бұрын
@@absalomdraconis And it already is a binary search (because we're setting 1 bit at a time, not incrementing the value +1 for each guess).
@evan7721 Жыл бұрын
Once you described how that weird arbitrarily large number was found/represented, it reminded me of the Fast Inverse Square Root function in Quakes engine that tells the computer to cast a float as an int with no conversion
@enriqueamaya3883
Жыл бұрын
Jesus Ioves youc[\zx]c[zx\]c
@endergamer7444
Жыл бұрын
//Evil floating point bit level hacking.
@dylon4906
Жыл бұрын
@@endergamer7444 // what the fuck?
@paulkirjonen1226
Жыл бұрын
@@endergamer7444 // what the duck
@fire_man3173
Жыл бұрын
//Newton Iteration
@MrEnyecz Жыл бұрын
It is nice to see young kids discovering the wonders of the world! :) What if I told you that the famous 6502 CPU which powered such machines like the C64 (and many-many others) had not only no division but no multiplication too. And it was still able to rule the 80s.
@kc9scott
10 ай бұрын
When I was using Apple II, I wrote a macro library for 16- and 32-bit int operations like add/subtract/multiply/divide. But when I got to wanting to do floating point from assembler, I cheated and just bought a floating-point processor board.
@whannabi
5 ай бұрын
You simply have to implement that in software as far as I know. It will just be slower but that's It.
@karmatical5837
5 ай бұрын
@@whannabiwell, by deciding to work with floating points, you kinda accepted the fact that you will sacrifice memory and performance. After some time studying hardwares, specially old ones, i got why its that so fundamental and common.
@TheNvipy
5 ай бұрын
Traditionally, we used log/exp approach on 8bit/16bit. Hitachi and Mips RISC cpu work around the problem. On mips, multiplication and division are carried out in parallel and store the result in specific registers. Hitachi "sh" cpus provides a division by part, we can set the required precision: one instruction to initialize, another to perform a step.
@MrEnyecz
5 ай бұрын
@@whannabiYes, of course, it was Turing complete. I just said that it was even more primitive, but still a CPU. Getting surprised that some CPU has no division shows that you're way too young. :)
@RokeJulianLockhart.s13ouq Жыл бұрын
This is why, whenever you're programming, and realize a solution that seems like cheating, do it. Cheat.
@watsonwrote Жыл бұрын
I love how this highlights how division is fundamentally different from the other basic operations. It opens up a whole new domain of numbers and we have to contend with that.
@goldenhate6649
Жыл бұрын
The mathematical proofs for multiplication and division is what made me hate math. I am an chem engineer by education and environmental by trade. I love equations, but screw any sort of proof. Hated every second of every math class I took in college for that reason. I would have loved to complete matrix theory, (still have nightmares there), but proofs made me quit. I think I made it less than a month. Kicker is, the class even said it WASN'T going to have proofs.
@Your_choise
Жыл бұрын
In a sense, subtraction also opens up a new domain as it gives negative numbers.
@coopergates9680
Жыл бұрын
@@goldenhate6649 Pretty much, I do recreational math all day, but forget deep dives into multivar calculus or other long routes by hand just to demonstrate a simple conjecture
@weirdlyspecific302
10 ай бұрын
It simply isn't. Subtraction does the same thing.
@antoinem3659
6 ай бұрын
@@weirdlyspecific302 but division open up more
@gapplegames1604 Жыл бұрын
My favorite trick is to take some huge power of 2, let’s say 1024. Turn 5/9 into the closest fraction with 1024 as the denominator. In this case you would get 570/1024. Now all you have to do is multiply your binary number by 570, and then move the decimal 10 places to the left. I’m sure you guys know way better ways to do this, but this was my trick when I had to write code in assembly for my computer science class.
@gapplegames1604
Жыл бұрын
Just realized this is almost exactly what you did in your video!
@user-dh8oi2mk4f
Жыл бұрын
@@gapplegames1604 This is exactly what he did in the video
@meneldal
Жыл бұрын
@@gapplegames1604 Depending on the architecture and the number of bits you have there are tradeoffs with precision/max value you can accept. Since 32 bit multiplies tend to be faster (on cheap hardware), something with a lower bit shift like your example (but most likely 16 to use tricks with using just high/low word) has its uses in some situations. If you're stuck with an 8-bit AVR, you're definitely not doing 64 bits multiplies (and probably not 32 either).
@gregorymorse8423
Жыл бұрын
Except multiplying by 590 reduces the domain and causes overflow in cases the better method doesn't. You usually like to write crappy buggy code, without consideration of any problems or limitations or domain restrictions. Yes we know. Hopefully nobody is paying you for such garbage tho
@rileyalden723
Жыл бұрын
@@gregorymorse8423 jesus christ bro chill, he's explaining how he did it "for my computer science class", he's just working with the closest possible precision he can given the limitations of practical time constraints and diminishing returns. For the purposes of understanding how assembly works, the goal of most courses that are going to require a basic understanding of this kind of math, this is a perfectly sufficient method.
@EvilidelRio Жыл бұрын
In the not-so-good old days, on an IBM mainframe and before IEEE math, we've got a lot of fun adding 100 events each one with a weight of 1/100... just to discover after many hours reviewing the code that 1/100 is a periodic fraction in binary! So we changed and weighted 128 events.
@chrisnatale5901 Жыл бұрын
Nice solution! I remember reading a book back in the 90's called "Game Programmer's Black Book" that detailed workarounds for faster division performance on the 486's and Pentiums of the day.
@Ma1ne2 Жыл бұрын
Loving this type of video format where you just tell a short story about something you have been working on yourself and learning something new! Would love to see more of this format!
@LowLevelLearning
Жыл бұрын
Thank you! I really had fun with this one
@enriqueamaya3883
Жыл бұрын
Jesus Ioves youx[]cpzx[c]
@MrZcar350
Жыл бұрын
And has the magic number (at least in the 32-bit version) 0x5F3759DF.
@lonegecko Жыл бұрын
I've got some bad news... looks like the Cortex M0, M0+ & M1 don't support the SMULL instruction (32-bit multiplication with 64-bit result). You only get the lower 32 bits. If you set the compiler flag `-march=armv6-m` in Godbolt, you'll see it calls a library function for division (`__aeabi_idiv`) instead.
@nahco3994
Жыл бұрын
Yup, the same thing tripped me up when I was fooling around with some playful attempts to do baremetal stuff on a Raspberry Pi 3B, which has a Cortex A53 I believe. I didn't bother to include or link any libraries, because I was only doing very basic arithmetic and pointer shit anyway, outputting over UART. I was quite surprised that I was getting an error referencing this function call, in a place where I fully expected a plain integer division to be.
@foobar1500
Жыл бұрын
I strongly doubt if he actually needs to convert millions of Fahrenheit to Celsius. For a reduced range a 32-bit multiplication with a 16-bit constant and a 16-bit shift is enough for correctness.
@Carewolf
Жыл бұрын
Dont code for armv6... That stuff is ancient.
@ABaumstumpf
Жыл бұрын
@@Carewolf "Dont code for armv6... That stuff is ancient." ah yes - the "ancient" 9 years ago... cause nobody would ever use something that was first released that long ago............................................................ yeah no.
@renakunisaki
Жыл бұрын
@@foobar1500 hopefully anyone working with temperatures over 32767 degrees (in any scale) is already using Celsius or Kelvin 🤪
@mervmartin21125 ай бұрын
At the machine level, computers don't subtract, multiply, or divide. They add, compliment and shift. So, in binary, shift the dividend left past the radix as many times as you want significant digits. Two's compliment the divisor and add it to the dividend until the quotient sign bit is negative. Shift the quotient right the same amount you shifted the dividend left. Viola! The decimal is not called decimal in any other number base but 10. It's the radix in any number base. The video was great! Good job digging into the software that deep!!!!
@__christopher__
Күн бұрын
At the machine level, they also don't add, either. Addition is build out of purely logical operations.
@mervmartin2112
Күн бұрын
@@__christopher__ Yep. And even the logic can be expressed in discrete components. And those in gathered natural materials. Sand, dirt, burnt things and wizardry ;-D
@jeffcauhape6880 Жыл бұрын
Awesome! I have recently decided to go back and learn some assembler, and chose the ARM chip. This is gold! Thank you.
@KillerSpud Жыл бұрын
I've been developing on a Motorola/freescale/NXP DSC family chip for a few years now and I'm always having to do multiply shift approximations. A simple divide operation takes a few hundred microseconds, and when I'm running a one millisecond loop that can blow things out quick. This is just one of those cases where close enough really is close enough.
@LowLevelLearning
Жыл бұрын
Yeah I’ll probably do another video for this but it’s crazy that even on modern CISC chips DIV instructions take sometimes up to 12-20 clock cycles
@johntrevy1
Жыл бұрын
@@LowLevelLearning But then the human brain typically struggles with division the most too. I still can't even get my head around long division lol.
@slicer95
Жыл бұрын
@@LowLevelLearning That is because Division is one of the hardest problems in Computer Arithmetic.
@UnblockMind
Жыл бұрын
Maybe we should teach them short division method.
@andrewdunbar828
Жыл бұрын
@@LowLevelLearning If you want to realize you're still coding at a high level even in asm, then try to implement a divide yourself in FPGA (-:
@rparl Жыл бұрын
The first computer which I programmed in Assembly was the IBM 704, in 1960. There were two floating point division instructions, Divide AND Stop & Divide OR Stop. This had to do with the requirement that the result had to be less than one. So the inputs had to be scaled to make sure.
@HansStrijker Жыл бұрын
If you need to do many chained calculations containing different operators, it can be faster to do them as rationals, and only do the final division at the end. Summation and division will end up to be three operations minimally, but it can pay off quite a bit for lower complexity with other operations. Though it can quickly run into rollover issues, so might need to keep an eye out for that, and apply a simplify step (gcd algorithm) where necessary.
@owensmith7530 Жыл бұрын
The ARM1 processor didn't have a multiply instruction either. Multiply was added for the ARM2 not really for maths but because of how often the C compiler had to calculate addresses (eg array members) and multiply is very useful for that.
@henriksundt7148 Жыл бұрын
If you had divided by a variable instead of 9 in the C source, the true face of division would show up. The code would then have to call a runtime division algorithm instead of multiplying by a constant. The ARM assembler code for efficient division is worth a look.
@enriqueamaya3883
Жыл бұрын
Jesus Ioves youx]c[zx\c[x]z[c
@hayfahvytsen Жыл бұрын
Before the prevalence of div instructions and FPUs this was standard practice in embedded systems and decades of products used fixed point math to do incredible things, including keeping your car on the road in the winter!
@rulee9543 Жыл бұрын
This is a very great video. Thanks for making it! I am an silicon engineer developed processors and I am surprised with ARM's brilliant solution! That make die smaller and power efficient!
@arnabbiswasalsodeep Жыл бұрын
Booth's multiplier is so great, it can be even used for division for infinite precision given infinite time as it operates in clock cycles & can be bit banged as well.
@dr.strangelove5622 Жыл бұрын
Really cool video!! It feels kind of great when I get to see someone else passionate about assembly language. This video reminded me about the time I was figuring out on how I should multiply and divide a number in the assembly language of ATMega8A microcontroller (and later with an 8085 processor): I had to use repeated addition and subtraction to get the job done. Later on, when I worked with 8086 microprocessor I got to know that division is an expensive operation (just like LOOP and REPNE instruction in x86 processor. Even in x86-64 family of processors, they are just artificates which no one has ever optimized, so everyone favours JMP instead of LOOP). Again, this was a great video! 👍👍
@LowLevelLearning
Жыл бұрын
Thanks for watching! :D
@menotworking Жыл бұрын
Those of us who work with smaller micros do this on an almost daily basis. Create a multiplier/divisor pair with the divisor a power of 2. 2^8, 2^16 and 2^32 are especially nice because they just involve discarding low bytes. The only caution is to make sure your intermediate multiplication cannot overflow. For "normal" temperatures, a 2^8 divisor is all you need. 5/9 = 142.22, so the multiplier can be 142 or 143. Now you work back to determine the maximum input temperature that will avoid overflow, so 65535/142 = 461 and change. The conversion will work with inputs up to 461 degrees, with an error of about 0.16%. You can also add a rounding factor of 256/2 to the multiply before the 2^8 "division".
@kenhaley4 Жыл бұрын
I'm old enough to remember the days when all calculators were mechanical (sometimes electrically powered, but not always). Unless they were the more expensive variety, they did not support division. I specifically remember my first hand-held 4-function electronic calculator (circa 1960) and being truly amazed that it could divide! There was actually a short pause (maybe 1/8 second) before it displayed the answer to a division problem.
@CosmiaNebula Жыл бұрын
old Babylon style division: use Newton's method to find 1/y, then multiply x * (1/y) This is actually often used on the microcode level, so that the ALU doesn't have to implement one-step division. It trades computation-time for chip-space.
@SansNeural Жыл бұрын
Old timer here... visiting you from the days before MUL and DIV instructions in microprocessors. I should be well versed in all manner of integer math operations, but thankfully I've been able to forget (or at least not recall and use) all that for years now. Heck, for some of the industrial machines (Z80 based) I programmed in the '80s we splurged and added a math co-processor card! Probably the main takeaway in your journey shown here is that ARM is older than we tend to think. Born in the '80s itself, ARM was radically new (RISC) but still influenced by its times. Also, adding hardware DIV might have been considered antithetical to the new RISC concept.
@HappyBeezerStudios
Жыл бұрын
And nowadays RISC and CISC aren't really meaningful anymore. The most widely used RISC designs have tons of instructions and extensions and the most widely used CISC designs translate CISC instructions into RISC-like µOps
@stevenpersoon Жыл бұрын
I encountered such a thing on a school project with FPGA's. With an FPGA you allocate logic units (AND, OR etc.) to build your own logic circuit. Using a single division operation used like a thousand logic units, which is a lot. Instead we used an division algorithm which got it down to under 200. It probably used multiplication and bit shifting, but I'm not sure anymore.
@svenbtb9 ай бұрын
I was just reading about this in one of the Write Great Code books the other day, it's super interesting how difficult fractions actually are for computers.
@Paul-sj5db Жыл бұрын
I used to program assembler on the Acorn Archimedes which was the first commercially available computer with ARM4 chips. There was a really good algorithm for dividing two 32 bit numbers. It involved about 96 instructions, no loops, 3 instructions to set it up, 30 repeated identical sets of 3 instructions and then a final 3 instructions. Those chips didn't have a multiply instruction either so you'd have to multiply by 5 by shifting left 2 and adding to the original.
@RoseHayes-321 Жыл бұрын
I strongly recommend the book "Hacker's Delight" by Henry S. Warren, Jr. It's particularly valuable for anyone doing assembly language programming. I wish it had been around in the 80s when I started out.
@ShimonDanilov
Жыл бұрын
I also thought of that immediately, because I remember that in Hacker’s Delight there is this exact algorithm. Such a good book.
@nandomax3 Жыл бұрын
Next time I get lost in a lucid dream, I'll try to divide with an old chipset. If division is supported I'm dreaming
@amigalemming8 ай бұрын
Some architectures like the Motorola 56k DSP have an instruction for doing one step of a division. If you call it often enough it will yield the correct quotient, still sufficiently fast. This can be a good compromise between speed and saving chip space.
@quiteindeed68092 ай бұрын
For smaller division, I found good ol timesing the numerator by E^(whatever accuracy in decimal places you wanted), doing the division, then moving the decimal place to the left by E^(same as earlier number). I know, not as fancy, but it got the job done.
@maksymiliank5135 Жыл бұрын
That's actually a pretty cool solution. I thought you were going to talk about floating point division and how inaccurate it is, but this was also interesting to watch
@edgarbonet1 Жыл бұрын
Nice video, thanks! It is worth noting that, even when division is supported by the hardware, it can be quite slow. I once timed the basic floating point operations on a Cortex-A5, which has an FPU implementing the VFPv4-D16 instruction set. Multiplications turned out to be fast (one/two cycles in single/double precision respectively), but divisions took more than seven times longer than multiplications.
@zlcoolboy
Жыл бұрын
Shows that ARM was right to forego a division module.
@watchm4ker
Жыл бұрын
@@zlcoolboy That's still faster than doing it in software. This method only works because it's a known constant that can be approximated. Once you divide by a variable, you've got to use much more sophisticated approximations, which will likely take multiple times as long as a hardware divide unit.
@fk319fk Жыл бұрын
I think the interesting part of this was when I was coding floating point multiplication and division, I found the algorithm was identical! Where you had + in multiplication, you replace it with -. (not exactly, but surprisingly close, well, at the time)
@paradiselost9946 Жыл бұрын
This is the bit i love. Go deep, go to circuit level... half bit and full adders, shift registers... We have paper, pencils, and extra ways of accessing memory. A puter is stick with whatever figures are in its amu. It "sees" one digit at a time. It cant see the whole number, it cant go back and "carry the one"... Subtraction by addition and complements has always fascinated me...
@_dekinci Жыл бұрын
When we manually tried different division approaches on a computer engineering courses I was fascinated by how hard division acually is. It makes sense though - division is also harder to do on paper. P.S. there is also a square root
@akallio9000
Жыл бұрын
That's why they call it a TRIAL divisor, if it's too big you get to start over again.
@Mueller3D
Жыл бұрын
Division in binary is actually not that difficult, since the (appropriately shifted) divisor either does or doesn't go into the dividend (determined using subtraction). There are even shortcuts you can do to make this fairly efficient. However, it still takes a number of cycles equal to the quotient length.
@_dekinci
Жыл бұрын
@@Mueller3D yeah, but it's much less intuitive than multiplication. It's also longer algorithms with more steps and branching. In multiplication you just go with a naive approach. In division I don't think any approach can be considered naive
@dickheadrecs Жыл бұрын
this is the real reason america measures temperature in furlongs. before 2004 you needed to write out and mail your division operations in a postage paid envelope to the centre for general compumentrics and applied calculations. in most cases you could expect to receive a response in 7-10 business days
@daylen577Ай бұрын
The thing people often don't understand is that every operation eventually leads to an addition or a subtraction; multiplying is just adding the same number multiple times, and dividing is just looping through a bunch of potential numbers to see which one is closest.
@ambassadorkees Жыл бұрын
1985 I did this all the time: floating point in assembler (or even direct hex machine code 💪) on a MC6800. Graphics card, automatic adaptive lighting with integration calculus, programmable school bell clock, all applications on the same self designed and soldered Eurocard sized controller. Basic components a 6800 + PIA I(O for e.g. relays, numeric multiplexed keyboard and lux sensor.
@adailtonjn Жыл бұрын
Amazing content. It would also be interesting videos about division of non-constant numbers, the implementation of a floating point unit (showing how to add and multiply floats) and also the implementation of algorithms like sine and square root. All of that comparing C and assembly.
@SolomonUcko
Жыл бұрын
It looks like the library function __aeabi_idiv/__aeabi_uidiv is generally used on ARM to handle division by a non-constant denominator.
@alexc4924
Жыл бұрын
Binary long division isn't as complicated as you might think, though it's fiddly
@Ghoetic Жыл бұрын
I usually pre-scale a fraction a with an acceptable binary power, eg 2^8 giving (5/9)*256 = 142, and multiply with that then right shifting the result by 8 to fast divide.
@marcoronzani71975 ай бұрын
Yeah, fixed point is a very effective technique to get around divisions, unfortunately you can't always use it when high precision is required. Luckly the problem does not apply when you have an FPU. A very interesting alternative I suggest you to look into is CORDIC, it computes only 1 bit per iteration, but if your architecture is superscalar that is not a big issue. Furthermore I also read something about CORDIC and loop pipelining at a certain point, thus looking that up could help as well!
@wardyorgason Жыл бұрын
You can also calculate it using subtraction. IE you first multiply by five. Then count how many times you can subtract 9 before it crosses 0. I made the exact same thing in college using the LC3 which also doesn’t have a divide instruction.
@svenmorgenstern9506 Жыл бұрын
Welcome to 6502 assembler with more bits! 😱 Seriously, at the assembly level (especially with RISC type instruction sets) you end up really appreciating how much heavy lifting your compiler does for you. I've never written anything in ARM assembler, but this video takes me back. Thanks for sharing! 👍🍺
@markday3145
Жыл бұрын
One of my favorite optimizations had to do with converting timestamps from 32-bit number of seconds to separate year/month/day/weekday/hour/minute/seconds fields on a 6502. The typical approach involves multiple 32-bit divisions. I realized that most of those quotient bits were always zero. So I essentially unrolled those divisions and eliminated all of the trial subtractions that were guaranteed to result in a zero. It ended up needing a total of something like 34 subtractions, resulting in a speedup of 7X.
@foobar1500
Жыл бұрын
On systems with fixed instruction length even stuff like loading immediates (that is, constants) to registers is quite funky, even on symbolic assembler. On x86 with variable instruction length but no widely available barrel shifter logic available to instructions these are quite literal "move immediate" instructions, but on ARM, complicated constants either take several instructions to load (if not using constant islands), or generate constants in interesting ways...
@ArneChristianRosenfeldt
Жыл бұрын
JRISC in 1993 ( Atari Jaguar ) and SH2 in 1994 on Sega 32x and Saturn offered 32-bit division. It has some latency, but did not block the processor. Similar how quake did division on CISC Pentium in 1996 .
@capability-snob Жыл бұрын
This was a triumph! FWIW most x86_64 CPUs have floating point division hardware but integer division is implemented using micro ops for newton-raphson method. This bugs me to no end because you normally want high throughput for floating division, so you use n/r for that, but for ints it usually is a single division on the critical path. The difference is huge - unpipelined trial subtraction hardware has an 8 cycle typical latency, but n/r is 26 cycles.
@soonts
Жыл бұрын
On modern AMD CPUs, integer division only takes 2 micro-ops. Still slow though, for 64-bit integers the latency is up to 19 cycles.
@capability-snob
Жыл бұрын
@@soonts ooh, interesting! DIV and IDIV timings going back to Bulldozer appear to be for a slim hardware division unit with up to 18 cycles of latency. Intel are still in the high 20s though. Thanks, I've been team red for over a decade but somehow never noticed this.
@Carewolf
Жыл бұрын
@@soonts It takes a varied amount of time depending on the value you divide with. Anywhere from 2 to 30 cycles.
@torbjorngranlund6299
Жыл бұрын
Not true. Hardware division makes use of SRT, not Newton's method.
@lolerie
Жыл бұрын
@@capability-snob divq on ice lake is fast as hell. That is 128 bit division.
@thomquiri98609 ай бұрын
ok I'm pretty new to programmation (like I started this year) and this seemed like an interesting exercise. What I would do to divide by a variable without using the division operation would be just as we teach manual divisions in french elementary schools: -1: a loop checking if the number to divide (a) is greater than the number by which we divide (b). If it is then substract the number by which we divise. If it isn't then we're left with the int just bellow the actual answer. -2: if we need more precision, we do the same loop, but if a divideBy) { currentDigit++; toDivide = toDivide - divideBy; } cout >> currentDigit; if (i == 0) cout >> "."; currentDigit = currentDigit *10; } } Keep in mind I've never coded in C in my life, much less in assembly, so low level logic like that isn't really my thing, I'm sure this is inefficient af, but it was interesting to exercise my brain during these school holidays.
@LowLevelLearning
9 ай бұрын
this is a solid solution! and this is actually how the processor implements the divide instruction under the hood (with some optimizations here and there obviously) the problem is that this takes a TON of clock cycles, so sometimes fixed point division is just faster. good job!
@thomquiri9860
9 ай бұрын
@@LowLevelLearning thanks, is that actually how it work? Wow that's something I need to take into account in my future codes when I carelessly put divisions inside loops that are themselves inside loops O_o. That's what I find interesting with learning what the processor actually does when you add a single operation that seems so innocent, you actually come to understand WHY everyone teach do to do/avoid this or that. Because damn that looks like a very inefficient way to divide.
@LaurenGlenn Жыл бұрын
Very cool. Reminds me of the video I saw where they used some exponent or something in Doom / Quake to speed up calculation of 3D back in the day.... Gotta love the ingenuity of programmers stuck with resource constraints.
@elcaciquedev Жыл бұрын
The first 2 minutes of this video are the reason why they usually make you take higher level math classes in most computer science programs. The lower you go, the more theory you'll need to know. Of which I, myself, don't recall any, so here I am. Great video!
@bradhaines3142
Жыл бұрын
i mean all computer programming is math and translation. so some language and math classes would be all you need wouldnt it? something about how to learn new languages before you start actually learning to speak machine
@phenanrithe Жыл бұрын
The RISC philosophy, love it or hate it 😉 Yes, I saw the interesting multiply-to-divide trick in printf routines to display numbers, where it has to divide by 10 (or powers of 10). I was impressed, it was packed with such little techniques to save cycles (on x86_64).
@linuxization4205
Жыл бұрын
Literally the only CISC cpu that is halfway alive and not x86 is the Motorola 680x0 architecture.
@phenanrithe
Жыл бұрын
@@linuxization4205 It depends which categories you are willing to consider. There are microcontrollers and CPUs derived for ex. from the 6809, 8051 or 6502 and so on which are borderline or plain CISC. Though many designs, including the x86, only have CISC instructions but a RISC architecture, so it's not as straightforward as when the term was initially coined. 😀 I believe Intel started translating the instructions into micro-instructions in their Pentium series, otherwise they wouldn't have achieved the same level of performance. It was bold and brilliant. Nexgen, which was bought by AMD to save their failing K5 architecture, did the same back then.
@HappyBeezerStudios
Жыл бұрын
@@phenanrithe It was the Pentium Pro, (P6 or i686), which all modern x86 chips still identify as. So yeah, the big x86 CISC architecture uses RISC-like tricks under the hood and only displays itself as CISC on the outside for compatibility's sake. And that whole thing was also supposed to be replaced. After IA-16 and IA-32 should be IA-64, but that went to nowhere just like i960, i860 or i432. And obviously Itanium which, as P7 was somewhat supposed to be the successor in 98
@boptillyouflop
Жыл бұрын
@@HappyBeezerStudios Yeah... what happened was that IA-64/Itanium was a disaster, it tried to optimize all the wrong things and ended up being somehow SLOWER than x86 if you can believe such a thing.
@austinrosenbaum2848 Жыл бұрын
Nice video! I run into this when writing code for FPGAs. Usually there is lots of multiplier hardware, but rarely is there division hardware. Solution? Everything is a (fixed-point) multiply!
@Kruimeldief Жыл бұрын
Thank you for not making this video unnecessarily long like so many KZreadrs do (8+ minute videos with boring history and whatnot).
@m.hosseinmahmoodi Жыл бұрын
I remember when I first started learning to use assembly, I started coding for Game Boy. I decided I'll write a few simple functions to get a feel of assembly, so I wrote a function for Fibonacci sequence then I attempted to write a function for “n!” !!! Imagine my surprise when I found out that GBZ80(CPU of Game Boy) doesn't have any instruction for multiplication. I had to write a multiplication function using bit shifting. I have used a lot of different CPU's assembly and most of the very old ones (6502, 65C816, …) doesn't have a mul\div instruction; and old ones (8086, 8186) only have integer mul\div; and newer ones (from pentium until now) are very hard to program (if you want to make them fast, that is) because of SIMDs.
@Optimus6128 Жыл бұрын
Multiplication by the fixed point reciprocal is very common trick in retro coding. It might be even a faster way even if the hardware did have division. It's just surprising the ARM didn't have it even much later, when even the old 8088 had an integer division instruction.
@foobar1500
Жыл бұрын
Modern compilers convert integer division by a constant invariably to a multiplication and a (possibly implied) shift on all modern architectures. It is not so much about integer division instruction being slow although it's available, but rather that multiplication units have gotten quite fast in comparison to division implementations, well, actually a decade or two ago. Also, how often one really needs to perform an integer division by an arbitrary, non-predetermined value? I'd say that in large scheme of things, not really that often. I still believe that the fact that original 8086/8088 even had a "hardware" division instruction was mostly a question of code density on the time of real shortage of memory back in the seventies and easier entry to market for compilers of high level languages - which were not that fancy back then for microprocessors.
@Autotrope26 күн бұрын
I started this on the wrong foot, I had "just use a high level language" yelling in my head until I realized that looking at these from the low level is the point of the video and indeed the channel
@philmccavity5 ай бұрын
I learned about this in a pure math course but this gives another perspective. Well done!
@macmcleod1188 Жыл бұрын
5/9 is a constant. As an assembly language programmer, I'd do the math by hand once and then use that resulting value as a constant. Heck, I still do it in java today. But you have a good point where you are dividing variable values.
@FilipArlet
Жыл бұрын
Problem is 5/9 is floating point constant, you can multiply by that, but you need FPU (or something that handles floating points)
@abdox86 Жыл бұрын
I remember when I first hit by this wall, but mine was bigger, just imagine ( 8-bit processor, and no multiply instructions) It was a PIC mcu, and it almost boiled me to write a floating point 64-bit based division and multiplication macros, it was enjoyable and hard challenge to tackle.
@cpK054L
Жыл бұрын
PIC is trash, move to ARM master race.
@abdox86
Жыл бұрын
@@cpK054L design one and then call it trash, for studying fundamentals pic mcus are the best , ARM is more advanced and complex.
@cpK054L
Жыл бұрын
@@abdox86 I used PICs for undergraduate programs and for the price to performance ARM gave me a better result as a general purpose MCU
@LuealEythernddare5 ай бұрын
I don’t think I’ll go into learning assembly at any point. But I started programming in my 3rd year of college. I first switched over from music education when I learned I didn’t want to be a teacher. I first learned some basics in game maker, then the teacher taught us some simple Visual Basic (it was outdated frankly even back in 2019, but it was a good intro language imo, statically typed yet simple to grasp the syntax). From there the next classes taught C#, sql server, then some basic web dev with html, css, and php. I didn’t finish there, but went on to teach myself python, JavaScript, Ruby, and I’m even trying to pick up some C++ and Go. I transferred my credits to WGU and am trying to finish up my degree and get out there as a developer, and I love that you, primeagen, melkey, and others put great content out there and it helps me to keep myself motivated to finish.
@thepoliticalstartrek Жыл бұрын
Number 1 thing I learned in CPU design is early on CPUs at the logic level can do one form of math. Addition . It does everything with addition. It uses flags or registers to determine negative numbers. It also uses counters for multiplication and division.
@soberbard Жыл бұрын
i remember making batch scripts to automate certain things and needing floating point except batch doesnt support it... i ended up doing fpm after trial and error without knowing that it had an actual name cus i suck at maths. interesting to see that it has real uses outside of my crappy scripts 😭
@hoej Жыл бұрын
This brought me back to my time at uni when we had some somewhat beefy FPGA units. FPGA is basically programmable breadboard, only you can reprogram them on the fly (which NASA does all the time). You just got to remember that even though the functions are written sequentially they are executed simultaneously. Anyway, those beefy units could hold a whopping THREE division circuits. And then only a few gates were left for everything else. Our first unit did not have space for a single division circuit. It could calibrate and control a six axis gimble, but division was too much.
@Cooe.
Жыл бұрын
FPGA's have come a long damn way since you were in school lol. You can recreate entire mid-late 90's machines on modern FPGA's, and not even crazy high end FPGA's either. And not simple machines either, but stuff like the Nintendo N64 w/ its beefy 64-bit MIPS CPU + SGI GPU.
@hoej
Жыл бұрын
@@Cooe. this unit was a Cortex-M with FPGA around it. We used it for pipelining computer vision to identify edges via a modified Hough transform supporting splitting the image into 9 parts after converting input from VGA, then sending the results to the Cortex-M. Has it come further than that in 6 years?
@minhazulislam4682 Жыл бұрын
0:37 "I felt I was on my way to a quick LOW LEVEL victory" nice one.
@jasonknight1085 Жыл бұрын
As an old assembly hound, it's fun to see something like ARM. I don't do a lot of RISC stuff in ASM because of the old adage "RISC is for people who write compilers. CISC is for people who write programs" On the 8086 and 8088 we too couldn't rely on there even being a FPU, and our MUL and DIV statements were painfully slow -- in excess of 150 clocks EACH. An old trick for doing percentages in that case was to do a multiply very like this as a fraction of 65536. What made that "fun" was you ended up with the integer result in DX, and the numerator (65536 as the denominator) in AX. We were further hobbled by there being no MUL reg, immed ; assumes temperature in AX sub ax, 0x20 mov bx, 0x8E38 ; roughly 5/9ths 0x10000 mul bx ; DX = integer result, AX = numerator over 0x10000 Becuase it uses only the one multiply, it takes half the time that MUL 5 and DIV 9 would. Which is good when we're talking turning 300+ clocks into 150.
@customsongmaker
Жыл бұрын
So true, typical MUL reg
@ArneChristianRosenfeldt
Жыл бұрын
What are 8086 and 68000 even doing in all those cycles. The factors are 16 bit as is the ALU. This only makes sense to me if MUL and DIV were just an afterthought which had to be cheap. Indeed I have seen a large stretches of code without any of these instructions. This typical IBM-PC business code does not MUL or DIV. Maybe once to calculate taxes.
@jasonknight1085
Жыл бұрын
@@ArneChristianRosenfeldt A hardware multiply requires a lot... no, A LOT ... of silicon. We're talking as much as 20 times the silicon as an addition. To that end a lot of early processors left it out entirely, or implemented it in "microcode" In fact that's exactly what the 8086 / 8088 does, is use microcode. A small bit of code on the processor that pretends to be an opcode, that runs on the other instructions of the processor. Even with a hardware multiply the operation is intensive enough to require a lot of clock cycles as multiply and divide -- even just as integer -- are expensive operations. A mul reg16 taking anywhere from 25 to 40 clocks depending on CPU. (laugh is the 486 is slower than the 286 or 386), and thus it wasn't uncommon for stuff like vga video routines to still use shifts instead. ; ACCEPTS ; AX = Y ; CORRUPTS ; AX ; RETURNS ; DI = Y * 320 xchg ah, al ; == y * 256 mov di, ax shl ax, 2 ; == y * 64 add di, ax ; di == y * 320 Runs in about half the clock cycles of a hardware multiply right up through the first pentium. Though still not as fast as a XLAT
@ArneChristianRosenfeldt
Жыл бұрын
@@jasonknight1085 still does not explain why anybody would need more than 20 cycles in microcode. With a barrel shifter and some look ahead for zeros all those shift special cases would go away. Zero flag running from msb to lsb for early out (like in 386). Silicon usage goes up with the square of bits. Hence it makes sense to implement 8x8 as MUL ( costs like a 64 bit add ), and then run the microcode over this. 4 cycles for 68k like 16x16 -> 32. ARM7 in the GBA has hardware 32x8 and needs 4 cycles for 32x32.
@jasonknight1085
Жыл бұрын
@@ArneChristianRosenfeldt Do you realize just how much silicon an ARMv4 has that an 8086 does not? We're talking about a ~29,000 transistor and ~5000 gate processor here. That ARM7TDMI (the 7 is misleading, it's a v4) is what, 280,000 transistors and 90k or so gates? What you describe in silicon alone would be a fifth the die space the 8086 used. JUST for the multiple logic alone. Remember, this is an age where a DRAM memory access was 4 clocks per bus width (so a word was 8 clocks on the 8088), and even a simple byte shift by fixed one place took two clocks. Memory shift was 15 + and effective address calculation, and shift reg16 by CL was 8+4n. But yeah, once transistor counts grew these problems went away. By the time of the 80186 you're looking at 26 to 33 clocks, and the 286 it's 13 to 26 clocks. The 186 was 55,000 transistors, the 286 was 134,000. Also worthy of mention is the NEC V20 and V30, drop-in replacements for the 8088 and 8086 respectively, with timings closer to and in some places even faster than the 286. See its flat 21 clocks for all 8 bit multiplies and 26 clocks for 16 bit. It achieved this by having over 60,000 transistors, more than double the stock intel part. The type of infrastructure needed for the microcode you describe by the time you do both 8 bit and 16 bit mul and div is more silicon than many processors even had when the 8086 started being designed in 1976. That's no joke either, look at the 8080 with its whopping 4500 transistors-- and it was considered a chonky boy on release. The Z80 was around 8500, 6502 was 4500. And IF contemporary processors had a multiple opcode (many did not) they were inherently 8 bit with a 16 bit result (in two registers). There's a reason the Motorola 68000 with its amazing fast hardware multiply via its 16 bit ALU needed 68000 transistors (hence its name), an absurd number for the time that took five years to even come close to looking viable outside of big iron use... because manufacturing processes for that many transistors reduced yields resulting in elevated prices.
@_--_--_ Жыл бұрын
Also important to know is that both integer divide and FPU divide are *very* slow even on architectures that supports it in hardware. An integer divide is up to 12 cycles on Cortex M4 and 10-50 cycles on modern X86-64 for example, Skylake needed like 40-100 cycles for integer div just a couple generations ago. FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4. Also noteworthy is that even architectures such as M0/M0+ can implement external division hardware, such as the case for the RP2040 using M0+ architecture which integer division hardware needs only 8 cycles and therefore is actually faster than standard Cortex M4 for integer division.
@ewenchan1239
Жыл бұрын
If integer ADD/SUB/MUL is so fast, then why isn't it used everywhere?
@_--_--_
Жыл бұрын
@@ewenchan1239 Can you restate your question, as it doesnt really make any sense. Are you asking why a div instruction isnt substituted with a combination of add/sub/mul instructions? They are, the compiler will try to replace a divide either by a shift, a combination of muls and shifts or adds and shifts, depending on if your target supports mul instructions. However this only works if the divisor is a compile time constant. For other cases the compiler most likely will use div instruction unless you tell it otherwise. For example you could use libdivide instead, which can be faster than hardware divide on a lot of targets. Or you could tell it to use its default inbuild algorithm, which most likely will be quite a bit slower than a hardware divider *if* the compiler cant optimize it down, which in a lot of cases it can.
@ewenchan1239
Жыл бұрын
@@_--_--_ "Can you restate your question, as it doesnt really make any sense." You state: "FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4." If your statement above is true, then why isn't it used in lieu of using the ACTUAL divide instruction, ALL the time? (You've answered it for the most part in the second half of your reply above.) But for said second half of your reply to be true, you would have to know: a) the divisor is a compile time constant (which means that if it isn't a compile time constant, it won't work) and b) you explicitly tell the compiler what to do/how to handle the DIV instructions, which typically means that you would have to know the specific purpose or application that you are programming for, AND that the difference in the quotient is not significant nor critical enough IF the approximation is sufficient. (i.e. you're programming for yourself rather than for someone else to use where the importance of the epsilon value is not known nor what someone else's tolerance limit for epsilon is)
@_--_--_
Жыл бұрын
@@ewenchan1239 "You state: "FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4."" I mentioned the integer divide instruction a bit before that, lets not mix up integer operations with floating point. As far as I am aware we are talking about integer arithmetics specifically. "a) the divisor is a compile time constant (which means that if it isn't a compile time constant, it won't work)" Wont work is not the right wording here. Yes if the divisor isnt a compile time constant than the compiler cant replace the div instruction with a faster combination of other instructions by default, however thats an optimization problem, it still "works". Also its often a good idea to rethink your code, often a lot of divisions are unnecessary (especially with floating point, but we are talking integer here), the compiler cant do this for you most of the time. This was mostly my intent of the original comment anyway, the compiler cant optimize all your mistakes away and using divides carelessly is often a mistake if performance is of any importance. "which typically means that you would have to know the specific purpose or application that you are programming for" As this discussion centers around low level programming, mostly embedded programming, we are talking about target specific code, where you know exactly what target you are progamming for and its specific compiler. Portable code is not relevant to this discussion. "AND that the difference in the quotient is not significant nor critical enough IF the approximation is sufficient." I am not talking about approximation, thats a good way of potentially optimizing code if its feasible, but thats not part of this discussion. All compiler optimizations or replacement software algorithms will give the exact same output as the hardware divider.
@ewenchan1239
Жыл бұрын
@@_--_--_ "I mentioned the integer divide instruction a bit before that, lets not mix up integer operations with floating point." But you also the "cost" (in terms of the number of operations) it takes for FP ADD/SUB and FP MUL on said Cortex M4 as well. "FPU divide on Cortex M4 is 14 cycles." "As far as I am aware we are talking about integer arithmetics specifically." My understanding is that we can be talking about both (given that you can represent floating point numbers as a sum-product of pairs of two integers, can you not? (i.e. the approximation for 1/9 ~= 1 * 2^-4 + 1 * 2^-5 + 1*2^-6 ~= 0.109375) And isn't the mul/shift - it's doing floating point math with integers, isn't it? (unless I am missing something here and/or misunderstanding what @Low Level Learning is talking about here, in this video) "Wont work is not the right wording here. Yes if the divisor isnt a compile time constant than the compiler cant replace the div instruction with a faster combination of other instructions by default, however thats an optimization problem, it still "works"." The "won't work" here refers to the substitution or the replacement of the slower div operation with the faster combination of other instructions. That's the "work" that you were talking about it doing. Therefore; pursuant to your statement, if you aren't dividing by a compile time constant, then said replacement/substitution won't take place - i.e. the "work" that I am referring to, based on what you wrote, won't take place. Ergo; "won't 'work'" (div operation won't get replaced by "combination of other instructions" (which should help to perform the division faster without using the div operation/instruction).) "often a lot of divisions are unnecessary" Varies, I would think. "This was mostly my intent of the original comment anyway, the compiler cant optimize all your mistakes away and using divides carelessly is often a mistake if performance is of any importance." Again, I think that it varies. (depends on what you are doing/why you are using division in the first place) re: mistakes I wouldn't necessarily call using divisions a mistake. I agree that this is where a skilled programmer comes in, but if you are learning or you are an unskilled programmer, then that's something that they will have to learn. Take for example, the inversion of 1/(1/9). If you use exact arithmetic, your E(X) is 9. If you approximate (1/9) ~= 1*2^-4 + 1*2^-5 + 1*2^-6 ~= 0.109375, then 1/0.109375 = 9.142857142857142.... Therefore; your epsilon would be 9.142857142857142... - 9 = 0.142857142857142.... or about 1.6% error. And therefore; the question that you would have to ask yourself is "is that 1.6% error significant or not?" In some cases, it might not be. In other cases, it can be VERY significant. It really depends on what you are doing. (Note that often times, if you are replacing a division operation, I have yet to see someone who advocates for the replacement or substitution of the div operations with a combination of other operations - they don't automatically talk about (nor calculate) their epsilon and let the user know what that epsilon is that arises from said replacement/substitution. I think that if you are going to implement such a kind of replacement (of the div operation) with a combination of other operations, where and whenever possible, you SHOULD be REQUIRED to tell the user what your epsilon is, when appropriate.) "As this discussion centers around low level programming, mostly embedded programming, we are talking about target specific code, where you know exactly what target you are progamming for and its specific compiler. Portable code is not relevant to this discussion." That's an assumption on your part though. There is nothing that explicitly states this. As I said, it is important for people NOT to have it as a key takeaway from watching this video that they are going to implement this everywhere they see because they see it as a speed/performance improvement whilst they fail to recognise that this speed/performance improvement comes at the expense of precision and accuracy. Even for assembly code (cf. GROMACS/Folding@Home), maintaining the precision and the accuracy can still be vital for the program to perform the calculations that is asked of it properly and correctly. "I am not talking about approximation, thats a good way of potentially optimizing code if its feasible, but thats not part of this discussion." The replacement of a division operation with a combination of other operations is almost always an approximation of the actual division operation. (cf. the example that @Low Level Learning shows above with respect to the approximation of 1/9 ~= 1 * 2^-4 + 1 * 2^-5 + 1 * 2^-6 ~= 0.109375) That IS an approximation. That IS what he did and that IS what is shown.
@embeddedroom Жыл бұрын
True story. Encounter this frequently when working with code generators! Good job!
@LowLevelLearning
Жыл бұрын
:D
@nekomakhea94405 ай бұрын
If you want to multiply/divide arbitrary numbers provided at runtime (rather than hard-coded constants) without hardware multiply/divide you can use the CORDIC algorithm, which only requires add, subtract, and shift instructions and a small lookup table. As a bonus, it can also do trig functions, hyperbolic functions, square roots, exponentiation, and logarithms too.
@janniklasbertram943610 ай бұрын
Hey Nice video. One small note, when I had to do a larger assembly project, I would use precompiler instructions to name as much as I could in my assembly code, especially registers. Made for much more readable code. Maybe this could add a bit to your code in the videos, since viewers spent very little time reading the code, they (I) may grasp the meaning of the code way quicker if they know what the registers mean
@QuantenMagier Жыл бұрын
It's called RISC (Reduced Instruction Set Computing) for a reason, if you want hardware division operations use a CISC (Complex Inst..) processor or a mathematical co-processor. ;)
@GeorgeTsiros
Жыл бұрын
RISC does not mean that it won't have certain instructions, or that it lacks functionality, or that it is a _limited_ instruction set computer. It just means load/store arch, orthogonal reg file and instr set.
@QuantenMagier
Жыл бұрын
@@GeorgeTsiros Well it's not limited it's just reduced, but it still is general purpose. ;) The idea behind RISC was to improve the efficiency per transistor count and a division unit would mean more transistors with almost no benefit regarding the most programs.
@user-fc8xw4fi5v Жыл бұрын
ah yes, fixed point multiplication with bit shifting! i had to do this once to use trigonometric functions in a Gameboy Advance game i was writing; those processors had no builtin floating point instructions either! cool stuff
@turboballer2 ай бұрын
I've always referenced the C to F calculation as (C * 9/5) + 32. In which case, you can simply turn the 9/5 division operation into a multiplication of 1.8 instead. Seems to work fine.
@MaddHatter Жыл бұрын
A long while back I had to do the same thing... But the complier I had was Garbage and implements successive subtraction.... So I had to speed it up with my Jr in college electrical engineering knowledge... My method was to divide by 9 and do mod in 1 function then multiply the remainder by 5 then re-divide the remainder by 9. This reduced loop iterations by a substantial amount. My point is... Some times turning 2 simple but long operations into more steps can really speed things up..
@jesseparrish1993 Жыл бұрын
That's why I stick to multiplication by inverses
@LowLevelLearning
Жыл бұрын
So much easier
@jesseparrish1993
Жыл бұрын
@@LowLevelLearning Obviously I posted a witless comment as quickly as I could to get attention. Thanks for the content!
@LowLevelLearning
Жыл бұрын
Anytime man
@jlewwis1995 Жыл бұрын
For the multiplication part you could optimize it even further by removing multiplication entirely and just storing f-32 in a register then bit shifting left by 2 followed by adding the f-32 which would be the same as multiplying by 5. The the divide by 9 is way tougher to optimize though because it seems like you can't to the same with right bit shifts unless I'm missing something :/ Also I guess this would only work if you wanted the whole integer part of the result, if you wanted some decimal places too I guess you would need to use fixed point
@jhgvvetyjj6589
Жыл бұрын
Why not combine multiplication by 5 and multiplication by 1÷9 (477218589÷4294967296) to do multiplication by 5÷9 (2386092943÷4294967296)
@computationaltrinitarianis1451
Жыл бұрын
@@jhgvvetyjj6589 because division in this case is floor division which is not associative (x*y)/z != x*(y/z). For example (2*3)/4 = 6/4 = 1 != 0 = 2*0 = 2*(3/4). EDIT: Ah, I see that is not what you are asking now.
@vmarcelo49
Жыл бұрын
Damn you guys are so nerd
@alexc4924
Жыл бұрын
this is how binary multiplication already works!
@cpK054L
Жыл бұрын
Wouldn't your register just end up trashing your bits if you go beyond scope? xxxxooooxxxxoooo >> 6 bits for example, would you not end up with 000000111100? then if you were shift left it'd come back as zeros?
@firiasu Жыл бұрын
Behind this is really interesting integer arithmetic, in which you need to solve an inequality with integer and fractional parts!
@davidgrisez Жыл бұрын
I remember reading that a historic computer called the IBM 1620 that came out around 1960 did math by using lookup tables. On this machine a divide instruction was an option at extra cost. It does not surprise me that there are processors that do not have divide instructions. Another way of handling this was a routine that performed division by using multiple subtraction.
@thisisreallyme3130 Жыл бұрын
YES - more of this please! I’m learning 6502 (cc65 but no stdlib lib emulation), and this kind of trickery totally relates. 😊
@54egg
Жыл бұрын
Lookup tables are your friend on 6502 where practical. Check this video which has a link to page (as I recall) that shows how to do 8x8 bit multiply on 6502 with 4 table lookups, 2 8 bit add/subtracts, and a 16 but subtract in under 40 clock cycles, kzread.info/dash/bejne/iXmgy6d6k5bAqaQ.html
@jhgvvetyjj6589 Жыл бұрын
Meanwhile x86 has native sine cosine and tangent and can convert between extended precision binary floating point and 18 digit decimal integers
@andrewdunbar828
Жыл бұрын
680x0 had those too in the FPU for '020 and '030, but they removed them again for the '040 because they're slow even in hardware and add so much complexity and use up so much silicon.
@johncampbell4389 Жыл бұрын
I ran across this issue when working with the UNIVAC 1100/80 and the DataWest Array Processor; When i spoke to an engineer i learned that division has always been a problem and the successive approximation used has a tendency to make the timing non-deterministic ... you know, it depends. The other simple database operation with a non-deterministic timing in INSERT...
@svenprigann2631 Жыл бұрын
I am following your videos quite a while now. Me, holding an masters degree in eletrical engineering and quite a dubious amount programming M0,M0+,M3,M7 ...still amazed by the simplicity of your videos about complicated topics. Keep up the great work =)
@a_soulspark Жыл бұрын
very interesting video! I'm curious, would it be more efficient to use the libc floating point library or the fixed point method you used?
@LowLevelLearning
Жыл бұрын
For just this example the fixed point will definitely be faster. Converting the number to a float, sending it to the FPU, and then bringing it back will take way more cycles than what we’ve done here. The trade off is accuracy though.
@andrewdunbar828
Жыл бұрын
The fixed point method will be fastest when diving by a constant. The FP library will be best in the general case.
@SolomonUcko
Жыл бұрын
Specialized integer division functions (__aeabi_idiv/__aeabi_uidiv) are probably faster than floating-point division, I would assume
@richardericlope3341
Жыл бұрын
Some toolchains emulate those calls. Not great if you want ot fast.
@CallousCoder Жыл бұрын
Been there done that on 6502, z80 a whole lot. And fixed point is the easiest and fastest approach. Also many CPUs don’t have multiply, they are both mathematical monsters 🤪
@gdclemo
Жыл бұрын
One method I've seen on 6502 is to use a lookup table of squares; (x+y)^2 - (x-y)^2 == (x*y*4). Useful for 8-bit * 8-bit -> 16-bit multiplies. You can do the divide by four with shifts or just build it into the table.
@jasonyesmarc309 Жыл бұрын
I think you've just solved my engineering problem. Thank you!