Making Golang 13x faster with Assembly code

One of the coolest parts of Go (golang) is that there are many ways to speed up your program. One such way is to take advantage of the ability to create .s and .asm assembly code files that are compiled directly into your program. In this video I go over what I did in my Golang Vulkan game engine to improve the performance of the linear algebra math. Taking advantage of the SIMD (AVX) instructions we can improve some functions by nearly 13x. SIMD is "single instruction multiple data" and is a key component missing from the standard go compiler. We can of course use the built in assembly capabilities of Go to improve performance and access non-accessable cpu instructions for many more things other than vectorization operations, but this is probably the most common operation people would drop into assembly for.
Go assembly file ► github.com/KaijuEngine/kaiju/...
Twitter ► brentfarris.com/twitter
Website ► brentfarris.com
GitHub ► brentfarris.com/github

Пікірлер: 89

  • @user-tw2kr6hg4r
    @user-tw2kr6hg4r20 күн бұрын

    you know its serious computer engineering when the source code is printed on a sheet of paper

  • @lozyodella4178

    @lozyodella4178

    18 күн бұрын

    😂😂😂

  • @w花b

    @w花b

    18 күн бұрын

    First one I've seen like that was ben Eater but this one even has colors that's next level

  • @araz911

    @araz911

    18 күн бұрын

    ​@@w花bfor syntax highlight, it's a paper of 2024, ok?...!!

  • @gregandark8571

    @gregandark8571

    16 күн бұрын

    Always has been.

  • @AK-vx4dy
    @AK-vx4dy19 күн бұрын

    In assembly you have only layer of abstraction... paper 😅

  • @tubbystubby
    @tubbystubby12 күн бұрын

    I started go half a year ago and have been enjoying it a lot. This was awesome, learned a lot. Thanks for great content. You know you are getting the juiciest stuff if it's on paper.

  • @mrrolandlawrence
    @mrrolandlawrence14 күн бұрын

    wow love this. i used to be an ARM programmer many many years ago. back in those days you really had to optimise code for the number of cpu cycles needed. sophie wilson really made ARM instructions a doddle to use.

  • @user-tw2kr6hg4r
    @user-tw2kr6hg4r20 күн бұрын

    matrix multiplication in primary school?

  • @BrentFarris

    @BrentFarris

    20 күн бұрын

    One year, you're learning to read books without pictures. The next, you're calculating the cross product on a 4 dimensional matrix. Then you go to middle school, learn about girls, forget it all, and have to relearn it in pre-calc.

  • @Paul-zh2jp

    @Paul-zh2jp

    19 күн бұрын

    this is what i came to comment lol

  • @w花b

    @w花b

    18 күн бұрын

    ​@@BrentFarris Don't have to forget it if no girls approaches you. That's a win in my book.

  • @MEMUNDOLOL
    @MEMUNDOLOL17 күн бұрын

    Link to this video could possibly be the best answer for the question "Should i build my own engine or use a ready one"

  • @grimquokka9843
    @grimquokka984316 күн бұрын

    this is a good idea you came up and also appreciate using Paper explanation,.Please keep up with these videos sir.

  • @mr.daniish
    @mr.daniish13 күн бұрын

    This is some serious knowledge! More of these please

  • @MaxPicAxe
    @MaxPicAxe14 күн бұрын

    That's nice how, for four floats, the 2 bits for src, 2 bits for dst and 4 bits for bitmask conveniently fit into exactly a byte. The next convenient number of floats with this pattern appears to be a very large number, where convenient means when the amount of space src,dst,bitmask take up in bits is a power of 2.

  • @BrentFarris

    @BrentFarris

    13 күн бұрын

    You can also operate on doubles, but half as many due to using the same space.

  • @crowlsyong
    @crowlsyong20 күн бұрын

    What a supernatural gift to the world

  • @lufsss_
    @lufsss_20 күн бұрын

    What a supernatural explanation

  • @QW3RTYUU
    @QW3RTYUU6 күн бұрын

    Ben Eater vibes this gives me. Thanks for the video!

  • 17 күн бұрын

    Love that you did it with go. It's just such a clean language.

  • @BrentFarris

    @BrentFarris

    17 күн бұрын

    I picked up Go after learning that Ken Thompson helped design it. Slices and goroutine/channels are awesome

  • @shurizzle

    @shurizzle

    11 күн бұрын

    @@BrentFarris Goroutines/channels come from Plan9, as does that ASM syntax. After all, Rob Pike is behind both Golang and Plan9.

  • @joeybasile1572
    @joeybasile157214 күн бұрын

    Thanks dude. Informative. Good presentation.

  • @baxiry.
    @baxiry.21 күн бұрын

    What a supernatural topic

  • @sanderbos4243
    @sanderbos424311 күн бұрын

    Extremely good explanation

  • @treelibrarian7618
    @treelibrarian761818 күн бұрын

    just thought you might be interested that the pack operation with 16 insertps's (16p23,16p5 ops) instead may be done as an in-register matrix transpose using unpckhps/unpcklps (4p23,8p5 ops) in half the time. I'm not familiar with golangs inline asm so I'll use intel asm instead, I'm sure you'll be able to translate: vmovups xmm1, [rbp + start] ; a3a2a1a0 vmovups xmm2, [rbp + start + 16] ; b3b2b1b0 vmovups xmm3, [rbp + start + 32] ; c3c2c1c0 vmovups xmm4, [rbp + start + 48] ; d3d2d1d0 vunpckhps xmm5, xmm2, xmm4 ; d3b3d2b2 vunpcklps xmm4, xmm2, xmm4 ; d1b1d0b0 vunpcklps xmm2, xmm1, xmm3 ; c1a1c0a0 vunpckhps xmm3, xmm1, xmm3 ; c3a3c2a2 vunpcklps xmm1, xmm2, xmm4 ; d0c0b0a0 vunpckhps xmm2, xmm2, xmm4 ; d1c1b1a1 vunpckhps xmm4, xmm3, xmm5 ; d3c3b3a3 vunpcklps xmm3, xmm3, xmm5 ; d2c2b2a2 has the advantage that whether xmm,ymm, or zmm registers it's still 8 unpack ops to do 1,2 or 4 4x4 matrix transposes. This formula uses one extra register and produces the result in the same order in the registers as your insertps-based version. edit: realized a couple of days later that I used the AVX 3-operand versions of the instructions not the SSE1 2-operand versions, so I've added the V's. it's not so pretty if you can't use the, since every output has to be copied first and the operand ordering is inconvenient too, so it doesn't fit in 5 registers any more...

  • @lozyodella4178

    @lozyodella4178

    18 күн бұрын

    Is this the language of Gods?

  • @stercorarius

    @stercorarius

    17 күн бұрын

    @@lozyodella4178 nah thats lisp

  • @treelibrarian7618

    @treelibrarian7618

    15 күн бұрын

    a further thought for you: perhaps the transpose is entirely unneeded anyway: this code here does the 4x4 matrix multiply without it. see the inline comments for functional details. I adapted this from a 16x16 avx512 version where the macro was just 16 fma instructions with inline broadcast loading the elements of A directly. Here using shufps as an SSE broadcast equivalent and the multiplies and adds are separated. ;; 4x4 matrix multiply ;; A is the matrix that is scanned horizontally, ;; B is the matrix to be scanned vertically. ;; output to O %macro domatrixrowSSE 0 shufps xmm0, xmm3, 0 ; broadcast first element of A row 1 mulps xmm0, xmm4 ; multiply whole first row of B shufps xmm1, xmm3, 0x55 ; bcast second element of A row 1 mulps xmm1, xmm5 ; multiply by second row of B addps xmm0, xmm1 ; add to first result shufps xmm1, xmm3, 0xaa ; e3 of A row 1 mulps xmm1, xmm6 ; mult B row 3 addps xmm0, xmm1 ; add shufps xmm1, xmm3, 0xff ; e4 of A row 1 mulps xmm1, xmm7 ; mult B row 4 addps xmm0, xmm1 ; last add %endmacro multiply4x4function: ; this is not complete: replace the tokens of a, b and o ; with whatever you have those pointers in. ; Can be used as a base for larger matrix multiplies ; if you load the prior output content before adding all 4 lines ; and change 16/32/48 to 1/2/3x row length in bytes, ; and a/b/o point to the relevant parts of the input/output matrices. movups xmm4, [b + 0] ; load whole b matrix movups xmm5, [b + 16] movups xmm6, [b + 32] movups xmm7, [b + 48] movups xmm3, [a + 0] ; load first row of A matrix domatrixrowSSE ; the macro multiplies one row of A by 4 columns of B movups [o + 0], xmm0 ; store results to first row of output matrix ; e1r1O = : e2r1O = : e3r1O = : e4r1O = ; e1r1A*e1r1B : e1r1A*e2r1B : e1r1A*e3r1B : e1r1A*e4r1B ; + e2r1A*e1r2B : + e2r1A*e2r2B : + e2r1A*e3r2B : + e2r1A*e4r2B ; + e3r1A*e1r3B : + e3r1A*e2r3B : + e3r1A*e3r3B : + e3r1A*e4r3B ; + e4r1A*e1r4B : + e4r1A*e2r4B : + e3r1A*e3r4B : + e4r1A*e4r4B movups xmm3, [a + 16] ; load second row of A domatrixrowSSE movups [o + 16], xmm0 ; store to second row of O movups xmm3, [a + 32] ; third row of A domatrixrowSSE movups [o + 32], xmm0 ; to third row of O movups xmm3, [a + 48] ; 4th row of A domatrixrowSSE movups [o + 48], xmm0 ; to 4th row of O ;; 28p01, 16p5, 8r4w. 16cycles/matrix on icelake, 28c/matrix on older CPU with only 1 vfp port (eg sandy bridge)

  • @shappertallw

    @shappertallw

    4 күн бұрын

    @@treelibrarian7618 this is insane i never thought i would see the day where someone cold rolled asm with sse instructions no less in a yt comments section. props

  • @treelibrarian7618

    @treelibrarian7618

    3 күн бұрын

    @@shappertallw it's a hobby of mine: I've done it before and I'll probably do it again. I think I might have scared one or two youtubers away from posting asm-related videos - which was not my intention. I really should be making video's myself...

  • @Antonio-yy2ec
    @Antonio-yy2ec18 күн бұрын

    Pure gold!!

  • @sirbumblefuck
    @sirbumblefuck20 күн бұрын

    What a supernatural way of explaining

  • @kira.herself
    @kira.herself20 күн бұрын

    What a supernatural video

  • @timofeysobolev7498
    @timofeysobolev749814 күн бұрын

    Great video!)

  • @blockshift758
    @blockshift75817 күн бұрын

    I always see comments "matrix math on middle/high school?!" On videos like this. And laugh to my self because i remember we did it on elementary(grade 4-6).

  • @Decastyled
    @Decastyled19 күн бұрын

    'Cause you're a supernatural A beating heart of stone You gotta be so cold To make it in this world Yeah, you're a supernatural Living your life cutthroat You gotta be so cold Yeah, you're a supernatural

  • @hyprland
    @hyprland20 күн бұрын

    What a super nature

  • @Caellyan
    @Caellyan17 күн бұрын

    What about using something like volk (vector optimized library of kernels)? Is Go FFI slow?

  • @BrentFarris

    @BrentFarris

    17 күн бұрын

    You likely can without issues. You may have to benchmark it though because you do have to pay the small cost of swapping stacks. Go's stack is built leaning for goroutines, so it has the swap to a C-compatible stack to call C.

  • @--bountyhunter--
    @--bountyhunter--19 күн бұрын

    what a natural super

  • @hz8711
    @hz871112 минут бұрын

    I am missing too many things to understand this, can someone explain it in few sentences, at least what is the idea? Thanks!

  • @hulakdar
    @hulakdar19 күн бұрын

    is there no way to natively emit vector instructions in go? If that is true, than that is quite unfortunate Isn't it easier to write those functions in C and link with them instead of writing out assembly?

  • @BrentFarris

    @BrentFarris

    19 күн бұрын

    Not at the moment in Go directly. You have a few options: 1. Write assembly as we did here directly (fastest execution). 2. Write vectorized assembly instructions as their own function (similar to C) and use them at a higher level, but you'll need to take care to follow the calling conventions to not clobbered your asm work. 3. Use the C vectorization library functions and call from C. This will have the tiny overhead of swapping stacks, though.

  • @gregandark8571

    @gregandark8571

    18 күн бұрын

    @@BrentFarris Go is bullshit language exactly for such technical lacks :(

  • @maximus1172
    @maximus117215 күн бұрын

    very cool!!, you should also try making the engine in rust

  • @BrentFarris

    @BrentFarris

    15 күн бұрын

    One day, I may. I enjoy trying out languages, and game frameworks/engines tend to be my testbed. Either as the core engine code or as a scripting language depending on the nature of the language.

  • @TheCyberBully420
    @TheCyberBully42010 күн бұрын

    You made an engine with Vulkan or you made something similar to Vulkan??

  • @BrentFarris

    @BrentFarris

    10 күн бұрын

    Using Vulkan, I've made engines in C, C++, and Go. It has a pretty nice and straightforward structure once you get a handle of it.

  • @MrTomyCJ
    @MrTomyCJ18 күн бұрын

    There is a flaw in the system: I can deduce from the comments that I should reply something supernatural without having watched the entire video. The next time you'll have to provide a function to determine what to comment instead of a phrase, so that the appropriate comment can't be deduced from the comments. You got me to comment anyway though.

  • @iant9053
    @iant905318 күн бұрын

    Holy, If you had to learn everything from scratch, in what order would you learn your langs? just starting with C, thx wizard

  • @BrentFarris

    @BrentFarris

    17 күн бұрын

    I would learn C if I had to go from scratch. It's just high level enough to do huge projects and just low level enough to teach you how computers work internally. I learned C++ as my first language, but I wish it were C.

  • @Onyx-it8gk
    @Onyx-it8gk19 күн бұрын

    Neat video! If you have this much programming knowledge and skill, I think you'd really appreciate Vale. It's a new language that takes a novel approach to memory management without a GC. It borrows concepts from many languages like Rust, Cyclone, Pony and Forty2.

  • @BrentFarris

    @BrentFarris

    19 күн бұрын

    Thanks! I'll have to check it out, I have a lot of fun trying out different languages. There have been a lot of languages popping up lately. It's so hard to keep up, haha

  • @Onyx-it8gk

    @Onyx-it8gk

    19 күн бұрын

    @@BrentFarris I know what you mean! I'm sure someone such as yourself has a very long list of things to check out with not enough time in the day!

  • @cvabds
    @cvabds19 күн бұрын

    How much you want to create a game engine for temple OS?

  • @BrentFarris

    @BrentFarris

    19 күн бұрын

    Haha, I haven't booted up TempleOS yet. It's still on my bucket list. When I do, it might just happen!

  • @cvabds

    @cvabds

    19 күн бұрын

    @@BrentFarris please don't be restricted to the whole religious thing, use it to the full potential please, 4k high res

  • @tiskanto
    @tiskanto12 күн бұрын

    This is the "Ben Eater" style

  • @wakanda6357
    @wakanda635717 күн бұрын

    What should one do or learn to understand assembly??

  • @greenrocket23

    @greenrocket23

    16 күн бұрын

    Well, a pretty good resource for beginners is the MIT OpenCourseWare for the x86_64 architecture

  • @BrentFarris

    @BrentFarris

    15 күн бұрын

    Program some small things in 6502 assembly. It is an incredibly small assembly language and will teach you 90% of what you need to know. You can then get a book or read online docs for x86/x64 and arm instructions. Check out this 6502 tutorial. It comes with an emulator and is a lot of fun: skilldrick.github.io/easy6502/index.html

  • @alejandroulisessanchezgame6924
    @alejandroulisessanchezgame692419 күн бұрын

    It is posible to develop 3d games with golang like this, even if its a gc language?

  • @BrentFarris

    @BrentFarris

    19 күн бұрын

    Yes, you can either write it from scratch like I do for fun (see Kaiju github engine link in description). Or you can load up helper C libraries for OpenGL, SDL, etc; which I've done in the past. You'll find most game engines like Unreal and Unity use an internally built garbage collector, so don't let the GC hold you back from experimenting.

  • @alejandroulisessanchezgame6924

    @alejandroulisessanchezgame6924

    19 күн бұрын

    Thanks i will try.

  • @nittani.

    @nittani.

    18 күн бұрын

    What is garbage ​@@BrentFarris

  • @QW3RTYUU

    @QW3RTYUU

    6 күн бұрын

    @@nittani. something to be collected it seems

  • @danielsmith5626
    @danielsmith562615 күн бұрын

    ASMR backend is peak

  • @harold2718
    @harold271819 күн бұрын

    Instead of transposing B and then doing dot-products essentially, you can take a row of B and multiply it by a broadcasted element of A, and then add it into the result. That's more efficient than doing dot-products, HADDPS isn't that efficient (essentially equal to 2 shuffles plus ADDPS). Also even you do want to transpose, you can do it with 8 shuffles instead of 16 INSERTPSes, similar to how the _MM_TRANSPOSE4_PS macro does it (but you have no access to that so you'd implement it manually).

  • @fqidz
    @fqidz19 күн бұрын

    supernatural season 2 ep 2

  • @domelessanne6357
    @domelessanne635718 күн бұрын

    wow

  • @trungthanhbp
    @trungthanhbp8 күн бұрын

    niec

  • @rdubb77
    @rdubb7719 күн бұрын

    Primary school? Linear algebra is generally a college subject, I didn’t learn matrix multiplication even in high school

  • @BrentFarris

    @BrentFarris

    19 күн бұрын

    What? Kids nowadays don't do linear algebra after nap time anymore?

  • @gbucks5117
    @gbucks511715 күн бұрын

    When code in paper , you know the shit is serious

  • @Jhat
    @Jhat16 күн бұрын

    the real question is... WHAT IS THAT PENCIL HOLDER????

  • @Kyle-do6nj
    @Kyle-do6nj17 күн бұрын

    All this to ultimately have a 20% efficiency at candy crush...

  • @sokiuwu
    @sokiuwu17 күн бұрын

    Making assembly 30× faster by writing in binary

  • @BrentFarris

    @BrentFarris

    17 күн бұрын

    Don't tempt me with a good time

  • @opkp
    @opkp20 күн бұрын

    Neat

  • @spoonikle
    @spoonikle19 күн бұрын

    Who else is naturally this super?

  • @Miles-co5xm

    @Miles-co5xm

    17 күн бұрын

    Java base classes

  • @emirsahin4105
    @emirsahin41059 күн бұрын

    manyakadam

  • @mikejohneviota9293
    @mikejohneviota929319 күн бұрын

    primary school huh for linear math i feel dumb

  • @BrentFarris

    @BrentFarris

    19 күн бұрын

    Me too, I must have missed the linear algebra class they taught at recess...

  • @blockshift758
    @blockshift75817 күн бұрын

    Bruh is expaining code on paper

  • @user-lh3xs9km6z
    @user-lh3xs9km6z18 күн бұрын

    it's nice results ... without dubt...but at that point of needed optimization going back to c/c++ isn't better?

  • @BrentFarris

    @BrentFarris

    17 күн бұрын

    Actually, there are some highly optimized Go functions that beat its Go Assembly counterparts. I'll make a video on this next. I always advocate for people to write in C, I'm biased because it's my main language. But, you really do get some amazing benefits in Go that you just don't in C/C++. So it's really up to the taste of the developer. I've written 3D Vulkan game engines in all 3 languages (C, C++, and Go)