Jonathan Blow
4 жыл бұрын
27,603
1

Discussion of SIMD, SOA, AOSOA. Followed by Questions & Answers.

Ойындар

The first programming related to this discussion is in this video: • Implementing User-Leve...

Пікірлер: 41

@tomwhitcombe76214 жыл бұрын
If you inherit from AActor in Unreal you inherit over 800bytes of trash. That spans 12 and a half cache lines. Before you add any of your data to it. Also they're polymorphic pointers so stored all over the heap.
@gan_49434 жыл бұрын
I am 3 month late on this, but at 56:48 someone writes that meshes are cut into chunks for culling purposes. I have worked on AAA engines that do this also. What Jon is missing here is that these chunks are NOT individual meshes. They are all packed into the same vertex buffer and drawn using multi draw indirect. Culling is typically done by runing a shader that computes intersections between chunks and the frustum (or hierarchical Z-buffer for occlusion) before building the indirect buffer.
@dandymcgee
2 жыл бұрын
He didn't missing anything. He said it's perfectly reasonable to cut levels into chunks, but not to cut character meshes into chunks for culling. The "surfs" sub-parts that the chatter is talking about could theoretically be for culling if the character mesh is extremely high poly, as is sometimes the case, but I suspect the more important reason it's in chunks is because those chunks need to be rendered with different materials.
@10bokaj8 ай бұрын
Your jokes are truely in a level most mortals cannot comprehend
@batmansmk4 жыл бұрын
I never knew this topic from this angle, but makes sense with database layout, like OLTP versus OLAP, partitioning etc. It looks like ASOA is more convenient to deal with inserts over SOA.
@juxhindyrmishibrigjaj71404 жыл бұрын
Googling SOA & AOSOA already yields this video as a top result. Well done!
@androth1502
4 жыл бұрын
Jon is the only online personality i know who has talked extensively about SOA and friends theory.
@TravisVroman3 жыл бұрын
6:17 "However, there is always a but. Right. And here's the but. It's a big but in this case, so Sir Mix-A-Lot should be happy." Lmao.
@clumsyjester4594 жыл бұрын
The best thing would be, if we could just tell the caching mechanism what to return. "Always give me bits 9-16 from each 64 bit block!" If you could control the caching mechanism, we could just save as AOS, deleting and reordering would be easy and fast reading (and maybe writing) could be done via cache control.
@TheMemesofDestruction4 жыл бұрын
Thanks Joe!
@wilfridtaylor4 жыл бұрын
I look forward to the shiny boots I can only buy in December :D.
@cybon46874 жыл бұрын
I really enjoyed this.
@4wb2 ай бұрын
such a gem
@SimGunther4 жыл бұрын
Next Halloween costume idea: a series of pipes and some parallel pipes in a CPU
@solhsa
4 жыл бұрын
You'll be beside yourself.
@phlopsi35864 жыл бұрын
Thank you
@michaelkrysiak37674 жыл бұрын
On Linux + Intel, perf-c2c can be used capture cacheline events.
@steven16714 жыл бұрын
15:38 I was thinking about this, and it led me to something called loop tiling. Instead of doing one calculation for every element in an array (x + some constant for all x's), I do one calculation for the first 64 bytes worth of x in the big buffer (so 16 floats) before moving on to the next calculation but still using the same 16 floats from the last calculation. Then, once when we've done all of the calculations, we go back to the very first calculation and do all calculations sequentially again but for the next 16 floats (because 16 floats fits into a 64 bit cache line). Unfortunately, this might not be portable if I use Intel compiler pragmas to automatically create the extra for loops and do all of the book keeping. It can be made portable if I use SDL to figure out the size of a cache line at runtime, but then I have to manually make parts of my codebase more complicated (for example, either I make a global variable or I explicitly pass the cache line size around wherever it's needed). An example: Each pair of brackets represents 16 floats that are loaded and modified. The first line is adding 1 and the second line divides by 2. cmiss means cache miss. Before tiling: +1: [cmiss 1] [cmiss 2][cmiss 3][cmiss 4] /2: [cmiss 5] [cmiss 6][cmiss 7][cmiss 8] After tiling: +1: [cmiss 1] [cmiss 2][cmiss 3][cmiss 4] /2: [cmiss 1] [cmiss 2][cmiss 3][cmiss 4] For any number of operations, we have O(n) time complexity for cache misses where n is the number of bytes in our array (256 in this example) divided by the cache line size in bytes (64 typically). Without tiling, it's O(n*m) where m is the number of operations (adds, multiplies, etc. In this example it's 2).
@goldLizzzard
4 жыл бұрын
Are you talking about cache-blocking? See intel's website: software.intel.com/en-us/articles/cache-blocking-techniques
@OMGclueless
4 жыл бұрын
Are you possibly overcomplicating this? If you have assembly-level reason like SIMD to go 4 or 8 or 16 or whatever floats at a time then it might be worth iterating in blocks for that reason. But otherwise there's not much reason to write your loop in any other way than "op 1; op 2" for each float in order. That will have an optimal number of cache misses, just like going 16 at a time will. And optimizing C++ compilers are quite good at unrolling loops for SIMD automatically. The simple thing will probably "just work". Here's a bit of code doing the dumbest simplest thing and as you can see it gets unrolled into a beautiful 32-bytes-at-a-time cache-optimal SIMD loop: godbolt.org/z/HvFE-q
@steven1671
4 жыл бұрын
@@OMGclueless Since my comment, I wrote a software triangle rasterizer from a tutorial called tinyrenderer (on github). After getting it to work in the simplest way possible, I rewrote everything such that all intermediate data would be stored and processed in dynamic arrays ("data oriented design"). Needless to say, this made my program a bit slower AND much more complicated. Anyways, I learned my lesson about trying to hand optimize C++ code without a profiler. Now I need to rewrite my program to be the way it was before I... wrecked it. Since then, I profiled the simple version of the program and found out that barycentric coordinate calculations were a big bottleneck. And I found a blog about how to write a modern software renderer, detailing how to remove most of the multiplications and replace them with additions in the barycentric coordinates calculation code (without changing/breaking the behavior of the code). Implementing this optimization is actually easier in the simple version of the code... in fact, the code in the blog post is very similar to the code in the github tutorial (simple, nested for loops looping over everything, with intermediate data just on the stack). What I did was not only stupid, but harmful. Good thing it's only a personal project.
@cutemath82254 жыл бұрын
39:07 that's spooky :D
@AinurEru4 жыл бұрын
Hey Jonathan Blow - great talk! You should have probably mentioned 2 additional benefits of having 3 separate arrays for x, y, and z values (respectively): 1) Single index per vector data WITHOUT any kind of index-arithmatic. 2) Resizing is trivial. For cache coherency, then yes, each vector-component array may end up in a completely different place in memory, but I would guess that in most cases the pre-fetcher would simply fetch each of them to separate cache-lines, so that they would all be cached aregardless... If that is the case, wouldn't this make this layout the clear winner overall?
@dandymcgee
2 жыл бұрын
I have no idea how it works.. I've always wondered this as well, but I think this is what he was speculating cache associativity might effect. i.e. does prefetching the y values somehow stomp the same cache storage as the prefetched x values? I would hope not, but I have no clue.
@wrong10294 жыл бұрын
the specs library for rust implements a data driven system like this, I think its pretty elegant the way it's done
@snarkyboojum4 жыл бұрын
Mo Blow! 👍
@Setlean4 жыл бұрын
18:55 the scoop about ps5/new xbox architecture
@Fnargl99
3 жыл бұрын
I was wondering what he was talking about. I so what additional silicon was added to xbox/ps5.
@playertwo9895
3 жыл бұрын
@@Fnargl99 I would guess PS5's dedicated kraken compression unit
@nineh97394 жыл бұрын
what if you had your entities only hold pointers into the arrays that pack all the data together for batching, I never got this far in depth before so I'd like to know the cost of pointers.
@ManuelBTC214 жыл бұрын
42:00 Most of the benefit of SOA can be achieved with relatively small arrays. Put differently, larger and larger arrays have diminishing returns. This means you can minimize the cost of copies needed to append or insert elements if you have a tree of small/medium size contiguous arrays, rather than a single huge contiguous array. My professor once said: In computer science all problems can be solved by adding an extra layer of abstraction or indirection /s.
@iestynne
4 жыл бұрын
Manuel Barkhau unfortunately in terms of code quality, that is the *source* of all problems.
@seditt5146
4 жыл бұрын
@Simon Farre I mean.... he is kind of right dude. Abstractions are the cause of the vast majority of problems we see in the computer industry from Run time problems to performance problems. Abstraction solves read ability and creation time problems but causes many problems in terms of performance. All the way down to the bottom most problems you have likely boil down to how the compiler abstracted a subset of commonly grouped CPU instructions, it's attempts at abstracting that out to a form you desire causes plenty of problems. It does eliminate some problems I get that, it helps us write faster, larger and more maintainable code but many issues you face in coding would not exist if you explicitly had to hand code the assembly for every single use case. We are forced to write patterns in Programming because our compiler is written in a way that doing so is the best course of action as our abstractions have to play nice with the compilers abstractions.
@joebravo42244 жыл бұрын
Aha!
@solhsa4 жыл бұрын
tl;dr: parallel access: use SOA. serial or random access, changing array sizes: use AOS. A bit of both, use AOSOA.
@daleowens76954 жыл бұрын
I tend to agree, I never thought the "internet is a series of tubes" was all that awful an analogy.
@Fnargl99
3 жыл бұрын
the worst part of the series of tubes was what he said just before "I just the other day got... an Internet was sent by my staff at 10 o'clock in the morning on Friday. I got it yesterday [Tuesday]."
@kristupasantanavicius90934 жыл бұрын
Wouldn't AOSOA hurt the performance if you just wanted to look at a single property (like flags) of an entity?
@brothir
4 жыл бұрын
That depends, but it's true that SOA and AOSOA aren't optimal in all use cases. Just like there are algorithms that are inherently serial, there is code that is inherently "unitary", i.e. it only cares about singular elements and that batch processing is irrelevant and detrimental if forced. However, any given program is probably not all inherently unitary, and the general observation is that a lot of performance critical code benefits from some fashion of SOA. But it's not an absolute.
@InZiDes4 жыл бұрын
You needs to consider that because of the cache prefetching, when you access to a linear pattern (on the cache line addrs), there is not cache miss.
@user-ov5nd1fb7s9 ай бұрын
The example with simd and addition doesn't look honest. What I mean by that is additions are associative and will trigger instruction level parallelism.