RTL Engineering

Educational videos covering various topics in computer engineering and computer architecture, specifically focusing on mid 1990s to early 2000s computers and game consoles. Most topics will be presented with FPGA implementation in mind. These topics will be presented assuming some prior knowledge of computer systems and digital electronics.

2 жыл бұрын

GPU Memory, and Clogged Pipes (Part 2 - 3dfx Voodoo) - #GPUJune2

2 жыл бұрын

GPU Memory, and Clogged Pipes (Part 1 - PS1 and N64) - #GPUJune2

2 жыл бұрын

GPUs, Points, and Lines - #GPUJune2

2 жыл бұрын

Pentium MMX Predecoding in a FPGA

2 жыл бұрын

Quake, Floating Point, and the Intel Pentium

2 жыл бұрын

MicroOps in the Pentium MMX

2 жыл бұрын

x86 Decoding Simulation in the Pentium MMX

2 жыл бұрын

x86 Front End Complexity (Part 2 - Pentium MMX)

2 жыл бұрын

x86 Decoding Simulation in the 486

2 жыл бұрын

x86 Decoding Simulation in the Pentium P5

2 жыл бұрын

x86 Front End Complexity (Part 1 - Pentium P5)

2 жыл бұрын

Channel Update, New Workflow, and Project Teaser

5 жыл бұрын

Implementation of a VR4300 ALU Shifter

5 жыл бұрын

A High Performance FPGA Game Console Platform

5 жыл бұрын

Implementation of an Efficient Floating-Point Complementor

5 жыл бұрын

Analysis of a Tensor Core

5 жыл бұрын

Implementing an Efficient MIPS III Multi-Cycle Multiplier

5 жыл бұрын

Designing an Efficient MIPS III Load Store Unit

6 жыл бұрын

Designing an Efficient MIPS TLB [Part 2]

6 жыл бұрын

Designing an Efficient MIPS TLB [Part 1]

6 жыл бұрын

Designing an Efficient Combined Register File

6 жыл бұрын

VR4300 Primer

6 жыл бұрын

Designing an Efficient Leading Zero Counter

6 жыл бұрын

N64 Hardware Architecture

6 жыл бұрын

Game Console Hardware Architecture (5th and 6th Generation)

6 жыл бұрын

FPGA Game Console Emulation and its Limitations

Пікірлер

@user-qf6yt3id3w29 күн бұрын

This reminds of Jim Keller's comment that x86 decode 'isn't all that bad if you're building a big chip'

@EhrenLoudermilkАй бұрын

Doing the lords work

@32_gurjotsingh82Ай бұрын

Great Argument, however one question in the LU OR implementation, can using tristate buffers enabled by the decoder help? The and-or stage is essentialy doing that only. Which would be more preferable from the two, as a first-cut prospective?

@blmac4321Ай бұрын

Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers bfloat16 MAC, one addition and one multiplication per clock : ~100 LUTs + 1 DSP48E2 @ > 600 MHz result accumulated in > 256 bits Tensor core needs 64 of these => ~ 6,400 LUTs + 64 DSP48E2

@blmac4321Ай бұрын

It's on Linkedin, eventually on arXiv. YT is not letting me post more, not sure why

@jmi2kАй бұрын

Where was this video when I was scratching my head with _exactly_ the same problem and had to come up with all this by myself :_(

@soloM81Ай бұрын

the n64 fell short on the mister platform now we have an idea what it takes for the ps1.saturn,N64 to run what are your thoughts on the new platforms like the mars and the replay2 are you still making your own fpga . its been almost 2 years would like for you to upload a new video what do you think now ?

@jmi2kАй бұрын

It saddens me to see all the fuzz about the speech synth thing. If you are into these kinds of things, this video is outstandingly good, going deep into what's going on. I can understand a difference in opinion about the perceived quality of the video (which is subjective anyways) but the claims I've read that the video is low-effort or that the author is lazy are hurtful and unfair, especially taking into account it's done for free and publicly available.

@jojodiАй бұрын

I'm curious now that we did get (almost all) of an N64 core on Mister if you think this platform is more interesting to pursue. In this video you mentioned you'd like to wait for that to come to reality before pursuing this further.

@FirstPrinciplesFirstАй бұрын

Thank you for this

@fungo66312 ай бұрын

The N64 could also decode lossily compressed audio like MP3 and today even Opus via the RSP. Some later Rare, Factor 5 and Boss Game Studios games implemented MP3.

@fungo66312 ай бұрын

Oh wow, blud can actually speak! I'm used to the Zoomer tts.

@nothingelse15202 ай бұрын

my first PC was a Pentium 100, I got Quake right after it launched......didn't run that great lol

@SaraMorgan-ym6ue2 ай бұрын

Quake, Floating Point, and the Intel Pentium because it was a pentium and not a pentium 4🤣🤣🤣🤣🤣

@HungNguyen-to7dg2 ай бұрын

I love this video

@novavr3dnovaresearch7803 ай бұрын

The most detail explanation about the gpu memory on the net. Thanks a lot for the videos.👍

@elliottzuk30083 ай бұрын

Please do a Dreamcast one!

@thomasvennekens41374 ай бұрын

the winchip did well , but it wasnt widely known

@markmental66654 ай бұрын

it was cheaper, but kind of slow

@Lilithe4 ай бұрын

Why is this done in TTS? I guess if you just like writing powerpoints for KZread...

@RTLEngineering4 ай бұрын

Or that I really dislike editing my own audio. AI voice generation took hours rather than days. That is on top of researching, writing the script, and creating the visuals, which took several weeks. Then following that up with tedious spectrogram work is quite an unpleasant experience. The primary content is not the audio, it is only one component to the medium.

@viscountalpha5 ай бұрын

I remember buying a Pentium 166mmx chip and thinking it was perfect priced/performance back then.

@jaxx40405 ай бұрын

Funny to think how we see tessellation as triangles when it’s a triangle representing a pyramid, representing points.

@Laykun90005 ай бұрын

I'm not sure clock rate is a good metric to use for gpu speed. Really, it should be transisters x clock speed? It makes the phrase at the end a bit hollow since gpu compute has generally been about scaling horizontally instead of vertically and will definitely give people the wrong impression. It just makes it sound like youre trying very hard to justify your original premise of memory more important than compute, when really it is both. Especially since compute has outstripped memory many times in gpu history, leaving them starved. I otherwise very much enjoy your videos, great work!

@RTLEngineering5 ай бұрын

Thanks for the feedback! Could you give an example where compute outstripped memory? The only cases I can think of were marketing (i.e. less VRAM was chosen to save cost - which is not a technology / architecture limitation). I disagree with transistors being a good metric, that's similar to comparing software by lines of code. Transistors are used for the compute, but also on-chip memory, data routing (busses), clock distribution, miscellaneous controllers, and I/O buffers. What you really want to use are operations/second, which for fixed function GPUs would be fill-rate. Comparing clock speed and fill-rate gives you an indication of where the performance came from. If fill-rate grows faster tan clock-speed, then the performance comes from scaling horizontally, whereas the contrary is from pipelining or a technology shrink. Bandwidth (memory) does still play a role there, but it's impossible to unlink the two in this domain as it forms an integral part to the processing pipeline. Also note that none of the memory claims (except for the PS2) account for DRAM overhead, which will necessary result in degraded performance compared to the ideal (peak) numbers.

@Laykun90005 ай бұрын

@RTLEngineering sure, transistor count isn't great either, but that doesn't mean clock speed is a good indicator. Current day gpus are WAY more than 15x faster than the PS2 gpu in terms of compute. Regardless, memory and compute are as important as each other, one isn't the main contributor vs. the other. And by compute out stripping memory, I mean memory bandwidth becoming the bottleneck. The geforce 256 was notoriously limited by its memory bandwidth, and they later released the geforce 256 ddr to unlock it's potential. It's simply a matter of balance and bottlenecks. You could possibly chart FLOPs vs memory speed, idk, but anything is better than hz.

@RTLEngineering5 ай бұрын

Then I guess you're in agreement with me / the videos. The entire premise was that bandwidth was the driving factor for performance, not memory capacity or clock-speed. Every GPU that I am aware of has been limited by the bandwidth in one way or another, the Geforce 256 being no exception. And the Geforce 256 DDR was still limited by the memory bandwidth. Unfortunately we can't plot FLOPs, because most older GPUs didn't execute floating-point operations. Similarly, when considering modern GPUs, FLOPs is not a great metric for render performance since large portion of the pipeline are still fixed function. So fill-rate remains the better metric, which serves as a proxy for "FLOPs". That's also what I used in the videos, not clock speed - the clock speed was shown to indicate that it was not the major contributing factor. Also, what you were describing (FLOPs vs Memory Bandwidth) is called a Roofline, which is the standard method for comparing performance of different architectures and workloads.

@Laykun90005 ай бұрын

@@RTLEngineering My issue is that I'm making a hard distinction between logic units and memory bandwidth, where as I think you've explicitly shown that they are deeply coupled, proving that the line is effectively a lot blurrier than I previously understood. I'm just smoothbrained from all the hardware reviews making hard distinctions between the two. Thanks for your detailed replies!

@CMSonYT5 ай бұрын

@@RTLEngineering one word counterarguement: RDNA2

@Sky1Down6 ай бұрын

These DAMNED computer READ VIDEOS are BULL SHIT!!>. Can't even pronounce shit right and I WILL NOT Reward YOU for stealing someone's work!!

@RTLEngineering5 ай бұрын

Your engagement by leaving a comment technically is a reward. Luckily, if you spend a few seconds thinking about it or reading the other comments, you will see that your concern is unjustified (i.e. no work was stolen). If you need a hint, this video was posted almost two years ago (pre AI craze), meaning a human would have had to write the script. The AI voice was chosen to save production time on my part, and I did take care to make sure all of the pronunciations were correct. The only issue was "Id" in "Id Software", which is said twice in a 20 minute video. Regardless, you're free to dislike the video and not watch it due to the voice over, but claiming plagiarism is uncalled for!

@yuvrajsingh0996 ай бұрын

Upto to GameCube will be great. WII ,Wii U are modern and will run judy fine in software emulation.

@mikafoxx27176 ай бұрын

Man, I hate to say it, but x86 is ugly. I can see why RISC was a huge deal back in that era. It would be cool to see the architectures compared. Early ARM was very odd with it's barrel shifter in every instruction, though MIPS and Power were more popular in the 90's. Even just looking at how the Z80 did it's instructions.. DJNZ is just a little dirty.

@mikafoxx27176 ай бұрын

The most exciting thing about emulation in hardware is the ability to modify the graphics hardware to render in higher resolutions, at least that's one of them.

@wilsard6 ай бұрын

the cyrix 6x86 pr233 ran at 188 or 200 depending on the version and bus speed.

@MadScientistsLair6 ай бұрын

I need to make a video on the total disaster my first "real" PC made from actually new parts was. It absolutely hauled for productivity and web browsing (back when page rendering speed mattered even on 56k!) but was an absolute dog at games. I picked pretty much the worst combo I could have back then for performance and stability.... A K6-2, An ALI Aladdin V chipset mobo and an NVIDIA TNT2. I'd have been better off with a PPGA Celeron, 66 MHz FSB and all and the cost difference would have been almost nil. Quake engine titles suffered the worst as expected but Unreal engine stuff wasn't exactly amazing either, though the latter DID benefit from 3DNow! without AMD making a special patch like they did for Quake II. I stayed with AMD for the next rig I built for my 16th birthday....Athlon Tbird 1000 AXIA stepping OC'd to 1400 and a Geforce 2 Pro on a KT133A board. That was a proper rig though it combined with the barely 68% efficient PSUs at the time kept my room rather warm. I learned a lot in between those two rigs.

@turbinegraphics167 ай бұрын

This looks like an AI generated video.

@athos53597 ай бұрын

i wonder how big the die size of the power vr 2 GPU inside the DC is,the voodoo 3 is 74sqaure millimeters and the powervr 2 looks like 2x the sizes.

@Phredreeke7 ай бұрын

16:55 didn't the Woz design the Apple II video circuitry to do DRAM refresh while drawing the screen, leading to a very unusual framebuffer layout?

@ccanaves7 ай бұрын

What about the 6x86? How does it differ from the K6?

@gsestream9 ай бұрын

so why dont you just say "matrix operation core" or matrix multiplication core, why would make things complicated with complex differing terminology, "tensor"

@RTLEngineering9 ай бұрын

Probably because the association was for AI/ML workloads which work with tensors (matrices are a special case of the more general tensor object). Though I am not sure why "Tensor Core" was chosen as the name since other AI/ML architectures call them "Matrix Cores" or "MxM Cores" (for GEMM). It might just be a result of marketing. I would say "MFU" or "Matrix Function Unit" would be the most descriptive term, but that doesn't sound as catchy.

@gsestream9 ай бұрын

how much memory is chip-internal-local in RDP DMEM? to be used as hardware z-buffer memory buffer, or frame buffer chip-local extension

@RTLEngineering9 ай бұрын

None. There's a small cache that's controlled by the hardware (to cover bursting), but otherwise the z-buffer and frame buffer are stored in the shared system memory. The DMEM on the RCP can't be used for z-buffer or color directly. It can be used for it indirectly, but you're going to end up copying stuff in and out of main memory which will perform worse than not using it at all. Alternatively, it's possible to program a software renderer using SIMD on the RCP, but it would leave the RDP idle.

@gsestream9 ай бұрын

you can do microcode changes directly, maybe a true hardware z-buffer, using the DMEM/IMEM 4kb caches@@RTLEngineering

@gsestream9 ай бұрын

maybe TMEM could be partially used as local z-buffer cache, while other part is used as normal texture memory@@RTLEngineering

@RTLEngineering9 ай бұрын

That's what I meant by "software render using SIMD". There's no read/write path between the DMEM and IMEM, nor is there a read/write path between the DMEM and the fixed-function RDP path. All communication between them would need to be done using DMA over the main system bus. Regarding TMEM, it's the same. There's no direct write path, where you can only write to the TMEM using DMA. Worse yet, the DMACs in all cases required that one address be in main memory, so you couldn't DMA between the memories without first going through the main memory.

@TUUK20069 ай бұрын

AI voice overs are unlistenable.

@phirenz9 ай бұрын

I've been trying to work out how they actually implemented the multiplier in the real r4300i design. The datapath diagram in "r4300i datasheet" shows they are using a "CSA multiplier block" and feeding it's result into the main 64bit adder every cycle (which saves gates, why use dedicated full adders at the end of the CSA array when you already have one). Going back to the r4200, there is a research paper explaining how the pipeline works in great detail, and the r4300i is mostly just an r4200 with a cut-down-bus and larger multiplier. The r4200 uses a 3bit multiplier, shifting 3 bits out to LO every cycle (or the guard bits for floats) and latching HI on the final cycle (floats use an extra cycle to shift 0 or 1 bits right then repack). I'm assuming they use much the same scheme, but shifting out more bits per cycle. So it's not that the r4300i has multipliers that take 3 and 6 cycles then take two cycles to move the result to lo/hi, but that the 24bit and 54 bit multiplies can finish 1 cycle sooner. So I think the actual timings are: 3 cycles for 24bit, 4 cycles for 32 bit, 6 cycles for 54 bit and 7 cycles for 64 bit (though, you need an extra bit for unsigned multiplication) To get these timings, the r4300 would need a 10 bit per cycle multiplier. If I'm understanding the design correctly: Every cycle, the CSA block adds ten 64 bit wide partial products. 10 bits are immediately ready shifting 10 bits out to LO, and the remaining 63 bits of partial sums and shifted carries are latched into the adder's S and T inputs. On the next cycle, the CSA block also takes the reduced partial sums from the adder's result as an 11th input to the CSA array.

@myownfriend2310 ай бұрын

I've always heard people say that TBDR's have a frame of latency. Maybe that was the case for older designs, I'm really not sure, but a lot of the time it felt like people misinterpreting what was happening because I've never seen anything from Imagination saying that. All that's happening is that, instead of the vertex and pixel shading being interleaved, like in IMRs, it's more like all the vertex shading happens and then all the fragment shading happens. There's nothing about this that requires that the pixel shading needs to happen on the next frame. The two stages don't take the same amount of time either. A triangle more or less just stays three points (three numbers per point) the whole vertex shading stage. One of those triangles can become hundreds of pixels in the rasterization phase though and that's going to take more time to compute and write to memory. In that sense, an IMR may have it's whole pipeline backed up by one triangle that turns into a particularly large amount of pixels. Since a TBDR keeps the stages separate, it can potentially finish it's vertex shading for a frame in far less time with less stalls. Then the fragment shading stage gets a huge boost from HSR and it's dedicated, fast on-chip buffer. Now you're right in that, while it's fragment shading one frame, it can start vertex shading the next, it's not like it's waiting for the next frame in order to start pixel shading. It's just getting started on the next frame before the current frame is done.

@RTLEngineering10 ай бұрын

What is meant with "1 frame of latency" comes down to the fact that all triangles must be submitted before rendering can begin - at least with the older GPUs. The new PVR archs (especially those used in the Apple SoCs) can reload previously rendered tiles, but the GPU used in the Dreamcast had no method to load the tile buffer from VRAM. So in practice, you want to pipeline the entire process, which gives you that extra frame of latency with IMRs don't require (since the render tiles can be revisited there). Submit -> Vertex Transform -> Bin -> Render -> Display. While you could do all of those in a single frame, that necessarily reduces the total amount of work you can do, else you will have to re-render the previous frame (introducing latency). For IMR, you don't need the Bin stage, and can instead interleave them, meaning you have... TBDR: |Submit -> Vertex Transform -> Bin ->| Render ->| Display|. (3 frames) IMR: |Submit -> Vertex Transform -> Render ->| Display|. (2 frames) Note that the Dreamcast was specifically modified to reduce this latency under certain scenarios, in which the tiles can be rendered in scanline order meaning that the next frame can start to be displayed while the pixel visibility was being computed and then shaded. Dreamcast: |Submit -> Vertex Transform -> Bin ->| Render -> Display|. (2 frames)

@myownfriend2310 ай бұрын

@@RTLEngineering If the Dreamcast couldn't re-load the tile buffer from VRAM then I don't know how that would be an issue unless the game was trying to use the rendered image as a texture. Outside of that, what gets rendered to the tile-buffer and then out to VRAM is the finished tile for the frame. It only needs to be read by the display controller and sent to the TV. It can still read from it's tile list in the same frame. |Submit -> Vertex Transform -> Bin ->| Render ->| Display| |Submit -> Vertex Transform -> Bin ->| Render -> Display| What you're saying about the Render and Display steps makes complete sense to me. It's the separation between Bin and Render that makes none. It's not reading from the tile-buffer here. The tile buffer is at the end of the pipeline. The Bin -> Render stage is when the tile list is being pulled into the GPU from VRAM to be rendered. There's nothing that would necessitate waiting for the next frame deadline for this to happen. If the GPU can't read the tile buffer from VRAM then that wouldn't cause an issue because the tile buffer isn't the tile list/parameter buffer which is all that's needed to be read in that stage. The tile list can obviously be read VRAM because that's where they're stored. If it couldn't then the GPU wouldn't work at all. I could understand it if you're looking at a example where the last triangle is submitted close to the deadline though. The IMR will have already completed rendering of almost all previous geometry and only need to finish that up. In that same case, yes, the TBDR will not complete rendering before that deadline because it was waiting for the last triangle to start rendering. But by saying that these two stages always happen in different frames would be incorrect. For example, if you're just rendering a menu on the Dreamcast then the amount of submitted geometry would be so little that it could be in counted by hand. The CPU computation and geometry submission could take, lets say, half of a millisecond. The transform and binning stage would take less than that. At that point it's not going to wait for the next 16ms before it starts rasterizing and texturing those triangles though. It's just going to start reading the tile list right after it's done with binning and it will finish rendering far before the next frame deadline and there will be no frames of latency.

@RTLEngineering10 ай бұрын

The issue is that every triangle for a frame must be binned before rendering can begin. So if you have 4M triangles in a frame, you must first submit, transform, and bin all 4M triangles before the first render tile can be touched. If you start rendering a tile before binning is complete, then you may finish visibility testing and rendering before all of the triangles are known for that tile - that would result in dropping triangles over the screen randomly based on submission order. This is a hard deadline which is not necessary for IMR - IMR can accept new triangles until a few cycles before the new frame must be presented to the display. Even the IMR architectures that do a type of render tile binning do so on a rolling submission basis because they can return to a previously rendered tile. The scenario you described is correct, in that case you have less work to do and therefore the deadline isn't as tight. But in general, a game developer wants to submit as many triangles with as many textures, with as many effects as possible, per frame. If you combine Submit->Vertex->Bin->Render, into a single frame, and target 60 fps, then that 16ms must be divided into the two phases: Submit->Vertex->Bin, and Render. So if Submit->Vertex->bin takes 10ms, then you only have 6ms to render all of the tiles (480p would be 300 tiles, so 20us per tile), which limits the total triangles per frame. Also keep in mind that Submit->Vertex is done on the CPU (for the Dreamcast) and is interleaved in the game logic itself, so that's going to take longer than if all it were doing is pulling from a preset list in RAM. Binning is done on the GPU, but only handles 1 triangle at a time, so that will be slow if there are too many as well. (It's a write-amplification task, meaning it can be done in bounded but not constant time). Regardless, if you take that approach during a game, you're likely going to drop every other frame to catch up with rendering. The alternative is to render the tiles as you display them, but that would mean that all 20 horizontal tiles need to be rendered within 1/15 of a frame, or 53us each. If the row of tiles is not complete by the time they need to be displayed, then you again need to drop the frame or accept screen tearing. While that same number is also true for the entire screen at once, you have 300 tiles to balance out the load rather than relying on 20 (you're more likely to have some tiles that take 2ms and some that take 2us in a pool of 300 than 20). In both cases, if you drop a frame, then you get 1 extra frame of latency. And besides, in your menu example, 1 extra frame of latency is not important... you should be thinking about the cases in which both latency and performance matters.

@myownfriend2310 ай бұрын

@@RTLEngineering I think you're misunderstanding where I'm disagreeing with you. I'm not disputing that submitting and binning all the triangles for the scene before rendering is a hard requirement on a TBDR or that "chasing the beam" with tile-render order would be required to get the most work done before the frame deadline. Where my issue lies is with saying that the 1 frame of additional latency is a rule that's built into how the hardware works when it isn't. That's the reason why I mentioned the menu example. It's not representative of the workload of a full 3D game scene, sure, but it demonstrates a real scenario that's not uncommon on the Dreamcast or a PC where the GPU would be needlessly wasting time and adding latency if it were really a hard requirement for the GPU to do it's vertex stage and fragment shading across two different frames. That's not how any GPU designer would design their GPUs and that's not Imagination designed their's. You could say that a frame of latency is a side effect of the architecture when triangles get submitted too close to the deadline and you could even say that that's common but explaining it as if the hardware literally can't avoid the latency in any scenario and is an absolute requirement of the hardware... is wrong. If the hardware has enough time before the frame deadline to finish rendering after it's done binning (like in the menu example or even in the case of a 2D game) then it will do that and it won't have the latency. This is an architectural video so it should describe the architecture. If there's a realistic limitation to that architecture when in use then that should be mentioned, too, but shouldn't be phrased like that limitation is built into the architecture. I mean you said it yourself. If you target 60 fps then the 16.6ms then to render the frame. That could be 10ms for Submit->Vertex->Bin with 6ms for ->Render->Display... but you CAN do that. It's not a hard limitation. Also keep in mind that workloads in a game aren't constant, they vary. If that's how one frame works out then the next frame could be the inverse of that, 6ms for Submit->Vertex->Bin with 10ms for Rending->Display and that assumes that the CPU waiting until the deadline of frame 1 before it started frame 2. If it started the vertex stage for frame 2 right after it was done with the vertex stage of frame 1 then it would be ready to render to start rendering frame 2 around the time of the deadline for frame 1. Sure, frame times aren't often that erratic, you can argue that the scenario I just mentioned isn't common, and you could say that the hardware would be underutilized in that scenario but the hardware IS capable of it. Lastly, any game that hits a stable 60fps likely isn't just hitting it's deadline, it's done way before it and is just capped at 60fps. The same is also true for 30 fps games. Without a cap, they could run at 35 or 40 but they just cap it at 30fps. That means they have 33.33ms between frames but they'll often be finished rendering in 22-28.5ms.

@RTLEngineering10 ай бұрын

Sure, although I don't think I ever claimed it was fundamental hardware constraint. That would be entirely wrong as the hardware had no interlocks as far as I am aware of - you could have it display a partially drawn tile if you hit the pointer-flip at the right time. Practically, I gave two examples in my previous response in which there would be no extra frame of delay, and mentioned the limitations of doing so. Typically software running on a TBDR do introduce a second frame of delay in how the software controls the GPU though, for those very reasons. It's also a lot simpler to write the game code to account for that delay than to dynamically adjust to it. Note that even with IMR, you don't need to have any frame delay either - you could just render directly to the display buffer (the PS2 did this in some cases), in which case you would be submitting triangles to the frame as it was being rendered. The Nintendo DS was actually notorious for this as that was the only way to draw the triangles (chasing the beam). Regardless that's arguing more semantics than anything since it's more complicated to say that the extra frame delay was introduced by software, as a result of the TBDR architecture's hard bin deadline requirement (a requirement not imposed by IMR), but could be overwritten in cases where the deadline is more relaxed or visual artifacts were tolerable. Some simplifications need to be made for an architecture video / lecture, as it's not reasonable to list all of the nuances. For example, you could use the PVR2 to compute N-body and fluid simulations instead of drawing triangles, same thing with the PS2's GPU (as a hint, you would do so with blending modes). Drawing 3D graphics is not inherent to the architecture itself, but it's the common / primary use case. So the video should discuss the common case where the rest is left as an exercise to the viewer. I disagree with your last comment about 60fps. You could easily write a game that continually just barely hits the 60fps cap as the GPU has two limits - visibility and shading. So you could have more than enough room in visibility, but be compute limited by the shading engine where a poorly ordered texture cache miss causes you to miss the deadline (this is what happened when drawing 2D sprites). The same thing happens can happen in modern 3D GPUs, but didn't usually occur in the older 3D ones like the Voodoo since the rasterizer was tied to the shading pipeline.

@3dfxvoodoo10 ай бұрын

Best hardware channel in KZread, thanks for the info my friend!

@golarac643310 ай бұрын

I cannot overstate how much i like your videos. I think I've watched this series 3 times already. I hope you make more videos like this

@pavlo7710 ай бұрын

Typo: should be ...+ A[0,3]*B[3,0]... at 1:32

@RTLEngineering10 ай бұрын

Thanks for pointing that out!

@wookiedookie0411 ай бұрын

damn

@jankleks425711 ай бұрын

I have got a general question (as I couldn't find the answer anywhere). Would this overclocking method known from software emulators, which does not break video and audio speed, be possible on FPGAs? Quotation: ""For many years, NES emulators had a method of overclocking where additional scanlines are added for the CPU for each video frame," byuu tells Ars Technica. "Which is to say, only the CPU runs on its own for a bit of extra time after each video frame is rendered, but the video and audio do not do so... This new(ish) overclocking method gives games more processing time without also speeding up the video and audio rates... and so the normal game pace of 60fps is maintained."

@adul00 Жыл бұрын

I was initially astonished, why there were no synthesis results for most of the video - unlike in previous ones. And even more, by the results - usually it was a struggle to approach ~300 MHz, and here even the Altera chip was decent.

@poiitidis Жыл бұрын

🤔limits?

@RTLEngineering Жыл бұрын

Perhaps the limits are more relative? You can always sacrifice speed - if you can fit a RISCV, you could run a software emulator.

@poiitidis Жыл бұрын

@@RTLEngineering that is a most excellent observation. 😌

@capability-snob Жыл бұрын

Do you know what the tradeoffs are between a CSA and a Wallace / Dadda multiplier?

@phirenz Жыл бұрын

"Making the PS2 ahead of its time" Debatable. Nvidia's gforce 256 was releaded in 1999 and had advanced register combiners that were already part way to being full pixel shaders. It had a dot3 mode that could do normal mapping without render to texture. And other GPUs of that time could do render to texture effects, just prehaps not quite as fast as the ps2. The main reason we have documentation of these effects done on the ps2 via epic hacks is because it was the last GPU ever made with a single-stage fully fixed function pixel pipeline (outside of the mobile space) and remained relevant long into the pixel shader era. The graphics programmers of the mid-2000s were desperately trying to make it do things that were common on contemporary GPUs with pixel shaders.

@RTLEngineering Жыл бұрын

Consider the context of the video: memory speed and performance. I believe the fact that the ps2 could do render to texture faster than the other GPUs make it "ahead of its time". You could argue there were other features that make it similar or comparable to other GPUs, but that would require a broader conversation than discussed in the video.

@capability-snob Жыл бұрын

This channel is an absolute gem.

@TheMasterofComment Жыл бұрын

I do hope you post more videos, the content is interesting and quite niche on KZread. The Quake title likely attracted many casual viewers who expected more infotainment type content, it's unlikely they would understand and therefore they're not ur target audience. Do not be disheartened with those who are bothered with the synthesized voice, after all with 68k views at least some haters are expected. Many of us focus on the content.

@beefquiche Жыл бұрын

Would love to hear your thoughts on Robert Peip's N64 FPGA development. It appears to be coming along beautifully, though may require hardware more powerful than the DE10NANO to run at full speed

@RTLEngineering Жыл бұрын

I don't really have many thoughts about it. He's interested in the architecture for the same reason that I am, and he has the skills to pull it off. I have my concerns about it fitting / running at the expected speed / being capable of achieving the required memory bandwidth and latency. However, if anyone can figure out how to make it work, it's probably him. I had no interest in trying as my interest lies with Xilinx hardware - I like the idea of being able to overclock (200 MHz N64 CPU? or 720p rendering?).

@beefquiche Жыл бұрын

So the N64 cannot deliver a frame rate above 30? As a limitation of its video DAC?

@RTLEngineering Жыл бұрын

The DAC operates at 60Hz in NTSC, but it does scan-line interleaving (480i). That means it can only display a full frame 30 times per second, either in 30fps or half frames at 60 fps. I guess technically the GPU could draw faster as there is no limit on how fast the pointer can be flipped, but the DAC will only read the interleaved frames at the fixed display rate.

@nathanlamaire Жыл бұрын

Is it possible for RDRAM to stay hydrated with data transfer to not letting stalling happen?

@RTLEngineering Жыл бұрын

Unfortunately no, stalling is part of the bus architecture (both RDRAM and the internal bus). It's needed for turn-around, and synchronization.

@user-qf6yt3id3w29 күн бұрын
This reminds of Jim Keller's comment that x86 decode 'isn't all that bad if you're building a big chip'
@EhrenLoudermilkАй бұрын
Doing the lords work
@32_gurjotsingh82Ай бұрын
Great Argument, however one question in the LU OR implementation, can using tristate buffers enabled by the decoder help? The and-or stage is essentialy doing that only. Which would be more preferable from the two, as a first-cut prospective?
@blmac4321Ай бұрын
Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers bfloat16 MAC, one addition and one multiplication per clock : ~100 LUTs + 1 DSP48E2 @ > 600 MHz result accumulated in > 256 bits Tensor core needs 64 of these => ~ 6,400 LUTs + 64 DSP48E2
@blmac4321Ай бұрын
It's on Linkedin, eventually on arXiv. YT is not letting me post more, not sure why
@jmi2kАй бұрын
Where was this video when I was scratching my head with _exactly_ the same problem and had to come up with all this by myself :_(
@soloM81Ай бұрын
the n64 fell short on the mister platform now we have an idea what it takes for the ps1.saturn,N64 to run what are your thoughts on the new platforms like the mars and the replay2 are you still making your own fpga . its been almost 2 years would like for you to upload a new video what do you think now ?
@jmi2kАй бұрын
It saddens me to see all the fuzz about the speech synth thing. If you are into these kinds of things, this video is outstandingly good, going deep into what's going on. I can understand a difference in opinion about the perceived quality of the video (which is subjective anyways) but the claims I've read that the video is low-effort or that the author is lazy are hurtful and unfair, especially taking into account it's done for free and publicly available.
@jojodiАй бұрын
I'm curious now that we did get (almost all) of an N64 core on Mister if you think this platform is more interesting to pursue. In this video you mentioned you'd like to wait for that to come to reality before pursuing this further.
@FirstPrinciplesFirstАй бұрын
Thank you for this
@fungo66312 ай бұрын
The N64 could also decode lossily compressed audio like MP3 and today even Opus via the RSP. Some later Rare, Factor 5 and Boss Game Studios games implemented MP3.
@fungo66312 ай бұрын
Oh wow, blud can actually speak! I'm used to the Zoomer tts.
@nothingelse15202 ай бұрын
my first PC was a Pentium 100, I got Quake right after it launched......didn't run that great lol
@SaraMorgan-ym6ue2 ай бұрын
Quake, Floating Point, and the Intel Pentium because it was a pentium and not a pentium 4🤣🤣🤣🤣🤣
@HungNguyen-to7dg2 ай бұрын
I love this video
@novavr3dnovaresearch7803 ай бұрын
The most detail explanation about the gpu memory on the net. Thanks a lot for the videos.👍
@elliottzuk30083 ай бұрын
Please do a Dreamcast one!
@thomasvennekens41374 ай бұрын
the winchip did well , but it wasnt widely known
@markmental66654 ай бұрын
it was cheaper, but kind of slow
@Lilithe4 ай бұрын
Why is this done in TTS? I guess if you just like writing powerpoints for KZread...
@RTLEngineering4 ай бұрын
Or that I really dislike editing my own audio. AI voice generation took hours rather than days. That is on top of researching, writing the script, and creating the visuals, which took several weeks. Then following that up with tedious spectrogram work is quite an unpleasant experience. The primary content is not the audio, it is only one component to the medium.
@viscountalpha5 ай бұрын
I remember buying a Pentium 166mmx chip and thinking it was perfect priced/performance back then.
@jaxx40405 ай бұрын
Funny to think how we see tessellation as triangles when it’s a triangle representing a pyramid, representing points.
@Laykun90005 ай бұрын
I'm not sure clock rate is a good metric to use for gpu speed. Really, it should be transisters x clock speed? It makes the phrase at the end a bit hollow since gpu compute has generally been about scaling horizontally instead of vertically and will definitely give people the wrong impression. It just makes it sound like youre trying very hard to justify your original premise of memory more important than compute, when really it is both. Especially since compute has outstripped memory many times in gpu history, leaving them starved. I otherwise very much enjoy your videos, great work!
@RTLEngineering5 ай бұрын
Thanks for the feedback! Could you give an example where compute outstripped memory? The only cases I can think of were marketing (i.e. less VRAM was chosen to save cost - which is not a technology / architecture limitation). I disagree with transistors being a good metric, that's similar to comparing software by lines of code. Transistors are used for the compute, but also on-chip memory, data routing (busses), clock distribution, miscellaneous controllers, and I/O buffers. What you really want to use are operations/second, which for fixed function GPUs would be fill-rate. Comparing clock speed and fill-rate gives you an indication of where the performance came from. If fill-rate grows faster tan clock-speed, then the performance comes from scaling horizontally, whereas the contrary is from pipelining or a technology shrink. Bandwidth (memory) does still play a role there, but it's impossible to unlink the two in this domain as it forms an integral part to the processing pipeline. Also note that none of the memory claims (except for the PS2) account for DRAM overhead, which will necessary result in degraded performance compared to the ideal (peak) numbers.
@Laykun90005 ай бұрын
@RTLEngineering sure, transistor count isn't great either, but that doesn't mean clock speed is a good indicator. Current day gpus are WAY more than 15x faster than the PS2 gpu in terms of compute. Regardless, memory and compute are as important as each other, one isn't the main contributor vs. the other. And by compute out stripping memory, I mean memory bandwidth becoming the bottleneck. The geforce 256 was notoriously limited by its memory bandwidth, and they later released the geforce 256 ddr to unlock it's potential. It's simply a matter of balance and bottlenecks. You could possibly chart FLOPs vs memory speed, idk, but anything is better than hz.
@RTLEngineering5 ай бұрын
Then I guess you're in agreement with me / the videos. The entire premise was that bandwidth was the driving factor for performance, not memory capacity or clock-speed. Every GPU that I am aware of has been limited by the bandwidth in one way or another, the Geforce 256 being no exception. And the Geforce 256 DDR was still limited by the memory bandwidth. Unfortunately we can't plot FLOPs, because most older GPUs didn't execute floating-point operations. Similarly, when considering modern GPUs, FLOPs is not a great metric for render performance since large portion of the pipeline are still fixed function. So fill-rate remains the better metric, which serves as a proxy for "FLOPs". That's also what I used in the videos, not clock speed - the clock speed was shown to indicate that it was not the major contributing factor. Also, what you were describing (FLOPs vs Memory Bandwidth) is called a Roofline, which is the standard method for comparing performance of different architectures and workloads.
@Laykun90005 ай бұрын
@@RTLEngineering My issue is that I'm making a hard distinction between logic units and memory bandwidth, where as I think you've explicitly shown that they are deeply coupled, proving that the line is effectively a lot blurrier than I previously understood. I'm just smoothbrained from all the hardware reviews making hard distinctions between the two. Thanks for your detailed replies!
@CMSonYT5 ай бұрын
@@RTLEngineering one word counterarguement: RDNA2
@Sky1Down6 ай бұрын
These DAMNED computer READ VIDEOS are BULL SHIT!!>. Can't even pronounce shit right and I WILL NOT Reward YOU for stealing someone's work!!
@RTLEngineering5 ай бұрын
Your engagement by leaving a comment technically is a reward. Luckily, if you spend a few seconds thinking about it or reading the other comments, you will see that your concern is unjustified (i.e. no work was stolen). If you need a hint, this video was posted almost two years ago (pre AI craze), meaning a human would have had to write the script. The AI voice was chosen to save production time on my part, and I did take care to make sure all of the pronunciations were correct. The only issue was "Id" in "Id Software", which is said twice in a 20 minute video. Regardless, you're free to dislike the video and not watch it due to the voice over, but claiming plagiarism is uncalled for!
@yuvrajsingh0996 ай бұрын
Upto to GameCube will be great. WII ,Wii U are modern and will run judy fine in software emulation.
@mikafoxx27176 ай бұрын
Man, I hate to say it, but x86 is ugly. I can see why RISC was a huge deal back in that era. It would be cool to see the architectures compared. Early ARM was very odd with it's barrel shifter in every instruction, though MIPS and Power were more popular in the 90's. Even just looking at how the Z80 did it's instructions.. DJNZ is just a little dirty.
@mikafoxx27176 ай бұрын
The most exciting thing about emulation in hardware is the ability to modify the graphics hardware to render in higher resolutions, at least that's one of them.
@wilsard6 ай бұрын
the cyrix 6x86 pr233 ran at 188 or 200 depending on the version and bus speed.
@MadScientistsLair6 ай бұрын
I need to make a video on the total disaster my first "real" PC made from actually new parts was. It absolutely hauled for productivity and web browsing (back when page rendering speed mattered even on 56k!) but was an absolute dog at games. I picked pretty much the worst combo I could have back then for performance and stability.... A K6-2, An ALI Aladdin V chipset mobo and an NVIDIA TNT2. I'd have been better off with a PPGA Celeron, 66 MHz FSB and all and the cost difference would have been almost nil. Quake engine titles suffered the worst as expected but Unreal engine stuff wasn't exactly amazing either, though the latter DID benefit from 3DNow! without AMD making a special patch like they did for Quake II. I stayed with AMD for the next rig I built for my 16th birthday....Athlon Tbird 1000 AXIA stepping OC'd to 1400 and a Geforce 2 Pro on a KT133A board. That was a proper rig though it combined with the barely 68% efficient PSUs at the time kept my room rather warm. I learned a lot in between those two rigs.
@turbinegraphics167 ай бұрын
This looks like an AI generated video.
@athos53597 ай бұрын
i wonder how big the die size of the power vr 2 GPU inside the DC is,the voodoo 3 is 74sqaure millimeters and the powervr 2 looks like 2x the sizes.
@Phredreeke7 ай бұрын
16:55 didn't the Woz design the Apple II video circuitry to do DRAM refresh while drawing the screen, leading to a very unusual framebuffer layout?
@ccanaves7 ай бұрын
What about the 6x86? How does it differ from the K6?
@gsestream9 ай бұрын
so why dont you just say "matrix operation core" or matrix multiplication core, why would make things complicated with complex differing terminology, "tensor"
@RTLEngineering9 ай бұрын
Probably because the association was for AI/ML workloads which work with tensors (matrices are a special case of the more general tensor object). Though I am not sure why "Tensor Core" was chosen as the name since other AI/ML architectures call them "Matrix Cores" or "MxM Cores" (for GEMM). It might just be a result of marketing. I would say "MFU" or "Matrix Function Unit" would be the most descriptive term, but that doesn't sound as catchy.
@gsestream9 ай бұрын
how much memory is chip-internal-local in RDP DMEM? to be used as hardware z-buffer memory buffer, or frame buffer chip-local extension
@RTLEngineering9 ай бұрын
None. There's a small cache that's controlled by the hardware (to cover bursting), but otherwise the z-buffer and frame buffer are stored in the shared system memory. The DMEM on the RCP can't be used for z-buffer or color directly. It can be used for it indirectly, but you're going to end up copying stuff in and out of main memory which will perform worse than not using it at all. Alternatively, it's possible to program a software renderer using SIMD on the RCP, but it would leave the RDP idle.
@gsestream9 ай бұрын
you can do microcode changes directly, maybe a true hardware z-buffer, using the DMEM/IMEM 4kb caches@@RTLEngineering
@gsestream9 ай бұрын
maybe TMEM could be partially used as local z-buffer cache, while other part is used as normal texture memory@@RTLEngineering
@RTLEngineering9 ай бұрын
That's what I meant by "software render using SIMD". There's no read/write path between the DMEM and IMEM, nor is there a read/write path between the DMEM and the fixed-function RDP path. All communication between them would need to be done using DMA over the main system bus. Regarding TMEM, it's the same. There's no direct write path, where you can only write to the TMEM using DMA. Worse yet, the DMACs in all cases required that one address be in main memory, so you couldn't DMA between the memories without first going through the main memory.
@TUUK20069 ай бұрын
AI voice overs are unlistenable.
@phirenz9 ай бұрын
I've been trying to work out how they actually implemented the multiplier in the real r4300i design. The datapath diagram in "r4300i datasheet" shows they are using a "CSA multiplier block" and feeding it's result into the main 64bit adder every cycle (which saves gates, why use dedicated full adders at the end of the CSA array when you already have one). Going back to the r4200, there is a research paper explaining how the pipeline works in great detail, and the r4300i is mostly just an r4200 with a cut-down-bus and larger multiplier. The r4200 uses a 3bit multiplier, shifting 3 bits out to LO every cycle (or the guard bits for floats) and latching HI on the final cycle (floats use an extra cycle to shift 0 or 1 bits right then repack). I'm assuming they use much the same scheme, but shifting out more bits per cycle. So it's not that the r4300i has multipliers that take 3 and 6 cycles then take two cycles to move the result to lo/hi, but that the 24bit and 54 bit multiplies can finish 1 cycle sooner. So I think the actual timings are: 3 cycles for 24bit, 4 cycles for 32 bit, 6 cycles for 54 bit and 7 cycles for 64 bit (though, you need an extra bit for unsigned multiplication) To get these timings, the r4300 would need a 10 bit per cycle multiplier. If I'm understanding the design correctly: Every cycle, the CSA block adds ten 64 bit wide partial products. 10 bits are immediately ready shifting 10 bits out to LO, and the remaining 63 bits of partial sums and shifted carries are latched into the adder's S and T inputs. On the next cycle, the CSA block also takes the reduced partial sums from the adder's result as an 11th input to the CSA array.
@myownfriend2310 ай бұрын
I've always heard people say that TBDR's have a frame of latency. Maybe that was the case for older designs, I'm really not sure, but a lot of the time it felt like people misinterpreting what was happening because I've never seen anything from Imagination saying that. All that's happening is that, instead of the vertex and pixel shading being interleaved, like in IMRs, it's more like all the vertex shading happens and then all the fragment shading happens. There's nothing about this that requires that the pixel shading needs to happen on the next frame. The two stages don't take the same amount of time either. A triangle more or less just stays three points (three numbers per point) the whole vertex shading stage. One of those triangles can become hundreds of pixels in the rasterization phase though and that's going to take more time to compute and write to memory. In that sense, an IMR may have it's whole pipeline backed up by one triangle that turns into a particularly large amount of pixels. Since a TBDR keeps the stages separate, it can potentially finish it's vertex shading for a frame in far less time with less stalls. Then the fragment shading stage gets a huge boost from HSR and it's dedicated, fast on-chip buffer. Now you're right in that, while it's fragment shading one frame, it can start vertex shading the next, it's not like it's waiting for the next frame in order to start pixel shading. It's just getting started on the next frame before the current frame is done.
@RTLEngineering10 ай бұрын
What is meant with "1 frame of latency" comes down to the fact that all triangles must be submitted before rendering can begin - at least with the older GPUs. The new PVR archs (especially those used in the Apple SoCs) can reload previously rendered tiles, but the GPU used in the Dreamcast had no method to load the tile buffer from VRAM. So in practice, you want to pipeline the entire process, which gives you that extra frame of latency with IMRs don't require (since the render tiles can be revisited there). Submit -> Vertex Transform -> Bin -> Render -> Display. While you could do all of those in a single frame, that necessarily reduces the total amount of work you can do, else you will have to re-render the previous frame (introducing latency). For IMR, you don't need the Bin stage, and can instead interleave them, meaning you have... TBDR: |Submit -> Vertex Transform -> Bin ->| Render ->| Display|. (3 frames) IMR: |Submit -> Vertex Transform -> Render ->| Display|. (2 frames) Note that the Dreamcast was specifically modified to reduce this latency under certain scenarios, in which the tiles can be rendered in scanline order meaning that the next frame can start to be displayed while the pixel visibility was being computed and then shaded. Dreamcast: |Submit -> Vertex Transform -> Bin ->| Render -> Display|. (2 frames)
@myownfriend2310 ай бұрын
@@RTLEngineering If the Dreamcast couldn't re-load the tile buffer from VRAM then I don't know how that would be an issue unless the game was trying to use the rendered image as a texture. Outside of that, what gets rendered to the tile-buffer and then out to VRAM is the finished tile for the frame. It only needs to be read by the display controller and sent to the TV. It can still read from it's tile list in the same frame. |Submit -> Vertex Transform -> Bin ->| Render ->| Display| |Submit -> Vertex Transform -> Bin ->| Render -> Display| What you're saying about the Render and Display steps makes complete sense to me. It's the separation between Bin and Render that makes none. It's not reading from the tile-buffer here. The tile buffer is at the end of the pipeline. The Bin -> Render stage is when the tile list is being pulled into the GPU from VRAM to be rendered. There's nothing that would necessitate waiting for the next frame deadline for this to happen. If the GPU can't read the tile buffer from VRAM then that wouldn't cause an issue because the tile buffer isn't the tile list/parameter buffer which is all that's needed to be read in that stage. The tile list can obviously be read VRAM because that's where they're stored. If it couldn't then the GPU wouldn't work at all. I could understand it if you're looking at a example where the last triangle is submitted close to the deadline though. The IMR will have already completed rendering of almost all previous geometry and only need to finish that up. In that same case, yes, the TBDR will not complete rendering before that deadline because it was waiting for the last triangle to start rendering. But by saying that these two stages always happen in different frames would be incorrect. For example, if you're just rendering a menu on the Dreamcast then the amount of submitted geometry would be so little that it could be in counted by hand. The CPU computation and geometry submission could take, lets say, half of a millisecond. The transform and binning stage would take less than that. At that point it's not going to wait for the next 16ms before it starts rasterizing and texturing those triangles though. It's just going to start reading the tile list right after it's done with binning and it will finish rendering far before the next frame deadline and there will be no frames of latency.
@RTLEngineering10 ай бұрын
The issue is that every triangle for a frame must be binned before rendering can begin. So if you have 4M triangles in a frame, you must first submit, transform, and bin all 4M triangles before the first render tile can be touched. If you start rendering a tile before binning is complete, then you may finish visibility testing and rendering before all of the triangles are known for that tile - that would result in dropping triangles over the screen randomly based on submission order. This is a hard deadline which is not necessary for IMR - IMR can accept new triangles until a few cycles before the new frame must be presented to the display. Even the IMR architectures that do a type of render tile binning do so on a rolling submission basis because they can return to a previously rendered tile. The scenario you described is correct, in that case you have less work to do and therefore the deadline isn't as tight. But in general, a game developer wants to submit as many triangles with as many textures, with as many effects as possible, per frame. If you combine Submit->Vertex->Bin->Render, into a single frame, and target 60 fps, then that 16ms must be divided into the two phases: Submit->Vertex->Bin, and Render. So if Submit->Vertex->bin takes 10ms, then you only have 6ms to render all of the tiles (480p would be 300 tiles, so 20us per tile), which limits the total triangles per frame. Also keep in mind that Submit->Vertex is done on the CPU (for the Dreamcast) and is interleaved in the game logic itself, so that's going to take longer than if all it were doing is pulling from a preset list in RAM. Binning is done on the GPU, but only handles 1 triangle at a time, so that will be slow if there are too many as well. (It's a write-amplification task, meaning it can be done in bounded but not constant time). Regardless, if you take that approach during a game, you're likely going to drop every other frame to catch up with rendering. The alternative is to render the tiles as you display them, but that would mean that all 20 horizontal tiles need to be rendered within 1/15 of a frame, or 53us each. If the row of tiles is not complete by the time they need to be displayed, then you again need to drop the frame or accept screen tearing. While that same number is also true for the entire screen at once, you have 300 tiles to balance out the load rather than relying on 20 (you're more likely to have some tiles that take 2ms and some that take 2us in a pool of 300 than 20). In both cases, if you drop a frame, then you get 1 extra frame of latency. And besides, in your menu example, 1 extra frame of latency is not important... you should be thinking about the cases in which both latency and performance matters.
@myownfriend2310 ай бұрын
@@RTLEngineering I think you're misunderstanding where I'm disagreeing with you. I'm not disputing that submitting and binning all the triangles for the scene before rendering is a hard requirement on a TBDR or that "chasing the beam" with tile-render order would be required to get the most work done before the frame deadline. Where my issue lies is with saying that the 1 frame of additional latency is a rule that's built into how the hardware works when it isn't. That's the reason why I mentioned the menu example. It's not representative of the workload of a full 3D game scene, sure, but it demonstrates a real scenario that's not uncommon on the Dreamcast or a PC where the GPU would be needlessly wasting time and adding latency if it were really a hard requirement for the GPU to do it's vertex stage and fragment shading across two different frames. That's not how any GPU designer would design their GPUs and that's not Imagination designed their's. You could say that a frame of latency is a side effect of the architecture when triangles get submitted too close to the deadline and you could even say that that's common but explaining it as if the hardware literally can't avoid the latency in any scenario and is an absolute requirement of the hardware... is wrong. If the hardware has enough time before the frame deadline to finish rendering after it's done binning (like in the menu example or even in the case of a 2D game) then it will do that and it won't have the latency. This is an architectural video so it should describe the architecture. If there's a realistic limitation to that architecture when in use then that should be mentioned, too, but shouldn't be phrased like that limitation is built into the architecture. I mean you said it yourself. If you target 60 fps then the 16.6ms then to render the frame. That could be 10ms for Submit->Vertex->Bin with 6ms for ->Render->Display... but you CAN do that. It's not a hard limitation. Also keep in mind that workloads in a game aren't constant, they vary. If that's how one frame works out then the next frame could be the inverse of that, 6ms for Submit->Vertex->Bin with 10ms for Rending->Display and that assumes that the CPU waiting until the deadline of frame 1 before it started frame 2. If it started the vertex stage for frame 2 right after it was done with the vertex stage of frame 1 then it would be ready to render to start rendering frame 2 around the time of the deadline for frame 1. Sure, frame times aren't often that erratic, you can argue that the scenario I just mentioned isn't common, and you could say that the hardware would be underutilized in that scenario but the hardware IS capable of it. Lastly, any game that hits a stable 60fps likely isn't just hitting it's deadline, it's done way before it and is just capped at 60fps. The same is also true for 30 fps games. Without a cap, they could run at 35 or 40 but they just cap it at 30fps. That means they have 33.33ms between frames but they'll often be finished rendering in 22-28.5ms.
@RTLEngineering10 ай бұрын
Sure, although I don't think I ever claimed it was fundamental hardware constraint. That would be entirely wrong as the hardware had no interlocks as far as I am aware of - you could have it display a partially drawn tile if you hit the pointer-flip at the right time. Practically, I gave two examples in my previous response in which there would be no extra frame of delay, and mentioned the limitations of doing so. Typically software running on a TBDR do introduce a second frame of delay in how the software controls the GPU though, for those very reasons. It's also a lot simpler to write the game code to account for that delay than to dynamically adjust to it. Note that even with IMR, you don't need to have any frame delay either - you could just render directly to the display buffer (the PS2 did this in some cases), in which case you would be submitting triangles to the frame as it was being rendered. The Nintendo DS was actually notorious for this as that was the only way to draw the triangles (chasing the beam). Regardless that's arguing more semantics than anything since it's more complicated to say that the extra frame delay was introduced by software, as a result of the TBDR architecture's hard bin deadline requirement (a requirement not imposed by IMR), but could be overwritten in cases where the deadline is more relaxed or visual artifacts were tolerable. Some simplifications need to be made for an architecture video / lecture, as it's not reasonable to list all of the nuances. For example, you could use the PVR2 to compute N-body and fluid simulations instead of drawing triangles, same thing with the PS2's GPU (as a hint, you would do so with blending modes). Drawing 3D graphics is not inherent to the architecture itself, but it's the common / primary use case. So the video should discuss the common case where the rest is left as an exercise to the viewer. I disagree with your last comment about 60fps. You could easily write a game that continually just barely hits the 60fps cap as the GPU has two limits - visibility and shading. So you could have more than enough room in visibility, but be compute limited by the shading engine where a poorly ordered texture cache miss causes you to miss the deadline (this is what happened when drawing 2D sprites). The same thing happens can happen in modern 3D GPUs, but didn't usually occur in the older 3D ones like the Voodoo since the rasterizer was tied to the shading pipeline.
@3dfxvoodoo10 ай бұрын
Best hardware channel in KZread, thanks for the info my friend!
@golarac643310 ай бұрын
I cannot overstate how much i like your videos. I think I've watched this series 3 times already. I hope you make more videos like this
@pavlo7710 ай бұрын
Typo: should be ...+ A[0,3]*B[3,0]... at 1:32
@RTLEngineering10 ай бұрын
Thanks for pointing that out!
@wookiedookie0411 ай бұрын
damn
@jankleks425711 ай бұрын
I have got a general question (as I couldn't find the answer anywhere). Would this overclocking method known from software emulators, which does not break video and audio speed, be possible on FPGAs? Quotation: ""For many years, NES emulators had a method of overclocking where additional scanlines are added for the CPU for each video frame," byuu tells Ars Technica. "Which is to say, only the CPU runs on its own for a bit of extra time after each video frame is rendered, but the video and audio do not do so... This new(ish) overclocking method gives games more processing time without also speeding up the video and audio rates... and so the normal game pace of 60fps is maintained."
@adul00 Жыл бұрын
I was initially astonished, why there were no synthesis results for most of the video - unlike in previous ones. And even more, by the results - usually it was a struggle to approach ~300 MHz, and here even the Altera chip was decent.
@poiitidis Жыл бұрын
🤔limits?
@RTLEngineering Жыл бұрын
Perhaps the limits are more relative? You can always sacrifice speed - if you can fit a RISCV, you could run a software emulator.
@poiitidis Жыл бұрын
@@RTLEngineering that is a most excellent observation. 😌
@capability-snob Жыл бұрын
Do you know what the tradeoffs are between a CSA and a Wallace / Dadda multiplier?
@phirenz Жыл бұрын
"Making the PS2 ahead of its time" Debatable. Nvidia's gforce 256 was releaded in 1999 and had advanced register combiners that were already part way to being full pixel shaders. It had a dot3 mode that could do normal mapping without render to texture. And other GPUs of that time could do render to texture effects, just prehaps not quite as fast as the ps2. The main reason we have documentation of these effects done on the ps2 via epic hacks is because it was the last GPU ever made with a single-stage fully fixed function pixel pipeline (outside of the mobile space) and remained relevant long into the pixel shader era. The graphics programmers of the mid-2000s were desperately trying to make it do things that were common on contemporary GPUs with pixel shaders.
@RTLEngineering Жыл бұрын
Consider the context of the video: memory speed and performance. I believe the fact that the ps2 could do render to texture faster than the other GPUs make it "ahead of its time". You could argue there were other features that make it similar or comparable to other GPUs, but that would require a broader conversation than discussed in the video.
@capability-snob Жыл бұрын
This channel is an absolute gem.
@TheMasterofComment Жыл бұрын
I do hope you post more videos, the content is interesting and quite niche on KZread. The Quake title likely attracted many casual viewers who expected more infotainment type content, it's unlikely they would understand and therefore they're not ur target audience. Do not be disheartened with those who are bothered with the synthesized voice, after all with 68k views at least some haters are expected. Many of us focus on the content.
@beefquiche Жыл бұрын
Would love to hear your thoughts on Robert Peip's N64 FPGA development. It appears to be coming along beautifully, though may require hardware more powerful than the DE10NANO to run at full speed
@RTLEngineering Жыл бұрын
I don't really have many thoughts about it. He's interested in the architecture for the same reason that I am, and he has the skills to pull it off. I have my concerns about it fitting / running at the expected speed / being capable of achieving the required memory bandwidth and latency. However, if anyone can figure out how to make it work, it's probably him. I had no interest in trying as my interest lies with Xilinx hardware - I like the idea of being able to overclock (200 MHz N64 CPU? or 720p rendering?).
@beefquiche Жыл бұрын
So the N64 cannot deliver a frame rate above 30? As a limitation of its video DAC?
@RTLEngineering Жыл бұрын
The DAC operates at 60Hz in NTSC, but it does scan-line interleaving (480i). That means it can only display a full frame 30 times per second, either in 30fps or half frames at 60 fps. I guess technically the GPU could draw faster as there is no limit on how fast the pointer can be flipped, but the DAC will only read the interleaved frames at the fixed display rate.
@nathanlamaire Жыл бұрын
Is it possible for RDRAM to stay hydrated with data transfer to not letting stalling happen?
@RTLEngineering Жыл бұрын
Unfortunately no, stalling is part of the bus architecture (both RDRAM and the internal bus). It's needed for turn-around, and synchronization.