Analysis of a Tensor Core

Ғылым және технология

A video analyzing the architectural makeup of an Nvidia Volta Tensor Core.
References:
Pu, J., et. al. "FPMax: a 106GFLOPS/W at 217GFLOPS/mm2 Single-Precision FPU, and a 43.7GFLOPS/W at 74.6GFLOPS/mm2 Double-Precision FPU, in 28nm UTBB FDSOI" (2016) arxiv.org/abs/1606.07852
Markidis, S., et. al. "NVIDIA Tensor Core Programmability, Performance & Precision" (2018). arxiv.org/pdf/1803.04014.pdf
Tensor Core overview:
www.nvidia.com/en-us/data-cen...
Tensor Core + Volta Architecture Whitepaper:
images.nvidia.com/content/volt...
CUDA programing details for Tensor Cores:
devblogs.nvidia.com/programmi...
I am not affiliated with any of the companies mentioned in the video. This video is intended for educational purposes.

Пікірлер: 33

  • @jackiegammon2065
    @jackiegammon2065 Жыл бұрын

    What a great shop and tour! I LOVE the detail and the thought process of creating something that will last for many,many decades.

  • @prashplus
    @prashplus5 жыл бұрын

    Hats off to u bro....

  • @Face2FaceHardware
    @Face2FaceHardware4 жыл бұрын

    Great job 👍

  • @richardcasey4439
    @richardcasey4439 Жыл бұрын

    Well done explanation

  • @khoakirokun217
    @khoakirokun2174 жыл бұрын

    Good video 😍

  • @AnatoliyRU
    @AnatoliyRU5 жыл бұрын

    It is also interesting to know about a new schedulers used for dispatching ray-tracing routines (e.g. closest/anyhit, that is dynamically scheduled). Are they accessible directly (or at least indirectly) from CUDA cores (C++).

  • @RTLEngineering

    @RTLEngineering

    5 жыл бұрын

    I haven't tried using any of the new Volta architecture with CUDA. From what I understand, the RT cores are only able to accelerate triangle-ray intersections, and tree traversal. So you would have to write a kernel that effectively does the ray-tracing, but is scheduled partly on the RT cores. As for how to actually write a kernel to use the RT cores, it looks like you basically create special CUDA kernels which use an rtcontext, which can share the kernel context and memory pointers with a normal CUDA context. Also, there appears to be no way to combine the RT kernels and CUDA kernels, where any communication needs to be handled on the host side (one of the biggest flaws of CUDA in my opinion - the inability to run a kernel continually without direct control from the host). The API for programming with the RT cores is part of CUDA 10, which is freely accessible from NVidia's website. Hopefully that was helpful.

  • @dennisrkb
    @dennisrkb2 жыл бұрын

    Fantastic overview. Any chance of a follow up with some CUDA C samples?

  • @RTLEngineering

    @RTLEngineering

    2 жыл бұрын

    Thanks! I could have sworn someone else asked as similar question. The CUDA C samples can be found on NVidia's website. But the gist is: 1) you have to first prime the tensor cores via wmma::load_matrix_sync 2) then you can perform the operation via wmma::mma_sync 3) you can read back the result via wmma::store_matrix_sync

  • @cem_kaya
    @cem_kaya2 жыл бұрын

    Thanks for explanation is there a source that you can recommend about warp scheduling and SM's ?

  • @RTLEngineering

    @RTLEngineering

    2 жыл бұрын

    It would depend on what information you want specifically, and how far down the rabbit hole you want to go. There's very little detail on NVidia's part though, and most of it is educated speculation from the architecture community, much of which is not collected in single sources or written down as it's considered "common knowledge" or "obvious". Regarding warp scheduling, there's even less detail there. It's mostly a resource allocation problem where entire SMs are allocated warps based on the scheduling engine in the GPU (it's not clear if it's a software scheduler running on a small CPU, or if it's a hardware scheduler that iterates over a descriptor list - my guess would be a software scheduler though).

  • @bhuvaneshs.k638
    @bhuvaneshs.k6384 жыл бұрын

    Systolic Array multiplier like tpu's Mxu unit

  • @RTLEngineering

    @RTLEngineering

    4 жыл бұрын

    Thanks! I didn't realize that it was called a Systolic Array. The layout of a Systolic array makes a little more sense, but I believe the implementation presented in the video is functionally equivalent. Additionally, both implementations have the same band-width limitations.

  • @bhuvaneshs.k638

    @bhuvaneshs.k638

    4 жыл бұрын

    @@RTLEngineering u r welcome... Yeahh even implementation in video would give same result ... Anyways thqs for the video 👍

  • @jaxx4040
    @jaxx40405 ай бұрын

    Funny to think how we see tessellation as triangles when it’s a triangle representing a pyramid, representing points.

  • @pavlo77
    @pavlo7710 ай бұрын

    Typo: should be ...+ A[0,3]*B[3,0]... at 1:32

  • @RTLEngineering

    @RTLEngineering

    10 ай бұрын

    Thanks for pointing that out!

  • @wookiedookie04
    @wookiedookie0411 ай бұрын

    damn

  • @blmac4321
    @blmac4321Ай бұрын

    Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers bfloat16 MAC, one addition and one multiplication per clock : ~100 LUTs + 1 DSP48E2 @ > 600 MHz result accumulated in > 256 bits Tensor core needs 64 of these => ~ 6,400 LUTs + 64 DSP48E2

  • @blmac4321

    @blmac4321

    Ай бұрын

    It's on Linkedin, eventually on arXiv. YT is not letting me post more, not sure why

  • @Saturn2888
    @Saturn2888 Жыл бұрын

    I commented on another video about it sounding like a computer speaking. This video sounds like a human, but the mic quality is much lower.

  • @RTLEngineering

    @RTLEngineering

    Жыл бұрын

    I did the actual voice recording for this video several years ago. It was a lengthy editing process, which I got tired of, causing me to stop producing videos for 2 years. The thing that got me back into producing them was the AI speech synthesis. For me, it's a tradeoff between time commitment and production value, and I don't think the increased value of recording audio is worth increasing the production time by 10x (especially considering all of the time spent researching and verifying the video material beforehand).

  • @gsestream
    @gsestream8 ай бұрын

    so why dont you just say "matrix operation core" or matrix multiplication core, why would make things complicated with complex differing terminology, "tensor"

  • @RTLEngineering

    @RTLEngineering

    8 ай бұрын

    Probably because the association was for AI/ML workloads which work with tensors (matrices are a special case of the more general tensor object). Though I am not sure why "Tensor Core" was chosen as the name since other AI/ML architectures call them "Matrix Cores" or "MxM Cores" (for GEMM). It might just be a result of marketing. I would say "MFU" or "Matrix Function Unit" would be the most descriptive term, but that doesn't sound as catchy.

  • @cun_0092
    @cun_00924 жыл бұрын

    For normies Tensor Core = DLSS + raid tracing = BETTER GAMING For machine Learning = Tensor Core = Better And Faster output I see two different world....

  • @RTLEngineering

    @RTLEngineering

    4 жыл бұрын

    I think you meant Ray Tracing. As I understand it, the Tensor Cores are not used for RT acceleration, and are only used for DLSS. DLSS is a Convolutional Neural Network (CNN) evaluation on an image (so basically a bunch of filter passes), which is what the Tensor Cores are really good at doing. The interesting thing in terms of machine learning, is that it's not clear how the architecture of the Tensor Cores are setup internally (I doubt NVidia will let us know). Though, if you look at the NVidia example code, you load the matrices first, and then do the matrix-matrix multiply. So in order to get the most usage, you probably need to be doing a data stationary or weight stationary operation. If you need to change both data and weights, then using the FP32 units will probably yield better performance. So not necessarily faster for ML either.

  • @cun_0092

    @cun_0092

    4 жыл бұрын

    @@RTLEngineering hmmm... thanks that ray was wrong due to predictive text. Also I heard the general term that Tensor Core will "improve deep/machine learning performance". I don't know if it's true or not but what about your thoughts?? I'm going to but a laptop for machine/deep learning purpose and I was deeply interested in tensor core due to it's capability of good deep learning performance. So I'm a bit confused whether to spend some money and get rtx card or go with default gtx cards. Please reply. I would really like to know whether it will make any difference. Also I absolutely loved your video even though I'm not a pure computer science student and started ML as a hobby I was able to get about 85-90% of Tensor core concept. Thank you

  • @RTLEngineering

    @RTLEngineering

    4 жыл бұрын

    That will entirely depend on the application, from both what the underlying algorithm is, and how the code is written. For example, if you are using some ML code that doesn't use the TC instructions, then the TCs won't improve anything. Similarly, if the code isn't written to take full advantage of the TCs, then they may end up having no improvement at best, and could end up resulting in a reduction in performance at worst. If the ML code uses some higher level library like Tensor Flow, then I'm not sure if the underlying kernels will take advantage of the TCs at all (I would imagine that they have added such features, but that may not be practical). If the cost difference doesn't matter to you, I would go with a RTX card just to have the TCs if you need them / to play around with, but keep in mind that the VRAM matters too. To get the most performance, you want to be continually feeding the compute units with data, however, if you can only fit a single data set into VRAM, then you may have identical performance to a GTX card instead. On the other hand, if you are only processing a single data set at a time, you may not see a performance improvement at all. So it depends on the application. Personally, I went to RTX cards in my most recent builds so that I had the TCs and RTs to play around with, though I have yet to use them.

  • @cun_0092

    @cun_0092

    4 жыл бұрын

    @@RTLEngineering hmmm... okay so for now or for few years TC won't have much effect of ML areas. So it's better to go with cheaper GTX for now. Anyways Thanks for your advice.

  • @RTLEngineering

    @RTLEngineering

    4 жыл бұрын

    That isn't what I said. I said that it will depend on the application being run... It's like traction control in a car, it's really only useful if you go off-road (needing it would depend on where you plan to drive). I don't know what you plan on doing that's ML, so I can't suggest anything further... If you plan on writing your own Cuda kernels for ML, then you can make sure to make use of the TCs. If you are using someone else's code, then it depends on how they wrote their code.

  • @pperez1224
    @pperez12244 жыл бұрын

    Tensors and Matrices are not the same mathematical objects. There is some confusion in there

  • @RTLEngineering

    @RTLEngineering

    4 жыл бұрын

    Partially correct, Matrices are a type of Rank-2 Tensor, so they are a subset. Some Tensors are Matrices, and all Matrices are Tensors, but not all Tensors are Matrices. It would be more accurate to call it a "Matrix Core", but that doesn't sound as catchy. You could also call it a "Systolic Array Processor", but that's also not as catchy. I suspect they were thinking about the fact that you can form a Rank-N tensor with Rank-2 operations (technically you can do it with Rank-1 operations as in SIMD). Anyway, blame the name confusion on marketing.

Келесі