We tour the world's fastest super computer at Oak Ridge National Laboratory!
Ғылым және технология
Everything Art of Network Engineering: linktr.ee/artofneteng
In this video we get a tour of the world's fastest super computers, Frontier and Summit, at @OakRidgeNationalLab! Both of these High Performance Computing (HPC) environments have played significant rolls in various areas of research.
Our tour guide, Daniel Pelfrey, Principal HPC Network Engineer, takes us through the challenges of Networking in an HPC environment, and some of them might surprise you!
A huge thank you to our friends at the Knoxville Technology Council for connecting us with the Oakridge National Laboratory.
Also, thank you to Kate and Daniel for the tour of ORNL, the super computers there, and for making this video possible!
Chapters
-------------------
0:00 Intro
00:58 What's the high level mission of ORNL?
02:07 What makes a High Performance Computing Environment different from Enterprise Networks?
04:09 HPC Network Design
05:09 Introduction to the Frontier Super Computer
06:40 Inside the Frontier Data Center!
09:54 We get to peek inside a Frontier cabinet!
15:12 The Summit Super Computer
17:03 HPC Environment Operations
19:24 The teams that keep these HPC environments going
20:20 The Why
21:43 Wrap up
22:36 Outro
Пікірлер: 155
"And is 2 Exaflops" big smile 😊 Nice to see someone that is passionate about his work!
@Alfred-Neuman
29 күн бұрын
It's not even "a" computer, it's basically just a botnet installed locally... The only difference I can see between this and a botnet it's this is installed inside a single room so they'll get better latency between the different RAM modules, CPUs and GPUs. What's I'm trying to say is this technology is not very impressing, they're using pretty much the same CPUs and GPUs that we are using in out cheap desktops. Just a lot of them...
@NorbertKasko
25 күн бұрын
@@Alfred-NeumanFor a while they used special processors in these systems, developed directly for them. Cray comes to the mind. When clock speeds become less scalable they started to use consumer hardware. In this they have 8.7 million processor cores instead of 16 or 64 (talking about high end desktop machines).
@mikmoody3907
23 күн бұрын
@@Alfred-Neuman Why don't you build a better one and impress us all..................
Biggest difference between HPC networks and corporate networks is lack of security in favor of performance at all costs. The compute nodes directly access remote memory over the network RoCE
One of the coolest aspect of Frontier's network architecture is at the node level. Since all the compute is done on GPUs the network fabric connects directly to the GPUs instead of something like a PCIe bus. So simulations can transfer directly between GPU memory and the network fabric without involving the CPU or having to move data on or off the GPU to get to the network. It allows for incredibly efficient internode communication.
@noth606
29 күн бұрын
So the GPU's have NIC's connected directly to them? With some sort of second MMU with it's own NIC? It's a tad unclear from the way you describe it, but I wonder how it connects to the GPU since you say it's not using PCIe?
@chuckatkinsIII
29 күн бұрын
@@noth606 I slightly misspoke. The NICs use PCIe ESM but connected directly to a PCIe root complex on one of the GPUs. Each node has 4 GPUs each with 2 dies (so 8 visible GPUs) and a dedicated nic, so 4 NICs per node. Thus any CPU operation that has to use the fabric actually traverses one of the GPUs to get to a nic. Source: you can find a bunch of architecture docs for Frontier but I also worked for several years on developing some of the library and software stack for this machine and a few others that were just beginning to come online.
Great video. Thanks for taking us along with you all!
Sorry for nitpicking but he got one thing wrong. The reason you don't use electrical network cables for longer distances is not primarily because of interference from the power cables but has all to do with attenuation. At these speeds it is very hard to get the signal more than a few meters, it will be heavily attenuated and very hard to distinguish a 1 from a 0. The solution to the problem is to use fibre optics instead.
@grandrapids57
26 күн бұрын
This is correct.
incredibly good interview you did there.
@artofneteng
2 ай бұрын
Thank you!
Awesome tour/interview. Dan seems like a real genuine dude. 👍
ORNL is my dream job. I'd honestly sweep floors just to be in the building.
@iamatt
Ай бұрын
It isn't all rainbows and unicorns
@jinchey
Ай бұрын
@@iamattdid you have a traumatic experience at oak ridge national laboratory
@pipipip815
Ай бұрын
Good attitude. Doesn’t matter where you get on the ladder, just get on, work hard, learn and be agile.
@stevestarcke
29 күн бұрын
I visited there long ago. It was the most awesome place I have ever seen.
@iamatt
28 күн бұрын
@@jinchey it's an interesting place to work when you get in the mix and actually see how the politics are, let's just say that.
You can tell when the machine is running some serious workloads because the lights flicker in the offices next to it.
No matter how big or small: The network IS the computer… For the past few decades, outside of embedded applications (and even in many situations there), computers have to be connected to a network to have any practical value; every piece of software, and most if not all its data, is sent over a network at some time in its lifecycle.
@dougaltolan3017
Ай бұрын
Never underestimate the bandwidth of a FedEx truck.
This is amazingly quiet for a system of that size
@artofneteng
Ай бұрын
Water cooled! The other half of the data center not shown in the video was all storage and that side was LOUD!
Simply Awesome!
Take note of the power cables for each rack, similar to the amount a large house might use, per rack. Removing the heat from those racks is a big part of the design. Air flows from the floor and out the top in active exhausts. A little hard to believe, but compactness is a top priority.
@mikepict9011
Ай бұрын
Could you imagine their hvac systems!!!! Chillers rated in swimming pools per min
@artysanmobile
Ай бұрын
@@mikepict9011 They have so much heat to get rid of, the concept of blowing cold air is no longer valid. Fluid is far more effective at conducting heat away from a metal structure and processors are manufactured with built-in liquid cooling. Each rack is built for purpose with an exchanger which takes it directly out of the room, then returns cold for the next batch. If you work on your home’s HVAC unit, you’re familiar. A widely distributed system like that can be monitored and adjusted for best efficiency.
@mikepict9011
Ай бұрын
@@artysanmobile yeah thats part of a larger cascading system when you consider the envelope usually. The liquid usually and ultimately needs to e rejected outside. And thats called a chiller in a liquid system and a condenser in a direct exchange system. But yeah , vapor compression, pipe joining . Its what i do .
@mikepict9011
Ай бұрын
@@artysanmobile i serviced the mini chillers that cool MRI machines, they still had a 1 air to refrigerant dx hx and 2 coaxial heat exchangers ( hx ) with 2 pumps . Simple systems compared to real low temp refrigeration
@hamsolo8165
10 күн бұрын
Frontier is water cooled. You have water doing the heat exchange, not your traditional HVAC. There is still HVAC in the room since there are other non water cooled systems in the same room (storage and commodity gear). The switches, controllers and nodes in Frontier are all water cooled.
Computers are like watches now we need to start making computers that last hundreds of years in my opinion
I am in Security now but I really miss being a network engineer. Thank you for sharing on this platform.
I’ve been in the computer rooms at fort Meade. Awe inspiring
@ronaldckrausejr7762
22 күн бұрын
Fort Meade is also volumes faster than this system. It’s just the specs are classified - someone will know those specs eventually (perhaps in 20-30 years). Even Snowden knew the NSA has had the best computer in the world since 2002
Why not check out the visualization suite? That's the coolest part.
@artofneteng
13 күн бұрын
We were there specifically to talk about the super computing network. We did have access to and see other things while onsite, but were only authorized to record the networking piece seen here.
I’m surprised they can even talk in there. I’ve been in some major data centers and communication can be difficult.
@artofneteng
Ай бұрын
They were water cooled so no fans on that side of the DC. The other side was storage which still had traditional cooling and was very loud!
@artysanmobile
Ай бұрын
@@artofneteng Ah, that makes sense.
The power supply behind them is unbelievable. Enough for a town.
@chuckatkinsIII
12 күн бұрын
About 10y ago when they built the Trinity supercomputer at Los Alamos they were able to save a bunch on power costs by partially diverting the river that runs through the town into giant pipes that run under the lab's datacenter for cooling. That was a wild one.
@artysanmobile
12 күн бұрын
@@chuckatkinsIII Intelligent use of resources makes me happy.
Amazing tour, mind blowing stuff
@artofneteng
22 күн бұрын
Glad you enjoyed it!
Great Video, thank you. Interesting would have been the type of Failures they see - Overheating, Bad Solder, Caps fail, Fans/Plumbing fails etc
@artofneteng
Ай бұрын
We did learn that they have full service staff provided by OEMs of the supercomputer. They were there performing maintenance that day. Our POC didn't have specifics on hardware failures of the HPC environment, I'll see if he has anything on the networking components.
@iamatt
Ай бұрын
L3.cache errs for 1
@iamatt
Ай бұрын
@@artofneteng MTBF was 1 hour at first
@iamatt
Ай бұрын
@@artofneteng blue jackets are pushing carts all day
After working with one we heard the gruntiest one is in Japan now rather than Oakridge.
What happens to the hardware after they are removed from Oak Ridge? Is there still some value in them besides recycling?
@hamsolo8165
10 күн бұрын
It can be sold as spare parts, and/or recycled.
Great video
In supercomputing it's either Network or Notwork :DD
@tuneboyz5634
Ай бұрын
thats really funny lil buddy 😊
Did I see Cray - oh my - that is just awesome
@iamatt
Ай бұрын
And AMD not NVDA 😂
7:50 his head got a head 😂 i cant stop seeing this
I can't image working there with all these computers so much electric field energy and hopefully is not affecting people's health. Any EMI/EF Faraday cage?
The guy in light blue needs a trimmer wardrobe.
Fascinating... still don't understand much of what that "time machine" is all about, but fascinating nevertheless... even though I think a DMC DeLorean properly retrofitted for time travel offers a bit more practicality and excitement in terms of time travelling!! Haha
@artofneteng
22 күн бұрын
The time machine reference was that the supercomputer has done in a shorter amount of time what would have taken us years to complete without it. It dramatically speeds up research.
It would have been nice to know a few things about how that plethora of processors is organised, how they work together and most of all how does the output from all processors is combined to one knowledgeable fact. I can imagine myself a number of cores where on each core is a part of a programme working, But with a numerous number of processors this can't be done any more.
@hamsolo8165
10 күн бұрын
Granted this is a 'network' overview. Go read about MPI. Swear a lot. And it will start to come together. Jobs that run on systems like this can run on hundreds of nodes at the same time. It's at all impossible. And yes, they can run containers if you wanted to do so.
Not sure I understand the "noise" issue with copper Ethernet? It is transformer coupled at each end, self balancing with common mode induced noise rejection via the twist. I've seen it run around along with the wiring for 3 phase CNC equipment with no issues. Even at those scales I am not sure I buy that explanation. Length would be a real issue at that scale rather than noise I would have thought.
@hamsolo8165
10 күн бұрын
The speed of the link and acceptable error rates for encoding on those links is a limiting factor. Noise, heat and other factors all play a role in maintaining high speed links.
You’re here to look at the networking in here as an electrician looking at the electrical.
No way do you get access to the world's fastest computer... Hypersonic missile systems are classified :P
@iamatt
Ай бұрын
Open research, class are in other DCs
200Gb, lol I have that between the switches at work which I put in like 3 years ago.
@stacksmasher
19 күн бұрын
You don’t understand. Each device in this platform has 200Gb so every CPU, GPU and storage connection has 200Gb direct to the platform.
@artofneteng
13 күн бұрын
Thank you!
@hamsolo8165
10 күн бұрын
Yes, you likely did a few uplinks as ISL.That's cool. However, on a slingshot switch -every- single one of those 64 ports on that switch do 200gpbs. To the nodes connected to it, and towards the fabric.
How many instances of doom can it load?
You know if they switched off all those small diodes on each server, blinking all the time, consuming power, I wonder how many watts that is total. You really only need those lights to debug if something is working right? could be a little switch instead to toggle those on and off
@sky173
29 күн бұрын
You can think of five L.E.D.s using about 1 watt of power. In the grand scheme of things, If they were switched off, most people would not know that some energy was saved. If you look at the home computer, it's costs (on average) $35-$40+/- a year to run a home computer 8 hours a day for one year (possibly much less). Those same five LEDs (diodes) that you mentioned would cost 35-40 cents to run them 8 hours a day for a full year (or just over a dollar per year if running 24/7)
@woodydrn
29 күн бұрын
@@sky173 But it's quite redundant to have them right? You dont need them at all really
@artofneteng
13 күн бұрын
If their blinking you know it's working.
Are these super computers shielded against EMP
@artofneteng
Ай бұрын
Great question! I don't recall if he said whether they are or not.
We will have the same processing power and a phone and around 20 years.. I watched the documentary about a super computer the size of a factory and it wasn't as fast as a new phone 10\15 years later.
@hamsolo8165
10 күн бұрын
Moore's law says yes. We just need people learning engineering so that we can push those limits. Innovation and education go hand in hand.
Where old system goes, eBay?
@olhoTron
Ай бұрын
I think it will go to auction
@eliasboegel
23 күн бұрын
It's usually auctioned off.
I would love to work there. Tired of making 10 gb as fast as possible. Mind you, I got into a terraflop
But can it run doom ?
wheres the NSA Stickers?
@artofneteng
13 күн бұрын
I'm sure the NSA has their own super computers.
@hamsolo8165
10 күн бұрын
At Fort Meade. :P
@ThisDJ808
10 күн бұрын
@@artofneteng yeah, and theyre plugged into this.
So SkyNet is a Tennesseean.
did he really tell what is the use of these machines?
But can it run crysis?
@tuneboyz5634
Ай бұрын
no
@drooplug
Ай бұрын
You spelled Doom wrong.
@munocat
Ай бұрын
how many chrome tabs can it handle?
@TAP7a
Ай бұрын
In all seriousness, not very well. Games have such miniscule latency requirements that any distributed system is immediately going to fall on its face. Even chiplet-to-chiplet within the same CPU package has proven to be enough to affect game experience - reviews of the R9 7950X all identified that frame pacing was affected dramatically when threads moved between CCDs, let alone moving between entire racks. Now, playing Crysis on a single unit, especially if it has both CPU and GPU compute...
@id104335409
Ай бұрын
Nothing can.
But can it run Crysis?
I would hate to see their electric bill.
But can it run doom
Asks whats a exaflop, proceeds not to explain a exoflop. EXA FLoating point OPerations per Second) One quintillion floating point operations per second.
OMG he asks so many stupid and repeated questions about the network cables....
@artofneteng
Ай бұрын
Some clarifying questions never hurt, and this channel is Network Engineering focused.
Can it play Minecraft?
Those poor bastards having to deal with AMD GPU drivers in HPC.
@evanstayuka381
28 күн бұрын
Is that a problem? Can you expantiate?
@switzerland3696
28 күн бұрын
@@evanstayuka381 Driver and firmware stability vs nvidia, look at the drama around Geohot / tinygrad / tinycorp having to abandon AMD as their primary platform due to the lack of stability.
@gunturbayu6779
27 күн бұрын
None of that matters when they use custom software and write their own codes , the hpc are used for open computing, CUDA wont matter. Now you better go back to your gtx 1650 fanboy.
@switzerland3696
26 күн бұрын
@@gunturbayu6779 You statement makes no sense, and why the hate? As they say when you do not have a good argument you resort to person attacks.
@switzerland3696
26 күн бұрын
@@evanstayuka381 I thought I already replied to this, or the post got deleted perhaps as the truth was too hard to handle perhaps. The AMD driver and firmware are competitively unstable compared with the NVIDIA driver and firmware stack. Look at the drama that Geohot / Tinygrad / Tinycorp had trying to go with AMD GPU's and had to abandon going AMD as the tinybox standard and offer the NVIDIA option as the primary option, as they could not get the driver / firmware stability required for the a shippable platform. Lets see if this post gets deleted.
Panduit.
Quantum computers are exponentially faster.
Heheh all im thing how cool be to play minecraft on it
Bitcoin miners😂
Balder dude has been wearing headphones for toooooo long....
@artofneteng
13 күн бұрын
Well, he's a podcaster too so...
Get that man clothes that don't look like he was just shrunk into
it should mine BTC :P
They need to generate crypto to pay for future machines and upgrades.
@kilodeltaeight
29 күн бұрын
lol. No. Who needs crypto when you can literally just print more dollars? DoE has a massive budget regardless.
@rtz549
29 күн бұрын
@@kilodeltaeight Then they could construct a supercomputer that had no final size or limitations.
This guy doesn’t seem like he’s ever seen the inside of a data center before what embarrassingly basic questions that didn’t even get to what’s special about their setup or capability.
Not impressed. It was outdated as soon as it was deployed.
Are you sure it's the fastest super computer in the world? China has come a long way in this field and would be very competitive.
@artofneteng
13 күн бұрын
At the time that were recorded this video, which was August of 2023, Frontier held the titled of fastest super computer in the world.
@hamsolo8165
10 күн бұрын
You build the worlds fastest car and run it on your own private hidden race track. How do you then convince the world your car is faster than the fastest car out there when you refuse to show it to anyone? That's what China is doing at the moment. They very likely have some fast system(s) hidden somewhere. Maybe. :)
Horrible soundtrack!
@artofneteng
13 күн бұрын
We appreciate your feedback
Chose amd to save money, could be faster with intel omegalul
@channel20122012
29 күн бұрын
Faster with Intel ? Are you living under a stone? Lol
@kilodeltaeight
29 күн бұрын
The real crunching of data here is happening on GPU cores, not the CPUs. Those are just managing the GPU cores, effectively. With a system like this, your biggest concern is power and cooling, so efficiency is what matters. AMD very much wins there, and has the experience with building large systems like this - ergo, they won the contract.
@gunturbayu6779
27 күн бұрын
The funny thing is. This oak ridge will be number 2 once el capitan project done , and it will be amd number 1 and 2 for fastest super computer. Intel will have number 3 with 80% more power usage lmao.
@hamsolo8165
10 күн бұрын
@@gunturbayu6779 You forgot Aurora!
It's just a lot of servers clustered together.... whats the big deal? Server clustering has been around for decades....
@artofneteng
13 күн бұрын
Right, but this particular cluster of servers has a MASSIVE amount of compute that has done amazing things for us!
@EvoPortal
13 күн бұрын
@@artofneteng Who cares? its the exact same thing, its just a bunch of servers clustered together. With enough money you can make one twice the size. There is nothing revolutionary here.
@hamsolo8165
10 күн бұрын
@@EvoPortal to a certain degree you are correct. HPC systems are just "bigger meaner" versions of a traditional cluster. Your traditional cluster could be optimized for a multitude of things where most HPC clusters are optimized for parallelism where speed is the key to running and processing massive sets of calculations. Every bit of the HPC cluster is optimized for speed. A traditional cluster is usually not built explicitly for speed. It's built for function. HPC is built for both speed and function. The innovation these days is coming (mostly, and imo) from faster/better connected fabrics. And with PCIe 7.0 soon to come out too. Evolution... with the occasional revolution thrown in there for fun.
You talk like a child!
For playing games?
What a great video, so informative must be a real privilege to work on that system. Reading about it here as well so impressive. en.wikipedia.org/wiki/Frontier_(supercomputer)