Computerphile
2 жыл бұрын
75,355
1

Dealing With Big Data - Computerphile

Big Data sounds like a buzz word, and is hard to quantify, but the problems with large data sets are very real. Dr Isaac Triguero explains some of the challenges.
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Пікірлер: 175

@griof2 жыл бұрын
Developer: we use a 3Gb database to plot some dashboards with statistical information about the customer's behavior Marketing team: we use big data, machine learning and artificial intelligence to analyze and predict customer's action at any given time.
@NeinStein
2 жыл бұрын
But do you use blockchain?
@TheStruders
2 жыл бұрын
Hahaha this is so true
@dutchdykefinger
2 жыл бұрын
exactly this.
@laurendoe168
2 жыл бұрын
Not quite.... "Marketing team: we use big data, machine learning and artificial intelligence to MANIPULATE customer's action at any given time."
@johnjamesbaldridge867
2 жыл бұрын
@@NeinStein With homomorphic encryption, no less!
@RealityIsNot2 жыл бұрын
The problem with the word big data is that it went from a technical jargon to a marketing one.. and marketing department don't care what the word means.. they create their own meaning 😀. Other examples include AI ML
@kilimanjarocruz660
2 жыл бұрын
100% this!
@Piktogrammdd1234
2 жыл бұрын
Cyber!
@landsgevaer
2 жыл бұрын
ML seems pretty well defined to me...? For AI and BD, I agree.
@danceswithdirt7197
2 жыл бұрын
Big data is two words. ;)
@nverwer
2 жыл бұрын
More examples: exponential, agile, ...
@letsburn002 жыл бұрын
I never realised just how much information there was to store until I tried downloading half a decade of satellite images from a single satellite at a fairly low resolution. It was a quarter Terabyte per Channel and it was producing over a dozen channels. Then I had to process it....
@isaactriguero3155
2 жыл бұрын
what did you use to process it?
@letsburn00
2 жыл бұрын
@@isaactriguero3155 Python. I started with CV2 to convert to Numpy arrays, then work with Numpy. But it was taking forever until I learnt about Numba. Numba plus pure type Numpy arrays are astonishingly effective compared to pure python. I'll never look back now I got used to using Numba. I need my work to integrate with Tensorflow too, so Python works well with that.
@gherbihicham8506
2 жыл бұрын
@@letsburn00 yeah that's still not big data, since you are only using python on presumably on a single node cluster. If this data is comming in real time and needs to be processed instantly than you'll need streaming tools like Apache Kafka. To be stored analysed/mined than it needs special storage and processing engines like Hadoop, NoSQL stores and Spark, rarely do you use traditional RDBMs as stores unless they are special enterprise level appliances like Teradata, greenplume or Oracle appliances etc.. Data processed using traditional methods like single node machines and traditional programming language libraries are not Big data problems. Many people confuse that because they think big volumes of data are Big data.
@letsburn00
2 жыл бұрын
@@gherbihicham8506 Oh, I know it's not big data. "The Pile" is big data. This is just a tiny corner of the information that is available. But it was a bit of interesting perspective for me.
@thekakan
2 жыл бұрын
@@NoNameAtAll2 I switch back and forth between Python, R and Julia. I love all three! Julia is the fastest, but R and Python have far better support and (usually) easier time in development. When you need the best compute power, Julia it is! It's quite amazing
@SlackWi2 жыл бұрын
i work in bioinformatics and i would totally agree that 'big data' is anything i have to run on our university cluster
@urooj09
2 жыл бұрын
@you- tube well you have to study biology a bit. In my college bioinformatics students take atleast one semester if biology and then they take courses on biology depending on what they wanna code.
@iammaxhailme2 жыл бұрын
I used to work in computational chemistry... I had to use large GPU-driven compute clusters to do my simulations, but I wouldn't call it big data. I'd call it "big calculator that crunches molecular dynamics for a week and then pops out a 2 mb results .txt file" lol
@igorsmihailovs52
2 жыл бұрын
Did you use network storage for MD? Because I was surprised to hear in this video how it is specific to big data. I am doing CC now but not MD, QC.
@iammaxhailme
2 жыл бұрын
@@igorsmihailovs52 Not really. SSH into massive GPU comp cluster, start the simulation, SCP the results files (which were a few gigs at most) back to my own PC. Rinse and repeat.
@KilgoreTroutAsf
2 жыл бұрын
Coordinates are usually saved only once every few hundred steps, with intermediate configurations being highly redundant and easy to reconstruct from the nearest snapshot. Because of that MD files are typically not very large.
@Jamiered182 жыл бұрын
It's interesting, because at my company we deal with petabytes of data. Yet, I'm not sure you could call that "big data", because it's not complex and it doesn't require multiple nodes to process.
@ktan8
2 жыл бұрын
But you'll probably need multiple nodes to store petabytes of data?
@jackgilcrest4632
2 жыл бұрын
@@ktan8 maybe only redundancy
@Beyondthecustomary
2 жыл бұрын
@@ktan8 large amounts of data are often stored in raid for speed and redundancy.
@mattcelder
2 жыл бұрын
That's why big data isn't the same thing as "large volume". "Large" is subjective and largely dependant on your point in time. 30 years ago, you could've said "my company deals with gigabytes of data" and that would've sounded ridiculously huge, like petabytes do today. But today we wouldn't call "gigabytes" big data. For the same reason, we wouldn't call "petabytes" "big data" unless there's more to it than sheer volume.
@AyushPoddar
2 жыл бұрын
@@Beyondthecustomary not necessarily, most of the large data I've seen being stored (think pb) is stored in a distributed storage like HDFS which came out of Google GFS, since RAID would provide redundancy and fault tolerance but there are no HD that I know of that can store a single PB file and it'll surely not be inexpensive as RAID suggests.
@kevinhayes60572 жыл бұрын
"Big Data" is talked about everywhere now. Really great to hear an explanation of it's fundamentals.
@AboveEmAllProduction
2 жыл бұрын
More like 10 years ago it was talked about alot
@codycast
2 жыл бұрын
@@AboveEmAllProduction no, gender studies
@mokopa2 жыл бұрын
"If you're using Windows, that's your own mistake" INSTANT LIKE + FAVORITE
@sandraviknander78982 жыл бұрын
Freaky! I had this exact need of data locality on our cluster for the first time in my work this week.
@nandafprado2 жыл бұрын
"If you are using windows that is your own mistake" ...well that is the hard truth for data scientists lol
@shiolei2 жыл бұрын
Awesome simple explanation and diagrams. Loved this breakdown!
@NoEgg4u2 жыл бұрын
@3:23 "...the digital universe was estimated to be 44 zeta-bytes", and half of that is adult videos.
@Sharp931
2 жыл бұрын
*doubt*
@Phroggster
2 жыл бұрын
@@Sharp931 You're right, it's probably more like two-thirds.
@G5rry
2 жыл бұрын
The other half is cats
@quanta83822 жыл бұрын
Take a drink everytime they say data for the ultimate experience
@seancooper8918
2 жыл бұрын
We call this approach "Dealing With Big Drinking".
@gubbin9092 жыл бұрын
Would love to see some future videos on Apache Spark!
@recklessroges
2 жыл бұрын
Yes. There is so much more to talk about on this topic. I'd like to hear about ceph and Tahoe-LAFS.
@isaactriguero3155
2 жыл бұрын
I am working on it :-)
@thisisneeraj7133
2 жыл бұрын
*Apache Hadoop enters the chat*
@albertosimeoni7215
2 жыл бұрын
Better to spend some words even on apache Druid
@nikhilPUD012 жыл бұрын
In few years "Super big data."
@recklessroges
2 жыл бұрын
Probably not as technology expands at a similar rate and the problem space doesn't change now that the the cluster has replaced the previous "mainframe" (single computer) approach.
@Abrifq
2 жыл бұрын
hungolomghononoloughongous data structures :3
@leahshitindi8365 Жыл бұрын
We had three hours lecture with Isaac last month. It was very interesting
@lightspiritblix14232 жыл бұрын
I'm actually studying these concepts at college, this video could not have come at a more convenient time!
@TheMagicToyChest
2 жыл бұрын
Stay focused and diligent, friend.
@evilsqirrel2 жыл бұрын
As someone who works more on the practical side of this field, it really is a huge problem to solve. I work with data sets where we feed in multiple terabytes per day, and making sure the infrastructure stays healthy is a huge undertaking. It's cool to see it broken down in a digestible manner like this.
@BlueyMcPhluey2 жыл бұрын
thanks for this, understanding how to deal with big data is one elective I didn't have time for in my degree
@chsyank2 жыл бұрын
Interesting video. I worked on and designed big data building large databases for litigation in the early 1980... that was big at the time. Then a few years later creating big data for shopping analysis. The key is that big data is big for the years that you are working on it and not afterwards as storage and processing gets bigger and faster. I think that while analysis and reporting is important, (otherwise there is no value to the data) I do believe that designing and building proper ingestion and storage designs are as important. My two cents from over 30 years of building big data.
@GloriousSimplicity2 жыл бұрын
The industry is moving away from having long-term storage on compute nodes. Since data storage needs grow at a different rate than compute needs, the trend is to have a storage cluster and a compute cluster. This means that applications start a bit slower as the data must be transferred from the storage cluster to the compute cluster. However it allows for more efficient spending on commodity hardware.
@Skrat2k2 жыл бұрын
Big data - any data set that crashes excel 😂
@godfather7339
2 жыл бұрын
Nah, excel crashes at like 1 million rows, that's not much actually...
@mathewsjoy8464
2 жыл бұрын
@@godfather7339 actually it is
@godfather7339
2 жыл бұрын
@@mathewsjoy8464 trust me, it's not, it's not at all.
@mathewsjoy8464
2 жыл бұрын
@@godfather7339 well you clearly don’t know anything, the expert in the video even said we can’t define how big or small data needs to be to be big data
@godfather7339
2 жыл бұрын
@@mathewsjoy8464 ik wht he defined, and I also know, PRACTICALLY, THT 1 MILLION ROWS IS NOTHING.
@Pedritox09537 ай бұрын
Great video!
@lookinforanick2 жыл бұрын
Never seen a numberphile video with so much innuendo 🤣
@Veptis2 жыл бұрын
At my university there is a masters programme in data science and artificial intelligence. It's something I might go into after finishing my bachelor in computational linguistics. However I do need to do additional maths courses, which I haven't looked into yet. Apparently the supercomputer at the University has the largest memory in all of Europe. Which is 8 TB per nodd
@laurendoe1682 жыл бұрын
I think the prefix after "yotta" should be "lotta" LOL
@jaffarbh2 жыл бұрын
One handy trick is to reduce the number of "reductions" in a map-reduce task. In other words, more training, less validation. The downside this could mean the training coverages more slowly
@Georgesbarsukov2 жыл бұрын
I prefer the strategy where I make everything super memory efficient and then go do something while it runs for a long time
@quintrankid80452 жыл бұрын
How many people miss the days of Fortran overlays? Anyone?
@isaactriguero3155
2 жыл бұрын
not me haha
@LupinoArts2 жыл бұрын
"Big Data" Did you mean games by Paradox Interactive?
@mikejohnstonbob935
2 жыл бұрын
Paradox created/published Crysis?
@rickysmyth2 жыл бұрын
Have a drink every time he says DATE-AH
@myothersoul19532 жыл бұрын
It's not the size of your data set that matters, nor is how many computers you use or the statistical they apply, what matters is how useful is the knowledge you extract.
@sabriath2 жыл бұрын
Well you went over scaling up and scaling out, but you missed scaling in. A big file that you are scanning through doesn't need all of the memory to load the entire file, you can do it in chunks and methodically. If you take that process and scale that out with the cluster, then you end up with an automated way of manipulating data. Scale the allocation code across the raid and you have automatic storage containment. Both together means that you don't have to worry about scale in any direction, it's all managed in the background for you.
@shards1627
2 жыл бұрын
scaling in typically sacrifices a bit of speed, as using lower computing power will inherently make it take longer, and using any sort of paging system or other system to break data into chunks has some overhead, though how much compared to a network for sharing data between computers I don't really know they specifically said the goal today was to perform your operation quickly, so scaling in (while the most efficient in terms of power and hardware costs) is not going to work for that particular goal
@bluegizmo19832 жыл бұрын
How many more buzz words are you gonna cram into this interview? Big Data ✔️, Artificial Intelligence ✔️, Machine Learning ✔️.
@glieb2 жыл бұрын
VQGAN + CLIP image synthesis video in the works I hope?? and suggest
@guilherme50942 жыл бұрын
Nice.
@Grimcookie_wow2 жыл бұрын
I think we can all agree that when you have to start using spark over pandas to process your datasets and save them on partitions rather than pure csvs then its big data
@kzuridabura8280
2 жыл бұрын
Try dask sometimes
@joeeeee87382 жыл бұрын
I have worked with Redshift and then with Snowflake. Snowflake solved the problems Redshift had by actually storing all the data efficiently in a central storage instead of storing in each machine. The paradigm is actually backwards now as storing is cheap (network is still the bottleneck)
@serversurfer61692 жыл бұрын
I totally thought this video was gonna be about regulating Google and AWS… 🤓🤔😜
@advanceringnewholder2 жыл бұрын
Based on what I watch till 2:50, big data is Tony stark of data
@_..---2 жыл бұрын
44 zettabytes? seems like the term big data doesn't do it justice anymore
@unl0ck9982 жыл бұрын
That spanish accent *swoon*
@RizwanAli-jy9ub2 жыл бұрын
We should store information and lesser data
@sagnikbhattacharya12022 жыл бұрын
5:10 "If you're using Windows, that's your own mistake" truer words have never been spoken
@phunkym82 жыл бұрын
La dirección de la visitación de Concepción Zarzal
@klaesregis74872 жыл бұрын
16GB a lucky guy? Thats like the bare minmum for a developer these days. I want 64GB for my next upgrade in a year or so.
@yashsvidixit71692 жыл бұрын
Didn't know Marc Márquez did Big data as well
@kees-janhermans9102 жыл бұрын
'Scale out'? What happened to 'parallel processing'?
@malisa71
2 жыл бұрын
Didn't the meaning changed few years ago? Parallel processing is when it is working on the same or part of a problem at the same time. Horizontal scaling is when you can add nodes that does not need to work on the same problem at same time. Only the result will be merged in end. But the meaning is probably industry dependent.
@Thinkingfeed2 жыл бұрын
Apache Spark rulez!
@Sprajt2 жыл бұрын
Who buys more ram when you can just download it? smh
@_BWKC
2 жыл бұрын
Softram logic XD
@KidNickles2 жыл бұрын
Do a video on Raid storage! All this talk about big data and storage, I would love some videos on raid 5 and parity drives!
@Goejii2 жыл бұрын
44 ZB in total, so ~5 TB per person?
@busterdafydd3096
2 жыл бұрын
Yea. We will all interact with about 5TB of data in our life time if you think about it deeply
@ornessarhithfaeron3576
2 жыл бұрын
Me with a 4TB HDD: 👁️👄👁️
@EmrysCorbin
2 жыл бұрын
Yeeeeeah, 15 of those TB are on this current PC and it still seems kinda limited.
@shards1627
2 жыл бұрын
yeah I'd expect facebook alone to have at least half that per person, maybe more if you include instagram and whatsapp
@advanceringnewholder2 жыл бұрын
Weather data is big data isn't it
@VACatholic
2 жыл бұрын
No its tiny. There isn't that much of it (most weather data is highly localized and incredibly recent)
@shards1627
2 жыл бұрын
not really because each region processes their weather data locally usually so they don't have all that much to work with
@khronos1422 жыл бұрын
"smart data"
@grainfrizz2 жыл бұрын
Rust is great
@forthrightgambitia10322 жыл бұрын
"Everyone is talking about big data" - was this video recorded 5 years ago?
@malisa71
2 жыл бұрын
Why? If you work in this industry you will hear about it few times a month
@forthrightgambitia1032
2 жыл бұрын
@@malisa71 I haven't heard someone where I work in an unironic way for years. Maybe you're stuck working in some snake-oil consultancy though.
@malisa71
2 жыл бұрын
@@forthrightgambitia1032 This "consultancy" is around for almost 100 years and is one of top companies. I will gladly stay with them.
@forthrightgambitia1032
2 жыл бұрын
@@malisa71 Defensive, much?
@malisa71
2 жыл бұрын
@@forthrightgambitia1032 How is anything i wrote defensive?
@lucaspelegrino12 жыл бұрын
I want to see some Kafka
@Ascania2 жыл бұрын
Big Data is the concerted effort to prove "correlation does not imply causation" wrong.
@maschwab632 жыл бұрын
If you need 200+ servers, just run it on a IBM z Server as a plain jane computer task all by itself.
@malisa71
2 жыл бұрын
Did look at pricing of IBM z? My company is actively working on moving to luw and we are not small
@danceswithdirt71972 жыл бұрын
So he's just building up to talking about data striping right (I'm at 13:30 right now)? Is that it or am I missing something crucial?
@G5rry
2 жыл бұрын
Commenting on a video part-way through to ask a question. Do you expect an answer faster than just watching the video to the end first?
@danceswithdirt7197
2 жыл бұрын
@@G5rry No, I was predicting what the video was going to be about. I was mostly correct; I guess the two main concepts of the video were data striping and data locality.
@AboveEmAllProduction2 жыл бұрын
Do a hit everytime he says Right
@dAntony12 жыл бұрын
As an American, I can hear both his Spanish and UK accents when he speaks. Sometimes in the same sentence.
@leovalenzuela8368
2 жыл бұрын
Haha I was just going to post that! It is fascinating hearing him slip back and forth between his native and adopted accents.
@isaactriguero3155
2 жыл бұрын
haha, this is very interesting! I don't think anyone here in the UK would hear my 'UK' accent haha
@treyquattro2 жыл бұрын
so I'm screaming "Map-Reduce" (well, OK, internally screaming) and at the very end of the video we get there. What a tease!
@isaactriguero3155
2 жыл бұрын
there is another video explaining MapReduce! and I am planning to do some live coding videos in Python
@COPKALA2 жыл бұрын
NICE: "if you use windows, it your own mistake" !!
@shemmo2 жыл бұрын
he uses word data so much, that i only hear data 🤔🤣
@isaactriguero3155
2 жыл бұрын
hahah, funny how repetitive one can become when doing this kind of video! hehe, sorry :-)
@shemmo
2 жыл бұрын
true true :) but i like his explanation
@austinskylines2 жыл бұрын
ipfs
@drdca8263
2 жыл бұрын
Do you use it? I think it’s cool, but currently it competes a bit with my too-large-number of tabs, while I don’t get much use from actively running it, so I generally don’t keep it running? I guess that’s maybe just because I haven’t put in the work to find a use that fits my use cases?
@DorthLous2 жыл бұрын
"1 gig of data". Look at my job as a dev. Look at my entertainment as games on Steam and videos. Yeaaaahhh....
@yfs90352 жыл бұрын
Where'd the British guy go what did you do with him!!!?? Who is this guy!!! Sorry I haven't even watched the video yet.
@thekakan2 жыл бұрын
Big data is data we don't know what we can do with it _yet_ 😉 ~~lemme have my fun~~ 6:08 when can we buy Computerphile GPUs? 🥺
@lowpasslife2 жыл бұрын
Cute accent
@jeffatturbofish2 жыл бұрын
Here is my biggest problem with all of the definitions of 'big data' in that it requires multiple computer. What if it only requires multiple computers because the person who is 'analyzing' it, doesn't know how to deal with large data efficiently? Quality of data? I will just use SQL/SSIS to cleanse the data. I normally deal with data in the multiple TB range on either my laptop [not a typical laptop - 64 GB of ram], or my workstation [again, perhaps not a normal computer with 7 hard drives, mostly SSD, 128 GB of ram and a whole lot of cores] and can build an OLAP from the OLTP in minutes and then running more code doing some deeper analyst taking a few minutes more. If it takes more than 30 minutes, I know that I screwed something up. If you have to run it on multiple servers, maybe you also messed something up. Python is great for the little stuff [less than 1 GB], so is R, but for the big data, you need to work with something that can handle it. I have 'data scientist' friends with degrees from MIT who couldn't handle simple SQL and would freak out if they had more than a couple of MB of data to work with. In the meanwhile, I would handle TB of data in less time with SQL, SSIS, OLAP, MTX. Yeah, those are the dreaded Microsoft words.
@albertosimeoni7215
2 жыл бұрын
In enterprise environment you have other problem to handle... Availability made with redundancy of VM and disk over the network (That makes huge latency)... SSIS is considered a toy in big enterprises...other uses ODI, BODS (sap) that is more robust...the natural evolution of SSIS sold as "cloud" and "big data" is azure data factory...but the cost is the highest of every competitor...(you pay for every task you run rather than for the time the "machine is on")
@DominicGiles2 жыл бұрын
There's data.... Thats it...
@JimLeonard2 жыл бұрын
Nearly two million subscribers, but still can't afford a tripod.
@NeThZOR2 жыл бұрын
420 views... I see what you did there
@llortaton28342 жыл бұрын
He still misses dad to this day
@AxeJamie2 жыл бұрын
I want to know what the largest data is...
@recklessroges
2 жыл бұрын
Depends on how you define the set. The LHC has one of the largest data bursts, but the entire Internet could be considered a single distributed cluster...
@quintrankid8045
2 жыл бұрын
largest amount of data in bits= (number of atoms in the universe - number of atoms required to keep you alive) / number of atoms required to store and process each bit(*) (*) Assumes that all atoms are equally useful for storing and processing data and keeping you alive. Also assumes that all the data needs to persist. Number of atoms required to keep you alive may vary by individual and requirements for food, entertainment and socialization. All calculations require integer results. Please consult with a quantum mechanic before proceeding.
@AudioPervert12 жыл бұрын
Not everyone is talking about big data 😭😭😭😂😂😂 these big data dudes never speak of the pollution, contamination and carbon generated by their marvellous technology. Big data could do nothing about the pandemic for example ...
@shards1627
2 жыл бұрын
big data totally could have been useful had anybody actually tried to standardize it and make it universal instead of spreading the data across 27 trillion different apps that nobody can remember the name of the pollution though, they mentioned that maybe bigger isn't better in the video
@isaactriguero3155
2 жыл бұрын
well, I briefly mentioned the problem of sustainable Big Data, and I might be able to put together a video about this. You're right that not many people seem to care much about the number of resources a Big Data solution may use! This is where we should be going in research, trying to develop cost-effective AI, which only needs big data technology when strictly needed, and when is useful.
@syntaxerorr2 жыл бұрын
DoN'T UsE WinDOws....Linux: Let me introduce the OOM killer.
@kevinbatdorf2 жыл бұрын
What? Buying more memory is cheaper than buying more computers… which just means you’re throwing more memory and cpu at it. I think you meant you solve it by writing a slower algorithm that uses less memory as the alternative. Also, buying more memory is often cheaper than the labor cost of refactoring, especially when it comes to distributed systems. Also, why the Windows hate? I don’t use Windows but still cringed there a bit
@malisa71
2 жыл бұрын
Time is money and nobody wants to wait for results. Solutions is to make fast and efficient programs that have proper memory utilisation. Almost no serious institution is using Windows for such tasks. Maybe on client side but not on a node or server.
@pdr.2 жыл бұрын
This video felt more like marketing than education, sorry. Surely you just use whatever solution is appropriate for your problem, right? Get that hammer out of your hand before fixing the squeaky door.
@Random22 жыл бұрын
Ehm... It is very weird that scale in/out and scale up/down are being discussed in terms of big data, when those concepts are completely independent and predate the concept of big data as a whole... As a whole, after watching the entire video, this might be one of the least well-delineated videos in the entire channel. It mixes up parts of different concepts into one as if it all came from big data, or all of it is related to big data, while at the same time failing to address the historical origins of big data and map/reduce. Definitely below average for computerphile.
@yukisetsuna13252 жыл бұрын
first
@darraghtate440
2 жыл бұрын
The bards shall sing of this victory in the annals of time.
@vzr3142 жыл бұрын
No. Everyone is talking about COVID. And I listened him until he mentioned COVID in first few minutes, enough nof broken English anyway