Rob Mulla
Күн бұрын
67,898
1

Speed Up Your Pandas Dataframes

In this video Rob Mulla teaches how to make your pandas dataframes more efficient by casting dtypes correctly. This will make your code faster, use less memory and smaller when saving to disk or a database.
Timeline:
00:00 Intro
00:47 Imports and Data Creation
02:32 Dataframe Memory Use
03:20 Baseline Speed Test
04:15 Casting Categorical
05:45 Downcasting Ints
07:07 Downcasting floats
08:15 Casting Bool Types
09:15 Benchmark Comparison
11:08 Outro
Thanks for taking the time to watch this video. Follow me on twitch for live coding streams: / medallionstallion_
Speed up Pandas Code: • Make Your Pandas Code ...
Intro to Pandas video: • A Gentle Introduction ...
Exploritory Data Analysis Video: • Exploratory Data Analy...
* KZread: youtube.com/@robmulla?sub_con...
* Discord: / discord
* Twitch: / medallionstallion_
* Twitter: / rob_mulla
* Kaggle: www.kaggle.com/robikscube
#python #code #datascience #pandas

Пікірлер: 133

@anoopbhagat132 жыл бұрын
That's a brilliant way to save memory & computational cost. Thanks Rob ! it was very useful.👍
@robmulla
2 жыл бұрын
Exactly! Casting the correct column types is very important to speeding up your code.
@shreyaskulkarni58232 жыл бұрын
I really predict that this guys channel is gonna grow a lot.The content is pure without any bs and straight to point with actually new info
@robmulla
2 жыл бұрын
Thanks for the feedback. I hope you’re right.
@gilzeevi9263 Жыл бұрын
Listen Rob, i came across your channel pretty randomly and your content is pure gold! straight to the point, and professionally presented! Thanks a lot! keep rocking
@robmulla2 жыл бұрын
This note in the docs goes into detail about how categorical values only reduce memory use when the number of unique values are low: pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-memory
@oommggdude
6 ай бұрын
What would you recommend is a good formula for determining when it should be categorical vs simple string? Unique values < 50% df length?
@Cmax15 Жыл бұрын
Wow, this is one of the things that we rarely encounter in courses and yet the impact matters so much for overall efficiency. Thank you for making these types of videos. Wish this channel the best!
@robmulla
Жыл бұрын
Thanks for the positive feedback. Totally agree that some of these specific details might not be covered in school but are great at speeding up your code and pipelines!
@dirk-jantoot1029
Жыл бұрын
@@robmulla it depends very much on the amount of data. Essentially you are making a trade off between faster and more efficient code and developing time. If you work with relatively small amounts of metadata and quickly want to get something done, this might be too much of a hassle. But if your code goes to production and has to go through massive amounts of data then it's certainly worth it.
@robmulla
Жыл бұрын
@@dirk-jantoot1029 Good point. There is always a tradeoff between time spent on implementation and code speed, but it's still best practice to properly set column dtypes.
@edmundgoldsberry Жыл бұрын
Dude, where were you when I was starting out... you would have saved me hours of struggle. Great content. Please keep it coming.
@robmulla
Жыл бұрын
So glad I could help you out Edmund! Share the channel with anyone else you think might also find it helpful.
@FilippoGronchi2 жыл бұрын
I love all the videos with these tricks that are critical in the daily developer activities! thanks so much
@robmulla
2 жыл бұрын
Thanks Filippo
@loctranp2 жыл бұрын
It's just wow! Beginner like me really appreaciate your video. Keep it up my man.
@robmulla
2 жыл бұрын
Glad it was helpful to you. Thanks for the feedback.
@shrekinahell2 жыл бұрын
this is incredibly useful information and explained nicely! subbed
@robmulla
2 жыл бұрын
Thanks for the feedback
@clibonthegrind1105 Жыл бұрын
Very nice videos mate! Thanks for sharing your knowledge with us
@robmulla
Жыл бұрын
Glad you like them!
@mohamed0h0hamed2 жыл бұрын
Thx Rob, really enjoyed this episode 👍🏼
@robmulla
2 жыл бұрын
Thanks!
@SimpleExcelVBA Жыл бұрын
I'm surprised how much info I learnt from this video, really good work!
@robmulla
Жыл бұрын
Glad it was helpful! Check out my other videos and share it with some friends!
@SimpleExcelVBA
Жыл бұрын
@@robmulla I will for sure! :)
@gigiosbar10 ай бұрын
That's brilliant! Thanks, Rob!
@rakeshkumarkuwar60532 жыл бұрын
Thanks Rob, for sharing such a great concept.
@robmulla
2 жыл бұрын
Thanks for watching Rakesh.
@TheRecordedLife2 жыл бұрын
Fantastic Video, definitely using this in my daily work.
@robmulla
2 жыл бұрын
Glad you like it! I hope to continue to create more videos like this in the future.
@wilmermorales2 жыл бұрын
this is life changing
@deepsajwan Жыл бұрын
Thanks Rob! for making this useful video
@robmulla
Жыл бұрын
Glad you found it useful. Thanks for watching.
@beethovennine Жыл бұрын
Nice vid! Glad I ran into your channel...subscribed!
@robmulla
Жыл бұрын
Thanks so much for the feedback! Hope you enjoy the other videos too.
@gatorpika Жыл бұрын
Great explanation. Just found out about that a while ago and wish I would have seen this video first instead of doing a bunch of googling. Was trying to get a 36M record dataset with categoricals and positional data to fit in memory on an average laptop for a mapping application. Recasting the datatypes made it all work out.
@robmulla
Жыл бұрын
Glad you enjoyed the video! Casting dtypes correctly is really helpful but easy to overlook
@mahfujulalamanik7308 Жыл бұрын
After watching 70% length of this video, i just stopped the video and mashed the like button. 🔥
@robmulla
Жыл бұрын
What took you so long. 😂
@anpham7108 Жыл бұрын
Thank you for making this video.
@robmulla
Жыл бұрын
My pleasure! Thanks for watching.
@bongkem2723 Жыл бұрын
awesome man, gonna save me $$$$ by not upgrading cpu but downsizing code !!!
@robmulla
Жыл бұрын
Glad I could help! Efficient code is important!
@kapamagicman2 ай бұрын
Perfect!
@rajeshkakawat97 Жыл бұрын
Thanks it usefull , will apply in my project
@robmulla
Жыл бұрын
Glad to hear you found it useful Rajesh!
@ShiNguyenchu4 ай бұрын
helpful video !
@maxwellarnold570 Жыл бұрын
I deal with stock data for my job a lot- dealing with data frames that have daily data for 3000 companies across 20+ years means dealing with 16+ million rows. These tips are incredibly helpful for saving memory- which for my role is often the limiting factor of pandas and my computer. Too much memory load can slow down groupby calls, your computer as a whole and all code, and even worse crash your computer which has happened to me.
@robmulla
Жыл бұрын
Glad this was helpful for you. Sounds like you are working with a lot of data, must be fun!
@TedMan55
3 ай бұрын
Working with 24 milion rows of TMY weather data and I totally feel your pain
@brendenmorley2643 Жыл бұрын
Thanks!
@robmulla
Жыл бұрын
Appreciate the super thanks!!
@myself40244 ай бұрын
🎯 Key Takeaways for quick navigation: 00:00 📊 *Efficient Memory Use in Pandas Introduction* - Importance of efficient memory use in pandas for code speed, reduced memory consumption, and storage efficiency. 02:44 📉 *Initial Data Size and Considerations* - Creating a large dataset (1 million rows) and checking its memory usage. - Highlighting the impact of increasing data size on performance and memory requirements. 05:27 🧹 *Optimizing Categorical Columns* - Demonstrating memory reduction by casting categorical columns (position and team). - Significant reduction in dataset size by utilizing categorical data types. 06:44 🎲 *Downcasting Integer Columns* - Explaining downcasting of integer columns to smaller types for memory optimization. - Choosing appropriate integer types based on the data range to avoid information loss. 07:40 📉 *Downcasting Float Columns* - Downcasting float columns to reduce memory usage while maintaining precision. - Highlighting the impact of float type selection on data frame size. 09:02 🧹 *Optimizing Boolean Columns* - Efficiently casting boolean columns for minimal memory usage. - Using boolean type for binary data representation (win column). 10:23 🔄 *Performance Comparison* - Comparing computation times before and after applying dtype optimizations. - Demonstrating the overall improvement in code performance and memory efficiency. Made with HARPA AI
@gangxaaku2 жыл бұрын
awesome!!
@robmulla
2 жыл бұрын
Thanks Akshat
@obayram4615 Жыл бұрын
Good explanation thank yiu very much 👋👍🙂
@robmulla
Жыл бұрын
Glad it was helpful! Thanks for watching.
@HAUPTSCHUELER99 Жыл бұрын
Thanks
@robmulla
Жыл бұрын
Welcome
@mehdismaeili37432 жыл бұрын
hi,thanks for this video.
@robmulla
2 жыл бұрын
Thanks for watching. Hope you found it helpful.
@skarevaara Жыл бұрын
Very nice video, thanks for the tips! If you do an update you could talk about unsigned integers also, like "uint8" for the Age data?
@robmulla
Жыл бұрын
Great suggestion! Thanks for the feedback.
@vrbaac1641 Жыл бұрын
very nice video ^^ just a question... will this help with the browser error "not enough memory" when doing EDA via Jupyter notebook? thanks ^^
@robmulla
Жыл бұрын
Thanks for the comment. It’s probably the cause of the error if you are running out of memory.
@John5iveАй бұрын
thanks. stuff a newb ie me would not think about
@jsp25182 жыл бұрын
Sorry, would you tell me how the first time you made the table it must have been a 1000 rows? I saying cause it says size = size but I don’t get where is the 1000 size from. Awesome vids btw!
@robmulla
2 жыл бұрын
Great question. I think someone else pointed this out. I think I might have edited out that part of the video but I did end up editing that function to take in the size = 1000 at some point. Check the gist I posted here: gist.github.com/RobMulla/f04b144bb766b692f9314e3782d724d3
@jeffgraham1389
10 ай бұрын
3x49x4x2=1176. Glad this comment was in here. Drove me a little nuts not knowing where 1000 came from.
@bhavinmoriya92162 жыл бұрын
Awesome! I saw in a Video by Matt Harrison that, there is a library which sort of tells which dtypes needs to be converted to do memory saving. Unfortunately, I do not remember exact video. Are you aware of any such library?
@robmulla
2 жыл бұрын
Thanks! I don’t know of that library but let me know if you find it.
@AmexL Жыл бұрын
Why did you use the ‘map’ method over the ‘astype’ when changing the yes/no strings to bool? Thanks again for this vid.
@robmulla
Жыл бұрын
I think I did it that way because you need to define how to convert the strings to a bool. Astype won’t automatically know to convert those strings unless the we “true” or “false”
@cradleofrelaxation64736 ай бұрын
This guy is a legend. He started by using “size” in the random function which he didn’t define. Please did you have the value of “size” in memory before you started recording?
@Omer698
3 ай бұрын
I was about to ask this same question
@rockwellshabani51802 жыл бұрын
Excellent video. Would converting the 'yes/no' to 1 or 0 save as much space as converting them to bool?
@robmulla
2 жыл бұрын
Thanks for watching. I believe any int will always take up more memory than a bool. That is because a bool only uses one bit. int8, int16, etc use 8, 16 bits. A bool is essentially an int1
@juliansteden2980
2 жыл бұрын
@@robmulla This is not completely true. In theory a bool needs only 1bit (true/false, 0/1). In practice CPUs can't address anything smaller than a byte, therefore a bool usually needs 1byte (8bit) of memory just like an int8. Nevertheless great video, thanks!
@robmulla
2 жыл бұрын
@@juliansteden2980 Thanks for clarifying! I stand corrected, that's interesting to know but totally makes sense.
@andyr8833 Жыл бұрын
Hi Rob, very useful thank you, but how do you deal with the following situation: you have a large MongoDB collection that you want to use locally to develop functions and play with the data. If you import it as a pandas data frame, it is just too large for the PC to handle. What's the best practice in this case? Worth a video tutorial? Thank you
@robmulla
Жыл бұрын
Thanks. Can you aggregate the data in some way before exploring it locally? You could also just get a really large ec2 instance to run it on :D - another option would be something like dask. I have a video about pandas alternatives you should check out.
@luismontero3416 Жыл бұрын
👏👏👏👏👏
@robmulla
Жыл бұрын
💪
@FabioRBelotto Жыл бұрын
Category columns are great, but it's important to set observed = true when doing a group by.
@robmulla
Жыл бұрын
Whoa! I didn’t know about that option. Need to try it next time. Actually would’ve been helpful on yesterdays stream.
@user-ld5dn3fv4m Жыл бұрын
if you use the astype and round operators on float data, pandas needs to set the signature or leave float64 by default ?
@robmulla
Жыл бұрын
Not sure I totally understand the question. But float precision depends on how precise you need the values to be.
@dariuszspiewak5624 Жыл бұрын
I know there's a Python module that takes a dataframe and calculates what type transformations one could do on the columns to reduce the size of the frame (it's pretty neat). Can't remember the name of the package now, though... I watched a YT video on it just about yesterday.
@robmulla
Жыл бұрын
Cool! Let me know if you find it. There is a function that I’ve used before from Kaggle that does it.
@mikele5355 Жыл бұрын
Hey! I recently really enjoy watching your videos. Could you maybe create a video in which you explain how I can run my python scripts automatically online, so that I don't always have to do this manually and with a switched on computer? I am getting a little bit more advancecd through your videos and I'd be super interested in this topic. Cheers! :)
@robmulla
Жыл бұрын
Thanks for watching. That’s a great idea for a video. I think it would differ a lot how you would automate it depending on what you were running. Small program vs a really computational intense process.
@mikele5355
Жыл бұрын
@@robmulla Sounds amazing! Thanks for appreciating the idea :)
@willTryAgainTmrw2 жыл бұрын
Does saving it in parquet, and then reading it back retains the dtypes? (Dont have a pc with pyarrow installed near me)
@robmulla
2 жыл бұрын
Great question. Yes it does!
@willTryAgainTmrw
2 жыл бұрын
@@robmulla Great. Another reason to use parquet!
@pierrebernard142 Жыл бұрын
Question : what is the time complexity of all those cast operations ?
@robmulla
Жыл бұрын
That’s a great question. I don’t know exactly but i haven’t ever come across a time when the cast operation has been an issue.
@pierrebernard142
Жыл бұрын
@@robmulla thx :)
@ChimeraGilbert Жыл бұрын
Can you use this astype method if the column contains missing data?
@robmulla
Жыл бұрын
Good question. It depends. Int or bool columns can’t contain null. Floats can.
@mcdolla7965 Жыл бұрын
sir can u plz xplain how we can convert string to catagory without using any inbuit function
@robmulla
Жыл бұрын
I'm not sure what you mean. You can set the dtype to 'category' using .astype('category') read more about it here: pandas.pydata.org/docs/user_guide/categorical.html
@mcdolla7965
Жыл бұрын
@@robmulla thanks for reply sir but lemmi clear my question again. without using astype or any inbuilt method how to convert the dtype to categorical of any column
@robmulla
Жыл бұрын
@@mcdolla7965 not sure that’s possible.
@mcdolla7965
Жыл бұрын
@@robmulla it is possible sir, inside categorical class is being called and two lists are returned but mechanics were too complex to be understood by me, but im sure you will help me out ,plz send me your email id,so that i will send u the github link then you can go through it..nd that would be gr8 content for your channel too..cuz nowhere its available.
@Pedro_Israel Жыл бұрын
Isn´t there a library or function to do this? or at least some of the steps?
@Pedro_Israel
Жыл бұрын
For example, I made this code which could help anyone. But I bet there are even better options out there: # 6) Change DTYPES #if column dtype == float & has no values after the decimal point = change dtype to int: for col in data1.columns: if data1[col].dtype == 'float64': if data1[col].astype(int).equals(data1[col]): data1[col] = data1[col].astype(int) #Else: try to reduce float to float 8,16,32,64: else: if data1[col].min() >= -128 and data1[col].max() = -32768 and data1[col].max() = -2147483648 and data1[col].max() = -128 and data1[col].max() = -32768 and data1[col].max() = -2147483648 and data1[col].max()
@robmulla
Жыл бұрын
I haven't used it myself but you could also look into this: pypi.org/project/pandas-dtype-efficiency/ Understanding this concept is important beyond just pandas dataframes, but I agree it could be somewhat automated.
@robmulla
Жыл бұрын
This function is good (I've used it on kaggle before) but you just need to be sure you are ok with casting the datatypes automatically, for instance if you expect to have new data added that could be larger or more percise, or if you would not like to automatically cast as categorical then you might not want to do it automatically.
@Pedro_Israel
Жыл бұрын
@@robmulla Sory for my late response. Thank you for the answer!. Yes, that package automates a big part of the process. I would add some functionalities to make it even more customizable but it´s great for anyone who wants to check it. I also agree with you that it´s important to understand the concept. I needed an automation because after your video I use this frequently in large datasets.
@Pedro_Israel
Жыл бұрын
@@robmulla Yes, you are right. I usually work with past data that will not be updated in the future so the function comes in handy. But of course, if new data will be added one should cast dtypes carefully.
@adelinorafailov233 Жыл бұрын
NameError: name 'size' is not defined this is what i get from the beggining why ?
@robmulla
Жыл бұрын
Are you sure you wrote the code correctly? It looks like you might have not been running size as a method and instead python thinks it's a variable.
@muhammadfadliaktsar7172
11 ай бұрын
@@robmulla I got same problem too, but I have been following your instruction correctly and still get that error
@muhammadfadliaktsar7172
11 ай бұрын
and the error look like been read from kaggle notebook is variable that not been declare
@bernard-ng Жыл бұрын
amazing 38 mb to 7 mb 🤩
@robmulla
Жыл бұрын
Yes 😁 its crazy how much space can be saved!
@beda9beda Жыл бұрын
Easy way to optimize run time
@doullagdz94792 жыл бұрын
is 10_000 equivalent to 10000?
@robmulla
2 жыл бұрын
Yes! It was added in python 3.6 peps.python.org/pep-0515/
@yusufcan1304Ай бұрын
this was traffic
@Hy60K Жыл бұрын
Team, not "time"! where is size variable init ?
@robmulla
Жыл бұрын
Not sure I get that you mean. Did I misspeak in the video?
@nonetype66
Ай бұрын
Huh!?
@Omer6983 ай бұрын
"NameError: name 'size' is not defined"
@ErikS- Жыл бұрын
If you really want to use Python and run into big bottlenecks, first try moving all things to numpy only...
@robmulla
Жыл бұрын
That’s not always so easy but I agree in some cases it is necessary indeed.
@debunkthis Жыл бұрын
Error in the thumb nail pd.DataFrame
@robmulla
Жыл бұрын
Nice catch!
@danielandarge665224 күн бұрын
Perfect !