This INCREDIBLE trick will speed up your data processes.

In this video we discuss the best way to save off data as files using python and pandas. When you are working with large datasets there comes a time when you need to store your data. Most people turn to CSV files because they are easy to share and universally used. But there are much better options out there! Watch as Rob Mulla, Kaggle grandmaster, discusses some alternative ways of saving data files: pickle, parquet and feather files. I run some benchmarks to show that you can save time, space and keep the important metadata about your files in the process!
Timeline
00:00 Intro
00:49 Creating our Data
02:08 CSVs
04:39 Setting dtypes for CSVs
06:15 Pickle Files
07:16 Parquet ❤️
09:07 Feather
10:31 Other Options
11:02 Benchmarking
12:19 Takeaways
12:43 Outro
Code Gist: gist.github.com/RobMulla/7384...
Follow me on twitch for live coding streams: / medallionstallion_
Other Videos:
Speed up Pandas: • Make Your Pandas Code ...
Efficient Pandas Dataframes: • Speed Up Your Pandas D...
Inroduction to Pandas: • A Gentle Introduction ...
Exploritory Data Analysis Video: • Exploratory Data Analy...
Audio Data in Python: • Audio Data Processing ...
Image Data in Python: • Image Processing with ...
* KZread: youtube.com/@robmulla?sub_con...
* Discord: / discord
* Twitch: / medallionstallion_
* Twitter: / rob_mulla
* Kaggle: www.kaggle.com/robikscube
#python #code #datascience #pandas

Пікірлер: 381

  • @miaandgingerthememebunnyme3397
    @miaandgingerthememebunnyme33972 жыл бұрын

    First post! That’s my husband he knows about data…

  • @LuisRomaUSA

    @LuisRomaUSA

    2 жыл бұрын

    He knows a lot of good stuff about data 😁. His the first non-introductory Python KZreadr I have found so far 🎉

  • @venvanman

    @venvanman

    2 жыл бұрын

    aww this is cute

  • @sketch1625

    @sketch1625

    2 жыл бұрын

    Guess he's really in a "pickle" now.

  • @foobarAlgorithm

    @foobarAlgorithm

    2 жыл бұрын

    Awww now you guys need a The DataCouple channel if you both do data science! Love your content

  • @Arpan_Gupta

    @Arpan_Gupta

    Жыл бұрын

    Nice work Mr. ROB

  • @DainiusKirsnauskas
    @DainiusKirsnauskasАй бұрын

    Man, I thought this video is a clickbait, but it was awesome. Thank you!

  • @lashlarue7924
    @lashlarue7924 Жыл бұрын

    You are my new favorite KZreadr, Sir. I'm learning more from you than anyone else, by a country mile!

  • @mschuer100
    @mschuer100 Жыл бұрын

    As always, awesome video...a real eye opener on most efficient file formats. I have only used pickle as compression, but will now investigate feather and parquet. Thanks for putting this together for all of us.

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad it was helpful! I use parquet all the time now and will never go back.

  • @bendirval3612
    @bendirval3612 Жыл бұрын

    A major design objective of feather is to be able to be read by R. If you are doing pandas-type data science stuff, this is a significant advantage.

  • @robmulla

    @robmulla

    Жыл бұрын

    Great point. The R package called "arrow" can read in both parquet and feather files.

  • @holgerbirne1845
    @holgerbirne18452 жыл бұрын

    Very good video :). One note: pickle files can be compressed. If you compress them, they become much smaller but reading and writing becomes slower. Overall parquet und feather are still much better.

  • @robmulla

    @robmulla

    2 жыл бұрын

    Good point! There are many ways to save/compress that I probably didn't cover. Thanks for watching the video.

  • @Banefane
    @Banefane4 ай бұрын

    Very clear, very structured, and the details are intuitive to understand!

  • @gustavoadolfosanchezhurtad1412
    @gustavoadolfosanchezhurtad1412 Жыл бұрын

    Very clear and insightful explanation, thanks Rob, keep it up!

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks Gustavo. I’ll try my best.

  • @pablodelucchi353
    @pablodelucchi353 Жыл бұрын

    Thanks Rob, awesome information! Learning a lot from your channel. Keep it up!

  • @robmulla

    @robmulla

    Жыл бұрын

    Isn’t learning fun?! Thanks for watching.

  • @Jvinniec
    @Jvinniec Жыл бұрын

    One really cool feature of .read_parquet() is that it passes through additional parameters for whichever backend you're using. For example the filters parameter in pyarrow allows you to filter data at read, potentially making it even faster: df = pd.read_parquet("myfile.parquet", filters=[('col_name', '

  • @robmulla

    @robmulla

    Жыл бұрын

    Whoa. That is really cool. I didn't realize you could do that. I've used athena which allows you to query parquet files using standard SQL and it's really nice.

  • @juanm555

    @juanm555

    Жыл бұрын

    Athena is amazing when backed with parquet files, I've used it in order to be able to read through 600M+ records that were in those parquets easily

  • @incremental_failure

    @incremental_failure

    11 ай бұрын

    That's the real use case for parquet. Feather doesn't have this.

  • @beethovennine
    @beethovennine Жыл бұрын

    Rob, you did it again...keep'em coming, good job!

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks!

  • @nancyzhang6790
    @nancyzhang6790 Жыл бұрын

    I saw people mentioned feather on Kaggle sometimes, but had no clue what they were talking about. Finally, I got answers to many questions in my mind. Thank you!

  • @robmulla

    @robmulla

    Жыл бұрын

    Yes. Feather and parquet formats are awesome for when you want to quickly read and write data to disk. Glad the video helped you learn!

  • @nascentnaga
    @nascentnaga3 ай бұрын

    as someone moving into datascience this is such a great explainer! thank you

  • @arielspalter7425
    @arielspalter7425 Жыл бұрын

    Excellent tutorial Rob. Subscribed!

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks so much for the feedback. Thanks for subscribing!

  • @anoopbhagat13
    @anoopbhagat132 жыл бұрын

    learnt something new today. Thank you Rob for this useful & informative video.

  • @robmulla

    @robmulla

    2 жыл бұрын

    Learn something new every day and before long you will be teaching others!

  • @KirowOnet
    @KirowOnet Жыл бұрын

    This was the first video from the channel that randomly appeared in my feed. I clicked, I watched - I liked and subscribed :D. This video plant a seed into my mind, some others inspired me to try. So few days later I got running playground environment in the docker. I'm not data scientist but tips and tricks from your videos could be useful for any developer. I used to code before to check some datasets, but with pandas and jupiter notebook it way more faster. Thank You for sharing your experience !

  • @robmulla

    @robmulla

    Жыл бұрын

    Wow, I really appreciate this feedback. Glad you found it helpful and got some code working yourself. Share with friends and keep an eye out for new videos dropping soon!

  • @rrestituti
    @rrestituti Жыл бұрын

    Amazing! Got one new member. Thanks, Rob! 😉

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad you liked it. Thanks for commenting!

  • @FilippoGronchi
    @FilippoGronchi2 жыл бұрын

    Excellent as usual Rob...very very useful indeed

  • @robmulla

    @robmulla

    2 жыл бұрын

    Thank you sir!

  • @walterpark8824
    @walterpark8824 Жыл бұрын

    Exactly what I needed to know, and to the point. Thanks. As Einstein said, 'Everything should be as simple as possible, and no simpler!'

  • @robmulla

    @robmulla

    Жыл бұрын

    That’s a great quote. Glad you found this helpful.

  • @user-qe7uw4ry7q
    @user-qe7uw4ry7q7 ай бұрын

    Hi Rob. I'm from Argentina, you are the best!!!

  • @MrWyYu
    @MrWyYu2 жыл бұрын

    Great summary of data types. Thanks

  • @robmulla

    @robmulla

    2 жыл бұрын

    Thanks for the feedback! Glad you found it helpful.

  • @javiercmh
    @javiercmh Жыл бұрын

    Very engaging and clear. Thanks!

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks for watching. 🙌

  • @marcosoliveira8731
    @marcosoliveira8731 Жыл бұрын

    I've learned a great deal with this video. Thank you!

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks so much for the feedback. Glad you learned from it!

  • @casey7411
    @casey7411 Жыл бұрын

    Very informative video! Subscribed :)

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad it helped! 🙏

  • @rafaelnegreiros_analyst
    @rafaelnegreiros_analyst Жыл бұрын

    Amazing. Congrats for the video

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad you like the video. Thanks for watching.

  • @olucasharp
    @olucasharp Жыл бұрын

    Huge thanks for sharing 🍀

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad you liked it? Thanks for the comment.

  • @safsaf2k
    @safsaf2k Жыл бұрын

    This is excellent, thank you man

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad it helped!

  • @JohnMitchellCalif
    @JohnMitchellCalif Жыл бұрын

    super clear and useful! Subscribed

  • @robmulla

    @robmulla

    Жыл бұрын

    Awesome, thank you!

  • @arpanpatel9191
    @arpanpatel9191 Жыл бұрын

    Great video!! Small things matter the most. Thanks

  • @robmulla

    @robmulla

    Жыл бұрын

    Absolutely! Thanks.

  • @user-hy1lm2rd9q
    @user-hy1lm2rd9q10 ай бұрын

    really good video! thank you

  • @humbertoluzoliveira
    @humbertoluzoliveira Жыл бұрын

    Hey Guy, nice job. Congratulations! Thanks for video.

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks for watching Humberto.

  • @reasonableguy6706
    @reasonableguy6706 Жыл бұрын

    Rob, You're a natural communicator (or you worked really hard at acquiring that skill) - most effective. I follow you on twitch and I'm currently going through your youtube content to come up to speed. Thanks for sharing your time and experience. Have you thought about aggregating your content into a book as a companion to your content - something like "Data Analysis Using Python/Pandas - No BS, Just Good Stuff" ?

  • @robmulla

    @robmulla

    Жыл бұрын

    Hey. Thanks for the kind words. I’ve never considered myself a naturally good communicator and it’s a skill I’m still working in but I appreciate your positive feedback. The book idea is great, maybe sometime in the future….

  • @cristianmendozamaldonado3241
    @cristianmendozamaldonado3241 Жыл бұрын

    I really love it man, thank you. You saved a life

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks! Maybe not saved a life, but saved a few minutes of compute time!

  • @baharehbehrooziasl9517
    @baharehbehrooziasl95179 ай бұрын

    Great! Thank you for this very helpful video.

  • @robmulla

    @robmulla

    9 ай бұрын

    Glad it was helpful!

  • @SergioBerlottoJr
    @SergioBerlottoJr2 жыл бұрын

    Awesome informations ! Thankyou for this.

  • @robmulla

    @robmulla

    2 жыл бұрын

    Glad you liked it!

  • @Patrick-hl1wp
    @Patrick-hl1wp Жыл бұрын

    super awesome tricks, thank you

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad you like them! Thanks for watching.

  • @truthgaming2296
    @truthgaming22965 ай бұрын

    thanks rob, its help me a lot for beginner like me to realize there is weakness in csv format 😉

  • @danilzubarev2952
    @danilzubarev29524 ай бұрын

    Lol this video changed my life :D Thank you so much.

  • @CalSticks
    @CalSticks Жыл бұрын

    Really useful video - thanks. I was just searching for some Pandas videos for some light upskilling on the weekend, so this was a great find.

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad I could help! Check out my other videos on pandas too if you liked this one.

  • @JoeMcMullin
    @JoeMcMullin2 ай бұрын

    Great video and content.

  • @vigneshwarselva9276
    @vigneshwarselva9276 Жыл бұрын

    Was very useful, thanks much

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks! Glad you learned something new.

  • @yogiananta9674
    @yogiananta9674 Жыл бұрын

    awesome ! thank you for this tutorial

  • @robmulla

    @robmulla

    Жыл бұрын

    You're very welcome! Share with a friend.

  • @hugoy1184
    @hugoy1184 Жыл бұрын

    Thank u very much for sharing such useful skills! 😉Subscribed!

  • @robmulla

    @robmulla

    Жыл бұрын

    Anytime! Glad you liked it.

  • @ChrisHalden007
    @ChrisHalden007 Жыл бұрын

    Great video. Thanks

  • @robmulla

    @robmulla

    Жыл бұрын

    You are welcome!

  • @predstavitel
    @predstavitel Жыл бұрын

    It's useful for me, thanks a lot!

  • @robmulla

    @robmulla

    Жыл бұрын

    Happy to hear that!

  • @niflungv1098
    @niflungv1098 Жыл бұрын

    This is good to know. I`m going into web development now, so I usually use JSON format for serialization... I`m still new to python so I didn`t know about parquet and feather. Thank you!

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad you found it helpful. Share it with anyone else you think would benefit!

  • @69k_gold
    @69k_gold3 ай бұрын

    I looked this up, and it's a pretty cool format, I kinda guessed that it could be a column-based storage strategy when you said that we can efficiently get only select columns, but after I looked it up and found it to be true, it felt very exciting. Anyways, hats off to Google's engineers for thinking out of the box on this, the number of things we can do just by storing data as column-lines rather than row-lines is a lot. Of course, the trade-off is that it's very expensive to modify column-wise data, so this is more useful for static datasets that require multi-dim analysis

  • @emjizone
    @emjizone6 ай бұрын

    Useful. Thanks.

  • @danieleingredy6108
    @danieleingredy6108 Жыл бұрын

    This blew my mind, duuude

  • @robmulla

    @robmulla

    Жыл бұрын

    Happy to hear that! Share with others so their minds can be blown too!

  • @againstthegrain5914
    @againstthegrain5914 Жыл бұрын

    Hey this was very useful to me thank you for sharing!!

  • @robmulla

    @robmulla

    Жыл бұрын

    So glad you found it useful.

  • @pawarasiriwardhane3260
    @pawarasiriwardhane3260 Жыл бұрын

    This content is really awesome

  • @robmulla

    @robmulla

    Жыл бұрын

    Appreciate that!

  • @steven7639
    @steven7639 Жыл бұрын

    Fantastic video

  • @robmulla

    @robmulla

    Жыл бұрын

    Fantastic comment. 😎

  • @chrisogonas
    @chrisogonas Жыл бұрын

    Great stuff! Thanks for sharing.

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad you enjoyed it!

  • @chrisogonas

    @chrisogonas

    Жыл бұрын

    @@robmulla 👍

  • @EVL624
    @EVL624 Жыл бұрын

    Very good and informative video

  • @robmulla

    @robmulla

    Жыл бұрын

    So nice of you. Thanks for the feedback.

  • @melanp4698
    @melanp4698 Жыл бұрын

    12:28 "When your data set gets very large." - Me working with 800GB json files: :) Good video regardless, i might give them a test sometime.

  • @robmulla

    @robmulla

    Жыл бұрын

    Haha. It’s all relative. When your data can’t fit in local ram you need to start using things like spark.

  • @ozymet
    @ozymet Жыл бұрын

    Very good stuff. The essence of information.

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad you liked it!

  • @ozymet

    @ozymet

    Жыл бұрын

    @@robmulla I saw few more videos, insta sub. Thank you. Glad to find you.

  • @krishnapullak
    @krishnapullak Жыл бұрын

    Good tips on speeding up large file read and write

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad you liked it! Thanks for the feedback.

  • @user-ld5dn3fv4m
    @user-ld5dn3fv4m Жыл бұрын

    Nice video. I'm going to rewrite the storage on the parquet

  • @robmulla

    @robmulla

    Жыл бұрын

    You should! Parquet is awesome.

  • @DAN_1992
    @DAN_1992 Жыл бұрын

    Thanks a lot, just brought down my database backup size to MBs.

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad it helped. That’s a huge improvement!

  • @Extremesarova
    @Extremesarova2 жыл бұрын

    Informative video! I've heard about feather and pickle, but never used them. I think I should give feather and parquet a try! I'd like to get some materials on machine learning and data science that are not introductory - something for middle and senior engineers :)

  • @robmulla

    @robmulla

    2 жыл бұрын

    Glad you found it useful. I’ll try to make some more ML videos in the near future.

  • @yosefasefaw4207
    @yosefasefaw4207 Жыл бұрын

    thanks very helpful

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad it helped

  • @wonderland860
    @wonderland860 Жыл бұрын

    This video greatly helped me. I didn't know so many ways to dump a DataFrame. I then did a further test, and found the compression option plays a big role: df.to_pickle(FILE_NAME, compression='xz') -> 288M df.to_pickle(FILE_NAME, compression='bz2') -> 322M df.to_pickle(FILE_NAME, compression='gzip') -> 346M df.to_pickle(FILE_NAME, compression='zip') -> 348M df.to_pickle(FILE_NAME, compression='infer') -> 679M # default compression df.to_parquet(FILE_NAME, compression='brotli') -> 334M df.to_parquet(FILE_NAME, compression='gzip') -> 355M df.to_parquet(FILE_NAME, compression='snappy') -> 423M # default compression df.to_feather(FILE_NAME) -> 500M

  • @robmulla

    @robmulla

    Жыл бұрын

    Nice findings! Thanks for sharing. Funny that compressing parquet still works. I didn't know that.

  • @DeathorGloryNow

    @DeathorGloryNow

    Жыл бұрын

    @@robmulla Actually if you check the docs parquet files are snappy compressed by default. You have to explicitly say `compression=None` to not compress it. Snappy is the default because it adds very little time to read/write with modest compression and low CPU usage while still maintaining the very nice columnar properties (as you showed in the video). It is also the default for Spark. Other compressions like gzip get it smaller but at a much more significant cost to speed. I'm not sure this is still the case but in the past they also broke some of the nice properties because it is compressing the entire object.

  • @MarcBenkert001
    @MarcBenkert001 Жыл бұрын

    Thanks, great comp. One thing about Parquet - it has some limitations in what chars column names can take, I spent quite some time renaming col names 1 year ago - perhaps that has fallen away by now.

  • @robmulla

    @robmulla

    Жыл бұрын

    Good point! I've noticed this too. Definately a limitation that makes it sometimes unusable. Thanks for watching!

  • @gsm7490
    @gsm7490Ай бұрын

    Parquet really saved me ) Around one year data, each day is appr 2GB (csv format). Parquet is both compact and fast. But have to use filtering and load only necessary columns “on demand”.

  • @meme_eternity
    @meme_eternity Жыл бұрын

    Great Video!!!!!!!!!!!

  • @robmulla

    @robmulla

    Жыл бұрын

    Glad you enjoyed it

  • @Schmelon
    @Schmelon Жыл бұрын

    interesting to learn the existence of parquet and feather files. nothing beats csv for portability and ease of use

  • @robmulla

    @robmulla

    Жыл бұрын

    Yea, for small/medium files CSV gets the job done.

  • @huuquannguyen6688
    @huuquannguyen66882 жыл бұрын

    I really hope you make a video about Data Cleaning in Python soon. Thanks a lot for all your awesome tutorials

  • @robmulla

    @robmulla

    2 жыл бұрын

    I'll try my best. Thanks for the feedback!

  • @crazymunna2397
    @crazymunna2397 Жыл бұрын

    amazing info

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks!

  • @hndr91
    @hndr91 Жыл бұрын

    Thanks!

  • @robmulla

    @robmulla

    Жыл бұрын

    Whoa. Thanks Aff. 🙏

  • @TzviKD
    @TzviKD Жыл бұрын

    Thank you

  • @robmulla

    @robmulla

    Жыл бұрын

    Anytime!

  • @Levince36
    @Levince36 Жыл бұрын

    Thank you very much 😂, I got something totally new to me.

  • @robmulla

    @robmulla

    Жыл бұрын

    Happy to hear it.

  • @Zoltag00
    @Zoltag00 Жыл бұрын

    Great video - It would have been good to at least mention the downsides to pickle and also the built in compatibility with zip files. Haven't come across feather before, will try it out

  • @robmulla

    @robmulla

    Жыл бұрын

    Great point! I did forget to mention that pandas will auto-unzip. I still like parquet the best.

  • @Zoltag00

    @Zoltag00

    Жыл бұрын

    @@robmulla - Agreed, parquet has some serious benefits You know it also supports a compression option? Use it with gzip to see your parquet file get even smaller (and you only need to use it on write)

  • @MatthiasBussonnier
    @MatthiasBussonnier2 жыл бұрын

    On the first pass when you timeit the csv writing you time both the writing to csv and generating the dataset. So you are likely having biased results as you only time the writing with other format. (Sure it does not change the final message, just want to point it out) Also with timeit, you can use the -o flag of timeit to output the result to a variable, and this can help you to for example make a plot of the times.

  • @robmulla

    @robmulla

    2 жыл бұрын

    Good point about timing the dataframe generation. It should be negligable but fair to note. Also great tip on using -o. I didn't know about that! It looks like from the docs it writes the entire stdout, so it would need to be parsed. ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit Still a handy tip. Thanks!

  • @mr_easy
    @mr_easy3 ай бұрын

    great comparison. What about HDF5 format? Is it in anyway better?

  • @codewithvishal91
    @codewithvishal91 Жыл бұрын

    Very nice bro

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks. Hope you learned something!

  • @FranciscoPMatosJr
    @FranciscoPMatosJr Жыл бұрын

    Experiment add the compression "Brotli" at the file create. The file size reduce considerably and the read is more fast a lot. Example: to save file: from pyarrow import csv, parquet parse_options = csv.ParseOptions(delimiter=delimiter) data_arrow = csv.read_csv(temp_file, parse_options=parse_options, read_options=csv.ReadOptions(autogenerate_column_names=autogenerate_column_names, encoding=encoding)) parquet.write_table(data_arrow, parquet_file + '.brotli', compression='BROTLI') to read file: pd.read_parquet(file, engine='pyarrow')

  • @robmulla

    @robmulla

    Жыл бұрын

    Oh. Very cool I need to check that out.

  • @pele512
    @pele512 Жыл бұрын

    Thanks for the great benchmark. In R / Python hybrid environment I sometimes use `csv.gz` or `tsv.gz` to address the size issue with CSV but retain the ability to quickly pipe these through line based processors. It would be interesting to see how gzipped flat files perform. I do agree that parquet/feather is a better way to go for many reasons, they are superior especially from the data engineering point of view.

  • @robmulla

    @robmulla

    Жыл бұрын

    I do the same with gzipped CSV files. Good idea about making a comparison. I’ll add it to the list of potential future videos.

  • @dist321
    @dist3212 жыл бұрын

    Great videos! Thank you for posting them. I wonder if feather is faster to read a >2G file.tsv than csv in chunks.

  • @robmulla

    @robmulla

    2 жыл бұрын

    Thanks for watching Ondina! I think it would depend on the data types within the >2G file. I think the only difference between tsv and csv is a comma ',' vs tab '\t' seperator between values. Hope that helps.

  • @vladvol855
    @vladvol855 Жыл бұрын

    Hello! Very interesting! Thank you! Can you please tell me is any limitation for a DF to save in parquet in terms of number of columns? Excel allow around 16-17k columns to save! Thank you for the answer!

  • @abhisekrana903
    @abhisekrana9032 ай бұрын

    stumbled on to this awesome video and absolutely loved it. Just out of curiosity - what tool are you using for making Jupyter notebook with themes especially dark theme?

  • @robmulla

    @robmulla

    2 ай бұрын

    Glad you enjoyed the video. I have a different video that covers my jupyter setup including theme: kzread.info/dash/bejne/Z6SaksGboLHIm9o.html

  • @yoyokoko5153
    @yoyokoko51532 жыл бұрын

    great stuff

  • @robmulla

    @robmulla

    2 жыл бұрын

    Thank you sir!

  • @Alexander-ms2ct
    @Alexander-ms2ct Жыл бұрын

    Thanks

  • @robmulla

    @robmulla

    Жыл бұрын

    Welcome

  • @DevenMistry367
    @DevenMistry367 Жыл бұрын

    Hey Rob, this was a really nice video! Can you please make a tutorial where you try to write this data to a database? Maybe sqlite or postgres? And explain bottlenecks? (Optional: with or without using an ORM).

  • @robmulla

    @robmulla

    Жыл бұрын

    I was actually working on just this type of video and even looking at stuff like duckdb where you can write SQL on parquet files.

  • @jonathanhody3622
    @jonathanhody3622 Жыл бұрын

    Thank you for the video! I've basically never heard of parquet or feather and don't really know what type of file those are. I assume it's not an easy format to share with stakeholders for example. Is there a way to link those types of file to a database or perhaps import them in a data vizualisation tool (such as PowerBI or Tableau)?

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks for watching Jonathan. Glad you found the video useful. You are correct these file formats are more common for storage within systems that read the data via code and not sharing with stakeholders. CSV and excel still dominates for that type of thing.

  • @getolvid5468
    @getolvid54689 ай бұрын

    Great comparing, thanks, not sure if feather/pickle files i'm creating from Julia's script use some compression - none that i'm specifying out of the box .. but happens that the pickle files always end up being 1/2 the size smaller than the feather ones. (havent compared those 2 to a parquet made file)

  • @sangrampattnaik744
    @sangrampattnaik7449 ай бұрын

    Very nice explanation. Can you compare Dask and PySpark ?

  • @coopernik
    @coopernik11 ай бұрын

    I’m working on a little project and I have a csv file that’s 15GB. If I get what you’re telling me, I could turn it into a parquet file and save tons of memory space and time?

  • @manigowdas7781
    @manigowdas7781 Жыл бұрын

    Just wow!!!!

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks!

  • @lorenzowinchler1743
    @lorenzowinchler1743 Жыл бұрын

    Nice video! Thank you. What about hdf5 format? Thanks!

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks! I haven’t used Hdf5 much but I’d be interested to hear how it compares.

  • @riessm
    @riessm Жыл бұрын

    In addition to everything, parquet is the native file format to spark and can fully support spark‘s lazy computing (spark will only ever read the columns and rows that are needed for the desired output). If you ever prep really big data for spark, parquet is the way to go.

  • @robmulla

    @robmulla

    Жыл бұрын

    That’s a great point. Same with polars!

  • @riessm

    @riessm

    Жыл бұрын

    @@robmulla Need to have a closer look at polars then! 🙂

  • @nirbhay_raghav
    @nirbhay_raghav Жыл бұрын

    Another awesome video. It has become my favorite channel. Only regret is that I found it too late. Small correction. It should be 0.3s 0.08s for parquet files. You mistakenly wrote 0.3ms and 0.08ms while converting. Thanks.

  • @robmulla

    @robmulla

    Жыл бұрын

    Apprecate that you are finding my videos helpful. Good catch on finding that typo!

  • @Jay-og6nj

    @Jay-og6nj

    Жыл бұрын

    i was going to comment that, but decided to check first, least should have caught that. Good video.

  • @scottybridwell
    @scottybridwell Жыл бұрын

    Nice video. How does the performance and storage size of parquet, feather compare to hdf/pytables?

  • @robmulla

    @robmulla

    Жыл бұрын

    Great question. I have no idea! I need to learn more about how they compare.

  • @leonjbr
    @leonjbr Жыл бұрын

    Hi Rob! I love our channel. It is very helpfull. I would like to ask you a question: is HDF5 any better than all the options you showed in the video?

  • @robmulla

    @robmulla

    Жыл бұрын

    Good question. I didn't cover it because I thought it's an older, lesser used format.

  • @leonjbr

    @leonjbr

    Жыл бұрын

    @@robmulla so the answer is no?

  • @robmulla

    @robmulla

    Жыл бұрын

    @@leonjbr The answer is - I don't know but probably not. 😁

  • @leonjbr

    @leonjbr

    Жыл бұрын

    @@robmulla ok thanks.

  • @CoolerQ

    @CoolerQ

    Жыл бұрын

    I don't know about "better" but HDF5 is a very popular data format in science.

  • @rdubitsk
    @rdubitsk Жыл бұрын

    Fantastic video as always. What are disadvantages of json? I use json because it can easily be passed to the front end.

  • @robmulla

    @robmulla

    Жыл бұрын

    Great question. I don’t use json much. It isn’t common for tabular/relational data and more for unstructured web based stuff I believe. It probably is pretty slow to read/write large dataset I’m guessing.

  • @alexanderf1795
    @alexanderf1795 Жыл бұрын

    Cool. Would be nice to compare with storing data to an sql base (Postgres for example).

  • @robmulla

    @robmulla

    Жыл бұрын

    Great suggestion! This video only covers storing to flat files, but comparison of different relational databases is a great idea for a future video.

  • @sabagx
    @sabagx Жыл бұрын

    keep uploading videos please!!

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks Sbg! I'm planning on it!

  • @philtoa334
    @philtoa334 Жыл бұрын

    Nice.

  • @robmulla

    @robmulla

    Жыл бұрын

    Thanks! Glad you liked the video.

  • @paarthurnax4561
    @paarthurnax4561 Жыл бұрын

    진짜 parquet는 혁명임... 저장용량은 확 줄이고 나중에 다시 데이터 불러올 때의 속도는 확 높이는 최고의 데이터 포맷

  • @robmulla

    @robmulla

    Жыл бұрын

    I agree. Parquet is great!

  • @harryhack91
    @harryhack91 Жыл бұрын

    Hey, that IDE / code editor is so cool! What is it? It looks similar to VS Code. But I don't know how to do that kind of tricks.

  • @robmulla

    @robmulla

    Жыл бұрын

    I have a whole video on this. It's jupyterlab with the solarized dark theme. Check out my jupyter tutorial for the full details!

  • @YeWangRDFZ
    @YeWangRDFZ Жыл бұрын

    I'm really interested in the comparison against hdf file. My guess is that it's gonna be the fastest to read, however it prolly takes up more space.

  • @robmulla

    @robmulla

    Жыл бұрын

    I’m not sure. But I think feather files are pretty fast.

  • @YeWangRDFZ

    @YeWangRDFZ

    Жыл бұрын

    @@robmulla Hey Rob thanks for the reply. I had the impression that hdf maps the data taken in ram so there wont be much conversion once its read in the ram but I could be wrong. Also it would be interesting to investigate how feather works. I'll do some benchmarking on my m1mac and maybe get back to you.