what is Apache Parquet file | Lec-7

In this video I have talked about parquet file reading in spark. If you want to optimize your file and process in Spark then you should have a solid understanding of Parquet file format. Please do ask your doubts in comment section.
Directly connect with me on:- topmate.io/manish_kumar25
Download Parquet Data:- github.com/databricks/Spark-T...
Download parquet tools in your local to run all the below commands.
Parquet tools can be downloaded using pip command.
Run the below command in cmd or terminal
pip install parquet-tools
Run the blow command inside python
import pyarrow as pa
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile(r'C:\Users
ikita\Downloads\Spark-The-Definitive-Guide-master\data\flight-data\parquet\2010-summary.parquet\part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet')
parquet_file.metadata
parquet_file.metadata.row_group(0)
parquet_file.metadata.row_group(0).column(0)
parquet_file.metadata.row_group(0).column(0).statistics
Run the below command in cmd/terminal
parquet-tools show C:\Users\manish\Downloads\Spark-The-Definitive-Guide-master\data\flight-data\parquet\2010-summary.parquet\part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet
parquet-tools inspect (path of your file location as above)
parquet.apache.org/docs/file-...
For more queries reach out to me on my below social media handle.
Follow me on LinkedIn:- / manish-kumar-373b86176
Follow Me On Instagram:- / competitive_gyan1
Follow me on Facebook:- / manish12340
My Second Channel -- / @competitivegyan1
Interview series Playlist:- • Interview Questions an...
My Gear:-
Rode Mic:-- amzn.to/3RekC7a
Boya M1 Mic-- amzn.to/3uW0nnn
Wireless Mic:-- amzn.to/3TqLRhE
Tripod1 -- amzn.to/4avjyF4
Tripod2:-- amzn.to/46Y3QPu
camera1:-- amzn.to/3GIQlsE
camera2:-- amzn.to/46X190P
Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
Pentab (Small size):-- amzn.to/3RpmIS0
Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
Laptop -- amzn.to/3Ns5Okj
Mouse+keyboard combo -- amzn.to/3Ro6GYl
21 inch Monitor-- amzn.to/3TvCE7E
27 inch Monitor-- amzn.to/47QzXlA
iPad Pencil:-- amzn.to/4aiJxiG
iPad 9th Generation:-- amzn.to/470I11X
Boom Arm/Swing Arm:-- amzn.to/48eH2we
My PC Components:-
intel i7 Processor:-- amzn.to/47Svdfe
G.Skill RAM:-- amzn.to/47VFffI
Samsung SSD:-- amzn.to/3uVSE8W
WD blue HDD:-- amzn.to/47Y91QY
RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
Gigabyte Motherboard:-- amzn.to/3RFUTGl
O11 Dynamic Cabinet:-- amzn.to/4avkgSK
Liquid cooler:-- amzn.to/472S8mS
Antec Prizm FAN:-- amzn.to/48ey4Pj

Пікірлер: 97

  • @manish_kumar_1
    @manish_kumar_15 ай бұрын

    I said 500 GB in the video by mistake. It is supposed to be 500MB, and when dividing 500/128, we will get 4 partitions.

  • @SrihariSrinivasDhanakshirur

    @SrihariSrinivasDhanakshirur

    5 ай бұрын

    I just saw this video and boom u mentioned the same in your comment section

  • @Shubhamkumar-cq5wt
    @Shubhamkumar-cq5wt10 ай бұрын

    Literally the best and most detailed video on parquet file format on yt. Thank you!

  • @user-qn6ud4hs3b
    @user-qn6ud4hs3b2 ай бұрын

    Never saw such a detailed video for parquet file, these videos are really valuable. Really appreciate the efforts put in making these videos

  • @sahillohiya7658
    @sahillohiya76589 ай бұрын

    I love how indept you are going, please keep doing it ! We are loving it.

  • @krishnasahoo6172
    @krishnasahoo617210 ай бұрын

    Wah....Itta clarification...maza aa gya...Video kab khatm hui pta hi ni chala....!!! Excellent explanation.

  • @akashprabhakar6353
    @akashprabhakar63532 ай бұрын

    Predicate pushdown - Rows filtering, Projection Pruning/Pushdown - Column filtering. Thanks for the session bro!!

  • @dishant_22
    @dishant_2210 ай бұрын

    This is the best explanation for parquet file format available online. Thanks Manish.

  • @shubhamwaingade4144
    @shubhamwaingade41445 ай бұрын

    The best explanation!!! Your videos are giving me motivation and inspiration to keep learning spark!

  • @shivakrishna1743
    @shivakrishna1743 Жыл бұрын

    Very detailed awesome video!! Thanks

  • @rahulgupta-po4ki
    @rahulgupta-po4ki10 ай бұрын

    highly informative and detailed video on parquet. Thanks a lot Manish!

  • @vaibhavmore7936
    @vaibhavmore7936 Жыл бұрын

    Thanks for this Manish! Great Work!

  • @ankitachauhan6084
    @ankitachauhan6084Ай бұрын

    the best explanation ! you are a wonderful teacher

  • @bidyasagarpradhan2751
    @bidyasagarpradhan27516 ай бұрын

    Someone ask me in interview about internals of parquet file format and i couldn't answer it,Then i found your video.Now i can explain easily.Best video on parquet file format.

  • @neelshah8247
    @neelshah82474 ай бұрын

    Excellent video. Thank you :)

  • @asifquasmi4538
    @asifquasmi45385 ай бұрын

    Hats of Manish, Please keep doing the good work :)

  • @ArunNair-z3m
    @ArunNair-z3mАй бұрын

    Hi Manish, thanks for such smooth explanation of not just information related to parquet but also things related to it, kudos to your efforts :D

  • @alokkumarmohanty8454
    @alokkumarmohanty8454 Жыл бұрын

    Hi Manish, the parquet file detail class was classic example for how to present something .if same like this avro and orc file format classes can be discussed then it would be really helpful. Nowadays the interviewer is asking on those as well

  • @lucky_raiser
    @lucky_raiser Жыл бұрын

    bhai, maza aa gya, thanks bro

  • @susanthomas223
    @susanthomas2232 ай бұрын

    Thank you so much for putting in so much time for making this video

  • @afjalahamad2465
    @afjalahamad24654 ай бұрын

    really awesome explanation

  • @deeksha6514
    @deeksha65144 ай бұрын

    Thanks! for this masterpiece

  • @ApoorvaShinde-on4ep
    @ApoorvaShinde-on4epАй бұрын

    This is so far the best video in which I got to know in depth knowledge of parquet and very easy to understand. Thankyou so much for sharing your knowledge.!! Could you please share the video having optimization of parquet?

  • @dollykushwah6352
    @dollykushwah635211 ай бұрын

    Hello Manish, excellent explanation, hats off to you. When will you give optimization video on parquet eagerly waiting for it

  • @debopower2009
    @debopower200911 ай бұрын

    Very nice.

  • @harshtalwar9615
    @harshtalwar96154 күн бұрын

    Superb bro … very helpful thanks 🙏🏻

  • @ashutoshkumarsingh3337
    @ashutoshkumarsingh3337 Жыл бұрын

    what a gem you are

  • @pramod3469
    @pramod3469 Жыл бұрын

    Thanks Manish

  • @user-mf6cx8xx5d
    @user-mf6cx8xx5d Жыл бұрын

    Thanks Manish 🙂

  • @lakkilakki772
    @lakkilakki77210 ай бұрын

    Hi Manish, great explanation of parquet i'm using parquet but didn't know about these features which made things fast how were you able to learn all this knowledge please suggest any documentation/resources to get deep understanding like this. you made my day. Thank you 😊

  • @pankajjagdale2005
    @pankajjagdale200510 ай бұрын

    informative Thanks

  • @user-lp3qe9jj3m
    @user-lp3qe9jj3m11 ай бұрын

    Please make a video on avro file format in detail because I faced challenges when interviewers asked about avro file format questions

  • @wellwisher7333
    @wellwisher7333 Жыл бұрын

    Thanks Sir

  • @AyushMandloi
    @AyushMandloi24 күн бұрын

    Also please explain Bucketing and partitioning

  • @krunalsuthar1420
    @krunalsuthar1420Ай бұрын

    Please make video on ORC and Avro as well

  • @prathamesh_a_k
    @prathamesh_a_k7 ай бұрын

    nice explaination brother

  • @prathamesh_a_k

    @prathamesh_a_k

    7 ай бұрын

    can you make one video on ORC also

  • @khadarvalli3805
    @khadarvalli380510 ай бұрын

  • @nileshgodase1007
    @nileshgodase10076 ай бұрын

    Nested json to data frame explain kijiyee na

  • @nikhiljain8411
    @nikhiljain8411Ай бұрын

    How one will understand in which 1L records we need to fetch the data. Still we need to scan the complete file. Isn't it? Kindly explain

  • @Wandering_words_of_INFJ
    @Wandering_words_of_INFJ8 ай бұрын

    Manish, if we are writing parquet by making the files already sorted in asc or desc then the process of retrieval of data would be faster right? Because in row_number's meta data would have min and Max value in a certain range? Please correct me if I am wrong.

  • @shubhamwaingade4144
    @shubhamwaingade41445 ай бұрын

    One doubt, I did not understand the logical partitioning completely, it resembles with file size we can set in spark config. Please help me understand it

  • @vaibhavshanbhag5016
    @vaibhavshanbhag50168 күн бұрын

    @manish_kumar_1 Sir kya mast content banaya, maja aa gaya, thank you!

  • @prashantmane2446
    @prashantmane2446Күн бұрын

    error occured while processing file ?? yeh error continous hai...help:

  • @sumitchoubey1284
    @sumitchoubey1284Ай бұрын

    unable to install parquet-tools. can you help or point n right direction

  • @royalkumar7658
    @royalkumar765811 ай бұрын

    Null kaise write hota hai disk pe??

  • @radheshyama448
    @radheshyama44811 ай бұрын

    😇

  • @pankajsolunke3714
    @pankajsolunke3714 Жыл бұрын

    Hi manish sir,Thanks for bringing such valuable info ..I have a question like how can we handle schema evaluation in parquet

  • @manish_kumar_1

    @manish_kumar_1

    Жыл бұрын

    I didn't get you

  • @mohitdaxini3067

    @mohitdaxini3067

    Жыл бұрын

    I think he wants to about shema evolution

  • @sankuM

    @sankuM

    Жыл бұрын

    @@manish_kumar_1 I think @pankajsolunke3714 is asking how to handle schema evolution in parquet if we can?

  • @avisinha2844
    @avisinha2844 Жыл бұрын

    Hello Manish, i really like your videos, thanks for the efforts you put in. Have a question, can you please tell a good tutorial/course that we can go through to get really good at pyspark, if not a single resource then what are the various resources that we can go through to get good at pyspark coding.

  • @manish_kumar_1

    @manish_kumar_1

    Жыл бұрын

    You don't need a course. Still if you want to go for a course then you can buy a udemy course by Prashant Pandey titled pyspark for Beginner. Rest depends on you ki how much questions you want to solve. Solve more problems rather than running behind multiple courses. Practice is the key to success not a number of course you have done.

  • @sankuM

    @sankuM

    Жыл бұрын

    sparkbyexamples is the RESOURCE we need for practice!! :)

  • @royalkumar7658

    @royalkumar7658

    11 ай бұрын

    Where can we practice spark from?

  • @royalkumar7658

    @royalkumar7658

    11 ай бұрын

    ​@@manish_kumar_1 where can we practice spark from??

  • @manish_kumar_1

    @manish_kumar_1

    11 ай бұрын

    @@royalkumar7658 Leet code se. Aap playlist start se follow kijiye tab Pata chal jayega kaha se and kaise

  • @MCAMadeEasy
    @MCAMadeEasy3 ай бұрын

    Manish bhai, nested json?

  • @navjotsingh-hl1jg
    @navjotsingh-hl1jg10 ай бұрын

    bhai 500gb data mein 4 row kyun rakhe gaye and manish bhai 128mb hota hai . aap explain kar sakte ho aisa kyun bhai

  • @ShekharBhide
    @ShekharBhide Жыл бұрын

    sir, parquet file download nahi ho raha he github se

  • @patilsahab4278
    @patilsahab42786 ай бұрын

    hii bro each row grop stores 128mb or 128 gb data you told 128mb bur for for 500gb you told 4 row groups you are talking about 500mb or 500gb

  • @AnubhavTayal
    @AnubhavTayal Жыл бұрын

    Hi Manish, thank you for the information. Please can you elaborate whats the default value of 128 MB and when we have 500 GB data how does that convert to 4 row groups? Thank you

  • @manish_kumar_1

    @manish_kumar_1

    Жыл бұрын

    500 mb not gb. 500 divided by 128 I.e 4. 4 block of data will be created. So the thing is we have a default block size of 128 MB in hdfs and multiple cloud service provider also use the same block size. So let say if you have 140 mb data that means one partition will be of 128 mb and next partition will be having just 12 mb of data.

  • @AnubhavTayal

    @AnubhavTayal

    Жыл бұрын

    @@manish_kumar_1 thank you so much!

  • @yogesh9992008
    @yogesh9992008 Жыл бұрын

    Cmd-parquet-tool issue

  • @manish_kumar_1
    @manish_kumar_1 Жыл бұрын

    Directly connect with me on:- topmate.io/manish_kumar25

  • @tnmyk_
    @tnmyk_5 ай бұрын

    Where is the nested JSON video? You said you will make a separate video on it in the previous lecture "how to read json file in Pysaprk"

  • @manish_kumar_1

    @manish_kumar_1

    5 ай бұрын

    Lec 23

  • @dineshboliwar9545
    @dineshboliwar9545 Жыл бұрын

    sir please make short video to downloadd and install parquet tool

  • @manish_kumar_1

    @manish_kumar_1

    Жыл бұрын

    Sure

  • @Marcopronto
    @Marcopronto Жыл бұрын

    Hi Manish, In the last video, you told that u will explain about nested json in further videos. Where can i find that?

  • @manish_kumar_1

    @manish_kumar_1

    11 ай бұрын

    I have not done yet. I will try to make one soon.

  • @220piyush

    @220piyush

    3 ай бұрын

    Yaar bhaiya wo bana do pls. Industry to usi pe chal ri

  • @rpraveenkumar007
    @rpraveenkumar007 Жыл бұрын

    Hi Manish, what is projection pruning? Unable to find it on Google. Or is it Partition Pruning*? Can you please explain/clarify?

  • @manish_kumar_1

    @manish_kumar_1

    Жыл бұрын

    Projection pushdown Hota hai jisme columns ki pruning hoti hai. So Projection pushdown ya Projection pruning same hai.

  • @rpraveenkumar007

    @rpraveenkumar007

    Жыл бұрын

    @@manish_kumar_1 thanks for clarifying!

  • @yogesh9992008
    @yogesh9992008 Жыл бұрын

    Stage failure error show

  • @ajaypatil1881
    @ajaypatil18819 ай бұрын

    Example of Modi ji for finding age >18 was highlight of the video

  • @manish_kumar_1

    @manish_kumar_1

    8 ай бұрын

    😂😂

  • @sankuM
    @sankuM Жыл бұрын

    Hey @manish_kumar_1, I was able to use the modes (append, overwrite, etc.) using this command: df.write.option("header", first_row_is_header) \ .option("sep", delimiter) \ .mode("Overwrite") \ .csv(file_location) All other ways of writing is returning error on Databricks if the file exists.. even if we're trying to append the data..! :| Unsure why is this happening...! :\

  • @manish_kumar_1

    @manish_kumar_1

    Жыл бұрын

    Same here. May be due to community edition. In production environment it does work

  • @sankuM

    @sankuM

    Жыл бұрын

    @@manish_kumar_1 oh..okay! Still weird, though!!! I'm yet to try databricks in production..

  • @dineshboliwar9545
    @dineshboliwar9545 Жыл бұрын

    anybody help me please i cant read parquet file using command prompt

  • @manish_kumar_1

    @manish_kumar_1

    Жыл бұрын

    Koi issue nahi hai. Aap direct databricks me read kar lijiye. Ek baar video ko bas sahi se dekh lijiyega

  • @dineshboliwar9545

    @dineshboliwar9545

    Жыл бұрын

    @manish_kumar_1 databricks me kr liya h command prompt ka nhi ho rha h

  • @aravind5310
    @aravind5310 Жыл бұрын

    Your content is good.Why don't you do videoes in English.

  • @manish_kumar_1

    @manish_kumar_1

    Жыл бұрын

    english nahi aati hai 😒. Just joking, I may record a session in future but not for now.

  • @izahmad90

    @izahmad90

    11 ай бұрын

    kzread.info/dash/bejne/rIFmsaN6pq2vpLQ.html&ab_channel=knowledgeEpicenter (We are making videos for those people for whom no one is making videos.)

  • @ajaywade9418
    @ajaywade94188 ай бұрын

    21:25 500 GB or 500 Mb ?

  • @220piyush

    @220piyush

    3 ай бұрын

    500 mb

  • @KavyaPristha
    @KavyaPristha7 күн бұрын

    Please drop your twitter or X account. I will promote you. You are the only person on youtube who is actually teaching something useful in DE filed. That TOO IN HINDI. Great work and great effort. God Bless You !!

  • @manish_kumar_1

    @manish_kumar_1

    6 күн бұрын

    I don't have ex😂😂. Sorry I mean this X

  • @KavyaPristha

    @KavyaPristha

    6 күн бұрын

    @@manish_kumar_1 Hahaha. Please create one than. It pays better than KZread

  • @manish_kumar_1

    @manish_kumar_1

    6 күн бұрын

    @@KavyaPristha oh is it. I did not know about this.

  • @DevSharma_31
    @DevSharma_31 Жыл бұрын

    import pyarrow as pa import pyarrow.parquet as pq parquet_file = pq.ParquetFile(r'C:\Users\DELL\Desktop\part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet') parquet_file.metadata parquet_file.metadata.row_group(0) parquet_file.metadata.row_group(0).column(0) parquet_file.metadata.row_group(0).column(0).statistics Not able to see any output with this file. Not sure why

  • @manish_kumar_1

    @manish_kumar_1

    Жыл бұрын

    Error v nhi aa rha?

  • @ranvijaymehta
    @ranvijaymehta Жыл бұрын

    Thanks sir