Understanding Databricks & Apache Spark Performance Tuning: Lesson 01 - Spark Architecture

Ғылым және технология

A popular interview question and a critical topic for all Databricks and Spark developers, how do you tune and optimize Spark queries? This video provides a conceptual understanding of where things can go wrong as a starting point to understanding performance tuning and optimization.
Support me on Patreon
www.patreon.com/bePatron?u=63...
Slides
github.com/bcafferky/shared/b...

Пікірлер: 17

  • @biouman4
    @biouman44 ай бұрын

    Thanks, very nice video 😉

  • @sarthakmane2977
    @sarthakmane2977Ай бұрын

    5:54, better comedian than half the comedians in the world

  • @homeitems8113
    @homeitems81133 ай бұрын

    Waiting for next video

  • @RM-xm5gl
    @RM-xm5gl4 ай бұрын

    Thank you so much. Excellent video.

  • @BryanCafferky

    @BryanCafferky

    4 ай бұрын

    Thanks

  • @Prmmani
    @Prmmani4 ай бұрын

    Thanks Sir ❤

  • @mfdba
    @mfdba3 ай бұрын

    I don't know if it's always true, but I've recently discovered that python can be significantly faster that some spark SQL operations such as joins. I'll check, but do you have a video about monitoring cluster performance? I kind of miss the ganglia ui. Thanks Bryan. As always, you're a great teacher and explainer of things. ❤

  • @BryanCafferky

    @BryanCafferky

    2 ай бұрын

    This may help learn.microsoft.com/en-us/azure/databricks/compute/cluster-metrics

  • @Andy-rw4hn
    @Andy-rw4hn2 ай бұрын

    is it possible to run spark nodes on already concurrent HDFS?

  • @BryanCafferky

    @BryanCafferky

    2 ай бұрын

    Park can read from HDFS. Is that your question?

  • @Andy-rw4hn

    @Andy-rw4hn

    2 ай бұрын

    @@BryanCafferky yes. I was not aware of PXF HDFS connector hdfs:parquet. Thank you.

  • @Andy-rw4hn
    @Andy-rw4hn2 ай бұрын

    11:50 I actually thought that the data for the query in the black box does not have to be distributed/indexed by City and the select/group-by can be easily made concurrent by itself

  • @BryanCafferky

    @BryanCafferky

    2 ай бұрын

    I am oversimplifying but when you request joins or aggregations, you trigger a shuffle which the documentation explains as reordering the data over the cluster nodes to, for example, co-locate data keys from the joined tables. See www.talend.com/resources/intro-apache-spark-partitioning/

  • @BryanCafferky

    @BryanCafferky

    2 ай бұрын

    i'm trying to find more detailed info on this. Thanks.

  • @Andy-rw4hn

    @Andy-rw4hn

    2 ай бұрын

    @@BryanCafferky Thank you

  • @TJ-hs1qm
    @TJ-hs1qmАй бұрын

    There's also Spark Rapids (GPU)

  • @BryanCafferky

    @BryanCafferky

    Ай бұрын

    Cool. Did not know that. Here's how to set that up on Databricks. Thanks! docs.nvidia.com/spark-rapids/user-guide/23.12.2/getting-started/databricks.html

Келесі