Microsoft Fabric: Good Notebook Development Practices 📓 (End to End Demo

Пікірлер: 18

@endjin2 ай бұрын
Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏
@MuhammadKhan-wp9znАй бұрын
This is a framework level work, not sure how many will understand and appreciate your efforts you did to create a video, but I will highly appreciate your thoughts and work and at one point I was thinking if I got a chance to create a framework how I will do, you gave very nice guide line here, once again thank you for video, I would like to see your other videos too.
@Nalaka-Wanniarachchi2 ай бұрын
Nice share on best practices round up ...
@ManojKatasaniАй бұрын
you are the best
@ManojKatasaniАй бұрын
very clean explanation, appreciate your efforts. is there any chance we get code on each layer ( Bronze to sliver etc.. advance thank you
@ThePPhilo2 ай бұрын
Great videos 👍👍 Microsoft advocate using seperate workspaces for bronze, silver and gold but that seems to be harder to achieve due to some current limitations. If we go with a single workspace and a folder based set up like the example will it be hard to switch to seperate workspaces in future? Is there any prep we can do to make this switch easier going forward (or would there be no need to switch to a workspace approach)?
@StefanoMagnasco-po5bb2 ай бұрын
Thanks for the great video, very useful. One question: you are using PySpark in your notebooks, but how would you recommend modularizing the code in Spark SQL? Maybe by defining UDFs in separate notebooks that are then called in the 'parent' notebook?
@endjin
2 ай бұрын
Sadly you don't have that many options here without having to fall back to Python/Scala. You can modularize at a very basic level using notebooks as the "modules", containing a bunch of cells which contain Spark SQL commands. Then call these notebooks from the parent notebook. Otherwise, as you say, one step further would be defining UDFs using some Python and then using spark.udf.register to be able to invoke them from SQL. Ed
@gpc392 ай бұрын
Very useful. One thing I would like to do is avoid having to add lakehouses to each notebook. Is there a way to do this within the notebook? Eventually given two Lakehouses, Bronze and Silver, I would want to merge from the Bronze table into the Silver Table. I have the merge statement working; it's just the adding of the lakehouses, which I can't see. I'm doing most of the programming with SQL, as I am less adept with PySpark, but am learning. Thanks Graham
@endjin
2 ай бұрын
Hi Graham. Thanks for your comment! By "do this within the notebook" do you mean "attach a Lakehouse programmatically"? If so, take a look at this: community.fabric.microsoft.com/t5/General-Discussion/How-to-set-default-Lakehouse-in-the-notebook-programmatically/m-p/3732975 By my understanding, a notebook needs to have at least one Lakehouse attached to it in order to run Spark SQL statements that read from Lakehouses. Once it has one Lakehouse, remember that you can reference other Lakehouses in your workspace by using two-part naming (SELECT * FROM .`) without having to explicitly attach the other Lakehouses. And if you need to reference Lakehouses from other workspaces, you'll need to add a shortcut first and then use two part naming. Ed
@ramonsuarez61052 ай бұрын
Excellent video, thanks. Do you ever use a master notebook or pipeline to run the 3 stages one after the other for the initial upload or subsequent incremental uploads? Why not use python files or spark job definitions instead of some of the notebooks that only have classes and methods ? How do you integrate these notebooks with testing in CD/CI?
@endjin
2 ай бұрын
All great questions! I'll be covering most of your points in upcoming videos, but for now I'll try to answer them here.. > Do you ever use a master notebook or pipeline to run the 3 stages one after the other for the initial upload or subsequent incremental uploads? Yes, we either use a single orchestration notebook or a pipeline to chain the stages together. On the notebook side, more recently we've been experimenting with the "new" mssparkutils.notebook.runMultiple() utility function to create our logical DAG for the pipeline. We've found this to be quite powerful so far. On the pipeline side, the simplest thing to do is to have multiple notebook activities chained together in the correct dependency tree. The benefit of the initial method is that the same Spark Session is used. This is particularly appealing in Azure Synapse Analytics where Spark Sessions take a while to provision (although this is less significant in Fabric since sessions are provisioned much more quickly). The benefit of the pipeline approach is that it's nicer to visualise, but arguably harder to code review given its underlying structure. One thing we do in either option is make the process metadata-driven. So we'd have an input parameter object which captures the variable bits of configuration about how our pipeline should run. E.g. ingestToBronze = [true/false], ingestToBronzeConfig = {...}, processToSilver = [true/false],processToSilverConfig = {...}, .... This contains all the information we need to control the flow of the various permutations of processing we need. Stay tuned - there'll be a video on this later on in the series! > Why not use python files or spark job definitions instead of some of the notebooks that only have classes and methods? We could and we sometimes do! But the reality is that managing custom Python libraries and SJDs are a step up the maturity ladder that not every org is ready to adopt. So this video was meant to provide some inspiration about a happy middle-ground - still using Notebooks, but structuring them in such a way that follows good development practices and would make it easier to migrate to custom libraries in the future should that be necessary. > How do you integrate these notebooks with testing in CD/CI? Generally we create a separate set of notebooks that serve as our tests. See endjin.com/blog/2021/05/how-to-test-azure-synapse-notebooks. Naturally, development of these tests isn't great within a notebook, and it's a bit cumbersome to take advantage of some of the popular testing frameworks out there (pytest/behave). But some tests are better than no tests. Then, to integrate into CICD, we'd wrap our test notebooks in a Data Factory pipeline, and call that pipeline from ADO/GitHub: learn.microsoft.com/en-us/fabric/data-factory/pipeline-rest-api#run-on-demand-item-job. --- Sadly I can't cover absolutely everything in this series, so I hope this comment helps!
@ramonsuarez6105
2 ай бұрын
@@endjin Thanks a lot Ed. You guys do a great job with your videos and posts. Super helpful and inspiring :)
@jensenb92 ай бұрын
Great stuff. Is this E2E content hosted in a Git repo somewhere that we can access? Thanks!
@endjin
2 ай бұрын
Not yet, but I believe that is the plan.
@EllovdGriek
Күн бұрын
@@endjin I had the same question, great content and i wanted to try it myself but with the guidance of your code. So, is it somewhere available?
@endjin
Күн бұрын
@@EllovdGriek Unfortunately, not yet.
@MuhammadKhan-wp9zn26 күн бұрын
How can i contact you pls let me know Thanks