endjin

We help small teams achieve big things.

We are a UK based, full remote, Technology Consultancy who specializes in Data, AI, DevOps & Cloud, and are a .NET Foundation Corporate Sponsor.

We produce two free weekly newsletters;
☁️ Azure Weekly - azureweekly.info for all things about the Microsoft Azure Platform,
📈Power BI Weekly - powerbiweekly.info for all things data visualisation and Power Platform.

Keep up with everything that's going on at endjin via our blog:

👉endjin.com/blog
👉twitter.com/endjin
👉www.linkedin.com/company/endjin

Information about our Open Source projects can be found at endjin.com/open-source

Find out more at endjin.com

#Microsoft #MicrosoftFabric #PowerBI #AI #Data #Analytics #DevOps #Azure #Cloud

Ай бұрын

10x Spark performance improvement in Microsoft Fabric

2 ай бұрын

Microsoft Fabric: Good Notebook Development Practices 📓 (End to End Demo - Part 8)

3 ай бұрын

Microsoft Fabric: Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations

3 ай бұрын

Microsoft Fabric: Machine Learning Tutorial - Part 1 - Overview of the Course

3 ай бұрын

Data is a socio-technical endeavour

3 ай бұрын

No Code Low Code is Software DIY How Do You Avoid a DIY Disaster

3 ай бұрын

How to Build Navigation into Power BI

3 ай бұрын

Data & AI Engineering Maturity

4 ай бұрын

The Heart of Reactive Extensions for .NET (Rx.NET)

7 ай бұрын

Microsoft Fabric: Processing Bronze to Silver using Fabric Notebooks

7 ай бұрын

Microsoft Fabric: Role of the Silver Lakehouse in the Medallion Architecture

11 ай бұрын

Microsoft Fabric: Local OneLake Tools

11 ай бұрын

Show & Tell: A Brief Intro to Tensors & GPT with TorchSharp

11 ай бұрын

Microsoft Fabric: Creating a OneLake Shortcut to ADLS Gen2

Жыл бұрын

Microsoft Fabric and The Pace of Innovation - The Decision Maker's Guide - Part 3

Жыл бұрын

Microsoft Fabric & Generative AI - The Decision Maker's Guide - Part 2

Жыл бұрын

Hedging your Microsoft Fabric Bet - The Decision Maker's Guide - Part 1

Жыл бұрын

Microsoft Fabric: Ingesting 5GB into a Bronze Lakehouse using Data Factory - Part 3

Жыл бұрын

Microsoft Fabric: Inspecting 28 MILLION row dataset in Bronze Lakehouse - Part 2

Жыл бұрын

Microsoft Fabric: Lakehouse & Medallion Architecture - Part 1

Жыл бұрын

A 10 minute Tour Around Microsoft Fabric

Жыл бұрын

Microsoft Fabric Briefing - after 6 months of use on the private preview.

Жыл бұрын

Reactive Extensions API in depth: Marble Diagrams, Select() and Where()

Жыл бұрын

Rx .NET Workshop: 08 Schedulers

Жыл бұрын

Rx .NET Workshop: 07 Reactive Coincidence

Жыл бұрын

Rx .NET Workshop: 06 Parameterizing Concurrency

Жыл бұрын

Rx .NET Workshop: 05 Writing Queries

Жыл бұрын

Rx .NET Workshop: 04 Unified Programming Model

Жыл бұрын

Rx .NET Workshop: 03 Event Processing

Пікірлер

@moeeljawad536112 сағат бұрын

Hello Nice video. could you please explain from where exactly did you get the dfs url that you had pasted in the ADLS shortcut connector? in my url i am having a blob part in it and that is preventing me from doing the connection. Thanks

@MdRahman-wl6qi4 күн бұрын

how to move data from one lakehouse's to another lakehouse table using pyspark?

@ullajutila165915 күн бұрын

The video was ok, but it did not explain how the ADLS storage account networking needs to be configured. More specifically, how to configure it in a secure manner, without allowing access from all networks.

@ravishkumar173918 күн бұрын

Hi @endjin great videos, have you uploaded the architecture diagram file anywhere that I can download and reuse for my own projects?

@MuhammadKhan-wp9zn26 күн бұрын

How can i contact you pls let me know Thanks

@MuhammadKhan-wp9znАй бұрын

This is a framework level work, not sure how many will understand and appreciate your efforts you did to create a video, but I will highly appreciate your thoughts and work and at one point I was thinking if I got a chance to create a framework how I will do, you gave very nice guide line here, once again thank you for video, I would like to see your other videos too.

@vinzent345Ай бұрын

Is there an option to connect from your local machine directly to the synapse spark cluster? Doesn't seem that debug friendly, having to compile & upload it every time. It almost feels more sensible to host your own autoscaling Spark Cluster in Azure Kubernetes Services. If I do so, I can interact directly with the Cluster and build Sessions locally. What do you think?

@idg10Ай бұрын

In this scenario, it would make more sense to run Spark locally. There are a few ways you can do that, but as you'd expect it's not entirely straightforward, and not something easily addressed in a comment.

@ManojKatasaniАй бұрын

very clean explanation, appreciate your efforts. is there any chance we get code on each layer ( Bronze to sliver etc.. advance thank you

@ManojKatasaniАй бұрын

you are the best

@rodrihcАй бұрын

Thanks for the video! Is there any alternative to run the test notebooks of synapse from the cicd pipeline in azure devops?

@jamesbroome949Ай бұрын

There's a couple of ways to achieve this - neither are immediately obvious but definitely possible! There's no API for just running a notebook in Synapse, but you can submit a Spark batch job via the API. However, this requires a Python file as input, so it might mean pulling your tests out of a Notebook and writing and storing them separately in an associated ADO repo: learn.microsoft.com/en-us/rest/api/synapse/data-plane/spark-batch/create-spark-batch-job?view=rest-synapse-data-plane-2020-12-01&tabs=HTTP Possibly an easier route would be to create a separate Synapse Pipeline definition that runs your test notebook(s) and use the API to trigger that pipeline run from your ADO pipeline. This is a straightforward REST API but operates asynchronously, so you'd need to poll for completion as the pipeline/tests are running: learn.microsoft.com/en-us/rest/api/synapse/data-plane/pipeline/create-pipeline-run?view=rest-synapse-data-plane-2020-12-01&tabs=HTTP Hope that helps!

@YvonneWurmАй бұрын

Do you know how to view the definition of the view or stored procedures?

@jamesbroome949Ай бұрын

Hi - I don't believe there's a way in Synapse Studio to automatically script out the definitions like you can do in, say SQL Server Management Studio. But you can see the column definitions for you View if you find your database under the Data tab and expand the nodes in the explorer. Hope that helps!

@datasets-rv7jf2 ай бұрын

looking forward to many more of this!

@endjin2 ай бұрын

Barry has ~9 parts planned!

@ThePPhilo2 ай бұрын

Great videos 👍👍 Microsoft advocate using seperate workspaces for bronze, silver and gold but that seems to be harder to achieve due to some current limitations. If we go with a single workspace and a folder based set up like the example will it be hard to switch to seperate workspaces in future? Is there any prep we can do to make this switch easier going forward (or would there be no need to switch to a workspace approach)?

@ramonsuarez61052 ай бұрын

Thanks a lot Barry. Great video. I couldn't find the repository for the series among your Github repos. Will there be one?

@vesperpiano2 ай бұрын

Thanks for the feedback. Glad to hear you are enjoying it. Yes - we are planning to release the code for this project on Git at some point soon.

@StefanoMagnasco-po5bb2 ай бұрын

Thanks for the great video, very useful. One question: you are using PySpark in your notebooks, but how would you recommend modularizing the code in Spark SQL? Maybe by defining UDFs in separate notebooks that are then called in the 'parent' notebook?

@endjin2 ай бұрын

Sadly you don't have that many options here without having to fall back to Python/Scala. You can modularize at a very basic level using notebooks as the "modules", containing a bunch of cells which contain Spark SQL commands. Then call these notebooks from the parent notebook. Otherwise, as you say, one step further would be defining UDFs using some Python and then using spark.udf.register to be able to invoke them from SQL. Ed

@applicitaaccount12582 ай бұрын

Really looking forward to this series, thanks for taking the time to put it together. I really enjoy the pace and detail level if the conte t @endjin put together.

@applicitaaccount12582 ай бұрын

Great series, what the naming convention you are using in the full version of the solution ? I noticed the LH is prefixed with HPA

@endjin2 ай бұрын

The HPA prefix stands for "House Price Analytics", although the architecture diagram on the second video has slightly old names, as you've probably noticed. The full version uses <medallion_layer>_Demo_LR, where LR stands for "Land Registry". Ed

@applicitaaccount12582 ай бұрын

@@endjin - Thanks for the clarification Ben.

@kingmharbayato6432 ай бұрын

Finally someone made this video!! Thank you for doing this.

@malleshmjolad2 ай бұрын

Hi Ed, We have loaded few tables using synapse link into adls gen2 and created shortcut to access the adlsgen2 files in fabric,but while loading the files into tables,we are not getting the column names for the tables and it is showing as c0,c1....etc which is causing an issue,can you please give some insights on how to overcome this and load the tables with metadata also

@endjin2 ай бұрын

Hi - thanks for the comment! Which Synapse Link are you using? Dataverse? If so, this uses the CDM model.json format which doesn't include header rows in the underlying CSV files. You would have to read the shortcut data, apply the schema manually, and then write the data out to another table (inside a Fabric notebook or something) if you wanted to use that existing data. However, if you're using Synapse Link for Dataverse, you should instead consider using the new "Link to Microsoft Fabric" feature available in Dataverse: learn.microsoft.com/en-us/power-apps/maker/data-platform/azure-synapse-link-view-in-fabric. This will include the correct schema.

@gpc392 ай бұрын

Very useful. One thing I would like to do is avoid having to add lakehouses to each notebook. Is there a way to do this within the notebook? Eventually given two Lakehouses, Bronze and Silver, I would want to merge from the Bronze table into the Silver Table. I have the merge statement working; it's just the adding of the lakehouses, which I can't see. I'm doing most of the programming with SQL, as I am less adept with PySpark, but am learning. Thanks Graham

@endjin2 ай бұрын

Hi Graham. Thanks for your comment! By "do this within the notebook" do you mean "attach a Lakehouse programmatically"? If so, take a look at this: community.fabric.microsoft.com/t5/General-Discussion/How-to-set-default-Lakehouse-in-the-notebook-programmatically/m-p/3732975 By my understanding, a notebook needs to have at least one Lakehouse attached to it in order to run Spark SQL statements that read from Lakehouses. Once it has one Lakehouse, remember that you can reference other Lakehouses in your workspace by using two-part naming (SELECT * FROM <lakehouse_name>.<table_name>`) without having to explicitly attach the other Lakehouses. And if you need to reference Lakehouses from other workspaces, you'll need to add a shortcut first and then use two part naming. Ed

@jensenb92 ай бұрын

Great stuff. Is this E2E content hosted in a Git repo somewhere that we can access? Thanks!

@endjin2 ай бұрын

Not yet, but I believe that is the plan.

@EllovdGriekКүн бұрын

@@endjin I had the same question, great content and i wanted to try it myself but with the guidance of your code. So, is it somewhere available?

@endjinКүн бұрын

@@EllovdGriek Unfortunately, not yet.

@Nalaka-Wanniarachchi2 ай бұрын

Nice share on best practices round up ...

@ramonsuarez61052 ай бұрын

Excellent video, thanks. Do you ever use a master notebook or pipeline to run the 3 stages one after the other for the initial upload or subsequent incremental uploads? Why not use python files or spark job definitions instead of some of the notebooks that only have classes and methods ? How do you integrate these notebooks with testing in CD/CI?

@endjin2 ай бұрын

All great questions! I'll be covering most of your points in upcoming videos, but for now I'll try to answer them here.. > Do you ever use a master notebook or pipeline to run the 3 stages one after the other for the initial upload or subsequent incremental uploads? Yes, we either use a single orchestration notebook or a pipeline to chain the stages together. On the notebook side, more recently we've been experimenting with the "new" mssparkutils.notebook.runMultiple() utility function to create our logical DAG for the pipeline. We've found this to be quite powerful so far. On the pipeline side, the simplest thing to do is to have multiple notebook activities chained together in the correct dependency tree. The benefit of the initial method is that the same Spark Session is used. This is particularly appealing in Azure Synapse Analytics where Spark Sessions take a while to provision (although this is less significant in Fabric since sessions are provisioned much more quickly). The benefit of the pipeline approach is that it's nicer to visualise, but arguably harder to code review given its underlying structure. One thing we do in either option is make the process metadata-driven. So we'd have an input parameter object which captures the variable bits of configuration about how our pipeline should run. E.g. ingestToBronze = [true/false], ingestToBronzeConfig = {...}, processToSilver = [true/false],processToSilverConfig = {...}, .... This contains all the information we need to control the flow of the various permutations of processing we need. Stay tuned - there'll be a video on this later on in the series! > Why not use python files or spark job definitions instead of some of the notebooks that only have classes and methods? We could and we sometimes do! But the reality is that managing custom Python libraries and SJDs are a step up the maturity ladder that not every org is ready to adopt. So this video was meant to provide some inspiration about a happy middle-ground - still using Notebooks, but structuring them in such a way that follows good development practices and would make it easier to migrate to custom libraries in the future should that be necessary. > How do you integrate these notebooks with testing in CD/CI? Generally we create a separate set of notebooks that serve as our tests. See endjin.com/blog/2021/05/how-to-test-azure-synapse-notebooks. Naturally, development of these tests isn't great within a notebook, and it's a bit cumbersome to take advantage of some of the popular testing frameworks out there (pytest/behave). But some tests are better than no tests. Then, to integrate into CICD, we'd wrap our test notebooks in a Data Factory pipeline, and call that pipeline from ADO/GitHub: learn.microsoft.com/en-us/fabric/data-factory/pipeline-rest-api#run-on-demand-item-job. --- Sadly I can't cover absolutely everything in this series, so I hope this comment helps!

@ramonsuarez61052 ай бұрын

@@endjin Thanks a lot Ed. You guys do a great job with your videos and posts. Super helpful and inspiring :)

@endjin2 ай бұрын