endjin

endjin

We help small teams achieve big things.

We are a UK based, full remote, Technology Consultancy who specializes in Data, AI, DevOps & Cloud, and are a .NET Foundation Corporate Sponsor.

We produce two free weekly newsletters;
☁️ Azure Weekly - azureweekly.info for all things about the Microsoft Azure Platform,
📈Power BI Weekly - powerbiweekly.info for all things data visualisation and Power Platform.

Keep up with everything that's going on at endjin via our blog:

👉endjin.com/blog
👉twitter.com/endjin
👉www.linkedin.com/company/endjin

Information about our Open Source projects can be found at endjin.com/open-source

Find out more at endjin.com

#Microsoft #MicrosoftFabric #PowerBI #AI #Data #Analytics #DevOps #Azure #Cloud

Пікірлер

  • @moeeljawad5361
    @moeeljawad536112 сағат бұрын

    Hello Nice video. could you please explain from where exactly did you get the dfs url that you had pasted in the ADLS shortcut connector? in my url i am having a blob part in it and that is preventing me from doing the connection. Thanks

  • @MdRahman-wl6qi
    @MdRahman-wl6qi4 күн бұрын

    how to move data from one lakehouse's to another lakehouse table using pyspark?

  • @ullajutila1659
    @ullajutila165915 күн бұрын

    The video was ok, but it did not explain how the ADLS storage account networking needs to be configured. More specifically, how to configure it in a secure manner, without allowing access from all networks.

  • @ravishkumar1739
    @ravishkumar173918 күн бұрын

    Hi @endjin great videos, have you uploaded the architecture diagram file anywhere that I can download and reuse for my own projects?

  • @MuhammadKhan-wp9zn
    @MuhammadKhan-wp9zn26 күн бұрын

    How can i contact you pls let me know Thanks

  • @MuhammadKhan-wp9zn
    @MuhammadKhan-wp9znАй бұрын

    This is a framework level work, not sure how many will understand and appreciate your efforts you did to create a video, but I will highly appreciate your thoughts and work and at one point I was thinking if I got a chance to create a framework how I will do, you gave very nice guide line here, once again thank you for video, I would like to see your other videos too.

  • @vinzent345
    @vinzent345Ай бұрын

    Is there an option to connect from your local machine directly to the synapse spark cluster? Doesn't seem that debug friendly, having to compile & upload it every time. It almost feels more sensible to host your own autoscaling Spark Cluster in Azure Kubernetes Services. If I do so, I can interact directly with the Cluster and build Sessions locally. What do you think?

  • @idg10
    @idg10Ай бұрын

    In this scenario, it would make more sense to run Spark locally. There are a few ways you can do that, but as you'd expect it's not entirely straightforward, and not something easily addressed in a comment.

  • @ManojKatasani
    @ManojKatasaniАй бұрын

    very clean explanation, appreciate your efforts. is there any chance we get code on each layer ( Bronze to sliver etc.. advance thank you

  • @ManojKatasani
    @ManojKatasaniАй бұрын

    you are the best

  • @rodrihc
    @rodrihcАй бұрын

    Thanks for the video! Is there any alternative to run the test notebooks of synapse from the cicd pipeline in azure devops?

  • @jamesbroome949
    @jamesbroome949Ай бұрын

    There's a couple of ways to achieve this - neither are immediately obvious but definitely possible! There's no API for just running a notebook in Synapse, but you can submit a Spark batch job via the API. However, this requires a Python file as input, so it might mean pulling your tests out of a Notebook and writing and storing them separately in an associated ADO repo: learn.microsoft.com/en-us/rest/api/synapse/data-plane/spark-batch/create-spark-batch-job?view=rest-synapse-data-plane-2020-12-01&tabs=HTTP Possibly an easier route would be to create a separate Synapse Pipeline definition that runs your test notebook(s) and use the API to trigger that pipeline run from your ADO pipeline. This is a straightforward REST API but operates asynchronously, so you'd need to poll for completion as the pipeline/tests are running: learn.microsoft.com/en-us/rest/api/synapse/data-plane/pipeline/create-pipeline-run?view=rest-synapse-data-plane-2020-12-01&tabs=HTTP Hope that helps!

  • @YvonneWurm
    @YvonneWurmАй бұрын

    Do you know how to view the definition of the view or stored procedures?

  • @jamesbroome949
    @jamesbroome949Ай бұрын

    Hi - I don't believe there's a way in Synapse Studio to automatically script out the definitions like you can do in, say SQL Server Management Studio. But you can see the column definitions for you View if you find your database under the Data tab and expand the nodes in the explorer. Hope that helps!

  • @datasets-rv7jf
    @datasets-rv7jf2 ай бұрын

    looking forward to many more of this!

  • @endjin
    @endjin2 ай бұрын

    Barry has ~9 parts planned!

  • @ThePPhilo
    @ThePPhilo2 ай бұрын

    Great videos 👍👍 Microsoft advocate using seperate workspaces for bronze, silver and gold but that seems to be harder to achieve due to some current limitations. If we go with a single workspace and a folder based set up like the example will it be hard to switch to seperate workspaces in future? Is there any prep we can do to make this switch easier going forward (or would there be no need to switch to a workspace approach)?

  • @ramonsuarez6105
    @ramonsuarez61052 ай бұрын

    Thanks a lot Barry. Great video. I couldn't find the repository for the series among your Github repos. Will there be one?

  • @vesperpiano
    @vesperpiano2 ай бұрын

    Thanks for the feedback. Glad to hear you are enjoying it. Yes - we are planning to release the code for this project on Git at some point soon.

  • @StefanoMagnasco-po5bb
    @StefanoMagnasco-po5bb2 ай бұрын

    Thanks for the great video, very useful. One question: you are using PySpark in your notebooks, but how would you recommend modularizing the code in Spark SQL? Maybe by defining UDFs in separate notebooks that are then called in the 'parent' notebook?

  • @endjin
    @endjin2 ай бұрын

    Sadly you don't have that many options here without having to fall back to Python/Scala. You can modularize at a very basic level using notebooks as the "modules", containing a bunch of cells which contain Spark SQL commands. Then call these notebooks from the parent notebook. Otherwise, as you say, one step further would be defining UDFs using some Python and then using spark.udf.register to be able to invoke them from SQL. Ed

  • @applicitaaccount1258
    @applicitaaccount12582 ай бұрын

    Really looking forward to this series, thanks for taking the time to put it together. I really enjoy the pace and detail level if the conte t @endjin put together.

  • @applicitaaccount1258
    @applicitaaccount12582 ай бұрын

    Great series, what the naming convention you are using in the full version of the solution ? I noticed the LH is prefixed with HPA

  • @endjin
    @endjin2 ай бұрын

    The HPA prefix stands for "House Price Analytics", although the architecture diagram on the second video has slightly old names, as you've probably noticed. The full version uses <medallion_layer>_Demo_LR, where LR stands for "Land Registry". Ed

  • @applicitaaccount1258
    @applicitaaccount12582 ай бұрын

    @@endjin - Thanks for the clarification Ben.

  • @kingmharbayato643
    @kingmharbayato6432 ай бұрын

    Finally someone made this video!! Thank you for doing this.

  • @malleshmjolad
    @malleshmjolad2 ай бұрын

    Hi Ed, We have loaded few tables using synapse link into adls gen2 and created shortcut to access the adlsgen2 files in fabric,but while loading the files into tables,we are not getting the column names for the tables and it is showing as c0,c1....etc which is causing an issue,can you please give some insights on how to overcome this and load the tables with metadata also

  • @endjin
    @endjin2 ай бұрын

    Hi - thanks for the comment! Which Synapse Link are you using? Dataverse? If so, this uses the CDM model.json format which doesn't include header rows in the underlying CSV files. You would have to read the shortcut data, apply the schema manually, and then write the data out to another table (inside a Fabric notebook or something) if you wanted to use that existing data. However, if you're using Synapse Link for Dataverse, you should instead consider using the new "Link to Microsoft Fabric" feature available in Dataverse: learn.microsoft.com/en-us/power-apps/maker/data-platform/azure-synapse-link-view-in-fabric. This will include the correct schema.

  • @gpc39
    @gpc392 ай бұрын

    Very useful. One thing I would like to do is avoid having to add lakehouses to each notebook. Is there a way to do this within the notebook? Eventually given two Lakehouses, Bronze and Silver, I would want to merge from the Bronze table into the Silver Table. I have the merge statement working; it's just the adding of the lakehouses, which I can't see. I'm doing most of the programming with SQL, as I am less adept with PySpark, but am learning. Thanks Graham

  • @endjin
    @endjin2 ай бұрын

    Hi Graham. Thanks for your comment! By "do this within the notebook" do you mean "attach a Lakehouse programmatically"? If so, take a look at this: community.fabric.microsoft.com/t5/General-Discussion/How-to-set-default-Lakehouse-in-the-notebook-programmatically/m-p/3732975 By my understanding, a notebook needs to have at least one Lakehouse attached to it in order to run Spark SQL statements that read from Lakehouses. Once it has one Lakehouse, remember that you can reference other Lakehouses in your workspace by using two-part naming (SELECT * FROM <lakehouse_name>.<table_name>`) without having to explicitly attach the other Lakehouses. And if you need to reference Lakehouses from other workspaces, you'll need to add a shortcut first and then use two part naming. Ed

  • @jensenb9
    @jensenb92 ай бұрын

    Great stuff. Is this E2E content hosted in a Git repo somewhere that we can access? Thanks!

  • @endjin
    @endjin2 ай бұрын

    Not yet, but I believe that is the plan.

  • @EllovdGriek
    @EllovdGriekКүн бұрын

    @@endjin I had the same question, great content and i wanted to try it myself but with the guidance of your code. So, is it somewhere available?

  • @endjin
    @endjinКүн бұрын

    @@EllovdGriek Unfortunately, not yet.

  • @Nalaka-Wanniarachchi
    @Nalaka-Wanniarachchi2 ай бұрын

    Nice share on best practices round up ...

  • @ramonsuarez6105
    @ramonsuarez61052 ай бұрын

    Excellent video, thanks. Do you ever use a master notebook or pipeline to run the 3 stages one after the other for the initial upload or subsequent incremental uploads? Why not use python files or spark job definitions instead of some of the notebooks that only have classes and methods ? How do you integrate these notebooks with testing in CD/CI?

  • @endjin
    @endjin2 ай бұрын

    All great questions! I'll be covering most of your points in upcoming videos, but for now I'll try to answer them here.. > Do you ever use a master notebook or pipeline to run the 3 stages one after the other for the initial upload or subsequent incremental uploads? Yes, we either use a single orchestration notebook or a pipeline to chain the stages together. On the notebook side, more recently we've been experimenting with the "new" mssparkutils.notebook.runMultiple() utility function to create our logical DAG for the pipeline. We've found this to be quite powerful so far. On the pipeline side, the simplest thing to do is to have multiple notebook activities chained together in the correct dependency tree. The benefit of the initial method is that the same Spark Session is used. This is particularly appealing in Azure Synapse Analytics where Spark Sessions take a while to provision (although this is less significant in Fabric since sessions are provisioned much more quickly). The benefit of the pipeline approach is that it's nicer to visualise, but arguably harder to code review given its underlying structure. One thing we do in either option is make the process metadata-driven. So we'd have an input parameter object which captures the variable bits of configuration about how our pipeline should run. E.g. ingestToBronze = [true/false], ingestToBronzeConfig = {...}, processToSilver = [true/false],processToSilverConfig = {...}, .... This contains all the information we need to control the flow of the various permutations of processing we need. Stay tuned - there'll be a video on this later on in the series! > Why not use python files or spark job definitions instead of some of the notebooks that only have classes and methods? We could and we sometimes do! But the reality is that managing custom Python libraries and SJDs are a step up the maturity ladder that not every org is ready to adopt. So this video was meant to provide some inspiration about a happy middle-ground - still using Notebooks, but structuring them in such a way that follows good development practices and would make it easier to migrate to custom libraries in the future should that be necessary. > How do you integrate these notebooks with testing in CD/CI? Generally we create a separate set of notebooks that serve as our tests. See endjin.com/blog/2021/05/how-to-test-azure-synapse-notebooks. Naturally, development of these tests isn't great within a notebook, and it's a bit cumbersome to take advantage of some of the popular testing frameworks out there (pytest/behave). But some tests are better than no tests. Then, to integrate into CICD, we'd wrap our test notebooks in a Data Factory pipeline, and call that pipeline from ADO/GitHub: learn.microsoft.com/en-us/fabric/data-factory/pipeline-rest-api#run-on-demand-item-job. --- Sadly I can't cover absolutely everything in this series, so I hope this comment helps!

  • @ramonsuarez6105
    @ramonsuarez61052 ай бұрын

    @@endjin Thanks a lot Ed. You guys do a great job with your videos and posts. Super helpful and inspiring :)

  • @endjin
    @endjin2 ай бұрын

    Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏

  • @endjin
    @endjin2 ай бұрын

    Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏

  • @endjin
    @endjin2 ай бұрын

    Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏

  • @endjin
    @endjin2 ай бұрын

    Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏

  • @endjin
    @endjin2 ай бұрын

    Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏

  • @endjin
    @endjin2 ай бұрын

    Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏

  • @endjin
    @endjin2 ай бұрын

    Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏

  • @endjin
    @endjin2 ай бұрын

    Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏

  • @endjin
    @endjin2 ай бұрын

    Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏

  • @endjin
    @endjin2 ай бұрын

    Thank you all for watching, if you could do me a favour, hit subscribe and turn notifications 🔔 on it helps us more than you know.

  • @john_britto_10x
    @john_britto_10x2 ай бұрын

    In your architecture diagram, you have mentioned HPA orchestrator pipeline. Is that a separate pipeline that needs to be created than the ones at the process layers below?

  • @endjin
    @endjin2 ай бұрын

    Thanks for the comment! Yes, the orchestrator pipeline is a separate pipeline that includes the configuration that controls the flow of the pipeline. I'll be showing that in an upcoming video, so stay tuned!

  • @john_britto_10x
    @john_britto_10x2 ай бұрын

    @@endjinawesome. Awaiting for it.

  • @john_britto_10x
    @john_britto_10x2 ай бұрын

    endjin have you published the 8th part video, if so the link please?

  • @endjin
    @endjin2 ай бұрын

    It's coming in the next week or so. We've just started a new series you might be interested in kzread.info/dash/bejne/p5WGx7KBldTcgbg.html

  • @john_britto_10x
    @john_britto_10x2 ай бұрын

    Thank you so much for replying quickly.

  • @endjin
    @endjin2 ай бұрын

    Part 8 - Good Notebook Development Practices - is now available: kzread.info/dash/bejne/h62HmLyOl8uTh8Y.html

  • @sailing_life_
    @sailing_life_3 ай бұрын

    Ohhh yea video #2! Nice job @enjin team!

  • @endjin
    @endjin3 ай бұрын

    Glad you enjoyed it. 7 more parts to come. Don't forget the "d" in endjin!

  • @sailing_life_
    @sailing_life_3 ай бұрын

    @@endjin hah Sorry!! Endjin* :) :)

  • @iamjameson
    @iamjameson3 ай бұрын

    Nice, looking forward to this series

  • @endjin
    @endjin3 ай бұрын

    We're just editing part 2, which should be published on Thursday.

  • @rh.m6660
    @rh.m66603 ай бұрын

    Not bad, not bad at all. Could you point to a best practice framework for this type of work? Or is this just the way you like to work? I'm finding i can do to the data what i want but am not sure if im being efficient and scalable. I like the idea of using helpernotebooks, i believe cusom spark environments also support custom library's now.

  • @endjin
    @endjin2 ай бұрын

    This is the way we like to work based on a background of software engineering within the company. We believe data solutions should get the same treatment as software solutions, and for that reason we're huge proponents of applying DataOps practices when building data products. In fact, we've presented at Big Data LDN and SQLBits on this very topic: endjin.com/news/sqlbits-2024-workshop-dataops-how-to-deliver-data-faster-and-better-with-microsoft-cloud Sadly we don't have a specific recording or blog we can share with you (yet!), but part of the point of this series is to expose you to some of our guidance in this area. But stay tuned! Check out "Part 8 - Good Notebook Development Practices" which is now available: kzread.info/dash/bejne/h62HmLyOl8uTh8Y.html --- And you're right that Environments in Fabric support custom libraries. This is another way you can package up your common code and more easily use it across workloads. But this increases the CICD complexity a little bit, so take that into account! Ed

  • @profanegardener
    @profanegardener4 ай бұрын

    Great video.... I'm very new to Synapse and it was really helpful.

  • @carlnascnyc
    @carlnascnyc4 ай бұрын

    this is indeed a great series, the best one of all youtube imo, it's direct to the point and straightforward, just the right amount of complexity, can't wait for the next chapters (especially the gold zone design/loading), please keep up the good work and many thanks!!! cheers!!

  • @endjin
    @endjin2 ай бұрын

    Part 8 - Good Notebook Development Practices - is now available: kzread.info/dash/bejne/h62HmLyOl8uTh8Y.html

  • @NKMAKKUVA
    @NKMAKKUVA4 ай бұрын

    Thanks for the explanation

  • @GabrielSantos-qu6so
    @GabrielSantos-qu6so4 ай бұрын

    Why you store the data on Bronze layer in the files folder and dont populate in the tables instead? Or do the both? Where i'm working currently, we ingest to the table, not to a file.

  • @endjin
    @endjin2 ай бұрын

    The principle we like to take is "do as little as necessary to land your data into bronze". Any conversions/transformations/parsing you do when ingesting into bronze introduces scope for error, and ultimately modifies the "raw" state of the dataset. The reason "bronze" has also been known as "raw" in the past (i.e. before Databricks coined the Medallion Architecture) is because the data in that layer was an unadulterated form of your source dataset. The main benefits of this being: auditability, reprocessing capability, error recovery, and the flexibility it provides where different users can consume the same data in different ways. Naturally, you can't always store your Bronze data in the same format it's stored in the underlying source, especially if your source is a database, for example. So in this instance, some sort of conversion is necessary. What you convert it to is up to you: csv is popular, as is parquet. Ingesting into Delta tables goes a step further - you're applying a schema to the incoming data and enforcing a specific read/write protocol in the Bronze zone. I would argue this makes the Bronze data less accessible for people that need to consume it in the future (which may not be an issue for your use-case). If you have a valid reason to use Tables instead of Files (which I'm sure you do!), I would still recommend storing a copy of the raw file in the Files section in some format that suits you. Imagine if your Delta table gets corrupted for some reason, and the historical data is no longer available at source - what would you do then? Hope this helps! Ed

  • @welcomesithole1501
    @welcomesithole15014 ай бұрын

    Amazing video, I can see that he knows what he is talking about, unlike other KZreadrs who just copy and paste code from somewhere and fail to give proper explanations. I wish there was a series of this man teaching TorchSharp, especially converting his Python GPT to C#.

  • @akthar3
    @akthar34 ай бұрын

    worked beautifully - thank you !

  • @Tony-cc8ci
    @Tony-cc8ci4 ай бұрын

    Hi, have enjoyed the Microsoft Fabric playlist so far, very informative. This one was 2 months ago, so just wanted to find out if you plan on continuing the series like you mentioned with the helper notebooks?

  • @endjin
    @endjin4 ай бұрын

    Yes, very much so! Ed (and Barry) are just heads down busy preparing for their workshops and talks at the SQLBits conference next week. As soon as that's done they'll have more capacity to finish off this series of videos.

  • @endjin
    @endjin2 ай бұрын

    Part 8 - Good Notebook Development Practices - is now available: kzread.info/dash/bejne/h62HmLyOl8uTh8Y.html

  • @Power888w
    @Power888w5 ай бұрын

    Super video thank you!

  • @endjin
    @endjin5 ай бұрын

    Thank you! Glad you enjoyed it! There are more coming!

  • @endjin
    @endjin2 ай бұрын

    Part 8 - Good Notebook Development Practices - is now available: kzread.info/dash/bejne/h62HmLyOl8uTh8Y.html

  • @richprice5434
    @richprice54345 ай бұрын

    Brilliant video perfect for what I need to do subscribed 🎉

  • @endjin
    @endjin5 ай бұрын

    That's great to hear! There are a few more videos dropping soon.

  • @endjin
    @endjin2 ай бұрын

    Part 8 - Good Notebook Development Practices - is now available: kzread.info/dash/bejne/h62HmLyOl8uTh8Y.html

  • @MucahitKatirci
    @MucahitKatirci5 ай бұрын

    Thanks

  • @endjin
    @endjin3 ай бұрын

    Glad you're enjoying the videos

  • @endjin
    @endjin2 ай бұрын

    Part 8 - Good Notebook Development Practices - is now available: kzread.info/dash/bejne/h62HmLyOl8uTh8Y.html

  • @MucahitKatirci
    @MucahitKatirci5 ай бұрын

    Thanks

  • @endjin
    @endjin3 ай бұрын

    We're impressed you're making your way through the series!

  • @MucahitKatirci
    @MucahitKatirci5 ай бұрын

    Thanks

  • @endjin
    @endjin3 ай бұрын

    There should be a new video dropping soon, seeing that you've binged everything so far!