Microsoft Fabric: Ingesting 5GB into a Bronze Lakehouse using Data Factory - Part 3

Ғылым және технология

Microsoft Fabric End to End Demo - Part 3 - Ingesting 5GB into a Bronze Lakehouse using Data Factory. In this video we'll see how we can quickly ingest ~5GB of data from an unauthenticated HTTP data source into #OneLake using #DataFactory in #MicrosoftFabric. We'll see the distinction between Tables and Files in a Fabric #Lakehouse and look at how we can preview data in the Lakehouse explorer.
00:00 Intro
00:15 Dataset recap
01:25 Workspace and pipeline artifacts
01:57 Pipeline UI layout
02:21 Copy data activity options
03:07 Configure copy data activity source
05:00 Configure copy data activity destination
06:21 Add dynamic content for destination Lakehouse filepath
08:21 Copy Data activity additional settings
09:00 Manually trigger pipeline
09:21 Alternative parameterized pipeline
11:38 Reviewing pipeline run details
12:10 Default workspace artifacts
13:04 Viewing Lakehouse Files
13:46 Roundup and outro
Useful links:
📂 Price Paid Data (PPD) in text or CSV format: www.gov.uk/government/statist...
📖 Data Factory in Microsoft Fabric: learn.microsoft.com/en-us/fab...
📖 Copy Data activity in Microsoft Fabric: learn.microsoft.com/en-us/fab...
📖 Lakehouse in Microsoft Fabric: learn.microsoft.com/en-us/fab...
Series contents:
📺 Part 1 - Lakehouse & Medallion Architecture - • Microsoft Fabric: Lake...
📺 Part 2 - Planning and Architecting a Data Project - • Microsoft Fabric: Insp...
📺 Part 3 - Ingest Data - • Microsoft Fabric: Inge...
📺 Part 4 - Creating a shortcut to ADLS Gen2 in Fabric - • Microsoft Fabric: Crea...
📺 Part 5 - Navigating OneLake data locally - • Microsoft Fabric: Loca...
📺 Part 6 - Role of the Silver layer in the Medallion Architecture - • Microsoft Fabric: Role...
📺 Part 7 - Processing Bronze to Silver - • Microsoft Fabric: Proc...
If you want to learn more about Fabric, take a look at some of our other content:
📺 Part 8 - Good Notebook Development Practices - • Microsoft Fabric: Good...
🤖 [Course] Microsoft Fabric: from Descriptive to Predictive Analytics: • Microsoft Fabric: Mach...
👉 Perspectives on #MicrosoftFabric: endjin.com/what-we-think/talk...
👉 A Tour Around #MicrosoftFabric: endjin.com/what-we-think/talk...
👉 Introduction to #MicrosoftFabric: endjin.com/blog/2023/05/intro...
and find all the rest of our content here: endjin.com/blog/2023/05/micro...
#Microsoft #PowerBI #MicrosoftFabric #lakehouse #datalake #onelake #data #datafactory #datapipline #ai #analytics #medallion #bronze #silver #gold #datafactory #projectplanning #hmrc #dataingestion

Пікірлер: 26

@endjin2 ай бұрын
Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏
@oskarlindberg4869 Жыл бұрын
Love this series! How many more episodes can we expect? :-)
@endjin
Жыл бұрын
There are around 10 planned. We're just recording the next episode. Should be out in the next week.
@MucahitKatirci5 ай бұрын
Thanks
@endjin
3 ай бұрын
We're impressed you're making your way through the series!
@applicitaaccount12582 ай бұрын
Great series, what the naming convention you are using in the full version of the solution ? I noticed the LH is prefixed with HPA
@endjin
2 ай бұрын
The HPA prefix stands for "House Price Analytics", although the architecture diagram on the second video has slightly old names, as you've probably noticed. The full version uses _Demo_LR, where LR stands for "Land Registry". Ed
@applicitaaccount1258
2 ай бұрын
@@endjin - Thanks for the clarification Ben.
@Abdullahakbarshafi5 ай бұрын
Hi My question is that while ingesting the data at 08:10 although the file name ends with .csv but the file format mentioned there is binary. Would be get binary files in the Data Lake or Binary?
@endjin
2 ай бұрын
"Binary" data type just means that the raw bytes of the underlying dataset are copied and nothing more (i.e. Data Factory won't try to parse the files as part of the copy). The binary contents of the output file will be whatever the binary contents of the input file were, and they'll be unchanged. "Binary" isn't a file type, it's more just a note to Data Factory that it shouldn't try to open/parse the file as part of the copy. The file extension is kept as .csv to ensure it can be read by the Fabric Lakehouse explorer and by downstream systems, because that's the type of my file. But I could perform a parquet -> parquet binary copy if my file was parquet, or json -> json if my file was json - it doesn't matter. Hope this helps!
@endjin
2 ай бұрын
Part 8 - Good Notebook Development Practices - is now available: kzread.info/dash/bejne/h62HmLyOl8uTh8Y.html
@GabrielSantos-qu6so4 ай бұрын
Why you store the data on Bronze layer in the files folder and dont populate in the tables instead? Or do the both? Where i'm working currently, we ingest to the table, not to a file.
@endjin
2 ай бұрын
The principle we like to take is "do as little as necessary to land your data into bronze". Any conversions/transformations/parsing you do when ingesting into bronze introduces scope for error, and ultimately modifies the "raw" state of the dataset. The reason "bronze" has also been known as "raw" in the past (i.e. before Databricks coined the Medallion Architecture) is because the data in that layer was an unadulterated form of your source dataset. The main benefits of this being: auditability, reprocessing capability, error recovery, and the flexibility it provides where different users can consume the same data in different ways. Naturally, you can't always store your Bronze data in the same format it's stored in the underlying source, especially if your source is a database, for example. So in this instance, some sort of conversion is necessary. What you convert it to is up to you: csv is popular, as is parquet. Ingesting into Delta tables goes a step further - you're applying a schema to the incoming data and enforcing a specific read/write protocol in the Bronze zone. I would argue this makes the Bronze data less accessible for people that need to consume it in the future (which may not be an issue for your use-case). If you have a valid reason to use Tables instead of Files (which I'm sure you do!), I would still recommend storing a copy of the raw file in the Files section in some format that suits you. Imagine if your Delta table gets corrupted for some reason, and the historical data is no longer available at source - what would you do then? Hope this helps! Ed
@moneshsutar7993 Жыл бұрын
Do you mean loading data for Bronze layer into Bronze_Lakehouse and then transform the data and load to new lakehouse Silver_Lakehouse? do we need multiple lakehouses or we can manage in single lakehouse?
@endjin
Жыл бұрын
While it's possible to share a single Lakehouse for "Bronze" and "Silver" zones, it's generally recommended to keep them separated in different lakehouses (even if Bronze initially doesn't utilize Tables, or Silver initially doesn't utilize Files). Benefits of splitting them apart include: - Security - you may want different permissions assigned to the different zones for different people - Organization - particularly for Tables - if you were only using a single lakehouse and had the need to create managed tables in your Bronze zone, then it'd potentially be difficult to organize alongside Silver tables. This is because lakehouses don't have any level of organization between the database level and table level (i.e. there's no notion of "schema" like in T-SQL). So any organization would need to be applied by table name convention, which is never ideal. However, if you have a simple use-case where neither of the above imposes any constraints, then merging into a single Lakehouse is fine. It totally depends!
@sbining11 ай бұрын
Hi. Great video - however i'm stumbling on the first bloc. I noticed your using basic authentication - and anonymous doesnt work. Can you provide further instruction here please?
@sbining
11 ай бұрын
Hi - I mean in the context of connecting to the LR sample file via http.
@endjin
9 ай бұрын
@@sbining Anonymous should work. I was only using Basic Auth because Anonymous wasn't available at the time. What error are you hitting?
@robmays698211 ай бұрын
so clean and easy to follow. can i get a copy of your slides please?
@endjin
3 ай бұрын
I think the plan is to release some assets once the series is complete.
@santavo1
2 ай бұрын
@@endjin When is this demo planned to finish? (these videos are very useful tbh)
@endjin
2 ай бұрын
@@santavo1 We're up to episode 8 - I think Ed has a few more planned... there was a bit of a break earlier this year because we had a number of workshops and talks to prepare for the SQLBits conference, but we're back working on these videos (except for people going on holiday) Check out the Titanic / Machine Learning with Fabric talks by Barry Smart - the first two have been published.
@endjin
2 ай бұрын
@@santavo1 Ed has ~12 parts planned... but that's not to say it won't be more!
@robmays698211 ай бұрын
grrr trying find the url you are using for price paid data?
@endjin
11 ай бұрын
You can find the data here: www.gov.uk/government/statistical-data-sets/price-paid-data-downloads