Microsoft Fabric: Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations

Ғылым және технология

In part 2 of this course, Barry Smart, Director of Data and AI, walks through a demo showing how you can use Microsoft Fabric to set up a "data contract" that establishes minimum data quality standards for data that is being processed by a data pipeline.
He deliberately passes bad data into the pipeline to show how the process can be set up to "fail elegantly" by dropping the bad rows and continuing with only the good rows. Finally, he uses the new Teams pipeline activity in Fabric to show how you can send a message to the data stewards who are responsible for the data set, informing them that validation has failed, itemising the specific rows that failed and the validation errors that were generated in the body of the message.
The demo uses the popular Titanic data set to show features in data engineering experience in Fabric, including Notebooks, Pipelines and the Lakehouse. It uses the popular Great Expectations Python package to establish the data contract and Microsoft's mssparkutils Python package to enable the exit value of the Notebook to be passed back to the Pipeline that has triggered it.
Barry begins the video by explaining the architecture that is being adopted in the demo including Medallion Architecture and DataOps practices. He explains how these patterns have been applied to create a data product that provides Diagnostic Analytics of the Titanic data set. This forms part of an end to end demo of Microsoft Fabric that we will be providing as a series of videos over the coming weeks.
00:12 Overview of the architecture
00:36 The focus for this video is processing data to Silver
00:55 The DataOps principles of data validation and alerting will be applied
02:19 Tour of the artefacts in the Microsoft Fabric workspace
02:56 Open the "Validation Location" notebook and viewing the contents
03:30 Inspect the reference data that is going to be validated by the notebook
05:14 Overview of the key stages in the notebook
05:39 Set up the notebook, using %run to establish utility functions
06:21 Set up a "data contract" using great expectations package
07:45 Load the data from the Bronze area of the lake
08:18 Validate the data by applying the "data contract" to it
08:36 Remove any bad records to create a clean data set
09:04 Write the clean data to the lakehouse in Delta format
09:52 Exit the notebook using mssparkutils to pass back validation results
10:53 Lineage is used to discover the pipeline that triggers it
11:01 Exploring the "Process to Silver" pipeline
11:35 An "If Condition" is configured to process the notebook exit value
11:56 A Teams pipeline activity is set up to notify users
12:51 Title and body of Teams message are populated with dynamic information
13:08 Word of caution about exposing sensitive information
13:28 What's in the next episode?
#microsoftfabric #dataengineering #greatexpectations #course #tutorial

Пікірлер: 6

@endjin2 ай бұрын
Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏
@ramonsuarez61052 ай бұрын
Thanks a lot Barry. Great video. I couldn't find the repository for the series among your Github repos. Will there be one?
@vesperpiano
2 ай бұрын
Thanks for the feedback. Glad to hear you are enjoying it. Yes - we are planning to release the code for this project on Git at some point soon.
@sailing_life_3 ай бұрын
Ohhh yea video #2! Nice job @enjin team!
@endjin
3 ай бұрын
Glad you enjoyed it. 7 more parts to come. Don't forget the "d" in endjin!
@sailing_life_
3 ай бұрын
@@endjin hah Sorry!! Endjin* :) :)