DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure Microsoft Fabric workspace settings
Configure Microsoft Fabric workspace settings
1-4. Workspace settings
• To access workspace settings, go to Workspaces, click on the … next to the relevant
workspace, and go to Workspace settings.
o You can also access it in the workspace by clicking on “Workspace settings”.
• There are the following tabs:
o About – you can change:
▪ The workspace image,
▪ Description,
▪ Domain (a group of workspaces),
▪ Contact list – which users receive notifications about issues in the
workspace.
▪ Microsoft 365 and OneDrive – configure a Microsoft Group where the
SharePoint document library can be made available to workspace users.
• You will need to give permissions to the Microsoft 365 Group
membership.
o License mode: Pro, Premium per-user, Premium capacity, Embedded, Fabric
capacity and Trial.
o Azure connections: configure dataflow storage to use Azure Data Lake Gen2
storage and Azure Log Analytics.
o System storage
▪ Manage your semantic model storage (workspaces can contain up to
1,000 semantic model per workspace),
▪ View storage,
▪ Delete semantic models (reports and dashboards based on those models
would not longer work),
o Git integration
▪ Connect workspace to an Azure Repo (see item 10).
o OneLake
o Other – remove this workspace
1-4. The Fabric admin portal
• To administer the admin portal, you need either:
o Global administrator,
o Power Platform administrator, or
o Fabric administrator.
• If you don’t have one of the roles, you will only see “Capacity settings” in the admin portal.
• You can get to the settings by going to Settings – Admin portal – Tenant settings. Settings
include:
Page 1 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure Microsoft Fabric workspace settings
o Microsoft Fabric
▪ Users can create Fabric items – can be managed at the tenant and
capacity levels.
o Help and support settings
▪ Users can try Microsoft Fabric paid features – a free 60-day trial.
o Workspace settings
▪ Create workspaces (enable),
▪ Use semantic models across workspaces – users would still need the
Build permission.
▪ Block users from reassigning personal workspaces (My Workspace) – to
stop users from changing the capacity assignment of My Workspace, as
the data might change region, which may be in breach of GDPR or other
data-residency rules.
▪ Define workspace retention period – by default, workspaces are retained
for 7 days before being permanently deleted. This can be changed for up
to 90 days.
o Information protection
▪ Allow users to apply sensitivity labels for content.
o Export and sharing settings
▪ Guest users can access Microsoft Fabric – guests would be accessing via
Microsoft Entra B2B (Business to Business).
▪ Guest users can browse and access Fabric content
▪ Guest users can work with shared semantic models in their own tenants
1. Configure Spark workspace settings
• To access the Spark workspace settings:
o In the relevant workspace, click on “Workspace Settings”,
o In the Data Engineering/Science section, click on Spark settings.
• Initially, you will be using a Starter pool.
o They are able to be started fairly quickly – typically (say, 9-12) seconds.
o They also allow for libraries to be installed quickly.
▪ However, extra custom Spark properties or libraries from your capacity
settings or workspace may take longer.
• You are only billed for the time that they are active/running.
o You are billed for the time spent when stopping them, but not for when they are
idle or deallocating or stopped.
o The following customizations do not affect currently active Spark sessions.
• Starter pools can be customized to some extent, but only if you have admin access to the
workspace. They have the following configuration:
o Node family: Memory optimized
Page 2 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure Microsoft Fabric workspace settings
o Node Size: medium
▪ This has 8 vCores and a memory of 64 GB per node.
o Autoscale: On.
▪ This allows for the scale up/down of compute resources based on actual
activity.
o Number of nodes: by default they are between 1 and 10 for a F64 capacity or
above.
▪ So they can scale up/down for your requirements.
▪ You can adjust both the maximum up to 16 for a F64 capacity.
• VCores are double the number of capacity units.
• The Node size is not configurable – it is always Medium.
SKU name Default max nodes Max number of nodes
F2 1 1
F4 1 1
F8 2 2
F16 3 4
F32 8 8
F64 or trial 10 16
F128 10 32
F256 10 64
F512 10 128
F1024 10 200
F2048 10 200
• Starter pool configuration
o Dynamic Allocation: On.
▪ This allows for more executors if needed.
▪ The default min and max is 1 to 9 – the number of nodes minus 1.
• These are the worker nodes. The other required node is the head
node.
• You can create a new (not Starter) pool by clicking on the drop-down and selecting "New
pool".
o You can choose a Node size of Small, Medium, Large, X-Large and XX-Large,
based on your compute requirements.
o If you change the "Default pool for workspace", it will then be used for that
workspace's Notebooks.
Page 3 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure Microsoft Fabric workspace settings
o Note: it can take 2-3 minutes for custom pools to start.
• To give users the choice of pool (as opposed to just the default):
o you will need to create an environment (an item in the workspace),
o assign the pool to that environment, and
o the end user can use that environment in the Notebook.
• You can:
o Allow users to "customize compute configurations for items" (in the Pool tab), and
o In the Environment tab, you can either select the default environment (with the
end user can change), or select the Runtime Version.
o You can also create an Environment object, and then set the Spar compute and
acceleration.
▪ In the acceleration tab, for Runtime 1.3 (Apache Spark 3.5), you can check
“Enable native execution engine”.
▪ It allows you to have faster job runs without additional cost, as it uses the
native capabilities of the underlying data sources, reducing the overhead
needed.
• You can also enable it:
o in SQL by using SET spark.native.enabled=TRUE; or
o in PySpark using spark.conf.set('spark.native.enabled',
'true')
▪ The native execution engine does not support:
• Delta Merge operations, checkpoint scans and deleting vectors,
• User-defined functions and the array_contains function,
• JSON, XML and CSV formats, or
• ANSI mode.
▪ Where the native execution engine is not supported, it will generally fall
back to the standard PySpark execution engine.
• For custom and starter pools, by default, sessions expire after 20 minutes.
o This can be changed in the Settings - Jobs tab.
o You can see the expiration for a current session by going to Home (or Run) –
Connect – View session information.
o You can stop a session in a Notebook by clicking on "Stop Session", or stopping
the session in the monitoring hub page.
• Custom pools have a default autopause duration of 2 minutes after the sessions expire,
following which they will be unallocated.
• In the Jobs tab, you can also "reserve maximum cores for active Spark jobs".
o If OFF, it reserves the minimum number of cores for a particular job (which can
then be autoscaled), therefore allowing more jobs to be run at the same time.
Page 4 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure Microsoft Fabric workspace settings
o If ON, it reserves the maximum number of cores for a particular job, allowing for
higher job reliability.
o However, if cores are not available, in either the ON or OFF setting, then a job is
refused.
• You can use a High Concurrency mode, which allows the same session to be used
simultaneously in multiple notebooks.
o Microsoft says that users get a 36 times faster session start compared to standard
Spark session.
o You can start such a session in a Notebook by going to Home (or Run) – Connect
– New high concurrency session.
▪ You can also see any currently running high concurrency sessions in that
menu.
▪ It needs to be switched on first, in the Workspace – Settings – High
concurrency. You can switch it on for notebooks, and for pipelines running
multiple notebooks.
▪ It may make isolating monitoring and debugging for a particular notebook
more difficult, and may degrade performance if there is already a lot of
compute in that session.
▪ All relevant notes will also have the same compute, library packages, and
default lakehouse.
• You can see capacity use for spark sessions in the Microsoft Fabric Capacity Metrics app.
2. Configure domain workspace settings
• Domains are used to organize workspaces. It can allow for better discovery, as the items
within the workspace are associated with the domain.
o This can be used to filter information in, for example, the OneLake data hub.
o You can also use some domain-level settings instead of tenant-level settings,
allowing you to manage the settings more easily.
o You can also create subdomains.
• Often they are organized into structures, such as:
o Functional structure, for a hierarchy, by roles and business functions
▪ Operations, IT, Strategic Development, Procurement
o Project or product structure, for different teams or product lines
▪ For example: Transportation, Retail, Technology, Healthcare
o By process, if your business is process-oriented and then by products or markets.
o By region – different continents, countries, or states/regions.
• You may also wish to consider any specific regulations or restrictions in your industry.
• You can assign workspaces by:
o Workspace name (if they have a suitable prefix),
o Workspace owner (e.g. all Healthcare in one domain, then all eye Healthcare in a
subdomain),
Page 5 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure Microsoft Fabric workspace settings
o By process, if your company uses that as the overriding factor.
• The following roles are involved with domain creation:
o Fabric admin. They have the following permissions:
▪ Creating/edit domains,
▪ Assign domain admins and contributors, and
▪ Associate workspaces to domains.
▪ See, edit and delete all domains.
o Domain admin. They have the following permissions:
▪ See and edit their own domains (for which they are admin), including
creating subdomains,
▪ Update the domain description,
▪ Assign domain contributors,
▪ Associate workspaces to domains.
▪ Note: they cannot:
• delete the domain,
• change its name, or
• assign/remove other domain admins.
o Domain contributors. These are workspace admins
▪ These need to be Administrator of a least one workspace.
▪ They can assign/change such a workspace to a domain.
▪ They can do this from the workspace settings.
• Fabric admins can create a domain:
o In the Admin portal, go to Domains tab.
o Click on "Create new domain".
o Enter the name, and optionally add any domain admins.
o Click create.
• Once a domain has been created, Fabric/Domain admins can associate a workspace to
a domain by:
o Going to the workspace
o Click on "Workspace settings".
o In the General tab, go to the Domain section, and assign the workspace to the
domain.
• Fabric admins or Domain admins can create subdomains:
o Go to the relevant domain.
o Click "New subdomain" and provide a name.
o Note – there is no such thing as a subdomain admin. The admins for a subdomain
are the same admins for the main domain.
Page 6 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure Microsoft Fabric workspace settings
• At the timing of creating a new workspace, you can assign that workspace to a domain.
• Fabric admins or Domain admins can assign domains or subdomains to a workspace:
o Go to the relevant domain.
o Click on "Assign workspaces".
o Search for the domains, either by workspace name, workspace admin or capacity.
▪ If any workspaces are already associated with another domain, you will
get a warning message.
• You can override this, but only the tenant setting "Allow tenant and
domain admins to override workspace assignments" is enabled.
▪ If searching for workspace admin: all workspaces the users/security
groups are admins will be assigned to the domain apart from "My
workspace".
▪ If searching for workspace admin or capacity: this is not a continuous
process – it only affects domains which are the users/security groups are
admins for at that time.
▪ If searching for workspace name or capacity: you can select multiple
workspaces/capacities at the same time.
• In the Admin portal – Domains tab, you can update the Domain settings, by:
o clicking on "Domain settings" in the relevant domain, or
o in the Domain tab, clicking on the … next to the Domain, and selecting Settings.
• The various tabs are:
o General settings. You can edit the name, and Domain/Fabric admins can edit the
description.
o Image. This is the image/color used in the OneLake data hub when that domain is
selected.
o Admins. Fabric admins can specify Domain admins.
o Contributors. Fabric/domain admins can assign Domain contributors – either:
▪ "The entire organization",
▪ "Specific users or security groups", or
▪ "Tenant and domain admins only".
o Default domain. Fabric/domain admins can assign users/security groups to this
domain – this includes assigning them the domain contributor role:
▪ The workspaces for which these users/security groups are Administrator
will be assigned this domain, unless it already has a domain assignment.
▪ Any new workspaces that are created by this user or a member of the
security group will also be assigned to the default domain.
o Delegated settings. The following admin settings can be assigned at the domain
level, and can there override the tenant level admin settings:
▪ Default sensitivity label.
• This can be assigned by Fabric/domain admins.
Page 7 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure Microsoft Fabric workspace settings
• Go into Information protection, and select the sensitivity label
from "Set a default label for this domain".
▪ Certification.
• Allow users/security groups to certify items (apart from Power BI
dashboards, which cannot be certified).
• You can select "All the users in domain" or include or exclude
"Specific security groups".
• You can optionally check "Override tenant admin selection", and
optionally provide a URL for relevant documentation.
• This requires certification to enabled in the Tenant settings.
3. Configure OneLake workspace settings
• To access the Workspace OneLake settings:
o go to the relevant Workspace,
o Click on “Workspace Settings".
o Click on the "OneLake" tab.
• There, you can turn on/off the Cache for Shortcuts. When it is enabled:
o This is for external Google Cloud Storage (GCS), Amazon S3 and S3 compatible
shortcuts (in other words, in a different cloud).
▪ So not, for example, for Azure Data Lake Storage (ADLS) Gen2 storage
accounts or shortcuts pointing to other Fabric items.
o When such a shortcut reads files, the files are stored in a cache.
▪ Files bigger than 1 Gigabyte are not, however, cached.
o Future read requests are then read from the cache, rather than be accessed from
the other cloud.
▪ However, if the external cloud's content has been updated, that is read
instead, and the cache updated.
o Cached files are removed by default 24 hours after their last read.
▪ This can be changed to be up to 28 days.
• There are also OneLake settings in the Tenant settings. If you go to Admin portal – Tenant
settings, you can select the following:
o Users can access data stored in OneLake with apps external to Fabric,
▪ such as OneLake File Explorer, Databricks, and Azure Data Lake Storage
APIs (application programming interface)
o Use short-lived user-delegated Shared Access Signature (SAS) tokens,
▪ allow applications to access OneLake through tokens, which can last up
to one hour.
o Authenticate with OneLake user-delegated SAS tokens, and
▪ allows applications to authenticate
o Users can sync data in OneLake with the OneLake File Explorer app
Page 8 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Implement lifecycle management in Fabric
▪ Use the OneLake File Explorer app in Windows File Explorer.
4. Configure data workflow workspace settings
• Apache Airflow jobs were introduced in May 2024.
o It allows you to create, schedule and monitor complex jobs.
o Apache Airflow is the name of the open-source platform that is used to power
these jobs.
o It uses directed acyclic graphs (DAGs) to show data pipelines.
o The creation of these jobs is outside of the DP-700 exam.
• Apache Airflow jobs would need to be enabled first:
o In Admin Portal, click on the Tenant Settings tab.
o In the “Microsoft Fabric” section, open the “Users can create and use Apache
Airflow Job”.
o You can then enable it for the entire organization or include/exclude specific
security groups.
• To configure data workflow workspace settings:
o Go to your workspace,
o Click on “Workspace settings”,
o Expand “Data Factory” and click on “Apache Airflow runtime settings” (this used
to be called “Data workflow settings”).
• In that tab, you can use the Starter Pool or click on “New Pool” to create a Custom pool.
o Starter Pools shut down after 20 minutes of inactivity. They should be used in
Developer environments.
o Custom Pools are always-on until and unless they are paused manually. They
could be used in Production environments.
• For Custom pools, you can select the Computer node size:
o The “Large” node size is for running complex or production DAGs (Directed Acyclic
Graphs). This is used in Starter Pools.
o The “Small” size is for simpler DAGs.
• In Custom Pools, you can also choose to:
o Enable autoscale (add nodes up/down as needed), and
o Allow extra nodes so the DAG can use these nodes concurrently – each node
provides up to the three workers.
o These options are not available in Starter Pools.
Implement lifecycle management in Fabric
5. Configure version control
• Git integration allows you to integrate your development processes into Fabric. It works
on a workspace level. Note:
o this is used through Azure DevOps Git Repos with the same tenant as the Fabric
tenant
Page 9 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Implement lifecycle management in Fabric
o not through GitHub Repos, and not the on-premises version of Azure DevOps.
• It allows you to:
o Backup and version work,
o Revert to previous stages if needed,
o Collaborate with others,
o Work alone using Git branches,
o Use Git source control tools.
• You can use it for:
o Data pipelines,
o Lakehouse,
o Notebooks,
o Paginated reports,
o Reports (except where the semantic model is in SSAS or Azure Analytics Services,
or semantic models hosted in My Workspace),
o Semantic models (except live connections and models created from the Data
warehouse/lakehouse).
• You need:
o An active Azure account for the same user that uses the Fabric workspace,
o Access to an existing Azure DevOps repository,
o Power BI Premium license (for Power BI items only) or Fabric capacity (for all
Fabric items),
o In Settings – Admin Portal, have “Users can create Fabric items” enabled.
• To sign up to Azure Repos:
o Go to the Azure Portal (portal.azure.com).
o Search in the top bar for “DevOps”, and click on "Azure DevOps organizations".
o Click on “My Azure DevOps Organizations”.
o Click on “Create new organization”, and enter your organization details, including:
▪ The organization name,
▪ The location for hosting your projects
o Create a new project.
o Enter the project details, including:
▪ The Project Name, and
▪ The visibility.
o In Repos – Files, click “Initialize” to create an empty branch.
• To connect your workspace to an Azure repo:
o You will need Admin rights for the Workspace, and Read rights for the Git repo.
Page 10 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Implement lifecycle management in Fabric
o Go to the relevant workspace.
o Click on “Workspace settings” (it might be in the … section),
o Go to Git integration. Select:
▪ Organization,
▪ Project,
▪ Git repository,
▪ Branch
• You can click “+New Branch” to create a new branch.
• You will need Admin rights for the workspace, and Write and
Create branch rights for the Git repo.
▪ Folder:
• Use an existing folder,
• Enter a name for a new folder, or
• Leave blank to use the root folder of the branch.
▪ You can only connect a workspace to one branch and one folder at a time.
• You can disconnect by going to Git integration and click “Disconnect workspace”.
o You will need Admin rights for the Workspace, but no rights are needed for the Git
repo.
• After you connect, if the workspace or Git branch is empty, content will be copied.
o It doesn’t sync data, but only the schema.
• Once connected, the Workspace includes a “Git status” column showing its status.
• To commit changes to Git:
o You will need at least Contributor rights for the Workspace, relevant permissions
for the items and external dependencies, and Read and Contribute rights for the
Git repo.
o In the workspace, click on the “Source control” icon.
o Go to the Changes tab in the Source control pane.
▪ A list shows the changed icons, with icons showing:
• new (green +),
• modified (brown non-equal sign),
• conflict (red x), or
• deleted (red -).
o Select all the items you want to “commit” (transfer).
▪ To commit all, check the top box.
o You can add a comment in the Commit Message box.
o You can then click "Commit".
Page 11 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Implement lifecycle management in Fabric
▪ Afterwards, the status of the selected items would change from
“Uncommitted” to “Synced”.
▪ You can also see the time of the last commit in the footer.
o If you click "Update", then all changes in the branch will be updated.
• If changes have been made in the connected Git branch:
o You will see a notification.
o You can click on the “Source control” icon and go to the Updates tab to see a list
of all changed items.
o You can then click on “Update all”.
o You will need at least Contributor rights in the Workspace, relevant permissions
for the items and external dependencies, as well as Read rights for the Git repo.
6. Implement database projects
• A database project is automatically implemented when you add a warehouse to a source
control.
• It contains SQL code saved in Data Definition Language (DDL) as .sql files for the schema
(definition) of items in the database, includes tables, views, functions or stored
procedures.
o It does not include data.
o It also does not include SQL security features, such as object-, column- or row-
level security, or dynamic data masking.
▪ They could be exported manually if these features were separately written
in code.
o Git integration also cannot easily handle ALTERations to the TABLE – if you execute
an ALTER TABLE, then the table will be dropped and then created again, which
may result in data loss.
• You can then manage a warehouse schema in Azure Data Studio and Visual Studio Code
using the SQL Database Projects extension.
• This database project can also be downloaded by:
o Going to the Home tab in the Warehouse, and
o Clicking on “Download SQL database project”.
• You can use this downloaded file to copy the schema to a newly created warehouse.
o When the warehouse is empty:
▪ on the main screen you can click on “SQL database project” in the “Start
developing” section, and
▪ Upload your SQL database project folder zipped file.
• You can also do this by creating a deployment pipeline, which means that changes in one
warehouse’s schema can be replicated in another warehouse’s schema.
7. Create and configure deployment pipelines
• You can create a deployment pipeline of between 2 and 10 stages (workspaces).
o The workspaces must reside on a Fabric capacity.
Page 12 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Implement lifecycle management in Fabric
• They would generally be in the categories of:
o Development – create/design new content,
o Test – release to testers, and
o Production – share final version.
• To create a pipeline:
o Go to Workspaces, and click “Deployment pipelines” (near the bottom).
o Click “Create pipeline”.
o Enter a name and optional description in the “Create a deployment pipeline”
dialog box.
o Enter the pipeline stages.
▪ By default, there are 3 stages named Development, Test and Production.
• Pipeline admins who are also Workspace Admins can then assign workspaces.
o In the pipeline, you should then select the workspaces next to the pipeline stage
and click “Assign a workspace”.
o Note: a workspace can only be assigned to one pipeline.
• Pipeline admins who are or are not Workspace Admins can unassign a workspace from a
pipeline stage. To do this:
o Open the pipeline,
o In the relevant stage, click the … and select Unassign workspace, then click
Unassign.
• You can compare stages by looking at “Compare” next to a stage.
o The icon compares that stage with the next stage. It shows:
▪ Green – metadata for all items in both stages is the same,
▪ Orange – either some items have changed/updated, or the number of
items are different.
o Where it is orange, you can click on the Compare link to compare the items. It will
show:
▪ New – new item in the source stage,
▪ Different – exists in both stages, but has been changed in the last
deployment. This includes if you have changed folder location.
▪ "Not in previous stage" – new item in the target stage.
o If something has been changed, then there is a “Review changes” button, which
allows you to see the changes the item, either side-by-side or inline.
• To deploy content, you can either:
o Click on “Deploy to X” – this deploys all content to the next stage,
o Click on “Show more”. You can then select specific items to be deployed.
o You can then add a note and click “Deploy”.
• To view the deployment history:
Page 13 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Implement lifecycle management in Fabric
o Go to the pipeline,
o Click on “Deployment history”. It shows:
▪ Deployed to – stage,
▪ Date/time – at the end of the deployment,
▪ Deployed by – person or service principal,
▪ Items – the new, different and unchanged items, and the items which
failed to deploy.
▪ A note (if it exists)
▪ Deployment ID
▪ Deployment Status (Successful/Unsuccessful)
• When you deploy items from a previous stage to a later stage:
o if any content has the same name in both stages, the content will be overwritten
in the later stage.
o Content in the later stage that is not in the earlier stage will remain (will not be
deleted).
o Up to 300 items can be deployed in a single deployment.
o You can group items together in folders.
• If you are deploying (for example) a report and not the semantic model it relies on, then:
o If the semantic model exists in the later stage, it will connect to the later stage
model.
o If the semantic model doesn’t exist in the later stage, then the deployment will
fail.
o Note: you cannot download a .pbix file after deployment.
o You cannot deploy semantic models which have real-time data connectivity.
• Any user (free user) can view the list of pipelines.
• To create the pipeline, you would need the “pipeline admin” permission (as a minimum
permission) in a Pro, Premium Per User, or Premium Capacity.
• To give a user the "pipeline admin" permission, go to "Manage Access" and click on "Add
people or groups".
• It allows:
o Create a pipeline
o View/share/edit/delete the pipeline,
o Unassign a workspace from a stage,
o Can see workspaces that are assigned to the pipeline,
o View deployment history,
o View the list of items in a stage,
o Manage pipeline settings,
o Add/remove a pipeline user
Page 14 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Implement lifecycle management in Fabric
• Pipeline admins who are also Workspace Contributors can also:
o Compare two stages
o View or set a rule
• Pipeline admins who are also Workspace Members can also:
o Deploy items to the next stage (if a workspace member/admin of both
workspaces)
• Pipeline admins who are also Workspace Admins can also:
o Assign a workspace to a stage
• Pipeline admins who are not Workspace Admins can:
o Unassign a workspace to a stage.
• You can also configure deployment rules.
• These are used for changing the content but keeping some settings as per the deployment
rule. It is used for:
o Dataflow/semantic model/datamart – to specify the data sources or parameters
for the dataflow/semantic model/datamart,
o Paginated report – to specify the data source, and
o Notebook – the default lakehouse for a notebook.
• To do this:
o Next to the pipeline stage, click on the “Deployment rules” button.
▪ You can’t create it in the first stage – it’s for the target stage.
o Select the items to create the rule for.
o Click on “+Add rule” next to:
▪ “Data source rules” – select from a list, or select Other and manually enter
the new data source (of the same type).
▪ “Parameter rules” – select the parameter and enter the value.
▪ “Default lakehouse rules” – select the lakehouse to connect the notebook
to in the target stage.
• You can use the following data source types:
o SSAS or Azure Analysis Services,
o Azure Synapse,
o SQL Server or Azure SQL Server,
o Odata Feed,
o Oracle,
o SapHana (using import mode, not direct query)
o SharePoint, and
o Teradata.
o But not dataflows.
Page 15 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure security and governance
• Note:
o if you delete an item, its rules are deleted as well, and cannot be restored.
• if you unassign and reassign a workspace, its rules are lost.
Configure security and governance
8. Implement workspace-level access controls
• Microsoft Fabric uses Power BI roles for Microsoft Fabric capabilities.
• The following is what each role does for Microsoft lakehouses/warehouses and related
apps
• Viewer
o View/read content of data pipelines and notebooks
o Execute/cancel execution of data pipelines (not notebooks)
o View execution output of data pipelines and notebooks
o Connect to and Read data/shortcuts through Lakehouse/Warehouse SQL
analytics endpoint
o Reshare items in a workspace, if you have Reshare permissions
• Contributor (as Viewer, plus)
o Read Lakehouse/Warehouse data/shortcuts through OneLake APIs and Spark.
o Read Lakehouse data through Lakehouse Explorer.
o Write/delete data pipelines and notebooks
o Execute/cancel execution of notebooks
o Schedule data-refresh via the on-premises gateway
o Modify gateway connection settings
• Member
o Add members, contributor and viewers
o Allow others to reshare items
• Admin
o Update and delete the workspace
o Add/remove admins, members, contributors and viewers.
• To give access to your workspace:
o In the workspace, click on Manage Access (it may be in the … section)
o Click “+Add people or groups”.
o Enter name/email and role, and click Add.
• You can view/modify access later if needed.
9. Implement item-level access controls
• You can also manage permissions for lakehouses by clicking on the … next to the
lakehouse (in the Workspace) and going to Manage permissions. You can assign the
following permissions:
Page 16 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure security and governance
o Read all of the lakehouse table data (not files) using the SQL endpoint,
o Read all of the underlying data files with Apache Spark,
o Build reports on the default semantic model.
• For data warehouse:
o They cannot see any of the data unless at least one additional permission is
selected.
o the "Apache Spark" permission is expanded to "Read all of the data warehouse's
underlying OneLake files using Apache Spark, Pipelines, or other apps which
access the OneLake data directly".
• You can also share notebooks with the following permissions:
o Share (or Reshare) the notebook with others,
o Edit (or Write) all notebook cells, and
o Run (or Execute) all notebook cells.
10. Implement row-level, column-level, object-level, and folder-/file-level access
controls
Row-level control
• To implement row-level security in a Data Warehouse, you need to have:
o A security function. This needs to RETURNS TABLE WITH SCHEMABINDING, and
returns SELECT 1 where the appropriate conditions are true. For example:
CREATE FUNCTION securityfunction(@UserName as VARCHAR(50))
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT 1 AS securityfunction
WHERE @UserName = LEFT(USER_NAME(), LEN(@UserName)) OR
LEFT(USER_NAME(),4) = 'Jane'
o A security policy, which references the security function.
CREATE SECURITY POLICY SecurityPolicy
ADD FILTER PREDICATE dbo.securityfunction(UserName)
ON dbo.tblActual
WITH (STATE = ON);
Column-level control
• You can implement column-level control by granting SELECT permissions to only some of
the columns in a table or query:
o GRANT SELECT ON NameOfTable(ColumnNames) TO [UserName]
Object-level control
• There are three main commands to manage permissions:
o GRANT permission ON object TO user/security group – this permits access to an
object, as long as there isn’t a DENY
Page 17 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure security and governance
o DENY permission ON object TO user/security group – this prevents access to an
object. It overrules a GRANT.
▪ DENY does not apply to owners of objects or members of the sysadmin
fixed server role.
o REVOKE permission ON object TO user/security group – removes a GRANT or
DENY
• Examples of objects include Table, View, Function, Stored Procedure.
• Permissions include SELECT, INSERT, UPDATE, DELETE and EXECUTE (run stored
procedures)
• You can also add WITH GRANT OPTION to allow that user to GRANT the permission to
other user/security groups.
• A list of the principals (user/login/group) can be found by running
o SELECT * FROM sys.database_principals
o The type column includes:
▪ E – External user from Microsoft Entra ID,
▪ S – SQL user
▪ X – External group from Microsoft Entra group or applications
File-level control
• To configure access to folders in a Lakehouse, click on “Manage OneLake data access”.
o You need to have Contributor, Member or Administrator access for the workspace.
o Once enabled, it cannot be turned off.
• A new role “DefaultReader” will be created, which gives you access to read all files.
o You can delete this default role by checking it and clicking on “Delete”.
o You will also be granted all permissions: Read, Write, ReadAll, Reshare, Execute,
ViewOutput and ViewLogs.
• You can create a role by clicking on “New Role”.
o The role name must start with a letter, contain only letters or numbers (up to 128
characters), and be unique.
• You can then select “All folders” or “Selected folders”.
o Folders also include subfolders within the “Tables” folder. This means that you
can control access to the data within tables.
o Only check a folder if you want the user to have permissions for that folder.
• You can then click “Assign role” to add people, groups, or email addresses to the role.
o You can also access this by clicking on the … next to the role and click on “Assign”.
o You can assign users by clicking:
▪ inside the “Add people or groups” box, or
▪ the “Add users based on Lakehouse permissions” (these are called
“virtual members”).
Page 18 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure security and governance
• The Lakehouse permissions are: Read, Write, Reshare, Execute
and ReadAll.
• These permissions use an AND – if you select “Read, Execute”,
then this role will only apply to users who have both the Read and
Execute permissions (not OR).
11. Implement dynamic data masking
• Dynamic data masking limits how data is shown in data warehouses.
o For example, hiding parts of emails or credit cards to users who do not need this
information.
o Data will not be masked for those with Contributor rights or greater, or with
elevated permissions (such as the UNMASK or CONTROL permission) on the
Warehouse.
o The mask can also be bypassed if someone has been shared the warehouse using
the ReadAll permission, as they can query the underlying OneLake data.
• The permissions which can be used are:
o Standard CREATE TABLE and ALTER permissions to incorporate the dynamic data
masking when creating a table.
o The ALTER ANY MASK and ALTER permissions for adding, replacing or removing
masks for existing columns.
▪ You might want to assign the ALTER ANY MASK for security officers.
▪ The ALTER ANY MASK permission is also granted with the CONTROL
permission.
o The SELECT permission is used to view the data.
o The UNMASK permission allows the SELECT statements to be unmasked.
▪ This is also granted with the CONTROL permission.
• Permissions can be granted using:
o GRANT Name_of_permission ON Table_name TO User_name.
• They can also be REVOKEd.
• To mask, then use the following syntaxes:
o As part of a table definition: Column_Name Column_Type MASKED WITH
(FUNCTION = 'name_of_function') NULL
o As part of an alter column: ALTER TABLE Name_of_table ALTER COLUMN
Column_Name ADD MASKED WITH (FUNCTION = 'name_of_function')
• To unmask, use:
o ALTER TABLE Name_of_table ALTER COLUMN Column_Name DROP MASKED
• The functions are:
o default(). This will show:
▪ xxxx (or fewer x's) for strings, 0 for numbers, 1 January 1900 for date/time,
and ASCII value 0 for binary, varbinary or image data types.
Page 19 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure security and governance
o email(). This will show the first letter of the email address, followed by
XXX@XXX.com.
o random(first_value, last_value), to show a random number between first_value
and last_value.
▪ For example, random(1, 10)
o partial(prefix, [padding], suffix) to show a random string. prefix and suffix contain
the maximum number of first/last characters to be shown.
▪ For example, partial(1, "XXXXXX", 2)
• Note – this technique only masks the data – it does not stop intelligent guesses.
o For example, let's say Country was masked with the default(), and I was logged in
a user which returns masked data.
o SELECT * FROM tblTable would return xxxx in the Country column.
o However, while SELECT * FROM tblTable WHERE Country = 'United States' would
still return xxxx in the Country column, it's fairly obvious what the actual Country
is.
• Therefore, this should be used in conjunction with other security, such as object-, row-,
and column-level security.
12. Apply sensitivity labels to items
• You can apply sensitivity labels to Fabric items, if you have Power BI Pro or Premium Per
User license.
o These labels do not affect access to content from Fabric or the Power BI Service.
▪ They can affect access to content in Power BI Desktop.
o They are also applied to any content which is exported:
▪ Export to Excel/PowerPoint/PDF,
▪ Analyze in Excel from the Power BI Service,
▪ Creating a PivotTable from a semantic model, or
▪ Download to .pbix from the Power BI Service.
• This is for items including lakehouses, data warehouses, pipelines, semantic models,
notebooks, eventhouse, eventstreams, reports, and dashboards, and Dataflow Gen2.
• There are two ways to do this:
o It may be at the top of the screen, next to the item name
▪ This does not apply to Dataflow Gen2.
o It may be in the Settings:
▪ click … next to the item,
▪ go to Settings
▪ go to the Sensitivity label section or tab.
o You may also be able to select “Apply to downstream items”. If so, the label will
be applied to items created downstream.
Page 20 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure security and governance
▪ For example, a semantic model created from a Lakehouse, or a report
created from that semantic model, or a dashboard which has a tile which
is pinned from the report.
▪ If, for some reason, it cannot be so applied on a downstream item, the new
item can still be created.
• The default sensitivity labels are:
o Highly Confidential
o Confidential
o No Sensitivity
o None (this could be any sensitivity, but hasn't been assessed yet).
• Once set, they will appear in a column in the workspace, and at the top of the screen when
the item is opened.
• In Power BI Desktop, you can also add sensitivity labels in Home – Sensitivity.
o It is then visible in the status bar.
• To enable sensitivity labels to be set, they need to be enabled in the Tenant:
o Go to the Fabric admin portal, and then Tenant settings.
o In the Information protection section, enable “Allow users to apply sensitivity
labels for content”.
o To allow for the downstream of updated sensitivity labels, enable “Automatically
apply sensitivity labels to downstream content”.
▪ You can also “allow workspace admins to override automatically applied
sensitivity labels”, and “Restrict content with protected labels from being
shared via link with everyone in your organization”.
• This list can be updated in Microsoft Purview Information Protection portal
(purview.microsoft.com).
o You can also create a Policy to apply a default label for new Power BI reports,
dashboards and semantic models can be applied in the Microsoft Purview
compliance portal.
• If you create semantic models which are connected to data sources which have got
sensitivity labels, then the semantic models can inherit those labels. These data sources
are:
o Excel files stored on OneDrive or SharePoint Online (not on premises behind a
gateway),
o Azure Synapse Analytics, and
o Azure SQL Database.
• It requires additional settings in Microsoft Purview, and in the Fabric tenant:
o Go to the Fabric admin portal, and then Tenant settings.
o In the Information protection section, enable “Apply sensitivity labels from data
sources to their data in Power BI”.
Page 21 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure security and governance
13. Endorse items
• There are three different types of endorsement: promotion, certification and master data.
o Promotion.
▪ This can be applied to any item apart from Power BI Dashboards.
▪ It advertises the item’s existence, and should only be used for data which
is of sufficient quality to share with others.
▪ Promotion can be done by any user.
o Certification.
▪ This can be applied to any item apart from Power BI Dashboards.
▪ It is used to say this item is quality-controlled by a user who understands
the data, and can be used in the entire tenant.
▪ Certification can only be done by authorized reviewers.
o Master data.
▪ This can only be applied to items which have data, such as lakehouses,
data warehouses, and semantic models.
▪ It is used for “single-source-of-truth” data. It is the central source for that
data.
▪ Setting “Master data” status can only be done by authorized reviewers.
• To endorse an item:
o Click on the … next to the item and go to Settings.
o Change the endorsement.
o For semantic models, you can also check “Make discoverable”. This makes it
shown in the OneLake data hub, and users who haven’t got access to it can
request access.
• Once endorsed, you will see the endorsement:
o in a column in the Workspace, and
o in the drop-down at the top of the screen, when the item is opened.
• In the Admin portal – Tenant settings, you can adjust the following:
o Enable “Certification”
▪ You can also “Specify URL for documentation page”. This is the page that
users will see when clicking “How do I get content certified?”
o Enable the “Master data” badge.
o “Make promoted content discoverable” and “Make certified content
discoverable”.
13a. Implement and use workspace logging
• Workspace monitoring can be used to query the logs stored in a new Eventhouse
database, which are to do with security, data collection and access.
• To enable it, you will need:
o The admin role in a Power BI Premium/Fabric capacity,
Page 22 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure security and governance
o The tenant “Workspace admins can turn on monitoring for their workspaces
(preview)” setting enabled.
o Log analytics cannot already be enabled.
• To turn it on:
o Go to a workspace,
o Click on Workspace settings,
o Go to the Monitoring tab, and
o Click on “+ Eventhouse”.
• The eventhouse is read-only, and monitoring data is retained for 30 days.
• Once it has been set up, you can query the following:
o Metrics, stored in the EventhouseMetrics table,
▪ Monitor and troubleshoot ingestion performance and trends
EventhouseMetrics
| order by MetricSumValue
▪ Look at materialized views and exports.
o Command logs, stored in the EventhouseCommandLogs table, and Query logs,
stored in the EventQueryLogs table, to:
▪ Look at command/query performance and trends,
▪ Show commands/queries which use a lot of resources
EventhouseCommandLogs
| summarize CPUTimeMs = sum(CpuTimeMs) by
EventhouseCommandType
| order by CPUTimeMs
| limit 10
EventhouseQueryLogs
| order by CpuTimeMs
| limit 10
▪ Show users and applications which run the most commands/queries.
EventhouseCommandLogs OR EventhouseQueryLogs
| summarize CPUTimeMs = sum(CpuTimeMs)
by UPN = tostring(Identity.claims.upn), APPID =
tostring(Identity.claims.appid)
o Data operation logs, stored in the EventhouseDataOperationLogs table,
▪ Analyze data operations performance and trends,
▪ Look at data operations using a lot of CPU
EventhouseDataOperationLogs
Page 23 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Configure security and governance
| order by CpuTimeMs
▪ Identify data operators used on a specific table.
▪ Data operations includes:
• batch ingestion,
• update policy,
• materialized view update,
• Sealing an extent of streaming data.
o Ingestion results logs
▪ Monitor the number of successful ingestions, and
▪ Monitor and troubleshoot failed ingestions.
EventhouseIngestionResultsLogs
| project OperationName, WorkspaceName, Status,
ResultCode, ShouldRetry
• You can also use the “Fabric Workspace Monitoring Dashboard.json”. To use it:
o Download it from https://github.com/microsoft/fabric-
toolbox/tree/main/monitoring/workspace-monitoring-dashboards
o Create a new Real-Time Dashboard.
o Go to Manage – Replace from file.
o Select the downloaded file.
o Go to Manage – Data sources.
o Click on “Add+” – “Eventhouse / KQL Database”.
o Select the “Monitoring KQL database” in the OneLake catalog dialog box.
o Click Connect, then Add, then Close.
o Click on the Workspace option (at the top-left hand corner of the Dashboard) and
select the “Monitoring KQL database”.
• You can also use the “Fabric Workspace Monitoring.pbit” Power BI template in Power BI
Desktop. To use it:
o Download it from https://github.com/microsoft/fabric-
toolbox/tree/main/monitoring/workspace-monitoring-dashboards
o Open the “Monitoring KQL database”, and copy the “Query URI”
o Open the template in Power BI Desktop
o Paste the “Query URI”, change any other settings, and click Load.
o Save and Publish the report into the Power BI Service.
o Next to the “Fabric Workspace Monitoring” semantic model, click on the … and
go to Settings.
o In the Data source credentials, click on “Edit credentials” and sign in.
o Refresh the data and open the report.
Page 24 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Orchestrate processes
Orchestrate processes
14. Choose between a pipeline and a notebook
• For copying high quantities of data with a low-code/no-code, either as a one-off or on a
schedule, use a Pipeline Copy Activity.
• You can use these destinations:
o Various Azure objects,
o Fabric: Data Warehouse, KQL Database, Lakehouse,
o Database: MongoDB, Oracle, SQL database and SQL Server,
o Files: Amazon S3, File system, Google Cloud Storage (GCS),
o Power Platform: Dataverse,
o Services and apps: Dynamics CRM, Salesforce, Snowflake, and
o Generic protocol: ODBC, REST.
• You can also use these sources:
o Database: Amazon RDS for SQL Server, Amazon Redshift, DB2, Google BigQuery,
MariaDB, MySQL, PostgreSQL, SAP HANA, Vertica,
o Services and apps: Dynamics AX, Microsoft 365, SharePoint Online list,
o Files: FTP, Oracle Cloud Storage,
o Generic protocol: HTTP, OData, and
o ServiceNow.
• If you need to transform data using a no-code or low-code interface, with hundreds of
transformation options and sources, then you can use Dataflow Gen 2…
o but only if the destinations are suitable: Lakehouse, Warehouse, SQL database,
Azure SQL database, and Azure Data explorer (Kusto)
• If you need to transform data, then Spark code in notebooks allows you to process large
amounts of data in parallel, and then write into Delta tables in lakehouses, where it can
be used elsewhere.
• If you need to do a combination, then you can create notebooks, and add them into a
pipeline.
Page 25 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Orchestrate processes
Pipeline copy Dataflow Gen 2 Notebook
activity
Use case, Data lake and Data transformation, data profiling, and …
including data data warehouse
ingestion migration, data wrangling data processing
lightweight
transformation
Primary Data engineer, data integrator Data engineer,
developer scientist, developer
persona
business analyst
Primary ETL, SQL, JSON ETL, M, SQL Spark (Scala, Python, Spark
developer skill SQL, R)
set
Development Graphical - No code/low code Code using Notebook,
interface Spark job definitions
Wizard, canvas Power Query
Sources 30+ connectors 150+ connectors 100s of Spark libraries
Destinations 18+ connectors Few 100s of Spark libraries
Transformation Low: lightweight Low to high: Low to high:
complexity type conversion, 300+ transformation support for native Spark and
column mapping, functions open-source libraries
merge/split files,
flatten hierarchy
15. Design and implement schedules and event-based triggers
Data pipeline
• To Schedule a data pipeline, either:
o Go to Home – Schedule or Run – Schedule in the data pipeline, or
o Go to the workspace, click on the … next to the Data Pipeline, and click on
Schedule.
• In the Scheduled run tab:
o Turn "Scheduled run" to On.
o Choose the Repeat schedule:
▪ By the minute, Hourly, Daily or Weekly.
▪ If you choose "Daily" or "Weekly", then you can select Times for it to run.
▪ If you choose "Weekly", then you can select which day(s) of the week it
should read.
• So "Weekly" can be used to run each weekday.
▪ If you choose "Hourly" or "By the minute", you can select the number of
hours/minutes between the runs.
Page 26 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Orchestrate processes
▪ You can select a start and end date and a timezone.
o Click Apply.
Dataflow
• To schedule a dataflow:
o In the workspace, go to … next to the pipeline, and go to Schedule.
o For the scheduling options, see above in data pipelines (except that you cannot
schedule By the Minute or Hourly – just daily or weekly, every half an hour).
Notebook
• To schedule a notebook:
o In the notebook, go to Run - Schedule.
o For the scheduling options, see above in data pipelines.
Events from Azure Blob storage
• An event-based trigger, part of the Data Factory, runs when something happens to files or
folders in Azure Blob storage.
• To create an event-based trigger in a pipeline:
o click on Home or Run – Add trigger. This opens the “Set alert” panel.
o In “Source”, click on “Select events”. This opens the “Connect data source” dialog
box.
o In the “Configure” section, you can select the Storage account:
▪ You can connect to an existing, or select a connected, Azure Blob Storage
Account.
▪ You then select a subscription and an Azure Blob Storage account.
▪ Then click “Next”.
o In the "Configure alert" section, you can select the Event type from
Microsoft.Storage:
▪ Blob-Created, -Deleted, -Renamed and -TierChanged,
▪ DirectoryCreated, -Deleted and –Renamed,
▪ BlobInventoryPolicyCompleted,
▪ AsyncOperationInitiated, and
▪ LifecyclePolicyCompleted.
o You can also filter the events, so not all of events would run the trigger.
▪ The fields are: Source, subject, type, time, id, data and specversion.
▪ The operators include in, greater/less than and equals to, contains (and
the opposite of these operators).
o You can click “Next” then “Connect”. It then creates an eventstream.
• You can then create an action of “Run a Fabric item” – either a pipeline or a notebook.
• You can then save this Activator item in a workspace and name the item (by clicking on
“Create a new item”).
Page 27 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Orchestrate processes
Events-based trigger using Activator
• Activator is used in Microsoft Fabric for doing an action (such as altering people or starting
Power Automate workflows) when needed due to changing data.
• In a workspace, create a new Activator.
• You can “try sample” data”:
o Bicycle rentals,
o Stock market, and
o Yellow taxi
• You can “Get data” from:
o Microsoft sources:
▪ Various Azure sources, including Cosmos DB (CDC – Change Data
Capture), PostgreSQL db (CDC), Event Hubs, IoT Hub, Service Bus, Azure
SQL Database (CDC) and Azure SQL Managed Instance (CDC).
▪ SQL Server on a Virtual Machine (CDC).
o MySQL DB (CDC),
o Azure Blob Storage events,
o Fabric events:
▪ Workspace Items events: Item creation/deletion/update/read was
successful/failed,
▪ Job events: Item Job Created, Status Changed, Succeeded and Failed,
and
▪ OneLake events: file/folder creation/deletion/renamed.
• You can create a new rule by clicking on “+New rule”, and then set the attribute (what you
are monitoring).
• Then you can set the condition:
o Numeric change/state,
▪ Change:
• Increases/decrease above/to or above/by
• Enters/exits range,
• Changes from/to/by,
▪ State: =, <>, >, >=, <, <=, and is within/outside range
o Text changes from/to, or state (=, <>, [does not] begins, contains, end, and
conditions on the length of the text),
o Logical change (becomes true/false)/state (=. <>),
o Common change, and
o Heartbeat – On every value or no presence of data over a certain period of time
(e.g. from 10 seconds to 24 hours).
• You can also click “Add filter”.
Page 28 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Orchestrate processes
• You may also be able to click “Add summarization” for:
o Average/Minimum/Maximum over time,
o Count/Total.
• You can then choose the action to take:
o Send an email,
o Add a Teams message,
o Run a Fabric notebook or pipeline, or
o New custom action.
• Note: rules are by default Stopped. You will need to “Start” or “Save and start” to activate
it.
o When it Is started, you will see “Running” in the title of the rule card. You will also
see a different icon.
o To halt the rule, click “Stop”.
• If data is late-arriving then:
o Power BI rules are evaluated every hour, and evaluates all events that arrive a
maximum of one hour after the event occurs.
o For other items such as eventstreams, the “Advanced settings” may have a “Wait
time for late-arriving events”.
▪ You can set it for 1, 2 or 5 minutes.
▪ This configures the balance between waiting for any potential late-arriving
events, or processing data more swiftly, though it may be incomplete.
• You can also create alerts from:
o a Real-Time Dashboard
o a KQL queryset
▪ Run a query that returns a visualization.
▪ After the query runs, click “Set Alert”.
• You can set the time frequency, from 1 minutes to 24 hours. The
default is 5 minutes.
• The Condition is “On each event”.
• Choose an action from: Send me an email, Message me in Teams,
and Run a Fabric item.
• You can then create the Workspace and Item location.
o an Eventstream
▪ In the “Add destination”, select “Activator”.
▪ You can then customize it.
o a Power BI report.
▪ You can also create alerts from Power BI Dashboards, but they do not
Activator, and so don’t require a Fabric license.
Page 29 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Orchestrate processes
▪ You can click on the … and go to Add alert.
▪ The “Set alert” pane will open, with the Monitor – Visual set.
▪ You can then assign the condition: a measure, an operator, and a value.
▪ You can then set an action: an email or Teams message.
16. Implement orchestration patterns with notebooks and pipelines, including
parameters and dynamic expressions
• You can call on notebooks from pipelines with a parameter.
o This allows you to pass information into the notebook.
o With a different input, you can get a different output, meaning that you don’t have
to recreate the notebook to cope with different inputs.
• In the notebook, click on the … in the first cell and select “Toggle parameter cell”.
o The word “Parameters” now appears in the bottom-right hand corner of the cell.
• You can then add a Parameter and assign it a default value.
• Then (for example) write some code to read a table, filter it, and then write to a new table.
• In the Pipeline, add a Notebook activity.
• In the activity Settings:
o Add the relevant workspace and notebook,
o In “Base parameters”, add the parameter name, type (String, Int, Float or Bool)
and the parameter value.
• Then run the Pipeline, and query the new table using the SQL analytics endpoint.
• You can also change the reference to the notebook from a fixed string literal to a variable
name.
o In the Pipeline – Notebook activity – Settings – Notebook, change the name to
“Add dynamic content”.
o In the “Add dynamic content” pane, go to the Parameters type, click the + button,
and then add a Parameter (Name, Type and Default value).
▪ You cannot use a space in the Name – you can only use letters, numbers,
and the underscore character.
o Click on the parameter name to add it to the “Add dynamic content” – it will be in
the format of: @pipeline().parameters.ParameterName .
o Click OK to confirm the change to the “Notebook ID”.
• When you run this, the “Pipeline Run” pane will ask for the name of the Notebook, and
gives the default name.
• In the dynamic expression, you can also use System Variables.
o @pipeline().Pipeline – the ID of the pipeline
o @pipeline().PipelineName
o @pipeline().TriggerID, .TriggerName and .TriggerTime – the ID and Name of the
trigger which started the pipeline, together with the actual time the trigger was
run. The trigger could be:
Page 30 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Design and implement loading patterns
▪ a scheduled trigger,
▪ an event-based trigger, or
▪ a manual trigger (will return “Manual”)
o @pipeline()?.TriggeredByPipelineRunId and
@pipeline()?.TriggeredByPipelineName – the pipeline which triggers this pipeline.
• When the pipeline has a storage event trigger attached, then you can use:
o @pipeline()?.TriggerEvent?.FileName and .FolderPath – the name of the file/folder
for the file whose event triggered the pipeline run.
• When the pipeline has an event trigger attached, then you can use:
o @pipeline()?.TriggerEvent?.Source, .Subject and ,.Type – the trigger event details.
• You can also add variables in the variable tab.
• You can also use functions in dynamic expressions.
o Logical functions: if, equals, and/not/or,
greater/less/greaterOfEquals/lessOrEquals
o String functions: concat, endsWith/startsWith, indexOf/lastIndexOf, replace,
substring, toLower/toUpper, trim
o Conversion functions, including: coalesce, float, int, string
o Mathematical functions, including: add, sub, mul, div, min/max, mod
o Date functions, including:
▪ addDays/Hours/Minutes/Seconds,
▪ addToTime, subtractFromTime,
▪ dayOfMonth/Week/Year,
▪ formatDateTime,
▪ startOfMonth/Day/Hour,
▪ utcNow, getFutureTime/getPastTime,
▪ convertFromUtc/ToUtc, convertTimeZone
Design and implement loading patterns
17. Design and implement full and incremental data loads
• To create an incremental data load from Data Warehouse to Lakehouse, you need:
o A column (in both databases) which shows when the row was created (a
“watermark”).
o A table in the source database which stores the last watermark that was copied
(a “watermark table”).
▪ It should have a single row which has a date before any of the
watermarked dates.
o A pipeline which contains:
Page 31 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Design and implement loading patterns
▪ Two lookup activities, one to retrieve the last watermark value from the
“watermark table”, and another to retrieve the current maximum
“watermark” value.
▪ A copy activity which copies the data between the two databases for the
relevant rows (after the last watermark up to and including the current
watermark).
▪ A stored procedure activity which updates the watermark table with the
current watermark.
• Step 1 – Create tables and stored procedure in a Warehouse (called
“IncrementalWarehouse”), and a blank Lakehouse.
• Step 2 – Create a pipeline and add a lookup activity from the “watermark table”.
o In a pipeline, go to:
▪ the Home menu and click Lookup, or
▪ the Activities menu, click on … and select Lookup.
o Name this lookup activity “LookuptblWatermark”.
o In the Settings tab:
▪ In Connection, select the IncrementalWarehouse.
▪ Use the default “Table” for “Use query”.
▪ In Table, select “dbo.tblWaterMark”.
▪ Use the default “First row only”.
• Step 3 – Add a lookup activity to get the latest watermark
o Go to Home – Lookup
o Name this activity “MaximumWatermark”.
o In the Settings tab:
▪ In Connection, select the IncrementalWarehouse.
▪ Change “Use query” to “Query”
▪ Enter: SELECT MAX(OrderDate) AS LatestWatermark FROM FactImport
▪ Use the default “First row only”.
• Step 4 – Add the copy activity
o Go to Home – Copy data – Add to canvas.
o Connect the two lookup activities to the copy table using “On Success”.
o In the Source tab:
▪ In Connection, select IncrementalWarehouse.
▪ Change “Use query” to “Query”.
▪ In Query, write the following query:
• SELECT * FROM FactImport WHERE OrderDate >
'@{activity('LookuptblWatermark').output.firstRow.Watermark}'
and OrderDate <=
Page 32 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Design and implement loading patterns
'@{activity('MaximumWatermark').output.firstRow.LatestWaterm
ark}’
• You can click inside the Query and then click on “Add dynamic
content”
• Enter SELECT * FROM FactImport WHERE OrderDate >
• Create an empty first row in the query, and click on
“LookuptblWatermark first row”. This inserts
@activity('LookuptblWatermark').output.firstRow
• Move the insertion to after the rest of the query.
• Enter “.Watermark”,
• Add curly brackets after the @ sign and at the end, and quote
marks before and afterwards.
• then “AND OrderDate <= “.
• Do the same with “MaximumWatermark first row”.
• Enter “.LatestWatermark”.
o In the Destination tab:
▪ Change the Connection to a Lakehouse.
▪ Change the “Root folder” to “Files”.
▪ Change the File path to Incremental/@concat(pipeline().RunId, '.txt’)
▪ The File format should be “DelimitedText”.
• Step 5 – Add the stored procedure to the pipeline
o Go to Activities – Stored procedure.
o Connect it to the Copy data activity using “On Success”.
o In the Stored procedure activity:
▪ Go to Settings.
▪ Change Connection to IncrementalWarehouse.
▪ Next to “Stored procedure name”, click on Refresh, then click on
“[dbo].[updateWatermark]”.
▪ Expand “Stored procedure parameters”
▪ Click on “+ New”.
▪ For this new parameter:
• The Name is Watermark (as per the Stored Procedure).
• The Type is DateTime (as per the Stored Procedure),
• The Value is
@{activity('MaximumWatermark').output.firstRow.LatestWaterm
ark}
• Note – there are no quote marks around this value, as it is going to
be treated as a DateTime.
Page 33 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Design and implement loading patterns
• Step 6 – query the end result:
o In a Lakehouse, create a dataframe which brings the files together.
• You can also do incremental refresh in Dataflow Gen2.
o The data source must allow for query folding.
o There needs to be a Date or DateTime column.
o The data destination must allow incremental refresh, namely:
▪ Fabric Warehouse
▪ Azure SQL Database
▪ Azure Synapse Analytics.
• To implement the incremental refresh:
o In Dataflow Gen2, right-hand click the query and click on “Incremental refresh”.
o Select the following from the Incremental refresh dialog box:
▪ Check “Enable incremental refresh”.
▪ Select the DateTime column to filter the query by.
▪ Select the time period to filter (x days/weeks/months/quarter or years).
▪ Add a Bucket size. The Bucket holds the data since the last refresh – up to
50 buckets.
• Fewer buckets mean less data per iteration, but more iterations.
• More buckets mean more data per iteration, but fewer iterations.
• For example – 4 years and bucket size of month.
▪ Select the column to say that the refresh shouldn’t happen if the
maximum value in this column has not changed.
• While this can be the same as the first DateTime column, it may
be different – for example, a Created Date for filtering, and a
Modified Date to mark changes.
▪ Optionally, you can say to ignore incomplete periods (the period is shown
in the Bucket size).
▪ Optionally, in the Advanced section, you can check “Require incremental
fresh query to fully fold”.
• If checked, this pushes the query down to the source system,
which may improve performance.
• In the Dataflow Gen2, the update method for new data needs to be “Replace”.
18. Prepare data for loading into a dimensional model
• Slowly Changing Dimensions
o SCD Type 0
▪ attributes do not change, or are "Original" values.
o SCD Type 1
▪ attributes always change to the last version.
Page 34 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
▪ Historical data is not traced.
o SCD Type 2
▪ changes to the dimension result:
• original and latest information between retained in additional
rows.
• Additional Start Date and End Date columns may be added to
show when the data is valid.
• An additional Current column may be added to show the latest
data.
▪ Both the dimension and fact table may have an additional surrogate key,
which links to the correct data in the dimension.
o Ideally, it will be implemented as close to the source as possible.
19. Design and implement a loading pattern for streaming data
• See topic 29.
Ingest and transform batch data
20. Choose an appropriate data store
• Import caches the data, so is often used for smaller amounts of data.
o It requires time to import, but is swift once imported to generate results.
• DirectQuery retrieves data as and when needed, and will always have the latest data.
o It requires no time to import, as no importing is required. However, it may take
time to retrieve data.
• Composite mode (using Dual mode) is a bridge between Import and DirectQuery modes.
• Direct Lake uses the advantage of data being stored in the OneLake:
o to give fast performance (similar to import mode),
o but with the freshest data (like DirectQuery mode).
• It needs a lakehouse or warehouse on a Microsoft Fabric capacity.
o This lakehouse/warehouse then uses OneLake to store the data.
o Only tables in the semantic models derived from tables (not views) in the
Lakehouse/Warehouse can use Direct Lake mode.
o You cannot use both Direct Lake tables and other table modes (Import,
DirectQuery, Dual).
o Calculated columns and tables are not supported.
o It supports write operations using the XMLA endpoint in the latest versions of
SSMS, Tabular Editor and DAX Studio.
• Best ways to copy data:
o If you are uploading a small file(s) from a local machine
▪ Use a Local file upload
Page 35 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
• You can right-hand click on the … next to Files, and go to Upload –
Upload files/folder.
• Table names can contain alphanumeric characters and
underscores up to 256 characters. No dashes or spaces are
allowed.
• Column names allow upper/lower cases, characters in other
languages like Chinese, and underscores up to 32 characters.
▪ You can also use the OneLake file explorer app. It integrates OneLake with
Windows File Explorer.
• You can download it from https://www.microsoft.com/en-
us/download/details.aspx?id=105222
▪ It adds this location into Windows Explorer, and includes a Sync column,
showing the synchronization status, showing:
• Blue cloud icon – online only,
• Green tick – downloaded to your computer,
• Sync pending arrows – in progress.
o If you are uploading a small amount of data, or using a specific connector (from
over 200 connectors), or want to use Power Query transformations
▪ Use a Dataflow
o If you have a large data source without using any data transformations
▪ Use the Copy tool in a pipeline
o If you have got complex data transformations
▪ Use Notebook code
Page 36 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
21. Choose between dataflows, notebooks, KQL, and T-SQL for data transformation
Dataflow Gen 2 Notebook Warehouse (T-SQL)
Primary developer skill Spark (Scala, Python,
ETL, M, SQL SQL
set Spark SQL, R)
Data ingestion, Data ingestion,
Data storage,
Use case, including transformation, data transformation, data
transformation,
data ingestion profiling, and data profiling, and data
reporting and analytics
wrangling processing
Data warehouse
Data engineer, data developer, data
Primary developer Data engineer,
integrator, business architect, data
persona scientist, developer
analyst engineer, database
developer
Graphical - No
Code using Notebook,
Development interface code/low code using Code using SQL scripts
Spark job definitions
Power Query
Using New Dataflow
Sources 150+ connectors 100s of Spark libraries
Gen2 or data pipeline
Data Warehouse,
Destinations Few 100s of Spark libraries Power BI Service, other
using Pipeline
Low to high: 300+ Low to high: support for Low to High: SQL-
Transformation
transformation native Spark and open- based transformations
complexity
functions source libraries for structured data
For KQL, see topic 29.
22. Create and manage shortcuts to data
• Shortcuts are OneLake objects which point to other storage locations.
o They can be other OneLake storage locations, or external to OneLake.
o The shortcut location is called the target path.
o They appear as folders in OneLake.
• To create a shortcut:
o Right-hand click on a folder (table or file) in the Explorer pane of the lakehouse,
and select "New shortcut".
o Select the source:
▪ Internal sources – Microsoft OneLake,
▪ External sources – Azure Data Lake Storage Gen2, Amazon S3 (or
compatible), Dataverse or Google Cloud Storage.
o Select the datasource (and authentication, if using an external sources),
o Expand File/Tables, select the subfolder(s) (up to 50 subfolders), and click Next.
o You will see the selected shortcut locations.
▪ You can edit to change the default shortcut name, or delete any selection.
Page 37 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
o Click Create.
o You need to have Contributor, Member or Admin role for the workspace,
o You need write permissions for the shortcut location, and read permission in the
target location.
• To use it:
o The calling user must have read permissions for the target location.
• In a Lakehouse, shortcuts are shown at the top level of the Tables folder, or anywhere in
the Files folder.
o Shortcuts are not available in Tables folder subdirectories.
o Ideally, don't use tables with spaces in the file name.
▪ They will not be discovered as a Delta table in the lakehouse.
• To access shortcuts in the Table folder, you can use:
o df = spark.read.format("delta").load("Tables/MyShortcut")
o df = spark.sql("SELECT * FROM MyLakehouse.MyShortcut LIMIT 1000")
o (In SSMS) SELECT TOP (100) *
FROM [MyLakehouse].[dbo].[MyShortcut]
• Note:
o If you delete the shortcut, the target is not affected.
o If the target path moves, is renamed or is deleted, the shortcut can break.
23. Implement mirroring
• You can mirror external databases into Fabric, from:
o Azure Cosmos DB,
o Azure SQL Database,
o Azure SQL Managed Instance,
o Snowflake.
• This allows for near real-time copying of data.
o This includes replication of inserts, updates and deletes.
o It also includes an autogenerated SQL Analytics Endpoint, together with a default
semantic model.
• There is also automatic mirroring of Fabric SQL database.
• You can also have metadata mirroring from Azure Databricks.
o This mirrors names, schemas, and table structures
o This uses shortcuts to access the data from Fabric.
• To create a mirrored database:
o go to your workspace,
o click on “+New Item”,
o enter “mirror”, and
Page 38 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
o select the appropriate object.
• For Azure SQL Database:
o To connect to the server, the System Assigned Managed Identity (SAMI) of the
Database server must be ON, and the primary identity.
o You also need a login and mapped database user to connect to the database.
• For the example of an Azure SQL Database:
o Click on New – Azure SQL Database.
o Enter the server and database name.
o Change the Authentication kind and credentials if necessary (if using Entra ID,
choose “organizational account”, and sign in with the relevant account).
• For the example of the Azure Cosmos DB:
o In Networking, Connectivity method should be All networks.
o In backup, continuous backup needs to be enabled – either 7-day (which is free)
or 30-day.
o You will need to Primary Key of the Cosmos DB from Settings – Keys.
o To create a sample database, go to Data Explorer – and “Launch quick start”.
o You can then select what data is to be mirrored.
▪ You can also check “Automatically mirror future tables”
o Unsupported column types will not be mirrored.
o Click Connect, then Create mirrored database.
24. Ingest data by using pipelines
• You can use the Copy data assistant:
o In the pipeline, click on "Copy data" or go to Home or Activities – Copy data – Use
copy assistant
o Source
▪ Select a data source, including sample data
▪ Enter your connection settings, either using an "Existing connection" or
"Create new connection".
▪ Choose the specific data to be transferred (for example, file/folder).
o Select a data destination source
▪ Select a data source
▪ Enter your connection settings, either using an "Existing connection" or
"Create new connection".
▪ Map your data to the destination.
o Review the details, and click OK to save.
▪ It will then be added to your data pipeline canvas.
▪ Advanced settings will be available in the tabs.
• You can also a copy activity.
Page 39 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
o Go to Home or Activities - Copy activity – Add to canvas
o In the general tab, you can select:
▪ Name and Description, and whether it is enabled (in the Activity state),
▪ Timeout – how long the activity can run. The default is 12 hours. It shown
in the format D.HH:MM:SS.
▪ Maximum number of retry events,
▪ Number of sections between each retry attempt,
▪ "Secure output/input". When this is checked, details of the activity is not
logged.
o In the Source tab:
▪ select an existing connection, or click on +New to create a new
connection.
▪ In a dialog box, you can select the data source and connection.
▪ Back in the source tab, you can select more details, depending on the
connection type – for example, the connection type, user query
(table/query/stored procedure) or root folder, and table.
▪ There are more settings in the Advanced section.
o In the Destination tab:
▪ select the connection, and more details.
▪ In the advanced section, you can select more settings, such as:
• Max rows per file,
• Table action – Append or Overwite, and
• Max concurrent connections.
o In the Mapping tab, you can select the mapping from the source table to the
destination table.
▪ This allows you to map between columns which are differently named in
the two sources.
▪ In the Type conversion settings, you can select:
• Allow data truncation (for example, from decimal to integer, or
DatetimeOffset to Datetime),
• Treat Boolean as number (true = 1),
• Date and DateTime format (for example "yyyy-MM-dd
HH:mm:ss.fff").
• DateTimeOffset format (for example "yyyy-MM-dd HH:mm:ss.fff
zzz").
• TimeSpan format (for example "dd.hh:mm:ss")
• Culture (for example, "en-us", "fr-fr")
o In the Settings tab, you can select:
Page 40 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
▪ Intelligent throughput optimization. Choose from Auto (which is dynamic
based on the sources and destinations), Standard, Balanced and
Maximum.
▪ Degree of copy parallelism,
▪ Fault tolerance – what happens if there are errors while copying.
▪ Enable logging – log copied files and skipped files and rows,
▪ Enable staging and Staging account connection (advanced).
• To run the data pipeline, go to Home – Run.
o You can see the results in the Output tab.
▪ You can export the results to CSV.
▪ You can filter for a particular "Activity status" (for example, succeeded),
hide output columns and show columns for Activity type, Run end, Activity
run ID, Source and Destination.
25. Transform data by using PySpark, SQL, and KQL
• See the “DP700CodeUsed” for examples in SQL and KQL.
• You can load raw data into the lakehouse, then refine it and enhance it with other data,
and then expose it to business users using a Medallion Architecture.
o Raw data is loaded in a bronze layer.
o Refined and enriched data can then be saved in a silver layer.
o Data can then be exposed to business users in a gold layer.
▪ This is usually not the entirety of the data, as business users may not need
all of the columns or tables, and may need summarized data.
▪ You can have multiple gold layers, exposing different datasets for different
user groups.
PySpark
• Spark syntax - dataframes
• SELECT - df.select("columnName", "columnName2"])
o selects only some columns.
o you can end with an unnecessary comma: df.select("columnName",
"columnName2"],)
• col("string") or column("string") refers to a column called "string".
• df.show()
o shows dataframe (not list) in a text table.
• display(df)
o shows dataframe (not list) in a graphical table.
o From this display, you can:
▪ Show the information in a table or a chart.
▪ In the table view, you can:
Page 41 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
• sort ascending or descending, or copy the column name.
• download the information to a CSV, JSON or XML file.
• click Inspect to show the individual cell, or if you haven't selected
an individual cell, the following for the table:
o Missing, Unique and (for non-strings) Invalid Value,
o A histogram.
• "Search" to filter the table, either on all columns (the default), or
on an individual column.
▪ In the chart view, you can click on "Customize chart" to:
• Change the chart type to:
o Line chart,
o Bar, Column or Area chart,
o Pie chart,
o Scatter chart,
o Box plot,
o Histogram chart,
o Pivot table or
o Word cloud.
• You can customise the data used. For bar charts, you can also
change the Key (the axis), Values, Series Group, Aggregation (Sum,
Avg, Min, Max, Count, First and Last), and whether it is Stacked.
• df.collect()
o shows dataframe in a list.
• df.schema
o this shows the structure using StructType (one for the table) and StructField (one
per column).
• df.summary
o shows columns and data types.
• Spark syntax - dataframes
• df.column_name.alias and df.column_name. name
o df.select(df.age.alias("age2"))
• df.column_name.concat(column1, column2) combines all the columns into a single
column.
• To add comments, prefix the comment with a #.
• ORDER BY
df.orderby(desc("columnName"), "columnName2")
df.sort(asc("columnName"))
Page 42 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
df.sort("age", ascending=True)
df.sort(df.column_name.desc()) or .asc())
o orders dataframe by column(s). Default is ascending.
o You can use asc(), asc_nulls_first(), asc_nulls_last() or the desc equivalent.
• df.columns
o shows all the columns as a list
• df.describe(["columnName", "columnName2"].show()
o show count, mean, stddev, min and max of the columns.
• df.head(n), df.take(n)
o returns the top n rows as a list. If df.head() is used, returns top row. Cannot use
show().
• TOP - df.limit(n)
o returns the top n rows.
o Note: in Spark SQL, TOP(10) or TOP 10 is not used. Instead, you should use LIMIT
10 at the end of the query.
• df.tail(n)
o returns the last n rows as a list.
Enrich data by adding new columns or tables
• You can create new tables in a Dataflow Gen2.
o In the Workspace, go to New – Dataflow Gen2.
o Use the Power Query window to transform the data and add new columns.
• To add a new column in a notebook in pySpark, use the withColumn method
o df = spark.table("datatable")
o df = df.withColumn("hello", col("puLocationID")*0+1)
o or
o from pyspark.sql.functions import *
o df = df.withColumn("hello", lit(1))
▪ or lit('') or lit('NA") or lit(None) for an empty column.
• To use some of these functions, you will need to execute:
• from pyspark.sql.functions import *
• Date functions include:
o dayofmonth(col), dayofweek(col) and dayoryear(col) – day of the
month/week/year.
o weekofyear(col), month(col), quarter(col), year(col)
o hour(col), minute(col), second(col)
o add_months(start_date, number_of_months)
o date_add(start_date, days) and date_sub
Page 43 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
o date_truc(format, timestamp) truncates to the nearest unit in the format.
▪ Format can be: ‘year’, ‘yyyy’, ‘yy’, ‘month’, ‘mon’, ‘mm’, ‘day’, ‘dd’, ‘hour’,
‘minute’, ‘second’, ‘week’, ‘quarter’
o datediff(end, start) – number of days between the dates
o months_between(date1, date2) – number of months between two dates.
o last_day(date)
o next_day(date, dayOfWeek)
▪ dayOfWeek can be “Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat”, “Sun”
o length(col) – number of columns
o current_date() and current_timestamp() – time now
o trunc(date, format) truncates the date to the "format" unit, either ‘year’, ‘yyyy’, ‘yy’,
‘month’, ‘mon’, ‘mm’.
• Math functions include:
o abs – the absolute value,
o signum(col) – the sign of the number (-1, 0 or 1),
o trigonomical functions – acos, acosh, asin, asinh, atan, atanh, atan2, cos, cosh,
sin, sinh, tan, tanh
o advanced math functions
▪ exp(col) – exponential
▪ factorial(col)
▪ log10(col)
▪ radians(col)
o Power functions
▪ cbrt(col) – cube-root
▪ pow(col1, col2) – power
▪ sqrt(col) – square root
• Rounding functions include:
o ceil(col) – the ceiling (rounded up)
o floor(col) – the floor (rounded down)
o round(col, scale) – rounds to the nearest "scale" decimal places. "scale" can be
negative.
• String functions include:
o concat(string1, string2...) combines multiple strings together.
▪ Make sure you use lit(string) if you want to use literals.
o instr(column, search_string) looks for strings in a column
▪ For example: instr(string_column, 'D')
▪ 1 is the first character. If not found, it returns 0.
Page 44 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
o lower(column) and upper(column)
o lpad(column, len, pad) pads the left of the string to the width "len" with the "pad"
string.
o ltrim(column), rtrim and trim – removes spaces from the left, right or both sides of
the column string.
o repeat(column, number) repeats the "column" string "number" times.
• Enrich data by adding new columns or tables
• To use some of these functions, you will need to execute:
• from pyspark.sql.functions import *
• Date functions include:
o dayofmonth(col), dayofweek(col) and dayoryear(col)
o weekofyear(col), month(col), quarter(col), year(col)
o hour(col), minute(col), second(col)
o add_months(start_date, number_of_months)
o date_add(start_date, days) and date_sub
o date_truc(format, timestamp)
o datediff(end, start)
o months_between(date1, date2)
o last_day(date)
o next_day(date, dayOfWeek)
o length(col) – number of columns
o current_date() and current_timestamp()
o trunc(date, format)
• Math functions include:
o abs
o signum(col)
o trigonomical functions – acos, acosh, asin, asinh, atan, atanh, atan2, cos, cosh,
sin, sinh, tan, tanh
o advanced math functions
▪ exp(col), factorial(col), log10(col), radians(col)
o Power functions
▪ cbrt(col), pow(col1, col2), sqrt(col)
• Rounding functions include:
o ceil(col), floor(col), round(col, scale)
• String functions include:
o concat(string1, string2...)
Page 45 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
o instr(column, search_string)
o lower(column) and upper(column)
o lpad(column, len, pad)
o ltrim(column), rtrim and trim
o repeat(column, number)
Merge data
• To merge dataframes in PySpark, you can use
o df.union(df2) OR
o df.unionAll(df2)
o They both do the same thing. If you want to de-duplicate the rows, you also need
to use distinct()
o The names do not need the same – it merges by position.
• If you want to merge by columns, you should use:
• df.unionByName(df2, allowMissingColumns = True).show()
o allowMissingColumns allows for columns not to be present in one of the
dataframes.
• To merge data in SQL, you can use:
• SELECT *
• FROM firstTable
• UNION [ALL]
• SELECT *
• FROM secondTable
o The names do not need the same – it merges by position.
• You can also merge data using a Dataflow Gen2 (in the Power Query environment) by
using Home – Append Queries, just as in Power BI.
Join data
• To join data in PySpark, you can use
o df.join(df2, df.column == df2.column, 'outer') or
o df.join(df2, 'column', 'outer')
o The first argument is the second dataframe to join.
o The second argument is the join column(s).
▪ If using multiple columns, then you can use ['column1', 'column2']
o The third argument is how the join happens:
▪ If not included, it is an 'inner’.
▪ inner = the same value must be there in both columns
▪ left / leftouter / left_outer = all rows from the first dataframe, and all those
in the second which matches.
Page 46 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
▪ right / rightouter / right_outer = all rows from the second dataframe, and
all those in the second which matches.
▪ Can also use cross; outer; full/fullouter/full_outer;
semi/leftsemi/left_semi; anti/leftanti/left_anti
• To join data in SQL, you can use:
• SELECT *
• FROM firstTable
• INNER/LEFT/RIGHT JOIN secondTable
• ON firstTable.column = secondTable.column
• You can also join data using a Dataflow Gen2 (in the Power Query environment) by using
Home – Merge Queries, just as in Power BI.
Convert data types
• In PySpark, you can use cast to change a column to a dataType. For example:
o df.tripDistance.cast("string")
• You can convert dates to strings by using:
o date_format(date (such as column_name), format)
▪ "format" could be "MM/dd/yyyy". It uses:
• yy or yyyy (not capitalized) – year
• Q – quarter of year
• d – day of month
• E – day of week ("Tue" or "Tuesday")
• D – day of year
• M (capital – otherwise, it would be minute) or L – month of year
o M is the "standard" form and L the "stand-alone" form,
which may be different is some languages (for example,
Russian)
o M or L = 1 or 12
o MM or LL = 01 or 12
o MMM or LLL = Jan
o MMMM or LLLL = January
• h – hour of day (1 to 12)
• H – hour of day (0 to 23)
• K – hour of day (0 to 11)
• k – hour of day (1 to 24)
• m – minute of hour
• s – second of minute
• S (1 to 9 characters) – fractional second
Page 47 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
• a – am or pm
• VV – time-zone ID (America/Los_Angeles; Z; -08:30)
• z (1 to 3) – time-zone name (Pacific Standard Time; PST)
• O (1, 2 or 4) – offset ("GMT+8" or "GMT+08:00" or "UTC-08:00")
• X (1 to 5) – zone-offset (Z; -08; -0830; -08:30; -083015; -08:30:15)
• x (1 to 5) – zone-offset (+0000; -08; -0830; -08:30; -083015; -
08:30:15;)
• Z (1 to 5) – zone-offset (+0000; -0800; -08:00;)
• ' – escape for text
• " – string literal
• You can convert numbers to strings by using
o format_number(number,decimal_places)
▪ converts to a string using a number of decimal places from 0 upwards
• The data types are:
Description PySpark SQL
tinyint -128 to 127 0 to 255
-32,768 to +32,767 smallint smallint
-2,147,483,648 to int int
2,147,483,647
-922337203685477580 to bigint bigint
9223372036854775807
decimal or decimal(p, s) or numeric
floating numbers float and double float and real
strings string, char(n) and varchar(n) char(n) and varchar(n)
boolean bool bit
date and time timestamp datetime2
date date date
time time
Filter data
• To filter data, you can use df.where() and df.filter()
o reduces the number of rows.
o df.filter("age = 2") or df.where("age > 2")
o Conditions are:
▪ > and < are greater than and less than.
▪ >= and <= include equal to.
Page 48 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
▪ = or == is equal to
▪ != or <> is not equal to
• You can use the following in the where or filter:
o df.column_name.between(number1, number2)
▪ Where the value is between number1 and number2.
o df.column_name.contains('string'), endswith and startswith
▪ Where the column_name contains, ends with or starts with 'string'. It is
case sensitive.
o df.column_name.like('string')
▪ Where the column_name is like the string.
▪ You can use % for zero, one or many characters.
▪ You can use _ for one character only.
o df.column_name.isin('string', number, 'string')
▪ It evaluates true if the column_name is any of the values in the
brackets/parentheses.
o df.column_name.substr(startPos, length)
▪ This extracts part of a string – the equivalent to MID in Excel. Don't use
substring in PySpark (but you can use it in SQL).
26. Denormalize data
• For KQL and SQL, see DP-600 "20. Denormalize data"
• For PySpark, the Third normal form data is efficient for writing data.
• However, for reading data (such as in a data warehouse) first normal data is better.
o It has fewer joins.
o It has more columns, but these can be compressed.
o It helps create a star schema, as opposed to a snowflake schema.
# Load DimProductCategory
dfCat = spark.read.format("csv").option("header","true").load("Files/DimProductCategory.csv")
display(dfCat)
# Load DimProduct
dfProd = spark.read.format("csv").option("header","true").load("Files/DimProduct.csv")
display(dfProd)
# Load DimProductSubcategory
dfSubc = spark.read.format("csv").option("header","true").load("Files/DimProductSubcategory.csv")
display(dfSubc)
# Join three tables together.
joined_df = dfProd.join(dfSubc, dfProd.ProductSubcategoryKey == dfSubc.ProductSubcategoryKey,
"left")
joined_df = joined_df.join(dfCat, "ProductCategoryKey", "left")
display(joined_df)
Page 49 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
# Select final columns.
selected_columns = ["ProductKey", "EnglishProductName", "EnglishProductSubcategoryName",
"EnglishProductCategoryName"]
final_df = joined_df.select(selected_columns)
display(final_df)
27. Group and aggregate data
SQL
• See the “DP700CodeUsed” document for examples.
KQL
• summarize by GroupCol1, Col3 = GroupCol2, …
o This groups by the GroupCols.
• summarize Aggregation[, Col2 = Aggregation, …] by GroupCol1, Col3 = GroupCol2, …
o This groups by the GroupCols, using the Aggregations (which could be renamed).
• avg(expression) – Mean Average of expression.
• avgif(expression, condition) - Mean Average of expression when condition is true.
o Use == to compare in a condition
o = is for assigning a value
• count – Number of rows
o | count
o | summarize CountRows=count() by Category
• countif(condition) – Number of rows when expression is true
• dcount(expression) and count_distinct(expression)
o Counts the number of distinct values.
o dcount gives an approprimation, but is quicker than count_distinct.
o dcount can also use an optional second "accuracy" argument from 0 (less
accurate) to 4 (more accurate). The default is 1.
o Recommend using dcount, as count_distinct is not recognized by Azure Data
Explorer.
• dcountif and count_distinctif filters on the second argument.
• max(expression) and min(expression) give the maximum and minimum value.
o maxif and minif filters on the second argument.
• sum calculates the total.
o sumif also filters on the second argument.
Page 50 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
Aggregation Description
avg, avgif Mean average
count, countif Counting the number of rows
dcount, dcountif Distinct counting, eliminating duplications
max, maxif The biggest value of numbers, text, datetime and bool
min, minif The smallest value of numbers, text, datetime and bool
sum, sumif The total
PySpark
• For PySpark, you can use groupBy:
o df.groupBy("passengerCount").avg("tripDistance").show()
o or
o df.groupBy(df.passengerCount).avg("tripDistance").show()
o or
o df.groupBy(["passengerCount", df.vendorID]).avg("tripDistance").show()
o or
o df.select("passengerCount",
"tripDistance").groupBy("passengerCount").sum("tripDistance").show()
• You can also use the aggregate functions:
o avg, count (and countDistinct and approx_count_distinct), first, last, max, mean,
min, stddev, sum, sumDistinct and variance
• You can also merge data using a Dataflow Gen2 (in the Power Query environment) by
using Home – Group By, just as in Power BI.
28. Handle duplicate, missing, and late-arriving data
28a. Duplicate data
SQL
• To identify duplicate data in SQL, you can GROUP BY the data, then use a HAVING
COUNT(*)>1
KQL
• You can identify duplicate data by using a summarise, then a count() > 1. This is equivalent
of Having.
o MyTable
o | summarize Count = count() by ID, Name, Value
o | where Count > 1
• To remove duplicate data, then either use:
o | summarize, or
o | distinct.
Page 51 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform batch data
PySpark
• To identify duplicate data:
o In PySpark, you can groupBy the data, count it, then apply a filter where the count
is greater than 1.
o In SQL, you can GROUP BY the data, then use a HAVING COUNT(*)>1
• In PySpark, to remove duplicate data from a dataframe, use:
o df.distinct().show()
• Alternatively, you can use dropDuplicates or drop_duplicates, if you only want to consider
certain columns:
o df.dropDuplicates(["vendorID", "passengerCount"]).select("vendorID",
"passengerCount").show()
• In SQL, you can use SELECT DISTINCT, or the GROUP BY clause.
• You can also join data using a Dataflow Gen2 (in the Power Query environment) by using
Home – Remove Rows – Remove Duplicates, just as in Power BI.
28b. Identify and resolve missing data
KQL
• To identify null values, you can use:
o | where isnull(column_name).
• To identify missing data, you can join between two (or more table), and use:
o | where isnull(column_name) for one of the tables, or
o join use leftsemi or rightsemi to find data in one table which is not in the other
table.
PySpark
• To look for missing data in PySpark, you can use:
o column_name.isNull() or column_name.isNotNull()
o For example: df.where(df.pickupLongitude.isNotNull()).show()
• You can then use replace:
o Replacing value in all columns or a specific column.
▪ df.replace(10, 20) or
df.replace('Alice', None, 'Name')
o If you want to replace nulls, then you can use fillna (or na.fill)
▪ df.select("pickupLongitude").fillna({"pickupLongitude": 1}).show()
• If you want to use Spark SQL, then you can use IS NULL or IS NOT NULL in the Where
clause.
• If you want to fill in Nulls, then you can use the ifnull or coalesce function.
o For example: ifnull(field, value)
o coalesce(column1, column2…) returns the first non-null column.
28c. Identify and resolve late-arriving data
• For late-arriving data:
Page 52 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform streaming data
o In Event Stream – you can use the Offset in the Group By operator – see topic 33.
o For Activator, see topic 37 (late arrival tolerance)
• If data is late-arriving then:
o Power BI rules are evaluated every hour, and evaluates all events that arrive a
maximum of one hour after the event occurs.
o For other items such as eventstreams, the “Advanced settings” may have a “Wait
time for late-arriving events”.
▪ You can set it for 1, 2 or 5 minutes.
▪ This configures the balance between waiting for any potential late-arriving
events, or processing data more swiftly, though it may be incomplete.
Ingest and transform streaming data
29. Choose an appropriate streaming engine
• Eventstreams are used for capturing, transforming and routing real-time events to a
destination.
o The “transforming” part means that the data can be transformed before it is
written to the destination, allowing for some analysis, joining/unioning of data,
and removal of unnecessary columns and rows (filter).
o It uses a no-code interface, like a pipeline.
o “Real-time” doesn’t necessarily mean huge quantities of data. It allows you event-
driven actions rather than schedule-driven.
• Eventhouses and KQL databases are good for processing data in motion.
o Ideal for time-based, streaming data.
o It can use structured, semi-structured, or unstructured data.
o Data is automatically indexed, and is partitioned based on when it has been
ingested.
o This results in fast analysis, and allows for complex analysis as well.
o Eventhouse data can be made available into OneLake, so other Fabric objects
and use the data.
• KQL querysets allow you to query data, and save the queries for later use and sharing.
o You can also create Power BI reports for it.
Page 53 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform streaming data
Feature Eventstream Eventhouses/KQL Database Spark Structured Streaming
Ingests and processes high- Stores and queries Processes real-time data
velocity data streams, acting processed streaming data streams using structured
Purpose
as an entry point for real-time using Kusto Query Language queries within the Spark
data into Fabric. (KQL) for real-time analytics. framework.
Captures and optionally Provides scalable and fault-
Performs real-time analytics
transforms incoming event tolerant stream processing,
Data on ingested data, supporting
data before routing to treating live data streams as
Processing complex queries and
destinations like KQL unbounded tables for
aggregations.
Databases or Lakehouses. continuous query execution.
Integrates with various data Serves as a destination for Integrates with Fabric's
sources and sinks within Eventstream outputs and Lakehouse and other storage
Integration
Fabric, facilitating seamless integrates with other Fabric solutions for both batch and
data flow. components for analytics. streaming data.
Interactive querying and
Real-time data ingestion from Complex event processing,
analysis of streaming data,
IoT devices, logs, or real-time analytics, and
Use Cases such as monitoring
applications requiring machine learning on
dashboards and alerting
immediate processing. streaming data.
systems.
Uses DataFrame and Dataset
Utilizes KQL for defining
Query Employs KQL for querying and APIs with support for SQL-like
transformations and routing
Language analyzing stored data. queries in Python, Scala, and
rules within the stream.
Java.
Designed to handle millions Optimized for high-
Scales horizontally to process
of events per second, concurrency queries on large
Scalability large volumes of streaming
ensuring low-latency datasets, providing quick
data with low latency.
processing. responses.
29a. Choose between native storage, followed storage, or shortcuts in Real-Time
Intelligence
• EventhouseDatabase, BikeEventhouse
• Native storage in Real-Time Intelligence is data which has been ingested.
• Database follower shortcuts and shortcuts create pointers to other storage.
o It means that data is not copied or duplicated, with potential problems when data
is updated.
• Database shortcuts (follower)
o A database shortcut in Real-Time Intelligence is a reference in a KQL database
from another KQL Database or from an Azure Data Explorer database.
▪ The KQL Database needs to be available in OneLake.
o This allows you to write KQL code which queries both the current KQL database
and the shortcut.
o It is read-only.
Page 54 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform streaming data
o They must be in the same region, but can be in different tenants.
o Changes in the source are reflected in the shortcut, but it may take a few seconds
to a few minutes.
o The shortcut views the data in the same storage account that the source uses,
and does not ingest it.
o You can create a shortcut by:
▪ In the source:
• if it is a KQL database, click on “Copy URI” next to “Query URI” (not
next to “Ingestion URI”) – next to the database, not next to the
table.
• enter the URI of the Azure Data Explorer database.
• If you are creating a shortcut from an Azure Data Explorer
database, you could Share – Invitation token instead.
▪ In the destination KQL database:
• Click on the + next to KQL databases.
• Enter a new for the shortcut, change the type to “New shortcut
database (Follower)”, and click Next.
• Select the Method as Cluster URI, and enter the URI.
o If you are sharing an invitation token, then select Invitation
Token as the Method, and paste the token.
• Change the Database source, and optionally modify the default
cache policy (days).
o You can then query the followed database in KQL by using:
database('KQLDatabaseName').table('TableName') OR
database('KQLDatabaseName').TableName
• Shortcuts (a single table)
o You can create a OneLake shortcut in a KQL database which points to:
▪ Microsoft OneLake,
▪ Amazon S3 or S3 Compatible sources,
▪ Azure Data Lake Storage Gen2,
▪ Dataverse (from the Power Platform), and
▪ Google Cloud Storage.
o It creates it as an external table. You can only connect to one table or subfolder at
a time.
o To create a shortcut:
▪ Go to New – OneLake shortcut,
▪ Select the source type,
▪ If the source type is external, then enter the Connection settings.
Page 55 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform streaming data
▪ Select the Source and click Next,
▪ Check the table and click Next.
▪ You can turn on/off “Query acceleration” for some data sources (for
example, OneLake and Azure Data Lake Storage Gen2).
• It indexes and caches some data into an Eventhouse.
▪ Then Create.
o To edit the “Query acceleration” policy:
▪ click on the … next to the Shortcut, and go to Data policies,
▪ You can turn Query acceleration on/off, and change the Caching period
from 1 to 36,500 days.
o You can query the data using:
external_table("DimProduct")
30. Process data by using eventstreams
• Eventstreams can have the following sources (and you can use multiple data sources):
o Microsoft Azure sources
▪ Including Event Hubs, Service Bus, IoT Hub, Data Explorer.
o Microsoft Azure Database sources using Change Data Capture (CDC)
▪ including Databases for SQL, PostgreSQL, MySQL, Cosmos DB, Managed
Instance, and SQL Server on a Virtual Machine.
o External sources:
▪ Google Cloud Pub/Sub (Publisher/Subscriber), a messaging service for
events
▪ Amazon Kinesis Data Streams,
▪ Confluent Cloud Kafka,
▪ Apache Kafka, and
▪ Amazon Managed Streaming for Apache Kafka.
o Discrete events:
▪ Azure Blob Storage events,
▪ Fabric OneLake, Workspace Item and Job events
o Sample data sources
▪ Bicycles,
▪ Yellow Taxi, and Stock Market (high data-rate)
o Custom eventpoint.
• Once a datasource is added to an Eventstream, a new stream is created, which you can
view in the Real-Time hub.
• You can see a preview of the data/transformed data in the “Test result” pane by clicking
on “Refresh”,
Page 56 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform streaming data
• You can also see any “Authoring errors”.
o When transforming data, you may see an “error”. This tab may show that you
haven’t configured it, or you eventually need a Destination.
• You add transformations by clicking on the “Add transformation” box, or going to Home –
Transform events.
o You can edit edges (the notes which connect for example a source to a transform)
by clicking on them.
o You can also have (for example) a stream connect to multiple transform events at
the same time by creating the Transformations, and dragging the Edge from the
stream to the Transformation.
o You can change the name of the Transformation by clicking on it and going to the
pencil icon (for Edit) and entering a new name, and clicking Save.
• You can transform the data using:
o Aggregate (calculations)
▪ Create SUM, AVG, MIN and MAX functions.
▪ You can have multiple aggregations.
▪ You can create a single result, or partition by a field.
▪ You can also aggregate values recently received, within the last X days,
hours, minutes, seconds or milliseconds.
▪ If you need more complex options, see Group By.
o Expand
▪ Create a new row for each value in an array.
▪ You can choose create/don’t create row for missing/empty array.
o Filter
▪ For example, IS NULL or IS NOT NULL
o Group by
▪ You can calculate Average, Count, Maximum, Minimum, Percentile,
Standard deviation, Sum and Variance.
▪ You can also group the aggregations by a field, and specify how recently
the values would have arrived: days, hours, minutes, seconds or
milliseconds or microseconds.
▪ You can also specify an offset, to ignore the last X seconds. This could be
useful to cope with late-arriving data.
▪ You can specify what type of time window it is going to be (in the following,
I will use “X seconds” to denote a time period, though it could be
milliseconds, for example). See topic 33 for details.
o Join
▪ Bring together two streams based on a condition.
▪ You can use an inner join or left outer join.
Page 57 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform streaming data
▪ You may want to use a “Manage fields” transformation to rename the
columns or remove some columns.
o Union
▪ Bring together two streams, combining all rows.
▪ Only includes columns with the same names and data type. Other
columns are dropped (you could Manage fields before the Union, if you
want to change their name).
o Manage fields. This allows you to select what fields to output from the operation.
▪ Any fields you do not add will not be included in its output.
▪ You can add existing fields by clicking on “+ Add field” and expanding
“Imported Schema”.
• You can also add on “Add all fields”.
• After they have been added, you can also change their Name and
Type.
o All Data Types can be converted to Strings.
o Int64 and Double can also be converted to the other data
type.
o Strings can be converted to Int64, Double or Datetime, but
only it can be converted if it is appropriate (e.g. you cannot
convert “one” into an Int64).
▪ You can also add additional column, using the following functions:
• Date/Time functions: SYSTEM.Timestamp() – the current date and
time, Year, Month, Day, DateAdd, DateDiff, DateName and
DatePart
• String functions: Upper/Lower, Len, LTrim/RTrim/Trim,
Nchar/Unicode, Reverse, CharIndex, Left/Right/Substring,
RegexMatch, Replace and Replicate
• Mathematical functions: Abs, Ceiling, Exp, Floor, Sign, Square,
Sqrt, Round and Power
• Once you have added a transform, you can edit it by clicking on the pencil icon.
o You can also click the “Delete” icon.
• You can insert a node by hovering over a connector and clicking the “Insert a node” icon.
o You can also click the “Delete” icon.
• Eventstreams can have the following destinations:
o Eventhouse,
▪ This is used for high quantities of data.
▪ You can create a KQL database within an eventhouse.
o Lakehouse,
o Activator (monitor), and
Page 58 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform streaming data
o Custom endpoint.
o If you have added transformations, then you can also add a derived stream, which
can be routed to multiple Fabric destinations, including the Real-Time hub.
• To start the Eventstream, click on “Publish”.
o Enter the details of the Destination, together with the Input data format.
o After an eventstream has been published, after a minute or so you may also have
to click on “Activate” on the destination.
• In the KQL_queryset, use:
• Stream2
• | summarize NoBikes = avg(left_No_Bikes) by left_BikepointID, left_Neighbourhood,
right_Window_End_Time
• | order by left_Neighbourhood asc, right_Window_End_Time asc
31. Process data by using Spark structured streaming
• Spark structured streaming allows you to incorporate real-time data streams into a
lakehouse using PySpark.
• First, get the schema for the new lakehouse table.
o This can easily be done by loading a file into the lakehouse, loading it into a
dataframe, retrieving the schema, and creating a structure with the query.
o Then you can delete the table.
• Next, load the streaming data with the specified schema.
• Then write the stream to a Delta Lake table.
o The checkpointLocation is needed to manage the writing of the data from the
stream.
o .start() is needed to being writing the stream
• Finally, stop the streaming query to finish.
• You can then run the code in your notebook.
o Once the code has initially run, you can query the table in (for example) the SQL
analytics endpoint.
• However, instead of requiring a notebook to be continuously running, you can use a Spark
Job Definition instead.
o Copy the code into a .py text file on your machine.
o Add the following code to the beginning:
▪ from pyspark.sql import SparkSession
▪ spark = SparkSession.builder.getOrCreate()
o This code is implicitly run in a Notebook, but not in a Spark Job Definition.
o Go to the Workspace, and click on “+New” and Spark Job Definition.
o In the “Main definition file”, click on “Upload a local file” and Import the .py file
from your computer.
Page 59 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform streaming data
o In the “Lakehouse reference”, select the appropriate lakehouse.
o Save the Spark Job Definition and then click on Run.
• You can see the job starting in the Message tab, and the runs in the Run tab.
o You can click on the … next to the Run to stop the job.
32. Process data by using KQL
• See the “DP700CodeUsed” document for examples.
33. Create windowing functions
• Tumbling window
o Windows are of a fixed length, last the same number of seconds, and one starts
when another ends.
o Windows cannot repeat or overlap, and an event only belongs to a maximum of
one tumbling window.
Page 60 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform streaming data
• Hopping window
o Create a window every timeperiod (say, every X seconds).
o The window could be bigger than the hop, and so windows can overlap, and
events can be in multiple windows.
o A tumbling window is a Hopping window, where the hop size is equal to the
window size.
Page 61 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Ingest and transform streaming data
• Sliding window
o Create a new window when an event enters or leaves a window (as long as there
is one event left in the window).
o Events can belong to more than one window.
o Every window has at least one event.
Page 62 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Monitor Fabric items
• Session window
o A session window groups events with a similar time.
o It starts when an event happens.
o It ends X seconds after the last event has happen, or when the window becomes
Y seconds long in total.
o No window occurs when no events happen.
• Snapshot window
o Groups events that have the same timestamp.
Monitor Fabric items
34-35. Microsoft Fabric Capacity Metrics app
• You can manage Fabric Capacity by installing the Microsoft Fabric Capacity Metrics app.
• To install it:
o Go to Apps – Get apps,
o Search for Microsoft Fabric,
Page 63 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Monitor Fabric items
o Click the “Microsoft Fabric Capacity Metrics” app.
o Click “Get it now”.
• To run it for the first time:
o Go to Apps – click the “Microsoft Fabric Capacity Metrics” app,
o You will see the message “You have to connect to your own data to view this
report”. Click Connect.
o Enter:
▪ CapacityID – you will see this in Settings – Admin portal – Capacity
settings, and select a capacity. The CapacityID is a series of hexadecimal
characters and dashes.
▪ UTC_offset – the number of hours before/after UTC (GMT).
▪ Timepoint/Timepoint2 – this is an internal value, which you should not fill
in.
▪ Advanced – whether the app automatically refreshes your data at
midnight.
o Click Next.
o In the “Connect to Microsoft Fabric Capacity Metrics”, fill in:
▪ Authentication method – by default, use OAuth2,
▪ Privacy level setting for this data source – using Organizational to access
all the organization’s data sources.
o Click on “Sign in and connect”.
o Select a capacity from the “Capacity Name” dropdown.
o It may take a few minutes for the app to get your data.
• The Compute page contains:
o A ribbon chart containing an hourly view of:
▪ Capacity Units (CU) (in seconds),
▪ Duration (processing time in seconds),
▪ Operations (count),
▪ Users (who have performed Operations),
o Capacity utilization
▪ It shows:
• Background % (billable) and non-billable: % of CU consumption
used in a 30-second period. These are operations not triggered by
users – for example, data refreshes.
• Interactive % (billable) and non-billable: Resources triggered by
users, associated with interactive page loads.
• Autoscale % – shows timepoints where the capacity is
overloaded.
• CU % Limit – the threshold of the allowed CU %.
Page 64 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Monitor Fabric items
o Throttling. This is a limit of your CU. It happens if the CUs for interactive and
background operations exceed the allowance in a 30 second timepoint. Either:
▪ You have Autoscale enabled. If so, a new CU will be added for the next 24
hours, up to the maximum number of CUs allowed. If it goes above that
threshold, throttling will happen.
▪ You do not have Autoscale enabled. Then throttling will be applied.
o You can use a linear or logarithmic scale, and use filters.
o It shows:
▪ Interactive delay
• Where capacity went over by between 10 and 60 minutes.
• User interactive jobs are throttled.
▪ Interactive rejection
• Where capacity went over by between 60 minutes and 24 hours.
• User interactive jobs are rejected.
▪ Background rejection
• Where capacity went over after 24 hours.
• User scheduled background jobs are rejected and not executed.
o Overages:
▪ Add % - the carry-forward % during the current period,
▪ Burndown % - the carry-forward % burned down during that period,
▪ Cumulative % - the cumulative %.
o System events:
▪ Displays pause/resume capacity events, with
• Time,
• State (suspended and active), and
• State Change Reason
o Matrix by item and operation
▪ Shows the “performance delta”, which compares fast operations (under
200 milliseconds to complete) for over the last week, the current value,
and the value 7 days ago.
▪ This can be used to see if your average performance improved/worsened
over the past week.
• The higher the value, the better the performance.
▪ You can sort the matrix by the “performance delta” to find the biggest
change in their performance.
▪ A high CU utilization means that it is being heavily used or run many
operations.
▪ A low CU utilization might be volatile.
Page 65 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Monitor Fabric items
▪ It shows:
• Workspace,
• Item kind (type),
• Item name,
• CU in seconds over the last 2 weeks,
• Duration (Processing time) over the last 2 weeks,
• Users (count), and
• Billing type (Billable, non-billable, and both).
▪ Using the “Select optional column(s)”, you can also add:
• Rejected/failed/invalid/inProgress/Successful count (number of
operations),
• Virtualized item/workspace,
• Item Size (Gb),
• Overloaded minutes (number of 30 second increments where
overloading occurred at least once),
• Performance delta.
• Storage page:
o You can use the following filters:
▪ Capacity Name,
▪ Date Range,
▪ Experience, and
▪ Storage type.
o You can see in cards:
▪ Number of workspaces,
▪ Current/billage storage (in Gb)
o There is a “Top workspace by billable storage %”, which includes:
▪ Workspace name/ID,
▪ Operation name,
▪ Deletion status (whether it is active),
▪ Billing type (whether it is billable),
▪ Current/billable storage in Gb,
▪ Billable storage % (of the capacity)
o Column charts showing Storage (Gb) and Cumulative Billable Storage (Gb) by
date/hour.
o You can also Export the Data.
• You can also monitor a paused capacity. It shows all the paused capacity events
Page 66 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Monitor Fabric items
34. Monitor data ingestion
Spark Job Definition
• To monitor Spark Job Definition runs, click on the Runs tab inside the item.
KQL databases
• To monitor streams going into KQL databases, open the database to see the ingestion over
the last 7 days.
o If you click on the table view, you can also see:
▪ Row count,
▪ Original/compressed size,
▪ Last ingestion,
▪ Caching,
▪ Retention,
▪ Whether it is available in OneLake, and
▪ Created on.
Data pipelines
• To monitor data pipeline runs:
o Go to Home – "View run history" or Run – "View run history" in the data pipeline,
or
o Go to the workspace, click on the … next to the Data Pipeline, and click on Recent
runs.
• You can then see the recent runs.
Monitoring hub
• You can click on "Go to monitoring hub" to view more details.
o You can also filter the runs.
o If you click on a run, you will see more information.
o If you click on a specific pipeline run, you can click on view more information, and
click on the Input and Output icons.
o To see performance details about an activity, click on it.
▪ You can see more information in the Duration breakdown and advanced
section.
▪ You can view the activities as a Gantt chart, showing the length of bars as
the duration of the activity, by clicking on "Gantt".
▪ To find more details about the Input and Output in JSON format, click on
the icon in those columns.
• You can copy the details to the clipboard.
o To run the pipeline again, click on Rerun.
▪ You can rerun the entire pipeline, selected activities, or only the failed
activity.
o To make changes to your pipeline, click on "Update pipeline".
Page 67 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Monitor Fabric items
• A more central location is the Monitor hub – click on Monitor on the left-hand side. It
displays activities for:
o Data pipelines,
o Dataflow Gen 2,
o Datamarts,
o Lakehouses,
o Notebooks,
o Semantic models and
o Spark job definitions.
• Activities are initially sorted by start time, but can be re-sorted.
• You can also filter for a keyword (in the text box), or click the Filter button to filter for:
o Status (cancelled, succeeded, failed, in progress, not started or unknown)
▪ More details can be found by clicking on the “View detail” for that activity.
o Item type,
o Start time,
o Submitted by, and
o Location (Workspace).
• Once a filter is applied, you can click on “Clear all” to remove the filters.
• The initial columns are the above together with “Activity Name”. You can click on “Column
Options” to add/remove/rearrange columns.
• You can click on the item to go either to that item or the workspace which contains it.
• You can click on the i symbol next to the activity name open the Details side-panel.
• You can also click on the … next to the item to:
o Open the item
▪ For Spark Job Definition, this allows you to look at Jobs, Resources Used,
Logs, Data and Item snapshots.
o View detail, or
o See Historical runs.
▪ This displays up to 30 days for that item.
▪ You can export or refresh that data.
35. Monitor data transformation
• To monitor your Dataflow Gen2, click on the … next to it, and go to “Refresh history”. This
shows:
o Start time,
o Status,
o Duration and
o Type.
Page 68 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Monitor Fabric items
• Initially, it shows the first 50 refresh histories up to 6 months back.
o This can be expanded to up to 250 refresh histories up to 6 months back.
• The refresh history can be downloaded to a CSV file by clicking on “Download as CSV”.
• You can drill down into a particular refresh by clicking on it. It shows:
o Status of the dataflow,
o Type of refresh,
o Start/End time,
o Duration,
o Request ID, Session ID and Dataflow ID.
• There is a section for Tables and Activities. For more information, you can click on a
particular table/activity.
• For data pipelines, see topic 34. Click on it and go to “Recent runs”.
o You can then click on “Go to monitoring run” for more details.
o To make changes to your pipeline, click on “Update pipeline”.
o To see information about a specific activity, you can click on it.
• Also, see topic 34.
36. Monitor semantic model refresh
• To monitor a semantic model refresh:
o Click on the … next to the semantic model in the Workspace, and
o Go to Refresh history.
• There you will see tabs for:
o Scheduled (upcoming refreshes)
o OneDrive
▪ This is if your semantic models/reports are based on a Power BI Desktop
file, Excel workbook, or .csv file on OneDrive or SharePoint Online.
▪ By default, this checks whether a refresh is needed every hour.
▪ You can switch it off by clicking on the … next to the semantic model, going
to Settings, then Semantic models, and expanding the “OneDrive refresh”
node.
▪ Note: it doesn’t refresh data in the Power BI Desktop source file – it just
refreshes the existing data. If you want to refresh the Power BI Desktop
source file as well, you need to do an on-demand refresh.
o Direct Lake and
o OneLake integration.
• You can also go to the Refresh summary History:
o Go to Settings – Admin portal,
o Then go to Capacity settings – Refresh summary.
Page 69 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Identify and resolve errors
• This shows:
o Schedule
▪ This tab is useful for working out if the refresh schedules are separated
from each other, thus reducing a spike in workload.
▪ It is separated into 30 minute slots.
▪ The “Refresh time booked (minutes)” shows the total number of minutes
due to happen within that slot (either starting in or going into that slot).
• If this is over 30 minutes for a 30 minute slot, you may have too
many things happening then.
▪ The “Refresh time available (minutes)” shows the remaining time.
o Refresh history
▪ You can refresh or export the History to a csv file.
• By clicking on the columns, you can sort ascending/descending, add a text filter, or clear
all filters.
37. Configure alerts
• See topic 15.
Identify and resolve errors
38. Identify and resolve pipeline errors
• To identify a pipeline error, you need to monitor it. See topic 34 for details.
o You can also manually monitor it, by:
▪ going to Run – “View run history” inside the Pipeline, or
▪ clicking on the … next to the Pipeline and going to “Recent runs”.
▪ You can then click on “Go to monitor”, or click on a run for more details.
▪ In a particular run, you can click on an Activity name for more details,
including the Activity status.
• You can also filter the table, add/remove columns, and export
either the current page or the top 1,000 activities to a .csv file.
o You should note the status to see whether it succeeded, failed, in progress,
cancelled or queued.
▪ If it failed, and had multiple activities, then you can see the list of activities
to see whether the problem was.
▪ If an activity was not completed (failed, timed out, or was cancelled), then
you can click on “Rerun from failed activity” to continue it.
▪ You can also click on “Update pipeline” to edit it.
• If the problem is Fabric capacity performance, then you can check it by going to the
Compute tab in the Microsoft Fabric Capacity Metrics app.
• To resolve a pipeline error, you can add an activity “On Fail”. To do this:
o Add an activity, such as an Office 365 activity, to send an email on failure, notifying
somebody of the failure.
Page 70 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Identify and resolve errors
▪ You could also add a Teams activity.
o Connect it to a preceding activity by dragging from the “X – On fail” to that activity.
• You can also add a link between activities:
o Upon Skip – if the activity did not run,
o Upon Completion – after the activity ran, either successfully or resulting in failure.
▪ This should be used if the success or otherwise of the preceding activity
was not critical.
o Upon Success – the preceding activity was successful.
39. Identify and resolve dataflow errors
• When dataflows load data, they can use a staging table.
o This allows it to be loaded before transformations are done. It may or may not
improve overall performance.
o You can toggle it on and off by right-hand clicking on the query and select "Enable
staging".
o You can also create separate dataflows for loading and then for transformation.
• To check the performance of a dataflow:
o Go to the Workspace,
o Next to the dataflow, click on … and select Refresh History.
o The runs are shown, together with their duration.
o Click on a run. This shows the activities, with:
▪ The start and end date/time, and
▪ The duration,
▪ Request, Session and Dataflow IDs.
o Click on an activity, and you can see:
▪ The start and end date/time,
▪ The duration, and
▪ The Volume processed (bytes/rows read/written).
• You can run the dataflow multiple times to compare timings.
• You can check the Fabric capacity performance by going to the Compute tab in the
Microsoft Fabric Capacity Metrics app.
• You can break complicated dataflows into multiple dataflows.
o It can make it easier to understand, and to reuse.
▪ It can also reduce timeout errors.
o You can have separate dataflows work on different tables, and sequence them
using a pipeline.
o Or you have sequential dataflows work on the same table.
o You can split the ingesting of data (staging dataflows) from those which transform
data.
Page 71 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Identify and resolve errors
▪ It can reduce the number of read operations from the source, and reduce
the requirements for the data gateway.
▪ It can also be useful to have a copy of the data that was on the source, in
case something changes on the source.
▪ It also allows the transformation dataflow to be completely independent
on the source.
o You should separate dataflows which have different refresh schedules.
• If you have more complicated transformations that are used in more than one data
source, you may be able to use a "Power Query"-type function.
40. Identify and resolve notebook errors
• To monitor a notebook run, see topic 34.
• You can view runs which failed from an error.
• You can click on those runs, view the problem, and correct as necessary.
• You can also manually run a notebook, and see and resolve any errors which arise from
this manual run.
• You can store your data in partitions (see topic 44).
• Additionally, to speed up loading in notebooks, you can use a high concurrency mode.
o You can attach a notebook to an existing Spark session.
o There is no need for a notebook to start its own Spark session.
o High Concurrency notebooks are:
▪ run by the same user,
▪ have the same default lakehouse,
▪ have the same Spark compute configurations, and
▪ have the same library packages.
o If you need more dedicated compute, you can use a standard session.
• To allow you to use high concurrency mode in any notebook:
o going to a workspace,
o Click on "Workspace settings".
o Expand the Data Engineering/Science and click on "Spark settings".
o In the High concurrency tab, you can switch to On the "For notebooks".
• You can do this for an individual notebook by going to Home – Connect – New high
concurrency session.
41. Identify and resolve eventhouse errors
• See topic 47.
42. Identify and resolve eventstream errors
• See topics 30 and 47.
43. Identify and resolve T-SQL errors
• General T-SQL knowledge.
Page 72 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Optimize performance
Optimize performance
44. Optimize a lakehouse table
• Loading data using partitions allows you to separate data.
o Each subset is called a partition or shard.
o These partitions can be processed separately.
• When copying data in a data pipeline, go to the Destination tab.
• In the Advanced section:
o Check "Enable partition".
o In "Partition columns", select the relevant partition column(s).
▪ The column(s) should be of string, integer, Boolean or datetime type.
▪ If you are using multiple columns, then it is partitioned by the first
columns, then by the second.
• To reorder it, you can drag the columns.
• When you run the pipeline, the files will be stored as [ColumnName]=[ColumnValue]
o You can view the files by right-hand clicking on the Lakehouse table (or left-hand
clicking on the table and going to …) and select "View files".
• To read multiple files into a dataframe, you can use:
o spark.read.option("recursiveFileLookup", "true").parquet("*.parquet")
o Using recursiveFileLookup = True will search through subfolders.
• V-Order
o The Delta Lake table format can be optimized using V-Order. This enables fast
reads for Power BI, SQL and Spark.
o Microsoft says that read times can be between 10% and 50% faster.
o It applies sorting, row group distribution, dictionary encoding and compression
on Parquet files.
▪ This reduces disk space. Therefore, it needs less network and CPU
resources to read it.
▪ It also decreases write speed. Microsoft says by around 15%.
o To check the status of V-Order in Apache Spark, or to enable it, use:
▪ spark.conf.get('spark.sql.parquet.vorder.enabled')
▪ spark.conf.set('spark.sql.parquet.vorder.enabled', 'true')
o To check the status of V-Order in SQL
▪ SET spark.sql.parquet.vorder.enabled
o To enable it in SQL, use
▪ SET spark.sql.parquet.vorder.enabled=TRUE
▪ or
Page 73 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Optimize performance
▪ CREATE TABLE … USING parquet
TBLPROPERTIES("delta.parquet.vorder.enabled" = "true");
• Optimize Write
o This aims to increase individual file size to between 128 Mb and 1Gb, and is
enabled by default in Microsoft Fabric.
o To set it in Apache Spark, use:
▪ spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
o In Spark SQL, use:
▪ SET 'spark.microsoft.delta.optimizeWrite.enabled'
• Delta table has higher performance when there are a small number of large files, not a
large number of small files.
• For better query performance, data files should be approximately 128 Mb-1 Gb in size.
• To performance optimization, then click on the … next to a table in a lakehouse, and click
on Maintenance.
o Delta Lake identifies tables which should be optimized, and queues them to be
optimized.
o It combines multiple smaller files into larger files.
o It does not impact on data readers and writers.
o It can perform:
▪ OPTIMIZE – it optimizes file size.
▪ You can also apply V-order to maximize reading speeds in Fabric (but
reduce writing speeds).
• VACUUM – Delta Lake keeps a history of all changes made over
time. VACUUM deletes data files not referenced by the Delta table
version for several days.
• By default, it is for the last 7 days.
45. Optimize a pipeline
• In a data pipeline, you can add a copy activity.
• In the Settings tab, you can select:
• Intelligent throughput optimization.
o This is a combination of CPU, memory, network resource allocation, and
expected cost
o Choose from Auto (which is dynamic based on the sources and destinations),
Standard, Balanced and Maximum.
o You can also check “Use custom value” specify a value between 4 and 256.
• Degree of copy parallelism
o This is similar to the maximum number of threads, operating in parallel.
o You can override the default value, and give a whole number between 1 and 32.
o At run time, the copy number uses this as the maximum number – it may use less.
Page 74 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Optimize performance
• Fault tolerance – what happens if there are errors while copying. You can skip
o Incompatible rows,
o Missing files,
o Forbidden files, and
o Files with invalid names.
• Enable logging – log copied files and skipped files and rows,
• Enable staging and Staging account connection (advanced).
o Staging is useful if you want to copy data:
▪ to/from Azure Synapse Analytics via PolyBase,
▪ to/from Snowflake, or
▪ From Amazon Redshift/Hadoop Distributed File System (HDFS).
o And
▪ if, for security reasons, you want to use only ports 80 and 443 (as opposed
to port 1433, which is the normal port for Azure SQL Database/Synapse
Analytics), or
▪ If it takes a while due to a slow network connection.
o However, as it costs more money, only use staging in a pipeline copy activity when
you need to do so.
46. Optimize a data warehouse
• To copy data into a warehouse from an Azure storage account, for the fast throughput, you
can use the COPY INTO command.
o This starts with COPY INTO name_of_table. It could be followed by the columns
in brackets/parentheses.
o It then continues with FROM and the Azure storage account in single quotation
marks.
o It is often then followed by lots of options in a WITH ( ).
o In Fabric, you can copy Parquet and CSV files from an Azure Data Lake Storage
Gen2 account into a Fabric Warehouse.
o See https://learn.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-
sql for more details.
• Other possibilities are:
o Data pipelines for no-code/low-code way to import large amount of data, or data
on a schedule.
o Data flows for data which can be transformed before being stored in the data
warehouse, and
o cross-warehouse ingestion, using a second data warehouse in the same
Microsoft Fabric workspace.
▪ You would then specify the second warehouse as follows:
• CREATE TABLE FirstWarehouse.Schema.Table AS
Page 75 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Optimize performance
• SELECT * FROM SecondWarehouse.Schema.Table
• It is advisable to avoid INSERT statements which insert a single row if possible.
• A data warehouse caches data using a Solid State Disk (SSD) and memory.
o Additionally, the internal statistics may be out-of-date, and Fabric would need to
update them.
o For these reasons, the first few executions of a query will be slower than
subsequent queries.
o If the first run’s performance is crucial, you can manually create statistics
▪ CREATE STATISTICS Stat_Name ON Table_Name (Columns) WITH
FULLSCAN;
o And then update them with:
▪ UPDATE STATISTICS ON Table_Name (Columns) WITH FULLSCAN;
o However, if the first run’s performance is not crucial, you should keep the
automatic statistics.
• INSERT, UPDATE and DELETE statements run as a single transaction.
o If they fail, they will be rolled back, which could take a long time.
o If possible, divide the statements into several statements with smaller amounts
of data.
• Use a star schema (instead of a snowflake schema) with fact tables and dimension
tables.
• Reduce the query size where possible.
o The SQL Query editor has a maximum of 10,000 rows.
o If you need more, use an external program like SQL Server Management Studio
(SSMS) or Azure Data Studio.
• Use the smallest data type for your columns. For example:
o smallint instead of bigint, if the values will fit.
o VARCHAR instead of CHAR, unless you have a column with a specific length.
o Numbers instead of strings, if possible.
o Integers instead of float-point numbers if possible, as SORT, JOIN and GROUP BY
work more quickly on integers compared with alternatives.
▪ For example, if a column contains the number of kilograms with a decimal
place (1.234 Kg), could it instead contain grams without a decimal place
(1234 g)?
• You should use Direct Lake mode where available.
• To check how the Warehouse is being used, use the Microsoft Fabric Capacity Metrics
app.
47. Optimize eventstreams and eventhouses
• To optimize an eventstream:
o Either:
Page 76 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Optimize performance
▪ Click on the … next to the eventstream in the Workspace and go to
Settings, or
▪ In the Eventstream, click on the wheel icon in the Home tab.
o In the “Event throughput” tab, you can change the estimated amount of data
coming in and out.
▪ The eventstream will then optimized for that level.
o You have these choices:
▪ Low (less than 10 Mb per second),
▪ Medium (between 10 and 100 Mb per second), and
▪ High (over 100 Mb per second).
▪ Note: Event throughput can only be increased, not decreased.
o In the “Retention” tab, you can change the retention period from 1 day (exactly 24
hours) up to 90 days.
• To view a Eventhouse dashboard:
o Click on the eventhouse in your workspace.
• You can see in a green bubble next to “System overview” the status:
o “Running” means running optimally.
o “Maintenance” means temporarily unavailable.
o “Suspended capacity” (this requires your capacity admin to reverse this),
o “Unknown” (unavailable for unknown reasons), and
o “Missing capacity” – because the Fabric compute capacity has been exceeded.
• You can also see the Eventhouse storage:
o Standard (cold) storage – less used data.
o Premium (hot) storage
o Eventhouse size (compressed) – the cold and hot storage combined.
• The default consumption is “On demand”. However, you can change that by clicking on
the “Minimum consumption”, and choose from a range from:
o “extra extra extra small” to “extra extra large”, or
o custom (which is larger than Extra extra large).
o These CUs will always be used (and therefore charged for).
o In addition to an additional minimum CU level, you will also receive a Solid State
Drive (SSD) capacity. The free storage level is:
▪ 20 Gb for extra extra small,
▪ 200 Gb for extra small,
▪ 800 Gb for small,
▪ Between 3,500 and 4,000 Gb for medium,
▪ Between 5,250 and 6,000 Gb for large,
Page 77 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Optimize performance
▪ Between 7,000 and 8,000 Gb for extra large, and
▪ Between 10,500 and 12,000 Gb for Extra extra large.
o You can compare this with the “Eventhouse size”.
o You can Refresh the view or click on a KQL database to view it.
o You can also see:
▪ Activity in minutes – the compute operations (not identical to CUs) over
the last hour/day/week/month,
• If two queries are running at the same time taking 6 and 8 minutes,
the activity would be 14 minutes, but the CUs would be running for
8 minutes.
▪ Most queried databases over the last hour/day/week/month,
• Useful for seeing which databases are taking the most resources.
▪ Eventhouse details,
▪ Activity in minutes – Top 5 users over the last hour/day/week/month,
▪ What’s new – recent eventhouse events in the last 7 days:
• Creating/delete a database/external table,
• Creating/altering/deleting a table/materialized view/function,
• Altering a caching/retention/table update policy.
48. Optimize Spark performance
• See topic 44 for partitions, V-Order, Optimize Write, OPTIMIZE and VACUUM in a
lakehouse.
• When importing data, define a schema explicitly if possible.
o This means that Spark doesn’t have to work out what the schema could be based
on the data.
o Additionally, you can specify the data types, which may be better than the inferred
version. The types are:
▪ BinaryType, BooleanType,
▪ Numbers without decimal places: ByteType, ShortType, IntegerType,
LongType
▪ Numbers with decimal places: NumericType, DecimalType, DoubleType,
FloatType, FractionalType
▪ Strings: StringType
▪ Date and Time: DateType, Timestamp Type
• Create a pool where appropriate (see topic 1).
• Use a “filter” to reduce the number of rows earlier in the formula if possible.
• Select only the columns that are needed for the query.
• You can use df.explain(True) for more details about the query execution.
• You can cache frequently used tables using:
Page 78 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Optimize performance
o spark.sql("CACHE TABLE Name_Of_Table")
• You can also enable autotune, which adjusts the Spark configuration by monitoring
performance automatically.
o It is currently in Preview, and is disabled by default.
o It uses historic workloads to refine its configuration.
o It will not be used for an unusually large amount of data.
o To get the current status, use:
▪ spark.conf.get('spark.ms.autotune.enabled’)
o To set it, use:
▪ %%pyspark
▪ spark.conf.set('spark.ms.autotune.enabled', 'true’)
o It enables:
▪ spark.sql.shuffle.partitions – the number of partitions for data shuffling
during joins/aggregations. The default is 200 partitions.
▪ spark.sql.autoBroadcastJoinThreshold – maximum table size (in bytes) for
worker nodes when joins are used. The default is said to be 10 Mb (but in
my version, it is 25 Mb).
▪ spark.sql.files.maxPartitionBytes – the maximum number of bytes for a
partition when reading files. The default is 128 Mb
49. Optimize query performance
• You can implement Query folding in Dataflows.
• Avoid SELECT * and get only those columns you need.
• Use LIMIT (or TOP) to return only a few rows.
• Reduce data types.
o It takes storage to store longer data types that are needed, and network and
compute to process them.
o Don't use CHAR(20) when VARCHAR(20) would work.
▪ Both allow for 20 characters.
▪ However, CHAR always takes 20 characters. If the majority of the data is
less than 20 character, then VARCHAR reduces the amount of size that is
needed.
o Use the smallest number type.
▪ Don't use bigint where smallint would store the data.
• Separate date and times, and any strings that can be separated.
o This would reduce cardinality – the number of different variations.
o Lower cardinality allows for better compression and storage, once dictionary and
compression has been employed.
• You can ask your database administrator whether your queries can be sped up – for
example, with indexes.
Page 79 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com
DP-700: Fabric Data Engineer Associate – Updated as per the exam update of 21 April 2025
Optimize performance
o Indexes should be used where a search is going to be found:
▪ In the WHERE clause, for searching,
▪ In the JOIN clause, where matches need to be made between tables,
▪ In the GROUP BY clause, for aggregating, and
▪ In the ORDER BY clause, so SQL has an index of the values for the
appropriate fields.
o Use SARGable conditions (SARG meaning Search ARGument ABLE). For example,
use:
▪ =, >, <, >=, <=, BETWEEN, LIKE, IS NULL, IS NOT NULL, IN. These are able
to make use of indexes.
▪ For LIKE, LIKE 'Hello%' can make use of indexes. LIKE '%Hello' cannot.
▪ For dates, BETWEEN '2026-01-01' and '2026-12-31 23:59:59' can make
use of indexes. Year(myField) cannot.
▪ Basically, avoid functions if you can write the expression a different way
using SARG and use an index.
Page 80 of 80 Copyright © 2025 I Do Data Limited www.IDoData.com